DeepSeek R1 : Optimization as a necessity

6 min readJan 29, 2025

In the ever-evolving world of artificial intelligence, where cutting-edge hardware and massive computational resources often dominate the conversation, one Chinese company has managed to turn the tables.

With its groundbreaking DeepSeek-R1 model, the company has become a poster child for efficiency, creativity, and resilience in the face of adversity. But how did they do it? Let’s dive into the juicy details , technical breakthroughs, and a sprinkle of industry gossip that make DeepSeek’s story so compelling.

“DeepSeek didn’t just find a way to survive — they found a way to thrive.”

The Secret Sauce: DeepSeek-R1’s Technical

So, what makes DeepSeek-R1 so special? How did they manage to achieve impressive performance without breaking the bank on high-end GPUs? The answer lies in a combination of clever techniques, relentless optimization, and a willingness to think outside the box.

Let’s Deep-Seek and Dive for the Details!

1. Multi-Stage Training: Building a Smarter Model Step by Step

DeepSeek-R1’s training process is a masterclass in efficiency. Instead of throwing everything at the model at once, they broke the training into three distinct stages:

Stage 1: Cold Start ❄️⚡

Starting with the pre-trained model DeepSeek-V3-Base , the model undergoes supervised fine-tuning using a compact dataset of results collected from DeepSeek-R1-Zero. These results were carefully validated for high quality and readability . Despite containing only a few thousand samples, this dataset is relatively small. Incorporating a fine-tuning phase on this curated dataset helps DeepSeek-R1 tackle readability issues observed in the initial model

Stage 2: Reasoning-Oriented Reinforcement Learning (RL) 🏆

Here’s where things get interesting. DeepSeek-R1 uses reinforcement learning to enhance its reasoning capabilities. To run reinforcement learning at a large scale and to save the training costs of RL, instead of using the standard reinforcement learning with human or AI feedback, an inhouse rule-based reinforcement learning method is employed. This approach is not only more scalable but also allows the model to develop a deeper understanding of complex tasks.

Here’s a summary of popular RL techniques with real life analogy:
PPO (Proximal Policy Optimization):Like teaching a student basketball with gradual improvements, PPO encourages small, manageable changes.
DPO (Direct Preference Optimization):DPO is like showing a student two ways to shoot and asking them which one they prefer.
GRPO (Group Relative Policy Optimization):In GRPO, a class plays basketball together, learning by watching and mimicking the best players. The model learns by comparison, aiming to match or outperform others, without explicit scores.

This rule-based mechanism, which does not use a neural model to generate rewards, not only simplifies and reduces the cost of the training process, making it feasible at a large scale. Moreover, the researchers found that reward models might suffer from reward hacking, where the model discovers a loophole or unintended way to maximize the reward, which does not align with the desired goal.

Stage 3: Rejection Sampling and Fine-Tuning ❌🎯

In this phase, the model checkpoint from phase 2 generates multiple samples, with only correct and readable ones retained through rejection sampling. A generative reward model, DeepSeek-V3, helps decide which samples to keep. Some of DeepSeek-V3’s training data is also included. The model is then fine-tuned on this enriched dataset, expanding its abilities beyond reasoning tasks to include writing, role-playing, and other general-purpose tasks across domains.

Stage 4: Diverse Reinforcement Learning Phase 🔄🌍

In this final phase, a variety of tasks are tackled. Rule-based rewards are applied to tasks where applicable, such as math. For other tasks, a large language model (LLM) provides feedback to help align the model with human preferences.

2. Mixture of Experts (MoE): Doing More with Less 👥🔧

One of the most talked-about aspects of DeepSeek-R1 is its use of the Mixture of Experts (MoE) architecture. This technique involves activating only a subset of specialized “expert” sub-models for each input, rather than using the entire model for every task. By engaging only the most relevant experts, DeepSeek significantly reduces computational load without sacrificing performance.For example, its V3 model has 671 billion parameters but uses just 37 billion per task, reducing computational costs.

3. Adaptive Strategies: Thriving Under Constraints

DeepSeek’s success is remarkable for turning hardware limitations into an advantage. With less powerful GPUs, they developed efficient algorithms and training processes that allowed DeepSeek-R1 to perform well on modest hardware. The training cost just $5.5M, thanks to innovations like Multi Token Prediction (MTP), Multi-Head Latent Attention (MLA), and extensive hardware optimization.

4. Leveraging Open-Source Tools: Standing on the Shoulders of Giants

DeepSeek didn’t start from scratch. By building on existing open-source frameworks and tools, they were able to reduce development time and costs. This approach allowed them to benefit from community-driven advancements while focusing their efforts on optimization and innovation. It’s a classic example of why the open-source movement is so powerful — and why it continues to drive progress in AI.

Gossip Section🤫💬

DeepSeek’s recent model reportedly cost just $5.6 million for its final training run, a 95% reduction compared to OpenAI’s GPT-4, which cost over $100 million. DeepSeek seems to have just upended our idea of how much AI costs, with potentially enormous implications across the industry.

Even if DeepSeek’s GPU claims are questioned, Hugging Face’s Leandro von Werra believes the open-source community will soon verify the truth, as his team is already working to replicate and open-source the R1 model. Researchers will quickly determine if the numbers add up.

Nvidia, boosted by the AI boom, saw its stock skyrocket in 2023, driven by the theory that AI companies would always need its chips. However, DeepSeek’s success challenges this, as more efficient chip use could reduce demand for Nvidia’s high-end products.

The Future Implications

The success of DeepSeek’s approach raises interesting questions about the future of AI development. While companies with massive computational resources continue to push the boundaries of scale, DeepSeek’s story suggests there might be smarter ways to build powerful AI systems. Their success demonstrates that innovation often flourishes under constraints, and sometimes the path to breakthrough isn’t through more power, but through smarter application of existing resources.

This story continues to evolve as DeepSeek faces the challenge of scaling their services to more users. But one thing is clear: they’ve shown that in the world of AI, clever engineering and innovative thinking can sometimes outmaneuver raw computational power.

And with that, the future of AI just got a whole lot more interesting. Happy exploring the open-source possibilities!

If you liked the article, show your support by clapping for this article. Follow me and lets unravel the mysteries and unleash the potential of AI. Feel free to connect with me on LinkedIn