DeepSeek Inference AI demonstrates the power of small models for efficient training

Written by

Audrey Miles

Updated on:July-17th-2025

—— In the view of IBM Fellow Kush Varshney, geopolitical differences in the global AI competition may not be as important as people think. He said: "Once the model is open source, where it comes from is no longer important in many ways."

(Beijing, February 10, 2025) DeepSeek-R1 is an artificial intelligence model launched by Chinese startup DeepSeek. Not long ago, it was released on the artificial intelligence open source platform Hugging Face and it jumped to the top of the list of the most downloaded and active models within a few hours. At the same time, it also caused shocks in the financial market because it prompted investors to reconsider the valuations of chip manufacturers such as NVIDIA and the huge investments made by AI giants to expand the scale of their AI businesses .

Why the fuss? DeepSeek-R1 is a so-called “inference model” digital assistant that performs on par with OpenAI’s O1 on certain AI benchmarks for math and coding tasks, but the company says it was trained using far fewer chips and costs about 96% less .

"DeepSeek is undoubtedly reshaping the AI landscape, challenging the giants with its open source ambitions and state-of-the-art innovations," said Kaoutar El Maghraoui, a principal research scientist and manager at IBM AI Hardware.

Meanwhile, TikTok’s parent company, Chinese tech giant ByteDance, recently released its own reasoning agent, UI-TARS , which it claims outperforms OpenAI’s GPT-4o, Anthropic’s Claude, and Google’s Gemini on certain benchmarks. ByteDance’s agent can read graphical interfaces, perform reasoning, and take autonomous and step-by-step actions.

From startups to established giants, Chinese AI companies appear to be closing the gap with their American rivals, thanks in large part to their willingness to open source, or share, underlying software code with other businesses and software developers. "DeepSeek has been able to promote some pretty powerful models across the community," said Abraham Daniels, senior technical product manager for IBM's Granite model . "DeepSeek really has the potential to accelerate the democratization of AI." DeepSeek-R1 is available on Hugging Face under the MIT license, which allows unrestricted commercial use.

Last summer, Chinese company Kuaishou released a video generation tool that was similar to OpenAI’s Sora, but available to the public. Sora was unveiled last February but wasn’t officially released until December, and even then, only users with a ChatGPT Pro subscription could use its full functionality. Developers on Hugging Face have also snapped up new open source models from Chinese tech giants Tencent and Alibaba. While Meta has open sourced its Llama model , both OpenAI and Google have taken a largely closed-source approach to model development.

In addition to the benefits of open source, DeepSeek engineers use only a fraction of the highly specialized chips from NVIDIA that their U.S. competitors use when training their systems. For example, DeepSeek engineers published a research paper when they released the DeepSeek-V3 model , saying that they only needed 2,000 GPUs (graphics processing units), or chips, to train their model.

Inference Model

“What’s really impressive is the reasoning capabilities of the DeepSeek model,” said IBM Fellow Kush Varshney . Reasoning models essentially verify or check themselves, representing a kind of “ metacognition ,” or “thinking about thinking.” “We’re starting to build intelligence into these models, and that’s a huge step forward,” Varshney said.

Reasoning models became a hot topic last September when OpenAI previewed its o1 reasoning model . Unlike previous AI models that only give answers without explaining the reasoning process, it solves complex problems by breaking them into several steps. Reasoning models may take a few extra seconds or minutes to answer questions because they reflect on their analysis step by step or in a " chain of thought ".

Reinforcement Learning

DeepSeek-R1 combines thought chaining reasoning with reinforcement learning , in which an autonomous agent learns to perform tasks through trial and error without any instructions from a human user. Reinforcement learning is distinct from more commonly used forms of learning, such as supervised learning , which uses manually labeled data to make predictions or classifications, and unsupervised learning , which aims to discover and learn hidden patterns from unlabeled data.

DeepSeek-R1 questions the assumption that a model’s reasoning abilities improve by training it on labeled examples of correct or incorrect behavior, or by extracting information from hidden patterns. “The core hypothesis is simple, but not so simple: Can we teach the model to answer correctly using only reward signals, and let it figure out the best way to think on its own?” said Yihua Zhang, a doctoral student at Michigan State University who has written dozens of papers on machine learning.

Zhang Yihua said that for him and experts like him who are accustomed to traditional supervised fine-tuning, "it is really amazing to see that large language models like DeepSeek can learn to 'think better' just by reinforcement learning rewards," especially seeing "a real 'aha moment' in the model, where it can take a step back, find its mistake and correct itself."

Cost Calculation

DeepSeek's buzz stems in part from its low price. According to a technical report released by the company , DeepSeek-V3, released on Christmas Day, cost $5.5 million to train, but is much cheaper for developers who want to try it. "What they've done on the cost of the model, and the time it took them to train the model, is really impressive," said Chris Hay, a distinguished engineer at IBM .

However, Kate Soule, director of product management for Granite technologies at IBM Research , said the low price tag may not be the whole story. The $5.5 million cost "represents only a small fraction of the compute required," she said. It also doesn't include details of the costs that companies would have to keep proprietary even if they adopted open source models, such as "the computational costs of reinforcement learning, data reduction, and hyperparameter search," she said.

There is no doubt that DeepSeek has achieved greater cost-effectiveness by using a mixture of experts (MoE) architecture , which significantly reduces the resources required for training. The MoE architecture divides the AI model into different sub-networks (or "experts"), each of which specializes in processing a subset of the input data. Instead of activating the entire neural network , the model only activates the specific experts required for a specific task . As a result, the MoE architecture significantly reduces the computational cost during pre-training and achieves faster performance during inference. In the past year, many companies around the world, including Mistral, a leading French AI company , and IBM have promoted the MoE architecture and achieved greater efficiency by combining MoE with open source. (For example, IBM announced the launch of InstructLab with Red Hat at the Think 2024 conference , a revolutionary large model alignment method that drives open source innovation for large models.)

In the case of IBM's open source Granite series of models (developed using the MoE architecture), enterprises can achieve leading-edge model performance at a very low cost because they can adjust large pre-trained models for specific applications or use cases, effectively creating smaller applicable models. Packing powerful functions into smaller dense models means that these models can be used in smartphones and other mobile devices running at the edge , such as automotive computers or smart sensors on factory floors.

This process of taking larger models and distilling them into smaller, less resource-intensive models has also contributed to DeepSeek’s success. Along with the release of its iconic R1 model, the Chinese startup also released a series of smaller models suited for different purposes. Interestingly, they have demonstrated that distilling large models into smaller models performs better inference than doing reinforcement learning on the smaller models in the first place.

A global AI reshuffle?

As these new models match or surpass older generation competitors on certain benchmarks, how will they affect the global AI landscape? “The global AI landscape is not just about raw performance on benchmarks, but whether these models can be integrated end-to-end in a safe and ethical way,” El Maghraoui said. As a result, El Maghraoui said it’s too early to tell whether DeepSeek-R1 and other products will “transform human interactions, technology, and enterprise applications.”

Ultimately, “developer adoption will determine how popular the DeepSeek model is,” Daniels said, adding that he looks forward to “seeing what kinds of use cases they come up with for the model.”

In the view of IBM Fellow Kush Varshney , geopolitical differences in the global AI race may not be as important as people think. He said: "Once the model is open source, where it comes from is no longer important in many ways."