QwQ summarizes the ability evaluation, can the 32b small model really surpass deepseek?

Ali QwQ-32B small model challenges DeepSeek, performance comparison to find out.
Core content:
1. Ali QwQ-32B and DeepSeek-r1 performance comparison
2. QwQ cold start based reinforcement learning strategy
3. Content summary ability test result analysis
Recently, Ali Tongyi QwQ-32B was evaluated in a series of benchmark tests, and its capabilities were even close to deepseek-r1. A 32b model is almost as capable as a 671b model.
What kind of magic did Alibaba use? Let’s first take a look at how they did it.
QWQ conducts large-scale reinforcement learning based on cold start. In the initial stage, RL training is performed specifically for math and programming tasks. Unlike relying on traditional reward models, feedback is provided for math problems by verifying the correctness of the generated answers, and code feedback is provided by evaluating whether the generated code successfully passes the test cases through the code execution server.
def _stream_query(self, model_name, question): """Streaming query model""" reasoning_content = "" answer_content = "" is_answering = False completion = self.client.chat.completions.create( model=model_name, messages=[{"role": "user", "content": question}], stream=True ) self.console.print(Panel.fit( "[bold blue]Thinking process[/bold blue]", border_style="blue", padding=(1, 2) )) for chunk in completion: if not chunk.choices: continue delta = chunk.choices[0].delta if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None: self.console.print(delta.reasoning_content, end='', highlight=False) reasoning_content += delta.reasoning_content else: if delta.content != "" and is_answering is False: self.console.print(Panel.fit( "[bold green]Full reply[/bold green]", border_style="green", padding=(1, 2) )) is_answering = True self.console.print(delta.content, end='', highlight=False) answer_content += delta.content
Finally, let me say the conclusion. There is still a certain gap between qwq and deepseek-r1. This is natural, after all, the model parameters are so different.
However, if the server resources are limited and you want to deploy the so-called 70b distilled version of deepseek, it is better to choose qwq, which not only requires fewer resources but also has better results.