QwQ summarizes the ability evaluation, can the 32b small model really surpass deepseek?

Written by

Clara Bennett

Updated on:July-13th-2025

Recently, Ali Tongyi QwQ-32B was evaluated in a series of benchmark tests, and its capabilities were even close to deepseek-r1. A 32b model is almost as capable as a 671b model.

What kind of magic did Alibaba use? Let’s first take a look at how they did it.

QWQ conducts large-scale reinforcement learning based on cold start. In the initial stage, RL training is performed specifically for math and programming tasks. Unlike relying on traditional reward models, feedback is provided for math problems by verifying the correctness of the generated answers, and code feedback is provided by evaluating whether the generated code successfully passes the test cases through the code execution server.

There is not much difference in the methods, so is it really that powerful? Talk is cheap, let's take a look at the comparison results directly.

I mainly tested my ability to summarize content here. Since the original text is long, I won't post it all, but it mainly contains several important information. There are 400,000 people who come to Chongqing for employment and entrepreneurship (including 60,000 from outside the city) , the employment rate before leaving school is not less than 75%, and the assistance rate for the unemployed at the end of the year exceeds 90%. The city provides about 16,000 jobs in government agencies and public institutions, and more than 10,000 jobs in municipal state-owned enterprises and district and county state-owned enterprises. Grassroots service projects such as "Three Support and One Assistance" and "Western Plan" are implemented to expand grassroots employment service positions and provide more than 10,000 jobs. Improve entrepreneurial support and issue 1.3 billion entrepreneurial loans .

In comparison, deepseek-r1 provides the most comprehensive summary, and it automatically merges the number of jobs provided (16,000+10,000+10,000) into 36,000, which is still correct in meaning.

qwq's summary is also good, but it missed the total number of jobs offered, which is a little worse than deepseek-r1.

As for the deepseek distilled version 70b and 32b model, even more information is lost, and basically no key numbers are retained.

The output results can be seen in the figure below:

Some codes for comparison:

  def _stream_query(self, model_name, question): """Streaming query model""" reasoning_content = "" answer_content = "" is_answering = False completion = self.client.chat.completions.create( model=model_name, messages=[{"role": "user", "content": question}], stream=True ) self.console.print(Panel.fit( "[bold blue]Thinking process[/bold blue]", border_style="blue", padding=(1, 2) )) for chunk in completion: if not chunk.choices: continue delta = chunk.choices[0].delta if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None: self.console.print(delta.reasoning_content, end='', highlight=False) reasoning_content += delta.reasoning_content else: if delta.content != "" and is_answering is False: self.console.print(Panel.fit( "[bold green]Full reply[/bold green]", border_style="green", padding=(1, 2) )) is_answering = True self.console.print(delta.content, end='', highlight=False) answer_content += delta.content

Finally, let me say the conclusion. There is still a certain gap between qwq and deepseek-r1. This is natural, after all, the model parameters are so different.

However, if the server resources are limited and you want to deploy the so-called 70b distilled version of deepseek, it is better to choose qwq, which not only requires fewer resources but also has better results.