A team of researchers from Sina Weibo, the Chinese social media giant, has published a technical report that is stirring up the entire AI community. The model, named VibeThinker-3B, is a language model with only 3 billion parameters, yet it can match or exceed the performance of much larger systems from Google DeepMind, OpenAI, and Anthropic. The news has shaken the industry because it challenges the scaling law that bigger models are always better.
The model scored 94.3 on the AIME 2026, the American Invitational Mathematics Examination, one of the hardest standardized math competitions in the world. That result places it alongside DeepSeek V3.2, which has 671 billion parameters, and ahead of Google's Gemini 3 Pro at 91.7. Using a technique called Claim-Level Reliability Assessment, VibeThinker-3B reaches 97.1, surpassing virtually every public system. But criticism has followed: many experts believe that benchmarks are now easily gameable and that the model does not perform as well in real-world settings.
Sponsored Protocol
The paper, posted on arXiv, quickly gathered 62 upvotes on Hugging Face, 130 likes on the model repository, and 685 stars on GitHub. However, social media erupted in debate. User @orcus108 wrote on X: 'WHAT THE HELL is happening in AI? A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5… I genuinely don't know if this is a breakthrough or if the benchmarks are broken.'
Weibo researchers introduced the Parametric Compression-Coverage Hypothesis, arguing that verifiable reasoning capabilities—like those tested in math and coding problems—can be compacted into a dense parameter core, while open-domain knowledge requires many more parameters. This explains why VibeThinker-3B excels on reasoning tests but scores only 70.2 on GPQA-Diamond, a graduate-level science knowledge benchmark, far behind Gemini 3 Pro's 91.9.
Sponsored Protocol
The four-stage training pipeline
VibeThinker-3B is not built from scratch; it is post-trained on Alibaba's Qwen2.5-Coder-3B through what the team calls the Spectrum-to-Signal Principle. Training unfolds in four major phases. The first is a two-stage supervised fine-tuning with curriculum learning: the model first trains on a broad mix of data, then focuses on harder problems. The second phase uses reinforcement learning with the MGPO algorithm, which prioritizes problems at the model's current capability boundary. The third phase extracts high-quality reasoning trajectories and distills them back into the model via supervised fine-tuning. The final phase, Instruct RL, applies reinforcement learning on instruction-following tasks using rule-based validators and reward models.
Sponsored Protocol
Francesco Bertolotti, an AI researcher, commented on X: 'These results were achieved primarily through post-training refinements on Qwen2.5-Coder. The paper doesn't provide many details, but it appears they distill from RL checkpoints and then do a final RL-based instruct RL.' His post garnered over 161,000 views.
Real-world testing reveals the gap
Despite the impressive numbers, many users who downloaded the model reported disappointing experiences. '@politilols' wrote: 'It doesn't even know what a uv script is (the most popular Python dev tool). Haven't seen that in a single LLM in at least a year now. Benchmaxxed.' Others criticized the choice of benchmarks, asking why tests like DeepSWE were not used. The researchers claim to have performed data decontamination, but the community remains skeptical.
The paper itself acknowledges that the model does not replace large generalists but shows that high performance on verifiable reasoning tasks is achievable with few parameters. This could have huge implications for AI accessibility: a 3-billion-parameter model can run on a consumer laptop, drastically reducing costs.
Sponsored Protocol
The debate sparked by VibeThinker-3B is fundamental: the AI industry has spent billions scaling parameters, but perhaps a part of intelligence could have been compressed all along. The question now is whether these results are reproducible and useful in the real world. As discussed in our article on Silicon Valley strategies, innovation sometimes comes from unexpected places. For a broader understanding of benchmarks, check Wikipedia.