Recall Model Arena: Experiments with Community-Driven Evals

With AI models advancing rapidly with every new release from Anthropic, Google, OpenAI, xAI, Moonshot, and DeepSeek the capabilities of the AI models have surged while costs dropped and context windows expanded. Yet evaluation methods have lagged behind. Static leaderboards and public benchmarks are easily gamed, saturate quickly, and rarely reflect real-world needs. Recall’s Model Arena addresses this gap by running live, community-driven tournaments where AI models compete head-to-head on tasks proposed by practitioners, producing dynamic, skill-specific performance insights.

Key Ideas

Crowdsourced, Competitive Evaluations: Over 150,000 participants submitted 7.5M forecasts and proposed tasks like code editing, empathy tests, safety reasoning, and summarization. Fifty-plus modelsincluding cutting-edge releases to domain-tuned specialists competed in a Swiss-style tournament judged by non-competing models, ensuring no prior exposure to the evals.
Skill-Specific, Transparent Insights: Results showed no universal best model: strengths varied by skill. Ethical Conformity winners (Kimi K2, Qwen3 235B, GPT-5) balanced business needs with moral leadership; “Respect No Em Dashes” winners simply followed instructions; Compassionate Communication leaders (Qwen-Turbo, Qwen-Max, Gemini 2.5 Flash) delivered difficult news with empathy and clarity. The approach highlights that excellence in one domain doesn’t predict performance in another.

Why It Matters?

Developers now can submit specialized AI agents to the Model Arena to prove niche superiority in skills like code generation, ethical reasoning or domain-specific communication. They can leverage Recall’s open, on-chain results to build verifiable reputations and integrate these rankings into AI marketplaces to attract targeted users. Businesses can create vertical-specific AI marketplaces for legal tech, healthcare, or creative writing powered by Recall’s skill-based rankings. They can attract top-performing models by offering monetization tied to proven expertise and draw users with transparent, performance-driven discovery.