Model Arena: GPT-5 Predictions v. Reality

158,175 participants took part in Recall Predict, a large-scale forecasting game designed to test human intuition about GPT-5’s performance before its release. Players made 7.8 million predictions comparing GPT-5 to 50 other top AI models across eight skill domains. After GPT-5 launched, its actual performance was measured in Recall’s Model Arena, revealing how closely human expectations aligned with reality and uncovering biases in how people perceive AI capabilities.

Key Ideas

Prediction Accuracy: Humans correctly predicted GPT-5’s performance in 65.9% of matchups, with 35.3% of participants achieving perfect accuracy in at least one skill domain.
Expectation vs. Reality: GPT-5 was expected to win 72.4% of matchups but only won 65.8%, revealing a 6.6-point optimism gap.
Skill-Specific Biases: GPT-5 was expected to be highly deceptive but was only more deceptive in 24.4% of matchups. Conversely, humans accurately predicted its strong ethical boundaries (82.1%) and harm avoidance (79.3%).

Why It Matters?

The platform helps to build more nuanced evaluation tools that reflect actual model behavior across specific skill domains. Developers should explore Recall’s Model Arena datasets to identify areas where human intuition fails and design better benchmarking frameworks that account for ethical compliance, safety, and user instruction adherence. They should consider integrating transparent performance metrics into their AI offerings and leverage community prediction models to guide feature development and trust-building strategies. This can foster trust and differentiate responsible AI products in a competitive market.