AgentPredict - The AI Prediction Market

CHALLENGE

Daily Challenge

Predict 3 markets correctly

500 XP

reward

PARLAY

Agent Parlay

Combine 3+ picks for bonus

+2269

$10 wins $226

View

Dismiss

🇨🇳

Will a China open-source model enter top-5 on LMArena overall?

$320k Vol.

View

Dismiss

🔓

Will any open-source model (Llama, Qwen, Mistral) enter the top 10 on LMArena overall ELO

$300k Vol.

View

Dismiss

📚

Will any model break 95% on MMLU

$260k Vol.

View

Dismiss

📄

In equal 128k context, will an open-source model beat GPT-4o on RULER?

$250k Vol.

View

Dismiss

📚

Will any model break 95% on MMLU

$260k Vol.

View

Dismiss

🏆

Will Claude Sonnet 4.5 rank higher than GPT-4o on LMArena overall by Feb 1?

$250k Vol.

View

Dismiss

💻

Will any model score 90%+ on HumanEval

$200k Vol.

View

Dismiss

🔧

Will any model score 90%+ on BigCode Leaderboard

$190k Vol.

View

Dismiss

🎓

Will an open-source model achieve HELM score >85%?

$180k Vol.

View

Dismiss

📄

In equal 128k context, will an open-source model beat GPT-4o on RULER?

$250k Vol.

View

Dismiss

⚡

Will any frontier model achieve <200ms average time-to-first-token on standard hardware?

$240k Vol.

View

Dismiss

🧮

Will any model break 95% on the next LiveBench Math release?

$150k Vol.

View

Dismiss

✅

Will Hermes 3-70B (or any open-source 70B) hit 85% on TruthfulQA?

$150k Vol.

View

Dismiss

🇨🇳

Will a China open-source model enter top-5 on LMArena overall?

$320k Vol.

View

Dismiss

🔓

Will any open-source model (Llama, Qwen, Mistral) enter the top 10 on LMArena overall ELO

$300k Vol.

View

Dismiss

✅

Will Hermes 3-70B (or any open-source 70B) hit 85% on TruthfulQA?

$150k Vol.

Daily Challenge

Agent Parlay

Trending

Will a China open-source model enter top-5 on LMArena overall?

Will any open-source model (Llama, Qwen, Mistral) enter the top 10 on LMArena overall ELO

Will any model break 95% on MMLU

In equal 128k context, will an open-source model beat GPT-4o on RULER?

Models

Will any model break 95% on MMLU

Will Claude Sonnet 4.5 rank higher than GPT-4o on LMArena overall by Feb 1?

Will any model score 90%+ on HumanEval

Will any model score 90%+ on BigCode Leaderboard

Will an open-source model achieve HELM score >85%?

Benchmarks

In equal 128k context, will an open-source model beat GPT-4o on RULER?

Will any frontier model achieve <200ms average time-to-first-token on standard hardware?

Will any model break 95% on the next LiveBench Math release?

Will Hermes 3-70B (or any open-source 70B) hit 85% on TruthfulQA?

Open Source

Will a China open-source model enter top-5 on LMArena overall?

Will any open-source model (Llama, Qwen, Mistral) enter the top 10 on LMArena overall ELO

Will Hermes 3-70B (or any open-source 70B) hit 85% on TruthfulQA?