Prophet Arena

73 posts

Prophet Arena

@ProphetArena

The AI benchmark for predictive intelligence | SIGMA Lab @UChicagoCS @DSI_UChicago Not affiliated to any tokens or crypto protocols.

شامل ہوئے Ağustos 2025

15 فالونگ2.1K فالوورز

Prophet Arena@ProphetArena·27 Oca

Happy to share that the first paper about Prophet Arena (yea, it's me!) will appear at #ICLR2026 ! We believe AI for forecasting should be a North Star of #Machine_Intelligence -- way to go! See our paper here arxiv.org/pdf/2510.17638

Prophet Arena@ProphetArena

Can LLMs predict the future? Hey X! its been awhile, we’ve got some exciting updates to share over the next few days. First, we want to share our technical report on the concept of “LLM-as-a-prophet”, where we analyze predictive intelligence of frontier models using live prediction markets. Check out our findings on arXiv 👇

English

2.7K

Prophet Arena@ProphetArena·1 Oca

Happy New Year! Here are some AI Forecasts for 2026🔮 Most likely World Cup winner: 🇪🇸 Spain Spotify #1 artist: Taylor Swift (Qwen 3 235B says 100%) 75% - GTA 6 releases before end of 2026 (Grok-4) 65% - One Battle After Another wins Best Picture (Claude Sonnet 4) 55% - U.S. tariff stimulus checks issued (Gemini) 11% - Bitcoin hits $150k before March 2026 (GPT-5) Let’s see how this ages! 👀

English

1.8K

Prophet Arena@ProphetArena·27 Ara

@avinashj_ @ahall_research @karpathy Love this line of thinking. We’re actively exploring multi-agent strategies at @ProphetArena, and hierarchical approaches like this are very much on our radar. We’re already testing these ideas in prediction markets, so stay tuned for results over the next few weeks!

English

Avinash@avinashj_·25 Ara

@ahall_research @karpathy Would love to see this kind of study applied to prediction markets - multi agent strategies could be next on @ProphetArena

English

255

Andy Hall@ahall_research·24 Ara

Recently @karpathy built an "LLM council" to provide advice to users. This got me thinking: what are the best governance rules for the council? So I built a little experiment testing four different procedures: (1) The LLMs do a simple majority vote without communicating to each other (2) LLMs vote, then deliberate, then can update their vote. (3) LLMs vote, deliberate, update their vote, and then a Chairman makes the decision. (4) LLMs vote, evaluate each other's answers, and then the Chairman makes the decision (like Karpathy's procedure). I evaluated how well these procedures do compared to individual models on two very basic eval sets with known answers (GSM8K and TruthfulQA). I focused on very cheap models to spare my wallet. For the very particular setup I created---and I don't claim this generalizes---it looks like all four approaches do a little bit better than the best model on its own, but option 2, voting with deliberation, does the best. (I didn't run the study long enough to get "statistical significance" but if someone wants to they probably could). It turns out, unbeknownst to me, there's a cool literature in cs already exploring some of these procedural questions, so I don't think I've discovered anything new here (one main example: a Du et al 2023 paper showing that councils outperform individual models). That being said, this is a cool opportunity for people in the social sciences who study collective decision-making mechanisms and governance. In the future there will be tons of situations where multiple agents have to interact and make decisions together, and we should develop a science of how to make optimal collective decisions with AI agents.

English

416

70.8K

Prophet Arena@ProphetArena·22 Ara

🚀 Introducing the Prophet Arena Agent Leaderboard Prophet Arena is now benchmarking end-to-end forecasting agents. Previously, we evaluated LLMs forecasting when given a fixed global context. This new leaderboard tests end-to-end forecasting agents that use their own web search, reasoning process and tool use in any creative way they please. Excited to welcome our first two agents, from two fascinating startups: @lightningrodai and @ag2oss – check them out! 🤖 Add your agent? Just provide an OpenAI-compatible endpoint and we’ll start benchmarking automatically: 👉 prophetarena.co/onboarding 📜 Rules & guidelines: 👉 prophetarena.co/research/agent…

English

2.1K

Prophet Arena@ProphetArena·19 Ara

Link 👉 arxiv.org/abs/2510.17638

English

346

Prophet Arena@ProphetArena·19 Ara

English

1.9K

Prophet Arena ری ٹویٹ کیا

Ben Turtel@BTurtel·6 Ara

Cool to see our tiny 32B model hanging with top frontier models on the @ProphetArena leaderboard! ProphetArena is run by @haifengxu0 at UChicago. They don't use our standard prompts or prediction flow, and they select from their own distribution of questions. So when we agreed to participate, we had no idea how our model would perform Foresight-32B is beating almost every model released before ours was trained, as well as the market baseline. We're doing even better in Sports. Nice 3rd-party validation of our results!

English

597

Prophet Arena@ProphetArena·6 Eyl

📊Dataset Release📊 huggingface.co/datasets/proph… Prophet-Arena-Subset-100, a compact dataset for evaluating calibration and forecasting reasoning abilities. Shipped with - 100 events and data from Prophet Arena benchmark - Plug-and-play predictor & evaluator scripts

English

1.4K

Prophet Arena@ProphetArena·25 Ağu

Thank you for tuning in on Prophet Arena! We’re super grateful for all the feedback and suggestions from the community… You asked and we listened! This week, we are excited to roll out some exciting new updates: -🚀 Check out the 5 NEW models that have been added to the benchmark (winners have changed!!) prophetarena.co/leaderboard -🎯 We've streamlined our onboarding process, visit our onboarding page and reach out if you have questions! prophetarena.co/onboarding -📰 Check us out on Yahoo Finance :) finance.yahoo.com/news/ai-now-ma…

English

2.2K

Prophet Arena@ProphetArena·22 Ağu

Thoughtful comments! Yes, there is indeed positive correlation, but clearly with variance. We think this is intuitively due to different models have very different characteristics. Some tends to be conservative with close to middle prediction whereas some are aggressive, some tends to be good at sports whereas some are good at politics, etc. Meanwhile, prediction market are more efficient in some category but less in others, so these factors caused variance, despite the generally positive correlation between accuracy and return. We also discussed this here #accuracy-vs-returns-which-metric-is-most-appropriate-for-forecasting" target="_blank" rel="nofollow noopener">prophetarena.co/blog/welcome#a….

English

Cherenedene@Cherenedene·22 Ağu

After reviewing the blog, I understand how adding the market maker variable affects average returns However, the difference of "edge/discrepancy" for individual bets should be expected to average out stochastically over time, so more accurate models should correlate precisely with higher average returns Do you expect a higher correlation to occur over time, and are you observing any evidence of this? Or alternatively, do you anticipate a persistent discrepancy for other reasons?

English

Prophet Arena@ProphetArena·17 Ağu

🔮 Introducing Prophet Arena — the AI benchmark for general predictive intelligence. That is, can AI truly predict the future by connecting today’s dots? 👉 What makes it special? - It can’t be hacked. Most benchmarks saturate over time, but here models face live, unseen future events. You can’t memorize tomorrow (unless you’ve cracked time travel). - It’s interpretable. Strong performance = real foresight, which translates into real investment gains. 👉 Check it out: prophetarena.co

English

140

1.2K

451.7K

Prophet Arena@ProphetArena·22 Ağu

@corygabrielsen Yes, we disabled Internet search when testing on recently resolved events. By default, all knowledge used for prediction is always before the event's resolution, but great suggestion -- we will update the UI to make this more explicit in upcoming releases!

English

Cory Gabrielsen@corygabrielsen·22 Ağu

My point is that you should say when the prediction was made or it’s not very useful. Also the models can search internet. Are you disabling that for predictions? My feedback is about making the UI more useful/interprettable. We don’t know how you’re promoting the models unless the UI tells us.

English

Prophet Arena@ProphetArena·22 Ağu

They were made in the past month or so. This particular one is due to the fact that both the models and news sources do not have up-to-date ETH price information while making the prediction -- exactly the reason that forecasting future events, as a benchmark, cannot be saturated.

English

138

Cory Gabrielsen@corygabrielsen·18 Ağu

@ProphetArena There's no information about when the predictions were made. For instance, ETH traded above $4500 THIS WEEK, but every model is predicting below 100%

English

340

Prophet Arena@ProphetArena·22 Ağu

Great question! Our blog here has a concrete example to illustrate the discrepancy #example-when-would-absolute-and-relative-metrics-differ" target="_blank" rel="nofollow noopener">ai-prophet.github.io/pm_ranking/blo…. But intuitively, this is because accuracy is an absolute measure of prediction (using scoring rules) having nothing to do with the prediction market itself, whereas average profit is a relative measure -- a model has high gain when it can pick up "market opportunity", i.e., do better (though not necessarily perfectly accurate) when the market did badly.

English

Cherenedene@Cherenedene·18 Ağu

@ProphetArena What's the reason for the discrepancy between accuracy and average profit? If average profit is analogous to expected value then there should be a tighter correlation, as optimizing a bet based on a certain outcome should be less difficult than making a correct prediction

English

353

Prophet Arena@ProphetArena·22 Ağu

Great comment! This is exactly why we allows AI-human interaction on the platform, where humans can provide and rate data/news sources. For events that have sufficient human interest with enough data, AI only need the capability to comprehend, critically think and reason about these sources -- we believe it is very possible to utilize such human-AI collaboration to generate great predictions.

English

Solder AI@0xSolderAI·18 Ağu

@ProphetArena Interesting take. Prediction benchmarks like this highlight raw capability but in real-world use, context retention often matters just as much as foresight. Without memory continuity, even the best predictions risk being disconnected from long-term reality.

English

229

Prophet Arena@ProphetArena·22 Ağu

1. The model directly gives percentage prediction 2. That depends on the event and the model -- not always extremal, but some models do tend to have "extremal opinions" and some events are easier to lead to extremal predictions. 3. Currently only a few categories are supported, but we are actively working on implementing a "search functionality" which you can use to filter topics -- stay tuned!

English

Marz@saipienorg·20 Ağu

@ProphetArena 1. How do you arrive at the percentages? Do you query a model multiple times? 2. Why are the predictions so .. not ambitious.. 3. In the leaderboard any way to filter topics?