Calibration City

Introduction

Predicting the future is hard, but it's also incredibly important.

Let's say someone starts making predictions about important events. How much should you believe them when they say the world will end tomorrow? What about when they say there's a 70% chance the world will end in 50 years?

Quantified Predictions

Wait, what does "70%" even mean in this situation? How can you have 70% of an apocalypse?

In this situation the predictor is making a prediction with a certain confidence. Rather than just saying "it's likely", they've chosen a number to represent how confident they are in that statement.

People make predictions every day, but most don't choose a specific number to assign to their confidence. This would be wildly impractical for most things! If you're driving and a car in front of you slows down, you could make a prediction about what it's going to do. If they turn on their turn signal, you could make a pretty confident prediction about what it's going to do. You usually don't need to state what potential outcomes you're anticipating, which you think is most likely, or what amount of confidence you'd place on each, but you are already doing it!

Explicit predictions are most useful when trying to communicate about important, uncertain events. When you hear the morning news say there's a 70% chance of rain today, they've given you a useful data point! You can use that information to make decisions: Should I take an umbrella? Should I wear a jacket? Probably!

Why should I care about a specific confidence number? Just say "probably" like everyone else!

Predictions, quantified or not, are ultimately only useful as tools that you can use to make decisions. If a prediction is not particularly relevant to a decision you're making, or it won't affect you much either way, then "probably" is fine! If someone tells you they will "probably" be home in twenty minutes, that's usually enough information for any decision you need to make.

On the other hand, predictions that would affect something significant in your life or require you to make a bigger decision should probably be taken more seriously.

Will it rain today? You may have to change your plan to go for a hike.
How is the economy doing? Should you invest or save?
Who is going to win the election? Will they pass that law they've been talking about?
Will COVID cases rise again? Should you stock up on masks or toilet paper?

These are the sorts of questions where it's helpful to have quantified predictions.

Grading Calibration

If these predictions are so important, how do we know who to trust? Just because someone is confident in themselves doesn't mean I should be confident in them.

One of the best ways we can measure how good a person is at predicting is by looking at how often they were right. If our Nostradamus was wrong about every prediction they've made so far, we should probably ignore them. If they have been right every time, we should probably take them seriously.

To grade simple predictions, we can put all of the YES predictions in one bucket, and all of the NO predictions in another. We'll count how many times those predictions came true - ideally everything in the NO bucket will resolve NO, and everything in the YES bucket will resolve YES.

Prediction	NO	YES
Resolved No	15	7
Resolved Yes	3	10
Average Resolution	3 / 18 = 16.7%	10 / 17 = 58.8%

Well it looks like our Nostradamus was decently accurate whenever he predicted NO - those only happened 17% of the time. But his YES predictions weren't so good - they happened about as often as chance! It seems like this predictor isn't very well-calibrated.

Anyways, we're more interested in forecasters that don't just say yes or no. We're looking at people who assign some sort of probability to their statement. In the example at the top of the page, our doomsayer was claiming a 70% chance that the world would end by a specific timeframe. How would we judge that after the fact? (Assuming the world did not end, that is.)

Instead of two buckets (YES and NO), let's break their predictions up into eleven buckets - 0%, 10%, 20%, and so on to 100%. If our Nostradamus said there's a 0% chance that the sky will fall and a 70% chance there will be a snowy Christmas this year, then we can sort those into the right buckets and then evaluate each one.

Prediction	0%	10%	20%	30%	40%	50%	60%	70%	80%	90%	100%
Resolved No	10	15	18	15	20	18	14	7	7	3	0
Resolved Yes	1	2	7	7	14	19	21	14	17	13	9
Average Resolution	9.1%	11.7%	28.0%	31.8%	41.2%	51.4%	60.0%	66.7%	70.8%	81.3%	100.0%

This looks a lot better! Now that we have more granularity, we can differentiate between things like "unlikely", "probably not", and "definitely not". When this predictor said something has a 10% chance to occur, it actually happened only 11.7% of the time. And when they gave something a 60% chance, it actually happened 60% of the time! It seems like this predictor has a much better calibration.

If a predictor is calibrated it means that, on average, predictions they make with X% confidence occur X% of the time.

Let's plot these on a chart for convenience. Across the bottom we'll have a list of all our buckets - 0 to 100%. Along the side we'll have a percentage - how often those predicted events came true. If our predictor is well-calibrated, these points should line up in a row from the bottom-left to the top-right. We'll call this a calibration plot , but it's also known as a reliability diagram.

This is very good! Now we can see visually where our predictor is calibrated or where they're over- or under-confident. If our forecaster keeps making predictions like this, we could expect them to be well-calibrated in most cases - especially when they make predictions between 30% and 70%.

Grading Accuracy

Those charts are nice and all, but it still doesn't tell me how seriously I should take this person.

Good point! Calibration plots can tell you plenty, but they're hard to compare and they don't give you a single numeric score. For that, let's look into accuracy scoring. Accuracy is an intuitive measure but it has some important caveats.

A predictor is more accurate the closer their predictions are to the resolved outcome.

We have a few ways to calculate accuracy, but let's focus on the most popular one: Brier scores.

For each prediction, we take the "distance" it was from the outcome: if we predict 10% but it resolved NO, the distance is 0.1 — but if we predict 10% and the answer is YES, the distance would be 0.9. We always want this number to be low! Once we have these distances, we square each one. This has the effect of "forgiving" small errors while punishing larger ones.

After we have done this for all predictions, we take the average of these scores. This gives us the Brier score for the prediction set.

Prediction	Resolution	"Distance"	Score
10%	NO (0)	0.10	0.0100
35%	NO (0)	0.35	0.1225
42%	YES (1)	0.68	0.3364
60%	NO (0)	0.60	0.3600
75%	YES (1)	0.25	0.0625
95%	YES (1)	0.05	0.0025
Average Brier Score			0.1490

The most important thing to note here is the fact that smaller is better! This score is actually measuring the amount of error in our predictions, so we want it to be as low as possible. In fact, an ideal score in this system is 0 while the worst possible score is 1.

If you were to guess "50%" on every question, your Brier score would be 0.25. Superforecasters tend to fall around 0.15 while aggregated prediction markets generally fall between 0.10 and 0.20.

So how is accuracy different than calibration here?

Calibration is about how good you are at quantifying your own confidence, not always about how close you are to the truth. If you make a lot of predictions that are incorrect, but are honest about your confidence in those predictions, you can be more well-calibrated than someone who makes accurate but over- or under-confident predictions.

It seems like these statistics are pretty easy to game. What's stopping you from predicting 100% on a bunch of certain things, like "will the sun come up tomorrow"?

Ultimately, nothing is preventing that! It's very important to check what sorts of predictions someone is making to ensure that they're relevant to you. It's especially important when looking at user-generated content on prediction market sites, where extremely easy questions can be added for profit or calibration manipulation.

This is especially relevant when comparing between different predictors or platforms. Just because someone has a lower Brier score does not mean that they are inherently better! The only way you can directly compare is if the corpus of questions is the same for all participants.

You can check the individual markets included in this site's data by browsing the markets on the list page.

Prediction Markets

What are these prediction markets? How can they be so accurate?

Prediction markets are based on a simple concept: If you're confident about something, you can place a bet on it. If someone else disagrees with you, declare terms with them and whoever wins takes the money. By aggregating the odds of these trades, you can gain an insight into the "wisdom of the crowds".

Imagine a stock exchange, but instead of trading shares, you trade on the likelihood of future events. Each prediction market offers contracts tied to specific events, like elections, economic indicators, or scientific breakthroughs. You can buy or sell these contracts based on your belief about the outcome - if you are very confident about something, or you have specialized information, you can make a lot of money from a market.

Markets give participants a financial incentive to be correct, encouraging researchers and skilled forecasters to spend time investigating events. Individuals with insider information or niche skills can profit by trading, which also updates the market's probability. Prediction markets have out-performed polls and revealed insider information, making them a useful tool for information gathering or profit.

Some popular prediction market platforms include:

While prediction markets have existed in various capacities for decades, their use in the U.S. is currently limited by the CFTC. Modern platforms either submit questions for approval to the CFTC, use reputation or "play-money" currencies, restrict usage to non-U.S. residents, or utilize cryptocurrencies. Additionally, sites will often focus on a particular niche or community in order to increase trading volume and activity on individual questions.

Calibration City

All of this brings us to this very site - Calibration City. This site is a project to answer the question:

"How much can we trust prediction markets?"

The way we approach this question is to look at each platform as a whole. We can take each market on the platform and treat it like an individual prediction, using the market's estimated probability as the prediction value. Once the market resolves, we can look at how accurate the market was and how calibrated the site was overall using the same methods outlined above. Just like before, these are our two keys: predictions and resolutions.

Can we really just use the market's "estimated probability" like an individual's prediction? Doesn't it change over time?

A market's listed probability is the aggregated prediction of hundreds of people, which means it does change over time as people make trades or news comes out. By default we use a market's probability at its midpoint as the prediction value - this is far enough from the market's start that traders have had time to settle on a consensus range, and far enough from the end that the outcome is uncertain.

When we calculate the calibration chart, for example, we can choose to use the probability at any point throughout the duration of the market. If you're more interested to see how calibrated markets near their beginnings, you can choose to look at the calibration at 10% of the way through the market's duration instead. You could also choose to use the average probability throughout the course of the market, which takes into account the entire market history.

We also have an accuracy plot, where you can calculate the total accuracy of a platform and compare it against another attribute, such as the market's volume or length. A common refrain is that prediction markets get more accurate as more people participate - is that true? You can check by changing the x-axis to "number of traders" to see how the accuracy changes based on how many people are involved.

Each chart has more options, such as searching based on title, picking individual categories, limiting based on date, or filtering based on any other attribute in the database. If you have a specific query, you can also see the markets used in the calculations on the list page.

Obviously every platform is different - there are design decisions made for each one that make it unique and not all platforms are designed in a way that shines when graded like this. Some platforms encourage shorter-term markets with more particiants, while others have systems that incentivize making as many predictions as possible. Just because a platform seems less calibrated under some conditions or their accuracy line is worse than another, does not mean that they shouldn't be taken seriously. This site is designed to make it easier to find objective information and present it in a standardized way, not to say which platform is better.