A forecast is a prediction of the form “event $a$ will happen with probability $f$”. Roughly, a forecast is calibrated when the long-run frequency of $a$ is actually $f$. Calibration is useful to communicate with others: if when we forecast rain with 90% confidence (so $f(\text{rain})=0.9$), there systematically is a 60% chance of rain, then others have to know that in order to properly interpret our prediction. If on the other hand our forecasts are always calibrated, then others can safely interpret them as probabilities and (for instance) use them to decide whether to take an umbrella.
Perhaps surprisingly, calibration is always achievable, regardless of what is being forecasted. In fact, you can train to be calibrated at narrow tasks such as multiple-choice trivia or guessing the correlation coefficient of a cloud of points online.
In this post, I reproduce a neat proof that calibration is possible which is due to Sergiu Hart. But first, let’s state the exact result we want to prove.
For concreteness, the rest of the post will be concerned with whether forecasts, but of course what we’ll say applies more generally. Let $a_t=1$ if it rains at time $t$, $a_t=0$ otherwise. Let $f_t$ be the forecasted probability of rain at time $t$. We will suppose that the set of possible forecasts is discrete: $f_t \in \lbrace f_i \rbrace_{1\leq i \leq n}$. Let $A(f_i) := \lbrace a_t | f_t = f_i\rbrace$. Let also $n_i=|A(f_i)|$. We will sometimes treat $A(f_i)$ as a random variable and e.g. write $\bar{a}(f_i)=\sum_{a_t\in A(f_i)} a_t/n_i$. We are now ready to formally define calibration. We say that a forecast $\lbrace f_t\rbrace_{1\leq t \leq T}$ is calibrated if $$ \begin{aligned} \forall f_i, \lim_{T\to \infty}\bar{a}(f_i) = f_i. \end{aligned} $$ In other words, a forecast is calibrated if, in periods where rain is forecasted to happen with probability $f_i$, the long-run frequency of rain is $f_i$.
As we’re going to see, calibration is guaranteed not only in this sense, but also in the following stricter sense. Define the calibration score as follows: $$ \begin{aligned} K_T := \frac{1}{T}\sum_{t=1}^T (a_t-f_t)^2 \end{aligned} $$ This score is indeed connected to calibration, as $$ \begin{aligned} K_T=\sum_{i=1}^n \frac{n_i}{T} \left(\bar{a}(f_i)-f_i\right)^2 \end{aligned} $$
Calibration theorem (informal version): For any forecasting set ${f_i}{1\leq i \leq n}$, there exists a forecasting strategy such that $$ \begin{aligned} \lim{T\to \infty} K_T = 0 \end{aligned} $$
If I know the correct frequency of rain (say, I know that it rains 10% of the time on average), then if I always predict a 10% chance of rain, my forecast will be perfectly calibrated. This is however very imprecise (here we’re using the minimal $n=1$ number of forecasting bins). The theorem above says we can do way better than this: for any $n$, a calibrated forecasting strategy exists.