My track record

My binary calibration on Metaculus questions: Binary Calibration on Metaculus

I am peter_m (click for a list of my comments) on Metaculus.

Up until now not enough questions (9 binary questions, 2 real-valued questions) have resolved to provide proof beyond doubt that I am well-calibrated or sharp (=not too over-/underconfident; for statisticians: By way of analogy to point estimators, calibration is like unbiasedness and sharpness is like precision (i.e., inverse variance).)

Attempts at finding “arbitrage opportunities”

Since Metaculus is no market, but only an information aggegrator, “arbitrage opportunity” does not appear to be a well-defined concept. Thinking of arbitrage as logical inconsistency (the only kind of inconsistency that allows you to extract risk-free money from a market, regardless of how bets resolve), however, we might simply interpret this as a question of whether the community prediction (either median or mean of Metaculus users’ predictions at a given time) is “consistent over time”. Before diving into the data, we need to make a small detour on what “being consistent over time” implies for a forecast:

Optional stopping bounds on big updates

Denote by \(\mathcal F=(\mathcal F_t)_t\) a filtration of our filtered probability space \((\Omega,\mathcal A,\mathcal F,\mathbb P)\). Informally, \(\mathcal F_t\) stands for the set of events whose occurrence you are aware of at "time" \(t\) (i.e. you know whether they happened or not), or, in other words, it models the information that is available at a given point.

Assume we are interested in whether some political party gets elected in the upcoming election; call this event \(A\in\mathcal A\). Our (subjective, at time \(t\)) probability \(p_t\) of \(A\) being true can be written as a conditional expectation \[p_t=\mathbb E[1_A\mid\mathcal F_t].\] It is well known that \(t\mapsto p_t\) is a martingale.

Upon learning more about the world (e.g. as time goes on and new polls are published), we need to update our forecasts. Clearly, if, at some point in time \(s<t\), we have \(p_s=0.01\), then the probability of \(p_t=1\) (i.e. \(A\) ever coming true) for any \(t>s\) is \(0.01\). But martingales give us more! We can also bound the probability of \(\sup_{t>s}p_t>x\) for any \(x\). Intuitively, if at some point a well-calibrated forecaster assigned very low probability to some event, their probabilities should hardly ever exceed a significantly higher probability ever again. The optional stopping theorem allows us to make this rigorous and indeed we get for any \([0,1]\)-valued martingale \((X_t)_t\) with \(X_T\in\{0,1\}\) (i.e. the question is resolved by time \(T\)) and stopping time \(\tau=T\land \inf\{t>0:X_t\geq p\}\) that \[\mathbb EX_0 = \mathbb E X_\tau = \underbrace{\mathbb P(X_\tau\geq p)}_{=:q}\underbrace{\dots}_{\in[p,1]}+(1-q)0 \Rightarrow \mathbb P(\sup_{t>0}X_t\geq p)\in [X_0,X_0/p]. \]

Comparing theory with data

Now we might, for example, look at questions where the community forecast (consisting of at least 10 different forecasts, say) was below 10% at some point and see how often it exceeded 90% after that again. By the above bound, this ought to happen with probability between 0.1 and 0.1/0.9≈0.11, i.e. for roughly 1 in 10 resolved questions. Now this is not very informative since it is essentially a question about whether the community prediction is well-calibrated. Replacing 90% with 50%, say, gives us the less trivial bound of about 10%–20% of all resolved questions going from ≤10% to ≥50%.

So what does the data say? I fetched data from the (inofficial?) JSON API, but this takes a while since we can’t download everything in one go. I restricted myself to

  • binary questions (entry['possibilities']['type'] == 'binary'),
  • that are resolved (entry['resolution']==True),
  • with predictions by at least 15 different users (entry['distribution']['num']>=15).

From these datasets (unfortunately only 329 at the time) I took the ['community_prediction'] field from the time series, and appended the ['resolution'] as the “final community prediction”.

What I found was mostly in line with what the above martingale bound predicts, with some outliers that may as well be due to chance since the sample size was not big enough.