Download this day:EPUB PDF

Block I · Foundations of Knowledge & Reasoning · Day 004 / 180

Probability as Extended Logic

A game show host opens a door. Your gut says it can’t matter. Your gut is about to lose two out of three times.

● you picked 1 · host opened 3 · should you switch to 2?

The whole of Bayesian reasoning, hiding inside a 1970s game show.

You pick Door 1. Somewhere behind these three doors sits a car; behind the other two, goats. The host — who knows exactly where the car is — strolls over to Door 3, swings it open to reveal a goat, and asks, almost kindly: would you like to switch to Door 2? Two doors left, one car. Fifty-fifty, surely. Switching can’t possibly matter.

It matters enormously. Stay, and you win the car one time in three. Switch, and you win two times in three — you double your odds by doing nothing but changing your mind. This is the Monty Hall problem, and when it ran in a magazine in 1990 it triggered one of the great public meltdowns in the history of mathematics. Today we’ll see why the answer is not just correct but inevitable — and how the same machine that solves it turns out to be the deepest available theory of what it means to reason under uncertainty at all.

Where we are

On Day 1 we met credence — belief as a dial from 0 to 1 — and the Dutch book argument showing that incoherent dials can be turned into a guaranteed loss. Today we learn the law that says how the dial must move when evidence arrives: Bayes’ theorem. On Day 2 we watched science struggle to draw the line between signal and noise, and saw the replication crisis as that struggle under live fire; today’s frontier — a quiet revolution replacing the p-value with a bet — is aimed squarely at fixing it. Threads lit today: information (evidence as bits that update belief), computation (the mind and the lab as inference engines), and a flicker of energy when the “Bayesian brain” returns.

The meltdown

The smartest people in the country, all wrong at once

In September 1990, Marilyn vos Savant — listed in the Guinness Book for the highest recorded IQ, writing the “Ask Marilyn” column in Parade magazine — answered a reader’s question about a game show. Switch doors, she wrote; you’ll win two-thirds of the time. The answer is correct. The response was apocalyptic.

Monty Hall, Carol Merrill, Jay Stewart, and contestants on the Let's Make a Deal set in a 1974 publicity photo. — Monty Hall’s actual stage makes the puzzle less like a parlor trick: the host was never a random door-opener, but a knowledgeable agent whose action carried information.

By her own count she received some 10,000 letters, the overwhelming majority telling her she was wrong — and roughly 1,000 of them signed by people with PhDs. Mathematicians wrote in to scold her. One professor offered the immortal line:

“You blew it, and you blew it big! … There is enough mathematical illiteracy in this country, and we don’t need the world’s highest IQ propagating more. Shame!” — Scott Smith, Ph.D., University of Florida, in a letter to Parade (1990)

He was the one who’d blown it. So had, by the strict statistics of the thing, most of his colleagues. Vos Savant held her ground across three more columns, eventually asking schoolteachers across the country to run the experiment with paper cups and a coin. They did. The data came back exactly as she’d said: switching wins twice as often. The professors, slowly and not always graciously, retreated.

The man who needed to see it to believe it

Even Paul Erdős — one of the most prolific mathematicians who ever lived, a man who proved theorems most of us can’t even read — refused to accept the answer. When his friend Andrew Vázsonyi laid out the logic, Erdős was unconvinced. Only when Vázsonyi ran a computer simulation, playing the game hundreds of times and watching switching win about two-thirds of the rounds, did Erdős concede. And even then he was annoyed: the simulation showed him that it was true without showing him why. (Recounted in Paul Hoffman’s biography The Man Who Loved Only Numbers, 1998.) If it tripped Erdős, you are in excellent company.

Here’s the thing the meltdown reveals. The Monty Hall problem isn’t a trick or a word game — its answer is provably, simulation-confirmably true. What it exposes is that human intuition about uncertainty is systematically miscalibrated, and that we badly need a formal tool to override it. That tool is the subject of today’s descent. But first, let’s actually break our intuition on the rocks — and then rebuild it.

Interactive · play it yourself

The Monty Hall Machine

Pick a door. The host opens a different one — always a goat, never your pick, never the car. Then click your original door to stay, or click the other closed door to switch. Play a few by hand, then hit auto-run 1,000× and watch the two strategies separate. The tallies don't lie.

Stay keeps the first pick1/3

Switch takes the unchosen pair2/3

Step 1Pick first

Pick a door to begin.

If you always STAY

0 / 0

— wins

If you always SWITCH

0 / 0

— wins

Run the experiment at scale (the schoolteachers' method, automated)

Monty Hall Outcomes

Under the standard host rule, switching wins exactly when the first pick was wrong.

First pick	Host action	Stay	Switch
Car, probability 1/3	Opens either goat door	Win	Lose
Goat, probability 2/3	Forced to open the other goat door	Lose	Win

So staying keeps the original 1/3 chance; switching captures the 2/3 chance that the first choice was wrong.

Why it works

The host is doing you a favor (and leaking information)

The cleanest way to feel the answer: your first pick is right one time in three. That number never changes. When you pointed at Door 1, there was a 1/3 chance the car was behind it and a 2/3 chance it was behind “one of the other two.” The host then opens a goat door — but crucially, the host is not choosing at random. He knows where the car is, and he is required to reveal a goat. So all of that 2/3 probability, which used to be smeared across two doors, gets concentrated onto the single door he didn’t open.

The host’s reveal isn’t noise. It’s information — the first appearance of one of our five recurring threads in hard quantitative form. Day 1’s stopped clock taught that being right by luck is not knowledge; here, the host’s constrained, knowledgeable action is evidence that moves the credence dial. Switch, and you’re betting on that fat 2/3. Stay, and you’re clinging to your original lonely 1/3.

If your intuition still resists, blow the problem up. Imagine a thousand doors. You pick one — a 1-in-1,000 shot. The host, who knows, then opens 998 other doors, every single one a goat, leaving just your door and one other. Do you really still think it’s a coin flip? Almost certainly the car is behind that one door the host so pointedly avoided. The three-door version is the same logic, merely too small to feel.

Older than the game show

The puzzle didn’t start with Monty Hall. The statistician Steve Selvin posed it in a 1975 letter to The American Statistician — and his follow-up was the first place the phrase “the Monty Hall problem” ever appeared in print. Its skeleton is older still: it’s identical to Bertrand’s box paradox (Joseph Bertrand, 1889) and Martin Gardner’s Three Prisoners problem (1959). Mathematicians call this a veridical paradox — an answer that looks impossible but is provably true. Convergent re-discovery again, exactly like the Gettier case on Day 1: when minds keep tripping over the same stone for a century, the stone is real.

The model

Bayes’ theorem: the law of belief revision

What we did to those doors by hand has a name and a formula. The formula looks colder than the idea. Bayes’ theorem is just a rule for reweighing live possibilities after evidence arrives:

Bayes as a sieveEvidence

E

= “the host opened Door 3.” Each hypothesis starts with the same prior, then gets shrunk by how well it predicted that exact reveal.

P (H ∣ E) = \frac{P ( H ) P ( E ∣ H )}{P ( E )}

$H$ this row’s car-location claim	$P (H)$ prior chance this row is true	$P (E ∣ H)$ chance host opens Door 3 if this row is true	$P (H) P (E ∣ H)$ row’s surviving weight	$P (E)$ overall chance host opens Door 3	$P (H ∣ E)$ chance this row is true after Door 3 opens
$H :$ Car behind your Door 1	$1/3$	$1/2$	$1/6$ possible, but only half as expected	$1/2$	$1/3$
$H :$ Car behind Door 2	$1/3$	$1$	$1/3$ best survivor: the reveal was forced	$1/2$	$2/3$
$H :$ Car behind Door 3	$1/3$	$0$	$0$ ruled out: the host cannot open the car	$1/2$	$0$

$P (E)$ is the total weight that survived the evidence: $1/6 + 1/3 + 0 = 1/2$ . Divide each surviving weight by that half, and the unopened Door 2 gets $2/3$ of the remaining probability.

P (H ∣ E) = \frac{P ( H ) P ( E ∣ H )}{P ( E )}

posterior (belief after evidence) = prior (belief before) × likelihood (how well $H$ predicts $E$ ), normalized by evidence

In words: your posterior belief in a hypothesis $H$ after seeing evidence $E$ equals your prior belief, multiplied by the likelihood — how strongly $H$ predicted that you’d see $E$ — divided by how expected $E$ was overall. Strong evidence is evidence your hypothesis predicts and rival hypotheses don’t. That’s the entire engine. Belief flows toward whatever best predicted what actually happened.

In Monty Hall, $H$ = “the car is behind Door 2” and $E$ = “the host opened Door 3.” If the car really is behind Door 2, the host is forced to open Door 3 (he can’t open your door or the car’s), so the likelihood is $1$ . If the car is behind your Door 1, he could’ve opened either 2 or 3, so the likelihood of opening 3 is only $1/2$ . That asymmetry in the likelihoods is exactly what tips the posterior to $2/3$ in favor of switching. The formula just does the bookkeeping our intuition botches.

The trap that fools doctors

Bayes’ theorem doesn’t only rescue game-show contestants. It catches a mistake that, in one famous study, most physicians got wrong.

Try the medical-test case and watch the base rate overturn intuition; the same arithmetic governs spam filters and airport screening.

The table walks through the medical-test case, where prevalence matters as much as test accuracy. The same arithmetic governs spam filters and airport screening.

Interactive · the base-rate trap

A 99%-sensitive test for a rare disease

A disease affects a small slice of the population. The test is excellent. You test positive. What's the chance you're actually sick? Drag the sliders and watch the grid of 1,000 people sort itself into the four outcomes. The answer is almost always far lower than people guess.

Disease prevalence (base rate) 0.1%

Test sensitivity — catches true cases 99%

False-positive rate — flags healthy people 5%

$P (sick ∣ positive test)$

1.9%

true positive (sick, +)false positive (healthy, +)negative test

In a 1978 study (Casscells, Schoenberger & Graboys, New England Journal of Medicine), 60 Harvard medical staff and students were handed essentially this problem. The most common answer was 95%. The correct answer was about 2%. Only 11 of 60 got it right. The culprit is mixing up $P (positive ∣ sick)$ with $P (sick ∣ positive)$ — and forgetting that when a disease is rare, the few real cases drown in a sea of false alarms.

The base-rate trap

For the default assumptions: 0.1% prevalence, 99% sensitivity, and a 5% false-positive rate.

Group	Count out of 1,000	Positive tests
Sick people	1	About 1 true positive
Healthy people	999	About 50 false positives
All positive tests	About 51	Only about 1 is truly sick

The posterior is therefore about $0.99/50.94$ , or $1.9%$ — the Casscells result with the sensitivity made explicit.

The deep idea

Why “extended logic,” not just a formula

Here’s the claim that gives today its title. Ordinary deductive logic — the syllogisms of Day 3 — is the logic of certainty: if all men are mortal and Socrates is a man, then Socrates is mortal, full stop. But almost nothing in real life is certain. We need a logic for the vast middle ground between “definitely true” (probability 1) and “definitely false” (probability 0). An influential result is that, under specified consistency and regularity conditions, numerical degrees of plausibility can be represented by the probability calculus.

This was made precise by the physicist R. T. Cox in 1946. Cox asked: suppose you want to attach a number to “how plausible is this, given what I know?” and you insist on a set of structural rules — plausibilities are ordered by real numbers; equivalent ways of evaluating the same proposition must agree (consistency); and the plausibility of “not- $A$ ” should depend only on the plausibility of ” $A$ .” Under suitable regularity assumptions, Cox-style representation theorems show that plausibilities satisfying those desiderata can be rescaled to obey the standard rules of probability. Negation then behaves like $1 - P (A)$ , conjunction obeys the product rule, and taking evidence $E$ as given uses conditional probability. A graded-belief system satisfying those conditions is probability theory in another coordinate system.

The physicist E. T. Jaynes built his great posthumous book Probability Theory: The Logic of Science (2003) on exactly this foundation. His slogan: deductive logic is just the special case of probability theory where all the probabilities happen to be 0 or 1. Probability is logic extended to handle uncertainty — which is to say, extended to handle reality. Notice this is the third independent road to the same destination: the Dutch book argument (Day 1) got there from “don’t be exploitable,” and we’ll see decision theory get there from “don’t make dominated choices.” Coherence, no-sure-loss, and consistent reasoning all point at one calculus.

The necessary footnote

Cox’s original proof was too quick. In 1999 the computer scientist Joseph Halpern showed that Cox’s stated assumptions do not establish the result on finite domains and may also be insufficient on infinite ones. Cox-style theorems can be proved with stronger assumptions, though their exact strength and naturalness matter. So the right thing to say is not “probability is the only conceivable logic of uncertainty” but “under specified regularity conditions, consistent graded belief can be represented by the probability axioms.” Cox’s program survives, but not as a proof from bare common sense alone. Cox theorem

The debate

Two tribes, one equation

If probability is this beautiful and unified, why has it been the site of a century-long civil war? Because the equation is agreed on; what’s fought over is what the numbers mean. Both tribes use the very same calculus — the axioms Andrey Kolmogorov wrote down in 1933, which deliberately decline to say what probability is and only fix how it must behave. Onto that neutral skeleton, two interpretations are draped.

Frequentist

probability = long-run frequency

A probability is the frequency of an event in infinitely many repetitions. “The coin is fair” means it lands heads half the time over endless flips.
Parameters are fixed unknown constants; the data are random. You reason about how often your method would mislead you.
Tools: p-values, confidence intervals, Type I/II error (Fisher; Neyman & Pearson, 1920s–30s).
Can’t coherently say “70% chance there was life on Mars” — Mars either had life or it didn’t; there’s no repetition to count.

Bayesian

probability = degree of belief

A probability is a credence — your rational degree of confidence given what you know (straight from Day 1’s dial).
Parameters get probability distributions; you update them with Bayes’ theorem as data arrive.
Tools: priors, posteriors, Bayes factors. Lineage: Laplace → Jeffreys → Ramsey → de Finetti → Savage.
Happily says “70% chance of past life on Mars” — a one-off claim with no repetitions is exactly what credence is for.

Frequentism dominated the 20th century partly for a good reason and partly for an accident. The good reason: its founders craved objectivity and distrusted the Bayesian prior as a smuggled-in opinion. (Fisher dismissed “inverse probability” as something that “must be wholly rejected.”) The accident: Bayesian methods need heavy computation, which didn’t exist until cheap computers arrived. The central Bayesian sore point remains the prior — where does your “before” belief come from, and why should anyone trust yours? Objective Bayesians (Jeffreys, Jaynes) hunt for rule-based priors; subjective Bayesians shrug and say all reasoning starts somewhere.

”Probability does not exist”

The Italian Bruno de Finetti opened his treatise with those four words, in capitals. His point was deliberately counterintuitive: there is no probability “out there” in the world like mass or charge — there is only the coherent betting behavior of a reasoning agent. He backed the slogan with a real theorem (his 1937 representation theorem): if you treat a sequence of observations as exchangeable — order doesn’t matter to you — then you are mathematically obliged to act as if there’s some fixed unknown frequency with a prior over it. Subjective belief and objective-looking parameters turn out to be two views of one structure. A truce, written in math.

And note the practical wisdom that falls out: Cromwell’s rule (named by Dennis Lindley after Oliver Cromwell’s 1650 plea, “think it possible that you may be mistaken”). Never set a prior to exactly 0 or 1, because Bayes’ theorem can never budge it afterward — a belief held with absolute certainty is, by construction, unteachable. Leave a sliver of doubt for the moon being green cheese, Lindley wrote, or no returning astronaut’s cheese samples will ever move you. Calibration, again — the through-line of this whole block.

The frontier · 2026

The quiet mutiny against the p-value

For a century, the frequentist p-value has been science’s gatekeeper: get below $0.05$ and you may call your result “significant.” On Day 2 we saw the bill come due — the replication crisis, in which mountains of “significant” findings simply evaporated on re-testing. A big culprit is structural: the p-value is fragile. Peek at your data midway and stop the moment you hit $p < 0.05$ , and you’ve quietly inflated your false-positive rate — a sin so common it has a name, “optional stopping.” A new framework now circulating through statistics rebuilds testing from the ground up to fix exactly this. Its central object isn’t a probability. It’s a bet.

Edge 01E-value math

The e-value: test a hypothesis by betting against it

An e-value is the payoff of a bet against the null hypothesis. You wager $1 that the null is false, under a betting contract designed to be fair if the null is true — meaning that if the null really holds, you can’t expect to grow your money (in symbols, the expected value of an e-value under the null is at most $1$ ). So if you walk away having multiplied your stake twentyfold, something is off with the null: either it’s false, or you got astronomically lucky. A large e-value is literally money won against the null, and your accumulated wealth is your evidence. The reciprocal $1/ e$ behaves like a conservative p-value, but the betting picture is the point.

In the coin example, the null is concrete: the coin is fair, $P (heads) = 0.5$ . The e-value is the wealth from two likelihood-ratio tickets. One ticket bets on a heads-heavy coin, $P (heads) = 0.60$ : a head multiplies that ticket by $0.60/0.50 = 1.2$ , while a tail multiplies it by $0.40/0.50 = 0.8$ . The mirror ticket bets on a tails-heavy coin, $P (heads) = 0.40$ , with the multipliers reversed. Split the starting $1 evenly between those two tickets, and either kind of sustained bias can make wealth grow. If the coin is actually fair, each ticket has expected multiplier $1$ on every flip; the game is fair under the null. In this toy game, winning means your wealth gets large enough to reject “fair coin”; losing means the wealth stalls or shrinks, so you have not earned evidence against fairness.

This isn’t loose metaphor; it’s a rigorous program — “game-theoretic statistics,” built over two decades by Glenn Shafer and Vladimir Vovk and now carried forward by Aaditya Ramdas, Peter Grünwald, Ruodu Wang and others. Shafer’s manifesto, “Testing by Betting,” was read before the Royal Statistical Society in 2020 and published in its Journal (Series A) in 2021. His complaint about the p-value is partly that it’s too confusing to communicate; “I won $20 betting against this hypothesis” is something a human can actually grasp.

Edge 02PeekingP-value replacement

Why a bet beats a p-value: you can peek all you want

Bets compound. If you make a fair bet against the null, then another, then another, your running wealth forms what mathematicians call a martingale, and a classical result (Ville’s inequality) guarantees it almost never balloons to huge values if the null is true. This gives e-values an almost magical property the p-value lacks: anytime validity. You may watch the experiment unfold, stop whenever you like, collect more data if it looks promising — peek as often as you want — and your error guarantee still holds. Grünwald, de Heide & Koolen call this “safe testing” (published in the RSS Journal, Series B, 2024); the broader machinery, including confidence intervals that are valid at every moment, is “safe anytime-valid inference” (Ramdas, Grünwald, Vovk & Shafer, Statistical Science, 2023). E-values also combine trivially: multiply independent ones, or even average dependent ones, and you still have a valid e-value — which makes pooling studies clean where p-values turn into a multiple-comparisons minefield.

Run the same data stream through a peeking p-value and an anytime-valid e-value.

The table contrasts a peeking p-value with an anytime-valid e-value and shows why their guarantees differ.

The toy task is intentionally narrow: it is trying to reject one claim, “this coin is fair,” not estimate the exact bias or prove unfairness with certainty.

What does this look like in science? In a living clinical meta-analysis, the null might be “BCG vaccination has no clinically relevant effect on COVID-19 infection in healthcare workers.” New randomized trials report at different times, and researchers want to update the synthesis whenever fresh data arrive without letting the false-positive risk creep upward every time they look. The ALL-IN meta-analysis framework was built for exactly that kind of setting: it lets evidence from successive trials be added while preserving type-I error and interval-coverage guarantees. In one BCG/COVID application, “winning” for the evidence process would have meant accumulating strong enough evidence for a clinically relevant benefit; the anytime-valid analysis instead found no clinically relevant reduction in infections, and left hospitalization too sparse for a firm conclusion. That is the same structure as the coin toy, with medical endpoints and trial streams replacing heads and tails.

Interactive · the gambler's ledger

Betting against a "fair" coin

The claim being tested is the null: the coin is fair, $P (heads) = 0.5$ . The slider sets the simulator's actual data-generating $P (heads)$ , hidden from the test in a real experiment. At $0.50$ the null is true; at 0.65 or 0.35 the null is false. This widget is asking only whether the data give strong evidence against fairness, not what the exact bias is. The e-value strategy bets against fairness flip by flip. If wealth reaches 20, the bet has won enough to reject fairness at level 0.05; if wealth stays near 1 or falls, the bet has not found evidence against fairness.

Actual

P (heads)

:0.50

null: coin is fair (

p = 0.5

)

Current wealth (e-value)

1.00

$1 split across two tickets

Flips so far

0 heads

Verdict

collecting…

need wealth ≥ 20

Set actual $P (heads)$ to exactly 0.50 and the null is true: most wealth paths hover or sag, and only rare lucky streaks reach 20. Nudge it to 0.65 or 0.35 and the data now come from a biased coin, so one of the two tickets tends to compound. The upward march is a win for the betting strategy and evidence against fairness; a flat or falling line is a loss for that bet, meaning "keep collecting" or "do not reject," not "the coin is proven fair."

The e-value ledger

An e-value is a nonnegative payoff whose expectation under the null is at most 1.

Quantity	Meaning	Use
$E = 1$	No net betting gain against the null	Starting point
Coin-demo ticket	A likelihood-ratio payoff: $1.2$ for the favored outcome, $0.8$ for the other	Fair in expectation if the coin is truly $P (heads) = 0.5$
$E = 20$	A twentyfold payoff from a bet fair under the null	Level- $0.05$ rejection threshold because $1/20 = 0.05$
Running wealth	A test martingale or e-process	Can be monitored continuously while preserving Type I error control

The tradeoff is conservatism: an anytime-valid ledger can need stronger or more sustained evidence than a fixed-horizon test when all modeling assumptions are exactly right.

Edge 03AdoptionReplacement hype

How far has the mutiny actually spread?

Here’s where the hype filter earns its keep. The mathematics of e-values is settled and elegant — peer-reviewed in the field’s very best journals (Annals of Statistics, both RSS Journals, Statistical Science), and gathered into a 390-page Foundations and Trends monograph by Ramdas & Wang after its 2024 preprint. That part is beyond dispute. E-value math

Real-world adoption is a narrower and more accurate story. The clearest uptake is in tech-company A/B testing, where “peeking” is the entire business model: Optimizely rebuilt its platform around “always-valid inference” (Johari, Koomen, Pekelis & Walsh), and Netflix and Adobe publicly run anytime-valid confidence sequences so product teams can monitor experiments continuously without cheating the statistics. That’s genuine production use — but it’s a long way from the world’s biostatistics, psychology, and physics communities, where the p-value remains entrenched.

And the new tool is no free lunch. In fixed-horizon comparisons, e-values can need more extreme data than p-values to reach the same rejection threshold; Shafer’s reply is that this is the cost of making the evidential scale stricter rather than a simple defect. The efficiency of your bet depends on choosing a good betting strategy — arguably the same modeling judgment a Bayesian makes in choosing a prior, reappearing in new clothes. Critics including Samuel Pawel and Leonhard Held warn that branding tests as “safe” or “always valid” can mislead, since the guarantees still rest on assumptions (a correctly specified model, no publication bias) that can fail like any other. The careful verdict: a rigorous, genuinely useful complement to the p-value with real promise Complement — emphatically not its science-wide replacement, at least not yet.

What would move the needle? If a drug regulator like the FDA or EMA blessed e-value designs for confirmatory clinical trials, or a top general-science journal wrote them into its author guidelines, the “replacement” claim could graduate from hype to hint to reality. Watch those two signals.

Open questions

What’s genuinely unsettled

What is a probability, really? A frequency in the world, a degree of belief in a mind, or a fair betting rate? Three centuries on, the interpretation war has truces (de Finetti) but no surrender.
Where do priors come from? Is there a principled, objective way to set your “before” belief, or does all reasoning rest on a choice no math can justify?
Will betting-based statistics actually take over? Or settle in as a specialist tool for sequential experiments while the p-value rules on — and is “choose your bet” any less subjective than “choose your prior”?
Is the brain literally running Bayes? Day 1’s predictive-processing thread says perception is Bayesian inference in neural tissue. Today gives that claim its normative backbone — but “the brain approximates Bayes” and “the brain is Bayesian” are very different bets, and we’ll return to them on Day 119.
Does Cox’s theorem truly force probability on any rational agent — including an artificial one — or only on agents that already accept his consistency axioms? (A question with teeth for the AI block, Days 138–145.)

The day in three sentences

Big idea: Probability isn’t merely a tool for dice and coins. Under Cox-style consistency and regularity conditions, it represents graded reasoning under uncertainty; Bayes’ theorem then updates those grades toward hypotheses that better predicted what you saw.
Best analogy: Monty Hall opening a goat door — a knowledgeable agent’s choice pours 2/3 of the probability onto one remaining door — and the gambler’s ledger, where evidence against a hypothesis is literally money won betting against it.
Live controversy: The frequentist–Bayesian split over what probability means, now joined by a 2020s mutiny that would replace the fragile, peek-sensitive p-value with the e-value — established as math, adopted in tech, but not (yet) the science-wide revolution its boldest fans promise.

Threads today › information (the host’s reveal and the e-value both as evidence that updates belief) · computation (mind and lab as inference engines) · energy (a light callback to the Bayesian brain) — with calibration carried straight from Day 1 and Day 2, and the intervention question opening into Causation.

Tomorrow → Day 5

Causation

Probability says how beliefs should move with evidence — but patterns alone never say why. Tomorrow: interventions versus observations, causal graphs, and what a correlation has to earn before it deserves the word because.

Sources

Sources & further reading

Selvin, S. (1975). “A Problem in Probability” (Letter to the Editor). The American Statistician 29(1): 67. doi:10.1080/00031305.1975.10479121. doi.org/10.1080/00031305.1975.10479121 — and the follow-up, “On the Monty Hall Problem,” 29(3): 134, doi:10.1080/00031305.1975.10477398, the first print use of the name. doi.org/10.1080/00031305.1975.10477398
vos Savant, M. “Ask Marilyn.” Parade (Sept 9, 1990, and follow-ups 1990–91). marilynvossavant.com/game-show-problem — the column, reader letters, and the ~10,000-letter / ~1,000-PhD estimates (vos Savant’s own).
Tierney, J. (July 21, 1991). “Behind Monty Hall’s Doors: Puzzle, Debate and Answer?” The New York Times. nytimes.com — includes Monty Hall and Persi Diaconis on the host-protocol caveat.
Hoffman, P. (1998). The Man Who Loved Only Numbers. Hyperion. — the Erdős / Vázsonyi simulation anecdote.
Bertrand, J. (1889). Calcul des probabilités. Gauthier-Villars. — Bertrand’s box paradox, the structural ancestor. See also Gardner, M. (1959), “Mathematical Games,” Scientific American (Three Prisoners).
Casscells, W., Schoenberger, A. & Graboys, T. B. (1978). “Interpretation by Physicians of Clinical Laboratory Results.” New England Journal of Medicine 299(18): 999–1001. doi:10.1056/NEJM197811022991808. doi.org/10.1056/NEJM197811022991808 — only 11 of 60 clinicians gave the ~2% answer.
Cox, R. T. (1946). “Probability, Frequency and Reasonable Expectation.” American Journal of Physics 14(1): 1–13. doi:10.1119/1.1990764. doi.org/10.1119/1.1990764 — conditions used to derive probability rules from graded plausibility.
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press (ed. G. L. Bretthorst). doi:10.1017/CBO9780511790423. doi.org/10.1017/CBO9780511790423 — probability as extended logic.
Halpern, J. Y. (1999). “A Counterexample to Theorems of Cox and Fine.” Journal of Artificial Intelligence Research 10: 67–85. doi:10.1613/jair.536. doi.org/10.1613/jair.536 — the rigor caveat on Cox’s theorem.
Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung (Foundations of the Theory of Probability). Springer. — the interpretation-neutral axioms.
de Finetti, B. (1937 / 1974). “La prévision…”; Theory of Probability (Eng. trans.). — “PROBABILITY DOES NOT EXIST”; the representation theorem.
Lindley, D. V. (1991). Making Decisions, 2nd ed. Wiley. — Cromwell’s rule (p. 104).
Shafer, G. (2021). “Testing by Betting: A Strategy for Statistical and Scientific Communication.” Journal of the Royal Statistical Society Series A 184(2): 407–431. doi:10.1111/rssa.12647. doi.org/10.1111/rssa.12647 — with published discussion (incl. Vovk’s comment, JRSS-A 184(2): 445–446).
Vovk, V. & Wang, R. (2021). “E-values: Calibration, combination, and applications.” The Annals of Statistics 49(3): 1736–1754. doi:10.1214/20-AOS2020. doi.org/10.1214/20-AOS2020 pdf
Grünwald, P., de Heide, R. & Koolen, W. (2024). “Safe Testing.” Journal of the Royal Statistical Society Series B 86(5): 1091–1128. doi:10.1093/jrsssb/qkae011. doi.org/10.1093/jrsssb/qkae011 (read paper, with discussion incl. Shafer, Pawel & Held).
Ramdas, A., Grünwald, P., Vovk, V. & Shafer, G. (2023). “Game-Theoretic Statistics and Safe Anytime-Valid Inference.” Statistical Science 38(4): 576–601. doi:10.1214/23-STS894. doi.org/10.1214/23-STS894 arXiv:2210.01948
Ramdas, A. & Wang, R. (2025; first posted 2024). “Hypothesis Testing with E-values.” Foundations and Trends in Statistics 1(1–2): 1–390. doi:10.1561/3600000002. doi.org/10.1561/3600000002 — the comprehensive monograph.
ter Schure, J., Ly, A., Belin, L. et al. (2022). “Bacillus Calmette-Guérin vaccine to reduce COVID-19 infections and hospitalisations in healthcare workers.” Prospective ALL-IN meta-analysis preprint. Amsterdam UMC — exact e-value logrank tests and anytime-valid CIs in a living clinical meta-analysis.
Johari, R., Koomen, P., Pekelis, L. & Walsh, D. (2022). “Always Valid Inference: Continuous Monitoring of A/B Tests.” Operations Research 70(3): 1806–1821. doi:10.1287/opre.2021.2135. doi.org/10.1287/opre.2021.2135 — Optimizely’s deployment; cf. Netflix Research on anytime-valid inference and Adobe’s Experience Platform confidence sequences.
Wasserstein, R. L. & Lazar, N. A. (2016). “The ASA Statement on p-Values.” The American Statistician 70(2): 129–133. doi:10.1080/00031305.2016.1154108. doi.org/10.1080/00031305.2016.1154108 — and Amrhein, Greenland & McShane (2019), “Retire statistical significance,” Nature 567: 305–307, doi:10.1038/d41586-019-00857-9. doi.org/10.1038/d41586-019-00857-9

End of Day 004 · 176 descents remain