Download this day:EPUB PDF

Block I · Foundations of Knowledge & Reasoning · Day 005 / 180

Causation

Ice cream and drowning rise and fall together. That correlation alone does not show causation. So what is the difference between a pattern and a cause?

A schematic toy scatter plot: the dots march upward together — a textbook correlation. Their color is the variable the chart never shows: the season. Heat sells cones and sends people to the water.

Every summer, two numbers climb in lockstep. As ice-cream sales rise, so do drownings; as the cones stop selling in autumn, the drownings taper off too. Plot them and you get a clean, confident upward line — the kind of correlation that would make a careless analyst reach for a headline. Ban ice cream, save lives. And yet the relevant point is narrower: that correlation alone does not show that eating ice cream causes drowning.

What you know in your bones is one of the hardest things to write down in science. In this toy graph there is a third character lurking off-stage — summer — and it is pulling both strings at once. Heat drives people to buy ice cream, and heat drives people into lakes and oceans where some of them drown. The two visible numbers dance together only because an invisible one is conducting. Today is about the machinery for catching that invisible conductor — and the modern discovery that causation is not just a stronger correlation. It lives on a higher rung of a ladder that observational distributions do not determine without additional causal assumptions.

Where we are

Four days in, the toolkit is starting to interlock. Day 1 warned us about beliefs that are true but only by luck — the stopped clock. A spurious correlation is exactly that at population scale: a number that’s “right” for entirely the wrong reason. Day 2 introduced Hume and the problem of induction; today we meet the same Hume because his accounts of causation and induction are tightly linked: both challenge our right to move from repeated observations to necessity or future regularity. Day 3 sorted reasoning into deduction, induction, and abduction — and causal discovery, as an analogy, often feels like abduction with teeth: inference to the best causal explanation under assumptions. And Day 4 gave us the do-versus-see distinction’s parent, probability: today’s punchline is that $P (y ∣ do (x))$ and $P (y ∣ x)$ answer different questions and need not be equal.

The oldest problem

Hume kicks out the leg

Start where the trouble starts. In 1739, in A Treatise of Human Nature, David Hume asked a deceptively simple question: when one billiard ball strikes another and the second rolls away, what exactly do you see? You see the first ball move. You see them touch. You see the second ball move. What you never see — look as hard as you like — is the causing itself: the necessary connection, the hidden force, the little arrow of because linking the two events.

All we ever actually observe, Hume argued, is constant conjunction: events of this type are reliably followed by events of that type. Add that the cause comes first (priority) and that the two events touch in space and time (contiguity), and you have everything experience delivers. On a standard Humean reading, the sense of necessity — the feeling that the second ball had to move — comes from customary expectation rather than from perceiving a necessary connection. Hume found two definitions of “cause” tangled together in our heads: one about the world (constant conjunction) and one about us (the mind’s practiced leap from one to the other).

This should feel familiar. It is closely related to the problem of induction from Day 2, wearing a different coat. If causation is just “this has always been followed by that,” then claiming the next collision will behave like the last is precisely the unprovable bet on the uniformity of nature that Hume argued can never be justified non-circularly. The challenge generated competing regularity, counterfactual, probabilistic, and interventionist accounts; no single account is generally accepted as resolving every major class of difficult case.

Why “the cement of the universe”?

The phrase gets attached to Hume, but it is better treated as Mackie’s title and image: J. L. Mackie’s 1974 book is called The Cement of the Universe. This “cement” is the one thing we can never see. We infer the glue only from the fact that the bricks keep ending up stuck together.

Four repairs

Saying what “more” there is

If causation is more than constant conjunction, the obvious move is to say what the “more” is. The twentieth century produced several serious answers — different ways to finish the sentence “C causes E means…”. They are not simple rivals so much as lenses that modern causal modeling keeps borrowing from.

Lewis · 1973

Counterfactual. In simple cases, E counterfactually depends on C: had C not happened, E would not have happened. Lewis then analyzes causation more generally through chains of such causal dependence. Clean and intuitive, but it has to wrestle with backup causes and double-killings (preemption and overdetermination).

Reichenbach · Suppes · Cartwright

Probabilistic. These are different probabilistic programs, but the shared slogan is that causes raise the probability of their effects. Reichenbach’s common-cause principle: if A and B are correlated but neither causes the other, a shared cause C “screens them off” — hold C fixed and the correlation dissolves in the model. Summer is the intended common-cause explanation in the toy example.

Woodward · 2003

Interventionist. $C$ is a cause of $E$ when an appropriate intervention changes $C$ while breaking $C$ ‘s usual causes, avoiding independent routes to $E$ , and thereby changes $E$ . No human needed: volcanoes cause ash even with no one to push the button. This is closely aligned with the interventionist semantics of structural causal models: below, Pearl turns the idea into $do (C)$ and graph surgery.

Mackie · 1974

INUS conditions. A cause can be an insufficient but non-redundant part of an unnecessary but sufficient condition. A short circuit is not enough by itself to burn a house down; it matters because it is a non-redundant part of one sufficient package: short circuit plus oxygen plus flammable material plus no timely suppression.

One question, several lenses: counterfactual = “what if it hadn’t?”; probabilistic = “does it raise the odds, holding rivals fixed?”; INUS = “what role did this part play in a sufficient package?”; interventionist = “what changes under the right kind of intervention?”. Structural causal models provide a precise language for many counterfactual and interventionist questions, though they do not settle every philosophical account of causation. The statistical frameworks that follow ask a narrower operational question: once an intervention and target effect have been specified, what assumptions let data identify that effect? Notice how Cartwright (1979) sharpened the probabilistic story, because her fix is the hinge of the whole day. Causes raise the probability of effects, yes — but only inside a “causally homogeneous” background, with the relevant other causes held fixed. Forget that proviso and you walk straight into the most beautiful trap in statistics.

The trap

Simpson’s paradox: when the numbers literally reverse

Here is a fact that sounds impossible until you write the counts down: a treatment can have a higher observed success rate among small stones, a higher observed success rate among large stones, and yet a lower observed success rate overall. The kidney-stone data below make the reversal concrete.

A Simpson reversal occurs because the groups being combined have different mixtures of subgroups. The reversal itself does not tell you which comparison is causally relevant. That depends on what the third variable is: a confounder that should be adjusted for; a mediator, where adjustment changes a total-effect question into a direct-effect question; a collider or selection variable, where adjustment can create bias; or merely a descriptive partition involved in algebraic or non-collapsibility behavior. Causal knowledge, not the reversal alone, decides whether pooling or stratifying answers the right question.

In the kidney-stone data (Charig et al., BMJ, 1986), this was a historical, nonrandomized comparison of open surgery A and percutaneous nephrolithotomy B. The B cases contained a much larger share of small stones. B therefore has the higher pooled success rate, while A has the higher observed success rate within each reported stone-size stratum. Those observational strata demonstrate unequal case mix and a reversal; by themselves, they do not establish that open surgery would have produced better outcomes for the same patients.

The table below makes the reversal concrete: pooled data favor B, while each stone-size stratum favors A. Stone size is a plausible baseline severity variable here, so stratification is informative; in other causal structures, the same adjustment can answer a different question or create bias.

Interactive · watch it flip

The Reversal Machine

Real 1986 kidney-stone data, 350 patients per treatment. The cases stay fixed; the toggle changes which comparison frame is drawn around them.

Treatment A — open surgery Treatment B — PCNL (minimally invasive) effective not effective

Kidney-Stone Simpson Reversal

Treatment B has the higher pooled observed rate, while Treatment A has the higher observed rate inside each stone-size group. This demonstrates a reversal, not a treatment recommendation.

View	Treatment A	Treatment B	Higher observed rate
All patients	273 / 350 = 78.0%	289 / 350 = 82.6%	B
Small stones	81 / 87 = 93.1%	234 / 270 = 86.7%	A
Large stones	192 / 263 = 73.0%	55 / 80 = 68.8%	A

The pooled comparison reverses because the treatment groups contain different mixtures of small- and large-stone cases.

The lesson lands hard: you cannot read causation off a table of numbers. The very same digits can support different summaries depending on a variable that may not even be in the spreadsheet. Which raises the question that organizes everything after this — if the data alone won’t tell you, what will? The answer, from a computer scientist who spent the 1980s building machines that reason under uncertainty, is that you need to add something the data doesn’t contain: a model of which arrows point where.

The causal revolution

Pearl’s ladder, and the verb that changed everything

Judea Pearl won the 2011 Turing Award “for fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning.” His central image, popularized in The Book of Why (2018), is the influential Ladder of Causation: association, intervention, and counterfactuals. It is a powerful conceptual hierarchy, not an uncontested taxonomy every causal-inference tradition states in exactly the same form.

Climb it yourself:

Interactive · climb the ladder

The Three Rungs of Causation

Tap a rung. Each adds a verb, a piece of notation, and a question the rung below cannot answer. The jump from Rung 1 to Rung 2 is the entire subject of today.

Pearl’s Ladder of Causation

Rung	Verb	Notation	Question
1. Association	Seeing	$P (Y ∣ X)$	How do outcomes differ among cases where $X$ is observed?
2. Intervention	Doing	$P (Y ∣ do (X))$	What would happen if $X$ were set by intervention?
3. Counterfactuals	Imagining	$P (Y_{x} ∣ X^{'}, Y^{'})$	What would have happened in a different world, given what actually happened?

In the last line, $Y_{x}$ means ” $Y$ in the world where $X$ is set to $x$ ”; $X^{'}$ and $Y^{'}$ are the actual facts you condition on.

The do-operator: seeing is not doing

Here is the conceptual hinge of the entire modern field, and it is worth saying slowly. There are two very different things you can do with a variable X.

You can condition on it — written $P (Y ∣ X = x)$ . This means: among all the cases where X happened to equal x, what’s the distribution of Y? You’re filtering an existing population. This is seeing.

Or you can intervene — written, in Pearl’s notation, $P (Y ∣ do (X = x))$ . This means: consider a regime that sets X to x in the target population, snipping X away from whatever usually causes it, and then watch Y. This is doing. In an ideal randomized trial, assignment is designed to approximate that severing operation.

The two may coincide, but equality requires a design or causal assumptions; it does not follow from their numerical similarity. $P (Y ∣ X = x)$ describes outcomes among cases in which X was observed to equal x. $P (Y ∣ do (X = x))$ describes outcomes in a hypothetical regime that sets X to x. The worked example below makes the difference auditable by calculating both quantities from an explicit structural model.

The potential-outcomes lens

The other major language of modern causal inference writes the same idea at the level of units. For unit $i$ , potential outcomes are $Y_{i} (1)$ and $Y_{i} (0)$ : the outcome if treated and the outcome if untreated. A common causal estimand is the average treatment effect in a specified target population and time horizon, $A T E = E [Y (1) - Y (0)]$ . The catch is the fundamental problem of causal inference: for any one unit, only one of those two potential outcomes is observed. The other is counterfactual.

Designs and assumptions connect this target to data. Consistency links observed outcomes to potential outcomes under the treatment actually received. Exchangeability says the groups being compared are comparable with respect to the relevant potential outcomes. Positivity means every relevant covariate group has a chance of receiving each treatment level. No interference means one unit’s treatment does not alter another unit’s outcome, unless spillovers are part of the model.

A safe causal workflow is: specify the unit, exposure, outcome, timing, and intervention; define the estimand; state the causal assumptions; determine whether the estimand is identified; estimate it statistically; then assess uncertainty, measurement, attrition, selection, overlap, and sensitivity to violations.

Interactive · seeing vs doing

The do-operator as a calculation

A toy structural model: rain makes umbrellas more likely and wet clothes more likely; umbrellas themselves reduce wetness. The sliders set logistic-coefficient strengths, not risk differences. Every number below is computed from this stated model.

Rain→umbrella coefficient strength b=+2.80

Umbrella→wetness coefficient d=−1.49

Rung 1 · Seeing

$P (wet ∣ umbrella) - P (wet ∣ no umbrella)$

+0.23observed risk difference

Rung 2 · Doing

$P (wet ∣ do (umbrella)) - P (wet ∣ do (no umbrella))$

−0.16interventional risk difference

Observed table	Wet	Dry	Risk wet
Umbrella	189	187	50.2%
No umbrella	170	454	27.2%

Model parameter	Default
$P (S = 1)$	0.45
`a`	−2.00
`b`	4 × slider = +2.80
`c`	−2.20
`d`	−3.3 × slider = −1.49
`e`	+4.10

Sign reversal

Seeing says umbrellas are associated with wetter clothes. Doing says umbrellas reduce wetness. Rain changed the mix of people in the observed umbrella and no-umbrella groups.

The model is: $S \sim Bernoulli (0.45)$ ; $P (X = 1 ∣ S = s) = logit^{- 1} (a + b s)$ ; $P (Y = 1 ∣ X = x, S = s) = logit^{- 1} (c + d x + es)$ . Here S is rain, X is umbrella use, Y is wet clothes, and the fixed defaults are a=-2.0, c=-2.2, e=4.1.

Worked example

Seeing Is Not Doing

In the default toy structural model, rain makes umbrellas more common and wet clothes more common. Umbrellas reduce wetness, but umbrella users are wetter in the observed data because they are disproportionately standing in rain.

Quantity	What it asks	Default reading
$P (wet ∣ umbrella) - P (wet ∣ no umbrella)$	Seeing: compare existing umbrella users with non-users.	+0.23 observed risk difference.
$P (wet ∣ do (umbrella)) - P (wet ∣ do (no umbrella))$	Doing: set umbrella use directly and sever its usual causes.	−0.16 interventional risk difference.
Back-door adjusted contrast	Average the rain-specific umbrella contrasts using the population rain mix.	−0.16, matching the intervention contrast under this model.

Default parameters: $P (S = 1) = 0.45$ , $a = - 2.0$ , $b = + 2.8$ , $c = - 2.2$ , $d = - 1.485$ , $e = + 4.1$ . Formula: $P (Y = 1 ∣ do (X = x)) = \sum_{s} P (Y = 1 ∣ X = x, S = s) P (S = s)$ . The observed contrast instead uses $P (S = s ∣ X = x)$ , which changes when rain affects umbrella use.

That severed arrow in the right-hand graph is the do-operator made visual. Intervening doesn’t just look at X — it deletes the arrows pointing into X and replaces them with your hand. The rain→umbrella link is cut, so the observational path umbrella ← rain → wetness is no longer open in the umbrella contrast. Under this graph, the back-door-adjusted contrast recovers the intervention contrast.

Pearl gave us a grammar for doing this on paper. The recursive or acyclic structural causal models used in this lesson are represented by directed acyclic graphs: boxes and arrows with no directed loops. Cyclic and dynamic structural models also exist. In a DAG, a fork (X ← Z → Y) can be a confounder; a chain (X → Z → Y) can be a mediator; and a collider (X → Z ← Y) is the trap. A path that enters a node through two arrowheads is normally closed at that collider. Conditioning on the collider — or sometimes one of its descendants — can open that path and induce an association along that route. This is why adjusting for every measured variable can introduce serious bias.

Fork / confounderSummer changes both exposure and outcome. Condition on summer before comparing.

Chain / mediatorTar carries part of smoking's effect. Conditioning on tar changes the question.

ColliderSkill and luck meet at admission. Conditioning on admission induces a noncausal association.

The front-door trick

Pearl’s front-door criterion is one of the framework’s neatest moves: sometimes you can identify a total effect even when the exposure-outcome path is confounded, by using a measured mediator. The usual textbook smoking → tar → cancer story is a stylized graph, not a settled description of the biological system. The standard criterion requires three graphical conditions: the mediator intercepts all directed paths from treatment to outcome; there is no unblocked back-door path from treatment to mediator; and treatment blocks all back-door paths from mediator to outcome. You still also need well-defined interventions, consistency, adequate support, and the usual measurement and selection cautions.

When those conditions hold, the calculation has three moves: estimate how treatment changes the mediator; estimate how the mediator changes the outcome while accounting for treatment; then average those pieces over the observed treatment mix.

P (y ∣ do (x)) = m \sum P (m ∣ x) x^{'} \sum P (y ∣ m, x^{'}) P (x^{'})

It identifies an interventional distribution from observational data under the assumed graph — but only because you supplied a graph strong enough to license the expression.

The frontier · 2026

Three live edges — and the hype filter

Now the question that has launched a thousand papers and at least one industry: can you infer causation from observation alone? The precise answer is “partly — and there’s a proven wall.” Each claim below is tagged for how much weight it can bear. Evidence represented through 2024; page editorially reviewed June 2026.

Edge 01do-calculusMarkov ceiling

The two theorems that fence the field

This is the most solid ground in the whole day — not empirical findings that could be overturned, but mathematical proofs. First, do-calculus is complete for a scoped problem: given an assumed nonparametric causal graph and the available observational distribution, it is complete for identifying the specified intervention distribution. If identification fails in that problem, no alternative manipulation of the same observational information can identify it. Additional experiments, longitudinal information, parametric restrictions, valid proxies, or stronger functional assumptions change the information available and can change the answer.

What a do-calculus rewrite looks like

For a simple back-door graph, where Z blocks every back-door path from X to Y and no element of Z is caused by X:

P (y ∣ do (x)) = z \sum P (y ∣ do (x), z) P (z ∣ do (x)) = z \sum P (y ∣ x, z) P (z)

The first equality is ordinary total probability over Z. The second uses the graph twice: once to replace the intervention on X with observation of X after conditioning on the sufficient adjustment set Z, and once to say intervening on X does not change the pre-treatment distribution of Z. The graph licenses exactly which symbols may be erased or exchanged.

Second, the wall on the other side: the Markov-equivalence ceiling. Under causal Markov, faithfulness, and a specified treatment of latent variables and selection, conditional-independence patterns can rule out some graphs but not all. X→Y→Z, X←Y←Z, and X←Y→Z share a skeleton and have no differing unshielded colliders, so they imply the same conditional-independence pattern: X and Z are independent once you know Y. They form a Markov equivalence class. The collider X→Y←Z differs because conditioning on Y can open the path through Y. The takeaway is stark but scoped: observational independence information alone cannot deliver a unique causal graph without assumptions — only a class of candidates. Both results are Theorems in the strongest sense available: theorems.

Edge 02DirectionRCT benchmark

Squeezing direction out of still data — and why randomized assignment is still the benchmark

So is the ceiling the end? Not quite — you can climb over it by importing extra assumptions the bare independence-tests don’t use. LiNGAM (Shimizu et al., JMLR, 2006) showed that if relationships are linear, acyclic, have no hidden confounders, have mutually independent disturbances, and the disturbances are non-Gaussian, the full direction becomes identifiable — the asymmetry of non-Gaussian noise breaks the X→Y / Y→X tie. Additive-noise models (Hoyer, Janzing, Mooij, Peters & Schölkopf, NIPS 2008 proceedings, published 2009) extended this to nonlinear cause-effect pairs. On the December 2017 Tübingen Cause-Effect Pairs benchmark version — using the eligible pairwise tasks with expert-assigned directions — Causal Mosaic reported about 83% unweighted accuracy (Wu & Fukumizu 2020). But note the shape of the trick: you escape the ceiling only by assuming non-Gaussianity, additivity, independence, or related structure. Some implications of these assumptions can be checked against data, but they cannot be fully established from finite observational samples. Assumption-bound

Which is why randomized assignment remains the benchmark design for internal validity when the intervention is assignable. A randomized controlled trial makes assignment independent of pre-assignment causes in the study population, up to chance imbalance. In an ideal trial with adherence, no interference, a well-defined treatment, complete follow-up, and appropriate analysis, assignment and treatment coincide closely enough to estimate intervention effects directly. With noncompliance, the cleanest identified quantity may instead be the effect of assignment; effects of treatment received require additional assumptions. When you can’t randomize, economics’ credibility revolution hunts for designs that approximate random assignment — instruments, discontinuities, policy changes. That program earned the 2021 Nobel in economics for David Card, Joshua Angrist, and Guido Imbens (we’ll spend real time here on Day 152). Credibility turn

Edge 03Causal repsLLM causality

Can the machines do it? Causal ML and the “causal parrot”

The hottest and haziest edge. Causal representation learning (Schölkopf et al., Proceedings of the IEEE, May 2021) asks the deep question: classical causal discovery assumes the variables are handed to you, but the real world arrives as pixels and words. Can a network learn the high-level causal variables — and would that buy robustness to distribution shift, which today’s models often lack? It’s a serious, active program whose biggest promises remain Program, not yet delivered at scale.

Then the lightning rod: can large language models reason about cause and effect? Kıcıman, Ness, Sharma & Tan (TMLR 2024; initial arXiv preprint 2023) reported GPT-4 hitting 97% on a natural-language Tübingen pairwise causal-direction task — using textual variable descriptions rather than the numerical observations — plus strong counterfactual scores. A contemporaneous critique argued that LLMs talk causality without being causal — they recite causal facts marinating in their training text rather than performing Pearl-style inference. Results are strongly task-dependent. Models can retrieve familiar causal facts and achieve high scores on semantically rich benchmarks, yet perform poorly when linguistic and world-knowledge shortcuts are removed or when causal mechanisms must be composed and transferred to novel settings. Recall the Day 1 Gettier trap in a new guise: an LLM that outputs a true causal claim for reasons that have nothing to do with the causal structure is right, but does it know? Verdict on “LLMs reason causally in Pearl’s sense”: General reasoner Useful for setting up a causal analysis; not established as a general causal reasoner. DoWhy and EconML are genuine tools in the modern ecosystem: DoWhy emphasizes graph-based identification, assumptions, and refutation; EconML emphasizes treatment-effect estimation with orthogonal machine learning methods. The sales pitch that causal AI will soon replace ordinary correlational ML is still ahead of the evidence.

Open questions

What’s genuinely unsettled

Is there a "right" theory of causation at all, or are counterfactual, probabilistic, INUS, mechanistic, and interventionist accounts each capturing a different facet — with none reducible to the others? No single account is generally accepted as resolving every major class of counterexample, including early and late preemption, overdetermination, omissions, and prevention.
How far can we trust faithfulness? The assumption that real systems do not hide causal paths through exact cancellations is convenient. Some implications can be checked against data, but faithfulness cannot be fully established from finite observational samples — and feedback, regulation, and homeostasis can make it questionable in some biological systems.
Can causal variables be learned from raw data (pixels, language) rather than handed to the algorithm — and is that even well-posed, since the “right” carve-up of the world into variables may itself be perspective-dependent?
Do large models build internal causal world-models, or only ever statistics of causal talk? The answer reaches straight into Days 138–145 and the question of whether prediction can ever amount to understanding.
Where do the arrows come from? Every method here needs some causal input — a graph, an assumption, an experiment. Hume’s ghost still asks whether that input is ever read off the world, or always brought to it.

The day in three sentences

Big idea: Causal effects are contrasts between outcomes under specified interventions. Observational and interventional quantities may coincide, but their equality cannot be assumed merely from observational data; it requires a design or causal assumptions encoded in randomization, a graph, a potential-outcomes model, or another defensible identification strategy.
Best analogy: Ice cream and drowning marching upward together while summer, off-stage, pulls both strings — and the do-operator as a pair of scissors that cuts the strings into a variable so you can see what it really drives.
Live controversy: Whether causal structure can be inferred from observation alone — often only up to a Markov-equivalence class under standard assumptions, partly recoverable with stronger assumptions (LiNGAM, additive noise) — and the current dispute over whether LLMs do general causal reasoning or mostly retrieve causal talk.

Threads today › information (a graph is the extra information data lacks; the do-operator distinguishes evidence from intervention) · computation (identification as an algorithmic problem; causal discovery as search under assumptions) · emergence (causal structure as a higher-level layer over raw correlation) — with callbacks to Day 1 (true-by-luck), Day 2 (Hume), Day 3 (abduction), and Day 4 ( $P (y ∣ x)$ ), before the statistical machinery continues in Statistics & the Art of Not Fooling Yourself.

Tomorrow → Day 6

Statistics & the Art of Not Fooling Yourself

Causal inference told us which comparisons deserve trust. Tomorrow the microscope turns on the numbers themselves: what a p-value does and does not say, confidence intervals, effect sizes, p-hacking — and the multiverse of analyses hiding behind every published result.

Sources

Sources & further reading

Hume, D. (1739–40). A Treatise of Human Nature, Book I, Part III; and (1748) An Enquiry Concerning Human Understanding, §VII. — constant conjunction; no observable necessary connection.
Mackie, J. L. (1974/1980). The Cement of the Universe: A Study of Causation. Oxford University Press. doi:10.1093/0198246420.001.0001. doi.org/10.1093/0198246420.001.0001 — INUS conditions; the title’s phrase.
Lewis, D. (1973). “Causation.” Journal of Philosophy 70(17): 556–567. doi:10.2307/2025310. doi.org/10.2307/2025310 See also Lewis, Counterfactuals (Blackwell, 1973); revised “influence” account (2000).
Reichenbach, H. (1956). The Direction of Time. University of California Press. — the common-cause principle & screening off.
Suppes, P. (1970). A Probabilistic Theory of Causality. North-Holland. — prima facie vs spurious causes.
Cartwright, N. (1979). “Causal Laws and Effective Strategies.” Noûs 13(4): 419–437. doi:10.2307/2215337. doi.org/10.2307/2215337 — probability-raising only within causally homogeneous contexts. overview
Woodward, J. (2003). Making Things Happen: A Theory of Causal Explanation. Oxford University Press. doi:10.1093/0195155270.001.0001. doi.org/10.1093/0195155270.001.0001 — the interventionist/manipulationist theory.
Charig, C. R., Webb, D. R., Payne, S. R. & Wickham, J. E. A. (1986). “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy.” British Medical Journal 292: 879–882. doi:10.1136/bmj.292.6524.879. doi.org/10.1136/bmj.292.6524.879 — the kidney-stone data; observational and historical, not randomized evidence of treatment superiority.
Simpson, E. H. (1951). “The Interpretation of Interaction in Contingency Tables.” JRSS B 13: 238–241. doi:10.1111/j.2517-6161.1951.tb00088.x. doi.org/10.1111/j.2517-6161.1951.tb00088.x Blyth, C. R. (1972), JASA 67: 364–366, doi:10.1080/01621459.1972.10482387 (coins “Simpson’s paradox”). doi.org/10.1080/01621459.1972.10482387 Yule, G. U. (1903) on spurious correlation, doi:10.1093/biomet/2.2.121. doi.org/10.1093/biomet/2.2.121
Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press. doi:10.1017/CBO9780511803161. doi.org/10.1017/CBO9780511803161 And Pearl, J. & Mackenzie, D. (2018). The Book of Why. Basic Books. — the Ladder of Causation; do-calculus; back-door/front-door.
Holland, P. W. (1986). “Statistics and Causal Inference.” JASA 81(396): 945–960. doi:10.1080/01621459.1986.10478354; potential outcomes and the fundamental problem of causal inference. doi.org/10.1080/01621459.1986.10478354 Rubin, D. B. (1974). “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of Educational Psychology 66(5): 688–701. doi:10.1037/h0037350. doi.org/10.1037/h0037350
Imbens, G. W. & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. doi:10.1017/CBO9781139025751. doi.org/10.1017/CBO9781139025751 Hernán, M. A. & Robins, J. M. (2020). Causal Inference: What If. Chapman & Hall/CRC. maintained edition — maintained edition consulted June 2026; estimands, consistency, exchangeability, positivity, interference, front-door/back-door.
Shpitser, I. & Pearl, J. (2006). “Identification of Joint Interventional Distributions in Recursive Semi-Markovian Causal Models.” AAAI. AAAI PDF & Huang, Y. & Valtorta, M. (2006). “Pearl’s Calculus of Intervention Is Complete.” UAI. arXiv — completeness of do-calculus within the stated identification problem.
ACM (2012). 2011 A.M. Turing Award — Judea Pearl. amturing.acm.org/award_winners/pearl
Spirtes, P., Glymour, C. & Scheines, R. (2000). Causation, Prediction, and Search, 2nd ed. MIT Press. doi:10.7551/mitpress/1754.001.0001. doi.org/10.7551/mitpress/1754.001.0001 — the PC and FCI algorithms; Markov equivalence.
Shimizu, S., Hoyer, P. O., Hyvärinen, A. & Kerminen, A. (2006). “A Linear Non-Gaussian Acyclic Model for Causal Discovery.” JMLR 7: 2003–2030. JMLR — LiNGAM; linearity, acyclicity, no hidden confounders, mutually independent non-Gaussian disturbances.
Hoyer, P., Janzing, D., Mooij, J., Peters, J. & Schölkopf, B. (2008/2009). “Nonlinear causal discovery with additive noise models.” Advances in Neural Information Processing Systems 21. Mooij, J. et al. (2016). “Distinguishing Cause from Effect Using Observational Data.” JMLR 17(32): 1–102. Wu, P. & Fukumizu, K. (2020). “Causal Mosaic: Cause-Effect Inference via Nonlinear ICA and Ensemble Method.” Proceedings of Machine Learning Research 108: 1157–1167. PMLR — additive-noise models, the Tübingen pairs benchmark, and Mosaic’s reported result on eligible expert-labeled pairwise tasks.
Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A. & Bengio, Y. (2021). “Toward Causal Representation Learning.” Proceedings of the IEEE 109(5): 612–634. doi:10.1109/JPROC.2021.3058954. doi.org/10.1109/JPROC.2021.3058954
Kıcıman, E., Ness, R., Sharma, A. & Tan, C. (2024). “Causal Reasoning and Large Language Models.” TMLR. initial arXiv preprint 2023; reports 97% on a natural-language pairwise causal-direction task using textual metadata rather than the numerical observations. OpenReview Zečević, M., Willig, M., Dhami, D. S. & Kersting, K. (2023). “Causal Parrots: Large Language Models May Talk Causality But Are Not Causal.” TMLR. arXiv:2308.13067. Jin, Z. et al. (2024). “Can Large Language Models Infer Causation from Correlation?” ICLR.
The Royal Swedish Academy of Sciences (2021). The Sveriges Riksbank Prize in Economic Sciences — Card, Angrist & Imbens. nobelprize.org/prizes/economic-sciences/2021
Stanford Encyclopedia of Philosophy. “Causation.” plato.stanford.edu/entries/causation-metaphysics
Stanford Encyclopedia of Philosophy. “Counterfactual Theories of Causation.” plato.stanford.edu/entries/causation-counterfactual
Stanford Encyclopedia of Philosophy. “Causal Models.” plato.stanford.edu/entries/causal-models
Stanford Encyclopedia of Philosophy. “Probabilistic Causation.” plato.stanford.edu/entries/causation-probabilistic

End of Day 005 · 175 descents remain