Download this day:EPUB PDF

Block I · Foundations of Knowledge & Reasoning · Day 003 / 180

Logic & Valid Inference

Three ways to move from premises to conclusions — and only one of them is safe.

Deduction keeps you safe but locked in the room. Induction and abduction get you out — at the price of certainty.

A stranger walks into a London consulting room. Within seconds Sherlock Holmes announces, to Watson’s astonishment, that the man is a retired army doctor, recently invalided home from Afghanistan. The skin is tanned but the wrists are pale — a sunburn caught abroad, not at the seaside. He holds his arm stiffly, war-wounded. The face is haggard with hardship and fever. Holmes calls this deduction, and the word has clung to him for over a century. He is wrong about the word. What Holmes performs — and what made him immortal — is not deduction at all. It is a humbler, riskier, far more creative thing.

That mislabeling is the perfect way in, because the whole of today turns on a distinction almost everyone blurs: there is more than one way to reason, and they do not come with the same guarantees. Some inferences are airtight — if you grant the premises, the conclusion cannot escape. Others are fertile but fallible — they reach past the evidence and can be overturned by tomorrow’s surprise. Mistaking one for the other is the root of a startling fraction of human error. So let’s draw the lines carefully.

Where we are

Two days in, we have circled reasoning from the outside. On Day 1 we asked what turns a true belief into knowledge — and hit the Agrippan trilemma, the worry that every justification either regresses forever, loops in a circle, or stops somewhere arbitrary. On Day 2, Hume’s problem of induction showed that no pile of observations can ever prove a universal law, which is why Popper told us to falsify rather than verify. Today we open the engine itself. Those two earlier puzzles were really about the limits of two specific inference modes; now we name all three, watch logic become mathematics, and follow it to the strangest frontier in the course — machines that check proofs with zero tolerance for error. The thread that lights up brightest today is computation.

Sidney Paget illustration of Sherlock Holmes and Dr. Watson, with Holmes holding a watch. — Paget’s Holmes became the public face of “deduction,” even though the famous diagnostic leaps are usually abductive: clues first, best explanation second.

The model

Three engines, three guarantees

If you remember one thing from today, make it this trichotomy. Reasoning is not one activity but three, and they are sorted by how much they promise.

Deduction is the truth-preserving engine. The conclusion is already folded inside the premises; valid deduction merely unfolds it. Grant that all men are mortal and that Socrates is a man, and you cannot avoid the conclusion that Socrates is mortal — to deny it is to contradict yourself. The price of this safety is that deduction is non-ampliative: the conclusion adds nothing not already entailed by the premises, though it can reveal consequences we had not noticed. Mathematics is the deductive art carried to its limit, which is exactly why mathematicians can be so certain and why their certainty never, by itself, settles a question about this universe.

Induction is the generalizing engine. You have seen the sun rise ten thousand times; you infer it will rise tomorrow. Every swan anyone had ever logged was white, so — until 1697 — “all swans are white” looked secure. Induction is ampliative: it adds content, reaching beyond the cases in hand. And for precisely that reason it is not truth-preserving: a single counterexample can break the pattern. This is Hume’s bomb from Day 2, still ticking: no finite run of observations can logically guarantee the next one. Induction is how empirical knowledge actually grows, and it comes with no warranty.

Abduction is the explaining engine — and the one most people were never taught to name. You meet a surprising fact, and you cast around for a hypothesis that, if true, would make the surprise dissolve. The American polymath Charles Sanders Peirce (1839–1914) singled it out as the only genuinely creative mode, the one that generates new ideas rather than just testing or unpacking them. “Every plank of [science’s] advance,” he wrote, “is first laid by retroduction alone.” Deduction and induction work over hypotheses you already have; abduction is where the hypotheses come from in the first place.

Now back to Holmes. The tan, the stiff arm, the haggard face — these are surprising facts, and Holmes leaps to the explanation that best accounts for all of them at once: a war-wounded military doctor home from a hot campaign. But notice the leap is not guaranteed. The man could be an actor who summers in Morocco and sprained his shoulder playing tennis. Holmes’s conclusion is the best explanation, not the only one — which is the signature of abduction, not deduction. Conan Doyle gave his detective the wrong word, and a century of readers inherited the mistake. (This will matter again on Day 4, when we ask how to make “best explanation” precise using probability.)

The shape of a great misnomer

Holmes is not alone. We say a doctor “diagnoses” — that’s abduction, reasoning from symptoms to the disease most likely to produce them. A mechanic listening to an engine, a detective at a crime scene, a scientist staring at an anomalous reading: all abducting, all leaping to the explanation that would render the strange ordinary. Even this sentence relies on it — you’re inferring a mind behind these words because that’s the best explanation for their orderly arrangement, not because a theorem forces it. Abduction is the water we swim in; we just rarely call it by name.

The distinction everyone fumbles

Valid is not the same as true

Inside the deductive engine lives the single most misunderstood idea in all of logic, and getting it straight is worth more than a dozen memorized fallacies. It is the difference between validity and soundness.

An argument is valid when its form guarantees that true premises would force a true conclusion. Validity is a property of the shape, not the content. The Internet Encyclopedia of Philosophy puts it cleanly: an argument is valid “if and only if it takes a form that makes it impossible for the premises to be true and the conclusion nevertheless to be false.” Soundness asks for more — an argument is sound only if it is valid and all its premises are actually true.

Here is the part that trips people: a valid argument can have a wildly false conclusion. Watch.

All birds can fly. A penguin is a bird. Therefore, a penguin can fly.

The form is flawless — “All M are P; s is M; therefore s is P,” the very mould Socrates was poured into. If the premises were true, the conclusion would have to follow. So the argument is perfectly valid. It is also, obviously, unsound, because the first premise is false: not all birds fly. Validity certifies the plumbing; soundness asks whether you also pumped in clean water. A valid argument with a false premise is a beautifully engineered pipe carrying sewage.

This is not hair-splitting. It is the working principle behind reductio ad absurdum, one of the sharpest tools in mathematics: to prove a premise false, assume it, reason validly to a conclusion you know is false, and the falsehood flows backward to indict the premise. The whole technique depends on a valid argument deliberately producing a false conclusion. Validity is the carrier; truth is the cargo; learn to track them separately and a fog lifts from every argument you’ll ever read.

When the form breaks

The two fallacies hiding in every “if”

If valid forms are the safe paths, fallacies are the trapdoors that look just like them. The most treacherous live in conditional reasoning — statements of the form P → Q — because the broken versions sit one letter away from the sound ones.

The two valid moves are old friends. Modus ponens: if P then Q; P is true; therefore Q. Modus tollens: if P then Q; Q is false; therefore ¬P. Both are airtight. Now meet their evil twins.

Affirming the consequent runs: if P then Q; Q is true; therefore P. It grabs the wrong end. “If someone lives in San Diego, they live in California. Joe lives in California. Therefore Joe lives in San Diego.” But California is large; Joe could be in Sacramento. The conclusion might be true, which is exactly what makes the fallacy so seductive — it sometimes lands the right answer for no good reason, and a true conclusion reached by a broken argument is the Gettier trap from Day 1 wearing a logician’s coat.

Denying the antecedent is its mirror: if P then Q; P is false; therefore Q is false. “If it’s raining, the ground is wet. It isn’t raining. Therefore the ground isn’t wet.” Sprinklers, dear reader. Burst pipes. A spilled bucket. Knocking out one cause does not knock out the effect, because effects can have more than one cause.

There’s a teaching classic that makes the structure unforgettable: If an animal is a dog, it has four legs. This animal has four legs. Therefore it is a dog. Cats, horses, and tables object. The absurdity is the point — it’s the same broken form as the San Diego argument, just with the silliness turned up so you can see the gears slip. (Eugène Ionesco built an entire scene of his play Rhinoceros on exactly this fallacy, a Logician gravely proving that a cat with four legs must be a dog.)

These are formal fallacies — broken shapes. Their cousins, the informal fallacies, are flaws not in form but in content: post hoc ergo propter hoc (the rooster crows, the sun rises, therefore the rooster summons the dawn), the ad hominem, the equivocation that quietly swaps a word’s meaning mid-argument. Formal fallacies you catch by checking the skeleton; informal ones you catch by reading what the words actually do.

Interactive · test the form, not the words

The Inference Inspector

Pick a conditional argument form. The machine doesn't care what the sentences say — only whether the shape guarantees the conclusion. Flip between the two valid moves and their two notorious impostors, then load a real-world example to feel why the broken ones fool us.

Dress it in real words

—

Conditional Forms

Conditional arguments can be sorted by form before their content is judged.

Form	Pattern	Verdict	Why
Modus ponens	If P then Q; P; therefore Q	Valid	Affirming the sufficient condition forces the consequent.
Modus tollens	If P then Q; not-Q; therefore not-P	Valid	If Q must follow from P, the absence of Q rules P out.
Affirming the consequent	If P then Q; Q; therefore P	Invalid	Q may have other causes: Joe can live in California without living in San Diego.
Denying the antecedent	If P then Q; not-P; therefore not-Q	Invalid	Removing one sufficient cause does not remove every route to Q: sprinklers can wet the ground.

The lineage

How logic became mathematics

The machinery you’ve been using has a deep history, and it bends in a surprising direction: over twenty-three centuries, the study of good argument slowly turned into a branch of algebra. The story has four landmarks.

Aristotle (4th century BCE) built the first formal system in his Prior Analytics. His genius was to use letters as placeholders — “all A are B” — and so to study argument forms apart from their content. This is term logic: it relates terms like “man” and “mortal.” Medieval logicians lovingly catalogued the valid syllogistic moods with mnemonic names — Barbara, Celarent, Darii. The names are codes, not people: their vowels mark proposition types, where A means “all S are P,” E means “no S are P,” I means “some S are P,” and O means “some S are not P.” So Barbara is AAA, Celarent is EAE, and Darii is AII; Barbara, for example, means all M are P; all S are M; therefore all S are P. For nearly two thousand years, this was logic.

The Stoics, above all Chrysippus (c. 279–206 BCE), built a second, parallel logic that history nearly lost. Where Aristotle related terms, the Stoics related whole propositions with connectives we still use daily: if…then, and, or, not. Chrysippus laid out five “indemonstrables” — basic proof rules, the first of which (“if the first then the second; but the first; therefore the second”) is precisely modus ponens. This is propositional logic, the ancestor of the logic inside every computer chip. The Stoics arguably had a truth-functional grasp of the connectives — understanding “or” by when the whole is true given its parts — two millennia before it was rediscovered. The 20th-century logician Jan Łukasiewicz startled scholars by arguing Stoic logic was not Aristotle’s poor cousin but “an achievement of equal rank.” Then it was buried for ages while Aristotle reigned — a reminder that intellectual history is not a tidy relay race.

George Boole snapped the two traditions onto a new track. In An Investigation of the Laws of Thought (1854), he did something audacious: he treated logical reasoning as calculation. Let 1 be the universe and 0 be nothing; let multiplication be “and,” addition be “or.” Suddenly the laws of valid inference looked like the laws of algebra. “We ought no longer to associate Logic and Metaphysics,” Boole declared, “but Logic and Mathematics.” His book sold modestly and puzzled contemporaries. Only decades later, when Claude Shannon noticed in 1937 that Boole’s two-valued algebra described electrical switching circuits exactly, did Boolean algebra become the literal foundation of digital logic. Every AND-gate in the device you’re reading this on is a sentence of Chrysippus, rendered in silicon.

Gottlob Frege delivered the largest leap since Aristotle. His slim, forbidding Begriffsschrift (“concept-script,” 1879) introduced the quantifier — the formal “for all” (∀x) and “there exists” (∃x) — and with it predicate logic. Aristotle’s term logic choked on arguments like “every horse is an animal, therefore every head of a horse is the head of an animal”; Frege’s machinery handled it and vastly more, analyzing propositions as functions fed with arguments. It is often called the finest single book in the history of symbolic logic. There’s a tragic coda: Frege dreamed of reducing all of arithmetic to pure logic, and just as the second volume went to press, a young Bertrand Russell sent him a letter containing a paradox — the set of all sets that don’t contain themselves: does it contain itself or not? Either answer contradicts itself. Frege’s grand foundation cracked. But his logic survived the wreck and became the modern symbolic logic we still teach. (The ghost of that paradox, and the limits it hinted at, will haunt us on Day 28, when Gödel proves no formal system can be everything mathematicians hoped.)

The debate

Is logic discovered or invented?

Here is a question that sounds like a parlor game and turns out to cut very deep. The bedrock laws — possible axioms such as identity (A is A), non-contradiction (not both A and not-A), and excluded middle (either A or not-A, no third option) — feel utterly inescapable. But where do they live? Are they features of reality, woven into the universe whether or not minds exist? Features of thought, the unavoidable grammar of any thinker? Or human conventions, real and binding but ultimately chosen, like the rules of chess?

Logical realism

discovered

The laws are objective, mind-independent structures of the world. We don’t legislate non-contradiction any more than we legislate the prime numbers — we find it. Logic is read off reality.

Psychologism

laws of thought

The laws describe how minds must operate — a branch of psychology. Frege and Husserl attacked this fiercely: logical truths are exact and a priori, while psychology is empirical and fuzzy.

Conventionalism

invented

The laws are stipulations we adopt because they’re useful — binding once chosen, but not handed down by the cosmos. Curiously rare as a fully worked-out position, despite its close kinship to moral anti-realism.

Revisability

empirical?

Quine and Putnam floated the radical thought that even logic might be revised for empirical reasons — that quantum mechanics could push us toward a non-classical logic, much as relativity pushed us to non-Euclidean geometry.

That last box is the hinge of today’s frontier. For most of history “the laws of thought” seemed untouchable — to question them was to saw off the branch you sat on. But the twentieth century produced rigorous, working alternative logics, systems that quietly drop one of the sacred laws and keep functioning. Once you’ve seen those alternatives do real labor, the grand metaphysical question softens into something more practical and, frankly, more interesting: not “which logic is True?” but “which logic is the right tool for this job?” Let’s go meet the alternatives.

The frontier · 2026

Three live edges — and the hype filter

Every day in this course ends at the research frontier, with each claim tagged for how much weight it can bear. Logic’s frontier is unusually concrete: it runs on real computers, checks real proofs, and has lately collided with artificial intelligence in ways that demand a careful eye.

Edge 01Non-classical

The logics that break the rules — on purpose

“Classical” logic is not the only consistent option; it’s one settled point in a landscape of alternatives, each built by surrendering a law most people thought non-negotiable.

Intuitionistic logic drops the law of excluded middle. Pioneered by L.E.J. Brouwer and formalized by Arend Heyting in the 1920s–30s, it insists a statement counts as true only if you can construct a proof of it. You may not assert “A or not-A” for free — you must prove one side. The motivating example is sharp: excluded middle would let you cheerfully assert, for any computer program, “it halts or it doesn’t” — yet (as we’ll see on Day 27) no general method decides halting, so there’s no construction backing the claim. Intuitionism says: then don’t assert it. This sounds like philosophical fastidiousness until you learn where it leads — straight into the heart of computer science, via a correspondence so beautiful it gets its own box below.

Paraconsistent logic drops explosion. In classical logic a single contradiction is apocalyptic: from “P and not-P” you can derive literally anything (the principle ex contradictione quodlibet) — one inconsistency and the whole system goes up in flames. Paraconsistent logics refuse this, letting you reason sensibly even when some contradiction has crept in — useful for large databases, legal codes, or any messy body of information that’s locally inconsistent but not therefore worthless. The stronger philosophical cousin, dialetheism — Graham Priest’s view that some contradictions are actually true, like the Liar sentence “this sentence is false” — is far more controversial. Keep them separate: you can adopt a paraconsistent logic (a technical choice about explosion) without being a dialetheist (a metaphysical claim about true contradictions). The first is a tool; the second is a worldview.

Fuzzy logic drops the two-value restriction entirely. Lotfi Zadeh (1965) let truth slide along the whole interval from 0 to 1 to handle vagueness — “the water is warm” is 0.7 true — building on the many-valued logics of Łukasiewicz from the 1920s. It runs in control systems and appliances. And modal logic — the logic of necessity and possibility (□ and ◇) — together with carefully chosen temporal logics underpins the formal verification of hardware and software: specific fragments are expressive enough to say useful things while remaining decidable enough for model checking. These aren’t museum pieces. They’re the working logics of the modern technical world.

The bridge · propositions as types

The deepest reason intuitionistic logic matters is the Curry–Howard correspondence: in suitable formal systems, propositions correspond to types and proofs correspond to programs. Proving a theorem can be treated as constructing the program-like object that inhabits its type — and vice versa.

This is why several proof assistants below are built on type-theoretic foundations — and why logic and computation, one of our five threads, are not neighbors but the same country seen from two sides. (Picked up on Days 27–29.)

Edge 02Proof assistants

Proof at zero tolerance: the rise of the proof assistant

Aristotle’s dream was a chain of reasoning so tight that no one could doubt it. Twenty-three centuries later, that dream has a software implementation. A proof assistant is a program in which every logical step must pass a mechanical check; nothing is accepted on authority, intuition, or “clearly.” The leading systems include Lean (now Lean 4), Rocq (the proof assistant formerly named Coq, renamed in 2025), Agda, and Isabelle/HOL. Lean, Rocq, and Agda live in the type-theoretic family; Isabelle/HOL is based on classical higher-order logic. Same ambition, different foundations.

Lean’s community-built library, mathlib, is one of the largest unified formalizations of mathematics ever assembled: more than 278,000 theorems and 132,000 definitions when checked in June 2026, growing continuously, and covering 84 of the 100 problems on a famous “formalize these” challenge list. This is not a toy. Consider what it has already verified:

2022 · completedThe Liquid Tensor Experiment. In December 2020, Fields Medalist Peter Scholze challenged the world to verify a theorem from his “condensed mathematics” that he himself wasn’t fully sure of. A team led by Johan Commelin and Adam Topaz did it in Lean, finishing on 14 July 2022. A working mathematician used a machine to gain confidence in a proof too intricate for comfortable human refereeing — exactly the point.
2023 · completed in 3 weeksThe Polynomial Freiman–Ruzsa conjecture. Days after Tim Gowers, Ben Green, Freddie Manners, and Terence Tao posted a proof of this additive-combinatorics result, Tao launched a Lean project to formalize it — and announced the dependency graph “completely covered in a lovely shade of green” three weeks later. Formalization keeping pace with research, nearly in real time.
2024–25 · completedThe Equational Theories Project. Tao’s collaborative experiment (launched September 2024) to settle the implication relation among 4,694 algebraic laws — 22,033,636 ordered pairs if you include each law’s trivial implication of itself, or 22,028,942 nontrivial graph edges — combining human proofs, automated theorem provers, AI, and Lean verification across 50+ contributors. It finished in just over 200 days: a new model of massively collaborative, machine-checked mathematics.
2024–2029 · in progressFermat’s Last Theorem. Kevin Buzzard’s EPSRC-funded project (launched April 2024, Imperial College London) to formalize FLT — not the original Wiles proof but a modern route. Buzzard is “quietly confident” of reducing it to 1980s-known results, but frank that the whole thing is “at least a 5 year project.” Not yet done — its status is work in progress, the last of those 100 challenge problems still open.

And the certainty reaches beyond pure mathematics into systems lives depend on. CompCert is a C compiler proved correct in Rocq; a celebrated bug-hunting study spent roughly six CPU-years trying to make it emit wrong code and failed — “the only compiler we have tested for which Csmith cannot find wrong-code errors” — while finding the usual swarm of bugs in GCC and LLVM. seL4 is the first operating-system microkernel with a full machine-checked proof of functional correctness (in Isabelle/HOL): under its stated assumptions, the C implementation refines the formal specification, so whole classes of crashes and unsafe behaviors are ruled out by theorem rather than hope. These are not ordinary promises; they are conditional theorems about software. This is what logic, mechanized, can do — and it is solidly established.

Edge 03Medal AISolved math?

When AI met the proof checker

The newest and noisiest edge is the collision of machine learning with formal proof — and it is exactly where the hype filter earns its keep, because headlines routinely overreach.

The genuine milestone first. In July 2024, DeepMind’s AlphaProof, paired with AlphaGeometry 2, solved 4 of 6 problems at the International Mathematical Olympiad, scoring 28 points — the top end of the silver-medal category, one point below the gold threshold of 29. It even cracked the fearsome Problem 6, which only 5 of roughly 600 human contestants fully solved. The methodology was published online in Nature on 12 November 2025, with the version of record appearing in 2026. Here’s the design fact that separates it from chatbot bluster: AlphaProof works inside Lean. It auto-formalized about a million natural-language problems into ~80 million formal Lean statements, then trained itself in an AlphaZero-style loop where Lean checks every step. As DeepMind put it, there are “no hallucinations to worry about” — because a hallucinated step simply fails to compile. The neural net supplies creative search; the proof assistant supplies ground truth. That marriage is real and important. AlphaProof + Lean

In July 2025 the bar rose again: both DeepMind (a Gemini “Deep Think” model) and OpenAI reported gold-medal scores — 5 of 6 problems, 35 points — and, strikingly, did it working end-to-end in natural language within the time limit, not in Lean. DeepMind’s result was officially certified by the IMO; OpenAI’s was graded internally. Genuinely impressive. But here is where you deploy the calibration instinct from Day 1:

“Gold medal” is a score, not a coronation. These are competition problems — a narrow, time-boxed slice of mathematics with known-to-exist short answers. They are not unsolved research problems, and per the official 2025 results, 26 human contestants still outscored both AI systems.
Dropping Lean is a trade, not a free upgrade. The 2024 silver was formally verified — guaranteed correct by machine. The 2025 natural-language gold was human-graded, which means we’re back to trusting prose that could harbor a subtle gap. More general, less certain. Don’t let “gold beats silver” hide that the epistemic ground shifted.
It is expensive and narrow. Each hard 2024 problem took two to three days of computation, and problems were hand-translated into Lean for the competition. This is not a general mathematical mind.

And the claim to retire most firmly: AI has not “solved mathematics” or made mathematicians obsolete. Obsolescence Until recently, AI systems could solve contest problems and assist formal proof, but had not produced an accepted landmark mathematical result on their own. That changed in an important but limited way in May 2026, when an internal OpenAI model generated a counterexample to the longstanding Erdős unit-distance conjecture; OpenAI reported that external mathematicians checked the proof, and Alon, Bloom, Gowers, Litt, Sawin, Shankar, Tsimerman, Wang, and Matchett Wood published a human-digested version on arXiv. That is a real milestone. It does not mean AI has “solved mathematics”: the result still needed human verification, refinement, exposition, and interpretation, and general autonomous mathematical discovery remains uneven. The real revolution is quieter and more durable than the headlines: a 2,300-year-old standard — a proof is a chain no one can doubt — is being handed to machines that can search, construct, and check with growing force. (A theme we’ll chase properly across Days 138–145.)

How to read this edge

The results above carry different kinds of warrant. A peer-reviewed paper supports a published method and result; an official contest score supports a benchmark performance; a company report or preprint can be important without settling what the system can do in general. Keep those categories separate. Formal verification establishes the checked statement under its formalization and axioms; it does not by itself establish that a machine understands the mathematics. The useful habit is simple: ask what was measured, who verified it, and which stronger claim still remains open.

Open questions

What’s genuinely unsettled

Twenty-three centuries in, the study of valid inference still leaves real questions wide open:

Is there one true logic, or many? Once intuitionistic, paraconsistent, and fuzzy logics all do useful work, “the correct logic” starts to look less like a fact about the universe and more like a choice of tool — but pluralists and monists are still genuinely at odds.
Discovered or invented? Are the laws of logic read off reality, baked into any possible mind, or adopted by convention? And could empirical physics ever force a revision, as Putnam suspected?
What is abduction, exactly? Is “inference to the best explanation” a real third mode, or dressed-up induction? Even whether Peirce meant it as inference-to-best-explanation (versus mere hypothesis-generation) is debated among his scholars.
Can mechanized proof change what mathematics is? If a result is true but only a computer has checked the proof, has anyone understood it? Does a verified-but-opaque proof carry the same value as an illuminating human one?
And the question that will stalk the AI block: when a machine outputs a true, well-supported theorem, does it know anything — or is it the ultimate Gettier case from Day 1, right for reasons that have nothing to do with comprehension? (Days 138–145.)

The day in three sentences

Big idea: Reasoning comes in three engines with three different warranties — deduction preserves truth and exposes what follows without adding new content, induction generalizes but can be broken by the next case, and abduction leaps to the best explanation — and inside deduction, validity (good form) is a wholly separate thing from soundness (good form plus true premises).
Best analogy: Sherlock Holmes’s “deductions” are really abductions — the best explanation of the clues, not a guaranteed conclusion — and a valid-but-unsound argument is a beautifully built pipe carrying sewage.
Live controversy: Whether logic is discovered or invented (and whether there’s one true logic or a toolkit of them), now sharpened by a real frontier where proof assistants like Lean verify cutting-edge mathematics at zero tolerance, AI has reached medal-level, and one AI-generated counterexample has become a genuine research milestone — but emphatically has not “solved mathematics.”

Threads today › computation (Curry–Howard: proofs correspond to programs; Boolean algebra in silicon; proof assistants) · information (formalization makes a proof’s content machine-checkable) · emergence (massively collaborative proof settling about 22 million implication relations) — with deduction and induction tying back to Day 1 and Day 2, then handing the next question to Probability as Extended Logic.

Tomorrow → Day 4

Probability as Extended Logic

Deduction hands out certainty, but almost nothing in life qualifies for it. Tomorrow probability takes over as logic’s extension into the uncertain: Bayes’ theorem as belief revision, Monty Hall as a trap for trained intuitions, and e-values recasting statistical tests as bets.

Sources

Sources & further reading

“Validity and Soundness.” Internet Encyclopedia of Philosophy (accessed 2026). iep.utm.edu/val-snd — the form-based definition of validity and the validity-vs-soundness distinction.
“Deductive and Inductive Arguments.” Internet Encyclopedia of Philosophy. iep.utm.edu/ded-ind — truth-preserving vs ampliative inference.
Douven, I. “Abduction.” Stanford Encyclopedia of Philosophy (rev. 2021). plato.stanford.edu/entries/abduction — Peirce, inference to the best explanation, and the scholarly debate over what abduction is.
“Aristotle’s Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/aristotle-logic — the syllogistic, Prior Analytics, and term logic.
Bobzien, S. “Ancient Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/logic-ancient — Chrysippus, the Stoic indemonstrables, and propositional logic; Łukasiewicz’s reassessment.
Boole, G. (1854). An Investigation of the Laws of Thought. London: Walton & Maberly. See “George Boole, The Laws of Thought,” PhilPapers. philpapers.org/rec/BOOTLO-4 — logic as algebra; “Logic and Mathematics.”
“Origins of Boolean Algebra in the Logic of Classes.” Mathematical Association of America (Convergence). old.maa.org — Boole, Venn, Peirce, and the path to digital logic via Shannon (1937).
“Frege’s Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/frege-logic — the Begriffsschrift (1879), quantifiers, predicate logic, and Russell’s paradox.
“Intuitionistic Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/logic-intuitionistic — Brouwer, Heyting, the rejection of excluded middle, the BHK interpretation.
Priest, G., Berto, F. & Weber, Z. “Dialetheism” and “Paraconsistent Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/dialetheism — explosion, paraconsistency vs dialetheism, the Logic of Paradox.
Zadeh, L. A. (1965). “Fuzzy sets.” Information and Control 8(3): 338–353. doi:10.1016/S0019-9958(65)90241-X. doi.org/10.1016/S0019-9958(65)90241-X — original fuzzy-set source: degrees of membership in [0,1], the basis for fuzzy logic.
Garson, J. “Modal Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/logic-modal — necessity/possibility and applications to computer science and verification.
Wadler, P. (2015). “Propositions as Types.” Communications of the ACM 58(12): 75–84. doi:10.1145/2699407. doi.org/10.1145/2699407 See also Sørensen, M. H. & Urzyczyn, P. (2006). Lectures on the Curry-Howard Isomorphism. Elsevier. — propositions-as-types / proofs-as-programs in type theory.
de Moura, L., Kong, S., Avigad, J., van Doorn, F. & von Raumer, J. (2015). “The Lean Theorem Prover (system description).” CADE-25. lean-lang.org/theorem_proving_in_lean4 — Lean’s small kernel, dependent type theory, axioms, and fully specified formal proofs.
“Mathlib statistics.” Lean community (accessed 14 Jun 2026). leanprover-community.github.io/mathlib_stats.html — current theorem and definition counts; fast-moving project page.
“100 theorems in Lean.” Lean community (accessed 14 Jun 2026). leanprover-community.github.io/100.html — 84 of Wiedijk’s 100 theorem benchmarks formalized in Lean; fast-moving project page.
Commelin, J. & Topaz, A. et al. “Liquid Tensor Experiment.” Lean community blog (completion 14 July 2022); Scholze’s original challenge (Dec 2020). leanprover-community.github.io — machine-checking a Fields Medalist’s uncertain proof.
Tao, T. “Formalizing the proof of PFR in Lean4.” terrytao.wordpress.com (Nov 2023). Gowers, Green, Manners & Tao, “On a conjecture of Marton,” Annals of Mathematics (2025). doi:10.4007/annals.2025.201.2.5. doi.org/10.4007/annals.2025.201.2.5 terrytao.wordpress.com
Tao, T. et al. “The Equational Theories Project.” Project announced Sept 2024; retrospective paper Dec 2025 (arXiv:2512.07087). teorth.github.io/equational_theories — 22,033,636 ordered pairs including self-implications; 22,028,942 nontrivial graph edges; 50+ contributors, Lean-verified.
Buzzard, K. “Fermat’s Last Theorem project.” Lean community blog (launch 30 April 2024); EPSRC grant EP/Y022904/1 (2024–2029), Imperial College London. leanprover-community.github.io — in progress; “at least a 5 year project.”
Leroy, X. et al. “CompCert” — a formally verified C compiler. Yang, Chen, Eide & Regehr, “Finding and Understanding Bugs in C Compilers,” PLDI (2011). doi:10.1145/1993498.1993532. doi.org/10.1145/1993498.1993532 compcert.org — six CPU-years and no wrong-code bugs found.
Klein, G. et al. (2009). “seL4: Formal Verification of an OS Kernel.” SOSP ‘09. doi:10.1145/1629575.1629596. doi.org/10.1145/1629575.1629596 sel4.systems — first machine-checked proof of OS-kernel functional correctness (Isabelle/HOL).
“AI achieves silver-medal standard solving International Mathematical Olympiad problems.” Google DeepMind blog (25 July 2024). deepmind.google — AlphaProof + AlphaGeometry 2; 28 points; works in Lean.
Hubert, T., Mehta, R., Sartran, L. et al. (2026). “Olympiad-level formal mathematical reasoning with reinforcement learning.” Nature 651: 607–613. doi:10.1038/s41586-025-09833-y. doi.org nature.com/articles/s41586-025-09833-y — the AlphaProof method paper; published online 12 Nov 2025, version of record 13 Mar 2026; ~80 million Lean problems.
“Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the IMO.” Google DeepMind blog (July 2025). deepmind.google — 35/42 officially certified; natural-language proofs within the contest time limit.
“66th IMO 2025.” International Mathematical Olympiad. imo-official.org/editions/2025 and individual results — 630 contestants; gold cutoff 35; human score distribution.
“Our First Proof submissions.” OpenAI (2026). openai.com/index/first-proof-submissions — OpenAI’s later summary of its July 2025 IMO gold-medal-level result, 35/42 points.
“Logical Pluralism,” “The Normative Status of Logic,” and “Logical Constants.” Stanford Encyclopedia of Philosophy (accessed 2026). plato.stanford.edu/entries/logical-pluralism · plato.stanford.edu/entries/logic-normative · plato.stanford.edu/entries/logical-constants — logical pluralism, revising logic, and what counts as logical.

Deep dive appendixThe Deeper Machinery of LogicOptional extension.

The main descent told a clean story: three engines of inference, validity versus soundness, the slow alchemy that turned logic into algebra, and a frontier where machines now check proofs without mercy. But a clean story is a curated one. Behind it sits a far stranger and richer landscape — a logic so old it predates its own discoverers’ embarrassment, paradoxes that have toppled foundations twice, a tortoise who refuses to be argued with, and a string of theorems no human will ever fully read, accepted as true anyway because a machine vouched for every step. This appendix is the director’s cut. It assumes you’ve watched the film; it adds the scenes that didn’t fit, in the same order the film would have used them, so you can read straight through.

Movement I

What the syllogism was hiding

The main lesson handed you “All men are mortal” and moved on. But Aristotle’s term logic has internal architecture worth seeing, and a fault line that took two thousand years to notice.

Every classical categorical statement comes in one of four flavours, and medieval logicians labelled them with the vowels of two Latin words — affirmo (“I affirm”) and nego (“I deny”). A: All S are P (universal affirmative). E: No S are P (universal negative). I: Some S are P (particular affirmative). O: Some S are not P (particular negative). Arrange these four at the corners of a square and you get one of the oldest diagrams in Western thought, the square of opposition, which maps how the four statements support and contradict one another. The A and O corners are contradictories — exactly one is true. So are E and I. The A and E corners are contraries — they can’t both be true, but they can both be false. It is a little machine for reasoning about quantity, and for centuries it was drilled into every educated head in Europe.

Diagram · the oldest grid in logic

The Square of Opposition

Four statement-types, four relationships. The diagonals are the strongest link: contradictories must always disagree.

The dashed diagonals are the load-bearing relations: if “All S are P” is true, “Some S are not P” must be false, and vice versa — no exceptions, ever.

Now the fault line. Look at the vertical edges, called subalternation: the classical square says that if “All S are P” is true, then “Some S are P” must be true too. From the universal, you may descend to the particular. This feels obvious — if all ravens are black, then surely some raven is black. But it smuggles in an assumption Aristotle never flagged: that there is at least one S. Watch it break. “All trespassers will be prosecuted” is something a landowner can say truthfully even if, happily, nobody ever trespasses. But subalternation would force “Some trespasser will be prosecuted” — which asserts a trespasser exists. Worse: “All unicorns are white” sounds harmlessly true, yet the classical square derives “Some unicorn is white,” conjuring a unicorn into existence by pure logic. This is the problem of existential import, and it is the precise crack that Frege’s quantifiers (from the main lesson) were built to seal.

Modern predicate logic reads “All S are P” as a careful conditional — for anything at all, if it is an S then it is a P — which is automatically true when there are no S’s to check (a “vacuously true” statement, like the promise you trivially keep by never making a relevant move). The universal no longer implies the particular. This is why “all unicorns are white” and “all unicorns are blue” can both be true at once: with zero unicorns, every universal conditional of the form “all unicorns are…” comes out vacuously true. Existential claims such as “some unicorns are white” do not. The price of Frege’s fix was surrendering some of the square’s tidy inferences — a worthwhile trade, and a beautiful case study in how formalizing an intuition reveals a hidden assumption nobody knew they were making.

The logic of things that don’t exist

What about the term “the present King of France”? It refers to nobody. Standard logic insists every name denotes something, so a whole sub-discipline — free logic — was built to reason cleanly about empty names and non-existent objects without breaking. It’s the logic you’d want for discussing fictional characters, failed scientific posits (like the planet Vulcan from Day 2), or definite descriptions that turn out to be empty. A reminder that even “what does this name point to?” is a question logic had to learn to handle with care.

How proofs are actually built

The main lesson spoke of “valid form” but never showed how a logician constructs a proof step by step. There are two great styles, and the rivalry between them shaped twentieth-century logic. Natural deduction, devised by Gerhard Gentzen in the 1930s (and independently by Stanisław Jaśkowski), tries to mimic how humans actually reason: you make temporary assumptions, derive consequences, and then “discharge” the assumptions — exactly the move in “suppose for contradiction…” Its sibling, the sequent calculus, also Gentzen’s, is more symmetric and machine-friendly, tracking what follows from what as an explicit ledger of premises and conclusions.

Gentzen then proved something deep about his own system: the cut-elimination theorem (his “Hauptsatz,” or “main theorem”). Informally, it says any proof that detours through a clever intermediate lemma can be rewritten into a direct proof that never leaves the subject matter of its premises and conclusion. Every shortcut can, in principle, be unwound into a plodding straight line. This sounds technical, but it supplies an important basis for automated proof search: cut-free proofs obey a subformula discipline that restricts which formulas need consideration. It does not make proof search uniformly manageable — first-order validity is undecidable, and eliminating cuts can make proofs enormously longer. Cut elimination also connects, through Curry–Howard from the main lesson, to how programs compute by simplification. Gentzen’s ledgers are the distant ancestor of the proof assistants we’ll return to in Movement V.

Movement II

A field guide to bad arguments

The main lesson gave you the two great formal fallacies — affirming the consequent and denying the antecedent — the ones with broken skeletons. But most of the bad reasoning you’ll actually meet in the wild is informal: the form is fine, the content cheats. These have been catalogued since antiquity (Aristotle wrote a whole treatise, the Sophistical Refutations, on argumentative dirty tricks), and knowing their names is a genuine cognitive upgrade — you cannot easily un-see a fallacy once you can label it. Here is a working field guide to the most common specimens.

Ad hominem “to the person”

Attack the arguer instead of the argument. “You’d say that — you’re a banker.” Whether bankers are biased is irrelevant to whether the claim is true.

Straw man

Replace your opponent’s actual position with a flimsy caricature, then knock that down. The easiest argument to win is one your opponent never made.

Equivocation

Slide a word between two meanings mid-argument. “Nothing is better than eternal happiness; a ham sandwich is better than nothing; therefore a ham sandwich is better than eternal happiness.” “Nothing” switched jobs.

False dilemma

Offer two options as if they were the only two. “Either we ban it or we descend into chaos.” Reality usually has more than two doors.

Begging the question petitio principii

Smuggle the conclusion into the premises. “It’s reliable because it says so, and it wouldn’t say so if it weren’t reliable.” The argument assumes what it’s meant to prove — a tiny version of Day 1’s circular justification.

Post hoc post hoc ergo propter hoc

”After this, therefore because of this.” The rooster crows, the sun rises; the rooster takes the credit. Mistaking sequence for cause — the trap Day 5 is built to disarm.

Slippery slope

Claim one small step must inevitably tumble all the way down, without showing why each step forces the next. Sometimes true, usually asserted rather than argued.

Appeal to authority ad verecundiam

Not all such appeals are fallacious — trusting genuine experts is rational. It curdles into fallacy when the “authority” is irrelevant, fabricated, or outside their field.

Hasty generalization

Leap to a sweeping rule from a handful of cases. Induction (Day 2) done recklessly — the fallacy is in the haste, not the generalizing.

No true Scotsman

Defend a generalization by redefining it to dodge counterexamples. “No Scotsman puts sugar on porridge.” “But my uncle Angus—” “No true Scotsman.” The claim is made unfalsifiable by fiat.

A special pair deserves its own spotlight, because they are where logic meets the probability we’ll meet tomorrow on Day 4. The gambler’s fallacy is the conviction that a run of reds at the roulette wheel makes black “due” — as if the wheel had a memory and a conscience. It doesn’t; each spin is independent, and the ball owes you nothing. Its mirror image, the base-rate fallacy, is ignoring how rare something is when interpreting evidence: a test that’s “99% accurate” for a disease that afflicts one person in ten thousand will still flag mostly healthy people, because the rare truth is swamped by the common error. Both are failures of probabilistic intuition so systematic that they need formal machinery to override — which is precisely the argument tomorrow opens with.

Interactive · train your eye

The Fallacy Spotter

Read each argument and name the move. Five rounds, drawn from the guide above. The point isn't the score — it's that after a few rounds, the shapes start to announce themselves.

—

Worked exercise

The Fallacy Spotter, as an answer key

Argument shape	Better name	Why
Dismiss the climate plan by attacking the speaker's private life.	Ad hominem	The personal attack does not test whether the plan is sound.
Offer only total budget cuts or bankruptcy.	False dilemma	The argument hides partial options between the two extremes.
Blame a new traffic light for back pain because it came first.	Post hoc	Sequence alone is not causation.
Trust the book because the book says it is true.	Begging the question	The conclusion is smuggled into the premise.
Infer that a whole city is rude from two people.	Hasty generalization	The sample is too thin for the sweeping rule.

Movement III

The sentences that broke logic

The main lesson mentioned Russell’s paradox in passing, as the letter that cracked Frege’s life’s work. But paradoxes are not mere curiosities in logic — they are its stress tests, the places where the machine seizes up and forces a redesign. Twice, a single self-referential sentence has brought down a foundation. Understanding why is to understand what logic is made of.

The Liar, and why it won’t sit still

Start with the oldest. Consider the sentence: “This sentence is false.” Is it true? If it’s true, then what it says holds — so it’s false. But if it’s false, then what it says fails — so it’s true. The sentence flips endlessly, true to false to true, refusing to land. This is the Liar paradox, known to the ancient Greeks (the Cretan Epimenides declaring all Cretans liars is a cousin), and it is not a party trick. It shows that the innocent-looking notion of truth, combined with a sentence’s ability to talk about itself, produces contradiction. Alfred Tarski drew the radical lesson in the 1930s: no language can consistently contain its own complete truth-predicate. To talk about the truth of sentences in a language, you must climb to a higher metalanguage — truth is always spoken from one level up. (This is also why a recurring move in this course is to separate the thing from the talk about the thing, a ladder of levels we first felt in Day 1’s internal-versus-external split.)

Russell’s paradox, in slow motion

Now the one that actually drew blood. Frege’s foundation rested on a generous assumption: any property you can state carves out a set — the set of all things with that property. Sounds unimpeachable. Russell asked about one particular property: not being a member of itself. Most sets aren’t members of themselves (the set of all teacups is not a teacup). So consider R, the set of all sets that are not members of themselves. Is R a member of R? If R is a member of itself, then by its own definition it must not be — contradiction. If R is not a member of itself, then it satisfies the membership condition, so it is — contradiction. Either way the system detonates. The barber version makes it vivid: in a village where the barber shaves exactly those who don’t shave themselves, who shaves the barber? Russell sent this to Frege in 1902, just as the second volume of his masterwork went to press. Frege’s reply is one of the most gracious admissions of disaster in the history of thought: “a scientist can hardly meet with anything more undesirable than to have the foundation give way just as the work is finished.” The foundation gave way.

The third sister: Curry’s paradox

There’s a subtler relative that’s even more alarming, because it doesn’t even need the word “false.” Consider: “If this sentence is true, then [anything you like].” Run the reasoning and you can apparently prove the arbitrary consequent — pigs fly, the moon is cheese — from nothing but the sentence’s self-reference and two innocuous logical rules. Curry’s paradox shows the danger isn’t located in negation alone; it lurks in the combination of self-reference, the conditional, and the freedom to assert truth. It’s one reason logicians treat self-reference the way chemists treat an open flame near solvent.

The rescue, and its price

How do you save logic from a sentence? Russell’s own answer, built with Alfred North Whitehead in the monumental three-volume Principia Mathematica (1910–1913), was the theory of types: arrange objects in a strict hierarchy — individuals at the bottom, sets of individuals above them, sets of those above that — and forbid any set from referring to itself. The rules of grammar simply outlaw “R is a member of R” as meaningless, the way “the number seven is blue” is meaningless. Paradox averted by stratification. Principia was so rigorous it took hundreds of pages to reach the proof that 1 + 1 = 2 (with the wry remark, when it finally arrived, that “the above proposition is occasionally useful”). It is a monument and a warning: the cost of absolute rigour can be near-unreadable.

That hierarchy-of-types idea did not die in 1913. Stripped of its philosophical baggage and rebuilt by Per Martin-Löf in the 1970s, type theory became the logical foundation underneath Lean, Rocq, and Agda — the very proof assistants the main lesson celebrated. The defensive wall Russell built against paradox turned out, sixty years later, to be the ideal architecture for telling a computer what a proof is. Logic’s worst crisis seeded its most powerful modern tool.

When two truth-values weren’t enough

One more foundational tremor, and a charming one. The main lesson listed fuzzy and many-valued logics among the non-classical zoo, but skipped the story of where the first rigorous many-valued logic came from. In 1920, the Polish logician Jan Łukasiewicz was brooding on a problem Aristotle himself had raised: the sea battle. “There will be a sea battle tomorrow.” Is that statement true or false today? If it’s already true, the future seems fixed and free will evaporates; if already false, the battle is impossible. Aristotle squirmed. Łukasiewicz’s bold move was to deny that the statement must be either true or false right now — he introduced a third value, “possible” or “indeterminate,” for the open future. From that one philosophical itch grew the entire field of many-valued logic, the mathematical parent of the fuzzy logic running in your camera’s autofocus. A question about fate, asked for two thousand years, turned into a branch of engineering.

Movement IV

The rule that needs no rule

Here is a puzzle that looks like whimsy and turns out to touch bedrock. It was published in 1895 by Lewis Carroll — yes, the Alice author, who by day was the Oxford logician Charles Dodgson — under the title “What the Tortoise Said to Achilles.” It is four pages long, written as a dialogue, and it has unsettled logicians for over a century.

Achilles has a valid argument. From two premises — call them A (“things equal to the same thing are equal to each other”) and B (“these two things are both equal to that third thing”) — the conclusion Z (“these two things are equal to each other”) plainly follows. The Tortoise, with maddening politeness, agrees that A and B are true, and agrees the argument is valid. But he declines to accept Z. Why? Because, he says, to get from “A and B are true” to “therefore Z,” you must be relying on some further rule — call it C: “if A and B are true, then Z is true.” Fine, says Achilles, grant C as well. The Tortoise cheerfully does — and still won’t accept Z. Because now, to get from A, B, and C to Z, you need yet another rule, D: “if A and B and C are true, then Z.” And so on, forever. Every time Achilles tries to nail down the rule of inference by writing it as an explicit premise, the Tortoise demands a new rule to license that premise. The list of premises grows without end and Z is never reached.

“And so on. You see, Achilles, we are now reaching a series of inferences that has no end.”

The Tortoise has discovered something profound: a rule of inference cannot be reduced to a premise. The step from premises to conclusion — the actual act of inferring — is not itself another premise in the argument; if you try to make it one, you trigger an infinite regress. Reasoning requires something beyond the explicit statements: a willingness to move, to actually apply the rule rather than merely contemplate a description of it. Logic, it turns out, cannot be purely a matter of more and more written-down content. At its heart there must be an act — a doing, not just a saying.

You have met this shape before. On Day 1, the Agrippan trilemma showed that every chain of justification must either regress forever, loop, or stop somewhere unjustified. Carroll’s Tortoise is the same skeleton wearing different clothes: every attempt to justify the act of inference itself regresses forever. The two puzzles rhyme because they’re about the same thing — the point where reasoning has to touch down on something that isn’t more reasoning. For justification, that floor is Day 1’s “basic beliefs” or “reliable processes.” For inference, it’s the brute capacity to apply a rule. Both tell us that a mind, or a machine, cannot run on explicit content alone. There has to be, at the bottom, a thing that simply does.

Why this matters for machines

The Tortoise’s lesson is quietly enormous for the proof assistants in the next movement. A proof checker like Lean does not store an infinite tower of rules-about-rules. It has a small, fixed kernel — a tiny trusted core that simply performs the basic inference steps, the way Achilles eventually must simply act. Everything else is built on that kernel’s brute willingness to apply its rules. Carroll, in 1895, put his finger on exactly the architectural fact that makes mechanized reasoning possible: somewhere, the regress has to stop in a component that does rather than describes.

The ceiling overhead

The main lesson’s frontier — proof assistants verifying mathematics at zero tolerance — naturally invites a dream: could we someday formalize all of mathematics, reduce every truth to a mechanical check, and never be uncertain again? It is the dream Frege and Russell chased, and it is worth knowing, even now, that the dream has a proven ceiling. In 1931, a 25-year-old Austrian named Kurt Gödel demonstrated that any consistent, effectively axiomatized formal system strong enough to express basic arithmetic is incomplete: some arithmetic statements can be neither proved nor refuted within it. His second incompleteness theorem adds that, if such a system is consistent, it cannot prove its own consistency using only its own resources. There is no consistent, effectively axiomatized, complete, self-certifying formalization of arithmetic. Not because we haven’t built it yet, but because it cannot exist.

This is the great counterweight to Movement V, and we will give it the full day it deserves on Day 28. For now, hold the tension in your mind like a chord: mechanized proof is breathtakingly powerful, verifying things no human could check by hand — and there is a hard, theorem-proven limit to how much certainty any such system can deliver about itself. The proof assistant is a magnificent tool operating inside a cage whose bars Gödel measured exactly. Keep both halves. The triumph and the limit are the same subject seen from two sides — which, you may have noticed, is the recurring shape of this entire course.

Movement V

The long road to a proof no human can read

The main lesson introduced Lean and AlphaProof as if they arrived fully formed. They didn’t. There is a sixty-year backstory of machines proving things, and it contains some of the most philosophically charged moments in modern mathematics — the points where the community had to decide whether a proof nobody could fully check by hand still counted as a proof. This is the deep history the frontier was standing on.

Two theorems that license the whole enterprise

First, a piece of foundational reassurance that makes mechanized logic trustworthy at all. For standard first-order logic, two great metatheorems hold. Soundness: anything you can prove using the rules is actually true in every model — the rules never lie. Completeness (proved by the same Gödel, in 1929, a gentler theorem than his famous one): anything that is true in every model can be proven using the rules — the rules miss nothing. Together they show that, in standard first-order logic, proof rules and model-theoretic truth can be made to line up exactly. Proof assistants such as Lean rest on a related but more specific trust story: a small kernel, a type-theoretic foundation, the axioms in use, and the faithfulness of the formalization to the original problem. The crucial advantage is mechanical checking: every step must pass the kernel, so a bad step cannot slip through as rhetoric.

The colours that started a fight

The first earthquake came in 1976. The Four Colour Theorem — the claim that any map can be coloured with just four colours so that no two bordering regions share a colour — had taunted mathematicians since 1852. Kenneth Appel and Wolfgang Haken at the University of Illinois finally cracked it, but their proof did something unprecedented and, to many, deeply controversial: it reduced the problem to nearly 2,000 specific configurations and then used a computer to grind through checking every one. No human had verified, or could feasibly verify, all those cases by hand. Was it a proof? Philosophers and mathematicians erupted. A proof was supposed to be a chain of reasoning a human mind could follow and be convinced by; this was an act of faith in silicon and the program that ran on it. The University of Illinois math department, with a touch of defiance, started stamping its postage with “FOUR COLORS SUFFICE.” The discipline had crossed a threshold it could never uncross.

The doubts lingered for decades — until the proof assistants arrived. In 2005, Georges Gonthier formalized the entire Four Colour Theorem inside the Coq proof assistant, reducing the whole argument, computer-checked cases and all, to steps verified by a tiny trusted kernel. The thing that had felt like blind faith in 1976 became, in 2005, a theorem checked at zero tolerance. Mechanization didn’t just match the contested proof; it redeemed it.

The conjecture a machine cracked alone

Then, in 1996, a milestone the main lesson hinted at but didn’t tell: the first time an automated prover solved a famous open problem essentially by itself. The Robbins conjecture — a question, posed in the 1930s, about whether a certain simple set of algebraic equations is enough to define Boolean algebra — had defeated the field’s best for sixty years. Tarski himself had worked on it and failed. On October 10, 1996, a program called EQP, written by William McCune at Argonne National Laboratory, found a proof after about eight days of searching. It made the front page of The New York Times. McCune phoned the then-81-year-old Herbert Robbins to tell him his conjecture was finally settled — by a machine.

But here’s the twist that makes it a perfect parable. The proof EQP produced was, in the words of one logician, “a computer-generated proof that nobody understands.” It was a terse sequence of algebraic substitutions, some so unintuitive that no human would ever have dreamt of them, and it offered no insight — no story about why the theorem is true, just an unanswerable certification that it is. This is the question that has haunted machine proof ever since, and that the AI provers of 2026 sharpen to a point: if a machine hands you an airtight proof you cannot comprehend, have you gained knowledge, or merely a verdict? Mathematics has always wanted two things from a proof — certainty and understanding. The machines can now deliver the first in cases where the second is nowhere to be found. (A direct descendant of Day 1’s worry about belief that is correct without being connected to the truth in the right way.)

The proof that took a team five years to doubt

The pattern recurred, bigger, in the story of Kepler’s conjecture — the 1611 claim, made by the astronomer who also gave us planetary orbits, that the way greengrocers stack oranges (the “face-centred cubic” packing) is the densest possible arrangement of spheres. Thomas Hales proved it in 1998, but the proof was a 300-page argument leaning on enormous computer calculations. The twelve referees of the Annals of Mathematics laboured for years and finally returned a verdict no journal had ever issued: they were “99% certain” the proof was correct, but could not fully verify the computer portions. Ninety-nine percent. For a mathematical proof, whose entire promise is certainty, that one percent was a wound.

Hales’s response was to declare war on the doubt. He launched the Flyspeck project — the name a playful contraction of “Formal Proof of Kepler” — to formalize the entire thing inside proof assistants, leaving no calculation untrusted. It took a international team over a decade. In 2014 they finished, using a combination of the HOL Light and Isabelle systems, and published the official account in 2015. The orange-stacking problem, open for 400 years and only “99% certain” in 1998, became a fully machine-verified theorem. established

1929 · the licenceGödel’s completeness theorem. Provability and truth march in step for first-order logic — the guarantee under which every proof checker operates.
1976 · the controversyFour Colour Theorem (Appel & Haken). ~2,000 cases checked by computer; the first major proof no human could verify by hand. “Four colors suffice.”
1996 · the lone machineRobbins conjecture (McCune’s EQP). An automated prover settles a 60-year-old open problem in 8 days — with a proof “nobody understands.”
2005 · the redemptionFour Colour Theorem, formalized (Gonthier, in Coq). The 1976 act of faith becomes a kernel-checked theorem.
2014 · the 400-year finishKepler conjecture, formalized (Flyspeck, HOL Light + Isabelle). Hales’s “99% certain” proof becomes fully machine-verified.
2024–26 · the AI turnAlphaProof and after. Neural search inside a proof kernel — covered in the main lesson, deepened below.

How a machine actually searches for a proof

It’s worth lifting the hood on how these systems work, because the main lesson treated the proof assistant as a black box. There are really two distinct jobs, and conflating them is a common error. A proof checker (Lean, Rocq, Isabelle) merely verifies a proof you supply, step by step, against its kernel — it’s a sceptic, not an inventor. A proof searcher or automated theorem prover (like EQP, or modern tools) tries to find a proof from scratch. The hardest and most practical version of search is the humble-sounding SAT solver — a program that, given a tangle of true/false constraints, hunts for an assignment that satisfies them all. This is the Boolean satisfiability problem, and (as we’ll see on Day 29) it sits at the exact centre of the P-versus-NP question, the most important open problem in computer science. Yet despite being theoretically intractable in the worst case, modern SAT solvers routinely chew through problems with millions of variables, and they quietly power chip design, software verification, and logistics. One SAT-solver proof — of a colouring problem called the Boolean Pythagorean Triples problem, settled in 2016 — produced a certificate 200 terabytes long, at the time the largest proof ever produced. A proof you could never print, let alone read, but every line mechanically checkable. Carroll’s Tortoise would have had feelings about it.

The AI provers, with the hype filter still on

Which brings us back to the present, and to the deeper version of the main lesson’s AI story. The genuine frontier of 2024–2026 is the marriage of two things that were always meant for each other: the neural network’s gift for guessing which step to try, and the proof kernel’s refusal to accept a step outside its formal rules. Several research threads are worth knowing by name. Autoformalization — automatically translating mathematics written in ordinary English into formal statements a machine can check — is the bottleneck everyone is racing to widen, because the world’s mathematical knowledge is almost all in prose, and a machine can only verify what’s been formalized. Tools like LeanDojo opened the proof assistant up as a training environment for AI. And the AlphaProof system the main lesson described works precisely by auto-formalizing about a million problems into Lean and training against the kernel’s verdicts, so that, in DeepMind’s phrase, there are “no hallucinations to worry about” in the checked derivation — an invalid formal step simply fails to compile.

The scorecard, sharpened from the main lesson:

What’s real established: AI has reached medal-level competition performance; AI-assisted formalization is genuinely accelerating; and in May 2026 an internal OpenAI model generated a counterexample to the Erdős unit-distance conjecture that human mathematicians checked and digested. A kernel-checked system will not accept a derivation that violates its formal rules. Passing verification therefore establishes the theorem relative to the formal statement, axioms, and kernel; whether the formalization faithfully captures the intended informal claim remains a separate question. That scoped guarantee is the deep point — machine-learning’s creativity can now be paired with mechanical proof checking.
What’s emerging promising hint: 2025–26 preprints describe AI agents solving some previously open (if narrow and specialized) problems in Lean, while the unit-distance result shows that autonomous mathematical discovery can sometimes reach accepted research. Fresh, important, still uneven. Watch this space without swallowing the pitch.
What’s hype contested/hype: any claim that AI has “solved mathematics” or made mathematicians obsolete. One landmark counterexample is not a general mathematical mind, and contest problems are still a narrow slice with known-to-exist short answers. Recall from the main lesson that even at the 2025 “gold” level, dozens of human teenagers still scored higher.

The thread running through all sixty years — from Appel and Haken’s contested map, through the Robbins proof nobody understands, to AlphaProof searching inside Lean’s unforgiving kernel — is a single quiet idea, the one the main lesson closed on. Aristotle wanted a chain of reasoning that could not be doubted. We have finally built a machine that enforces a formal version of that standard, line by line, relative to a specified kernel and axioms. What we are still discovering is the strange new questions that machine raises: whether a proof we cannot read is a proof we have understood, and whether formally checked certainty, delivered without insight, is the thing mathematicians were really after all along.

The appendix in three threads

Logic has hidden depths the syllogism conceals. The square of opposition hides an existence assumption; empty names need their own logic; and proofs are built in styles (natural deduction, sequent calculus) whose cut-elimination theorem supplies a disciplined basis for automated search without making first-order proof search decidable or uniformly tractable.

Self-reference is logic’s open flame. The Liar, Russell’s paradox, and Curry’s paradox each break a foundation through a single self-pointing sentence — and the type-theory wall built to contain them in 1913 became, sixty years later, the architecture inside every modern proof assistant.

Reasoning cannot be pure content, and mechanized proof has both a ceiling and a sixty-year history. Carroll’s Tortoise shows inference needs an act, not just more premises; Gödel proved that a consistent, effectively axiomatized system strong enough for arithmetic cannot certify its own consistency; and from the 1976 Four Colour controversy to the AI provers of 2026, machines have been forcing mathematics to ask what a proof is for — certainty, understanding, or both.

Appendix sources

Sources & further reading

Parsons, T. “The Traditional Square of Opposition” and Nolt, J. “Free Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/square · plato.stanford.edu/entries/logic-free — existential import, vacuous truth of universals, and reasoning with empty names.
von Plato, J. “The Development of Proof Theory.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/proof-theory-development — natural deduction, the sequent calculus, and the cut-elimination Hauptsatz (1934–35).
Dowden, B. “Fallacies.” Internet Encyclopedia of Philosophy. iep.utm.edu/fallacy — the standard catalogue of informal fallacies; Aristotle’s Sophistical Refutations.
“Gambler’s fallacy” and “Base rate fallacy.” Wikipedia (accessed 2026). en.wikipedia.org/wiki/Gambler’s_fallacy — the two great failures of probabilistic intuition (setup for Day 4).
Beall, Jc, Glanzberg, M. & Ripley, D. “Liar Paradox.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/liar-paradox — self-reference, truth, and Tarski’s hierarchy of languages.
Irvine, A. & Deutsch, H. “Russell’s Paradox.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/russell-paradox — the set of all non-self-membered sets; Frege’s 1902 reply; the barber.
Shapiro, S. “Logical Consequence”; Hodges, W. “Model Theory”; and “Classical Logic.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/logical-consequence · plato.stanford.edu/entries/model-theory · plato.stanford.edu/entries/logic-classical — logical consequence, model-theoretic semantics, and soundness/completeness.
Shapiro, L. & Beall, Jc. “Curry’s Paradox.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/curry-paradox — deriving anything from self-reference plus the conditional.
Whitehead, A. N. & Russell, B. (1910–1913). Principia Mathematica. Cambridge University Press. See “Type theory,” SEP. plato.stanford.edu/entries/type-theory — the ramified theory of types; the line to Martin-Löf type theory and modern proof assistants.
“Many-valued logic” and “Jan Łukasiewicz.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/logic-manyvalued — the three-valued logic of 1920 and Aristotle’s sea-battle / future contingents.
Carroll, L. (1895). “What the Tortoise Said to Achilles.” Mind 4(14): 278–280. doi:10.1093/mind/IV.14.278. doi.org/10.1093/mind/IV.14.278 — the regress showing a rule of inference cannot be reduced to a premise.
Raatikainen, P. “Gödel’s Incompleteness Theorems.” Stanford Encyclopedia of Philosophy. plato.stanford.edu/entries/goedel-incompleteness — the 1931 ceiling on formalization (full treatment reserved for Day 28); and the 1929 completeness theorem.
Appel, K. & Haken, W. (1977/1989). “Every Planar Map is Four Colorable.” doi:10.1090/conm/098. doi.org/10.1090/conm/098 Gonthier, G. (2008). “Formal Proof — The Four-Color Theorem,” Notices of the AMS 55(11): 1382–1393. ams.org — the 1976 ~2,000-case computer proof and its 2005 Coq formalization.
McCune, W. (1997). “Solution of the Robbins Problem.” Journal of Automated Reasoning 19(3): 263–276. doi:10.1023/A:1005843212881. doi.org/10.1023/A:1005843212881 Project page: cs.unm.edu/~mccune/papers/robbins — EQP’s proof, found 10 Oct 1996 in ~8 days; the proof “nobody understands.”
Hales, T. et al. (2017). “A Formal Proof of the Kepler Conjecture.” Forum of Mathematics, Pi 5, e2. doi:10.1017/fmp.2017.1. doi.org/10.1017/fmp.2017.1 arXiv:1501.02155. arxiv.org/abs/1501.02155 — the completed Flyspeck project (HOL Light + Isabelle, 2014); the original “99% certain” referee verdict on the 1998 proof.
Heule, M., Kullmann, O. & Marek, V. (2016). “Solving and Verifying the Boolean Pythagorean Triples Problem via Cube-and-Conquer.” SAT 2016. doi:10.1007/978-3-319-40970-2_15. doi.org/10.1007/978-3-319-40970-2_15 — the ~200-terabyte SAT proof; modern SAT-solver scale (links to P vs NP, Day 29).
Cook, S. A. (1971). “The complexity of theorem-proving procedures.” STOC ‘71. doi:10.1145/800157.805047. doi.org/10.1145/800157.805047 See also Levin, L. (1973), “Universal search problems.” — SAT, NP-completeness, and worst-case computational hardness rather than undecidability.
de Moura, L., Kong, S., Avigad, J., van Doorn, F. & von Raumer, J. (2015). “The Lean Theorem Prover (system description).” CADE-25. lean-lang.org/theorem_proving_in_lean4 — Lean’s small trusted kernel, dependent type theory, axioms, and fully specified formal proofs.
Yang, K. et al. (2023). “LeanDojo: Theorem Proving with Retrieval-Augmented Language Models.” NeurIPS 2023. Hubert, T. et al. (2025), “Olympiad-level formal mathematical reasoning with reinforcement learning,” Nature, doi:10.1038/s41586-025-09833-y. doi.org — autoformalization, Lean as an AI training environment, AlphaProof’s kernel-checked design (deepening the main lesson).

Deep dive appendixThe Unsettled FrontierOptional extension.

The first appendix was the director’s cut — settled history, scenes that didn’t fit but were never in doubt. This one is different, and you should read it with a different posture. Everything here is from 2023 to 2026; much of it is so fresh that the mathematical community has not finished arguing about what it means. Some of it will look, in five years, like the moment the ground shifted. Some will look like a press release that fooled a lot of people for a week. The hard part is that, right now, nobody can fully tell which is which — and that is precisely why it belongs in a course about valid inference. This is the hype filter not as a teaching exercise but as a survival skill, applied live, to claims still warm from the lab.

Dispatch I

The provers grew up — and went open

The main lesson left the AI-proving story at AlphaProof: a closed system, behind a corporate wall, requiring days of computation per problem. Since then the field has done something the larger AI world rarely does — it has gone open, and it has gotten startlingly good. In 2025, anyone with a decent graphics card could download a model that solves olympiad-level mathematics in formal, machine-checked Lean. That is the quiet earthquake.

To follow it you need to understand the scoreboard. Machine provers are ranked on shared benchmarks — fixed sets of problems used to compare systems. Three matter. miniF2F (introduced 2022) is a few hundred olympiad-style problems and was, until recently, the gold standard; it is now nearly saturated, with the best systems scoring near the ceiling. PutnamBench (2024) is harder — problems from the legendarily brutal Putnam undergraduate competition, where the median human score is famously zero. FrontierMath (2024) spans tiers from challenging university-level work to research-level problems that may take experts days. Its scores depend on the tier, dataset version, and evaluation protocol, so no single percentage characterizes the benchmark. As one benchmark saturates, the field builds a taller one.

Diagram · the moving ceiling

The benchmark ladder

As older benchmarks approach saturation, researchers build harder ones. The ordering is only a guide: a score identifies an evaluation only when its version and test conditions travel with it.

The miniF2F and PutnamBench figures are rough, best-reported orientation. FrontierMath is intentionally shown without one headline score: version, tier, model snapshot, tools, and inference budget all affect the result.

Now the systems doing the climbing. Two open-weights provers reset expectations in 2025, and a third pushed into olympiad-gold territory.

System	What it claims	Status
DeepSeek-Prover-V2Apr 2025	88.9% on miniF2F-test; 49 / 658 PutnamBench. Works in Lean 4 by recursively breaking a goal into subgoals, then learning from which ones it can close.	establishedopen weights, reproducible
Goedel-Prover-V2Aug 2025	~90% miniF2F (with self-correction); 86 PutnamBench problems. Its 32B model beats rivals 20× larger; an 8B version matches a 671B predecessor.	establishedsmaller & better; open
Seed-Prover2025	The July 2025 preprint reports full Lean proofs for 5 of 6 IMO 2025 problems, after the contest setting; treat it as a strong but still vendor/preprint claim until artifacts and independent audits settle the details.	promisinggold claim vendor-reported

Two things deserve emphasis, one exciting and one limiting. The exciting one: a 32-billion-parameter model — small enough to run on hardware a serious hobbyist could own — now matches systems that, a year earlier, needed a data centre. Capability is collapsing in size and cost, fast. The limiting detail hides inside the phrase Pass@k. When a paper reports “88.9% on miniF2F,” it often means the model was allowed thousands of attempts per problem, and succeeded at least once. “The machine solved it” can quietly mean “one of eight thousand tries compiled.” That’s not cheating — for formal proof, a single compiling proof is a proof, no matter how many failures preceded it — but it is a very different thing from a human who gets it on the first attempt because they understood it. Hold both facts: the provers are genuinely, rapidly better, and the headline percentages flatter them.

Why Lean changes the epistemics

There’s a reason this whole dispatch can wear an established tag while the next four get murkier. These systems output proofs in Lean, and Lean’s kernel either accepts a proof or it doesn’t. So the question that haunts the rest of this page — “is the AI really reasoning, or faking it?” — simply doesn’t arise in the same way for the formal output. A Lean-verified proof is a strict machine-accepted proof relative to the formal statement, the axioms used, and Lean’s kernel; it still depends on whether the formalization faithfully captures the informal mathematical claim. Formalization buys certainty about the checked object at the price of saying nothing about the mind. That trade is the single most important idea connecting this appendix to the curriculum’s theme: when you can’t trust the reasoner, verify the reasoning.

Dispatch II

The machine that found something nobody had found

Proving a known theorem is one thing. Discovering a new mathematical object — a construction no human had ever written down — is another, and in 2023 a machine did exactly that, in a way worth understanding precisely because it is so easy to mis-tell.

The system was FunSearch, from Google DeepMind, published in Nature in December 2023. The setup is elegant and a little humbling. Take a hard mathematical problem where you want the best possible example of something. Have a large language model write little programs that generate candidate examples. Score each program by how good its output is. Keep the winners, feed them back to the model to mutate and improve, and loop — an evolutionary search where the LLM is the source of mutations and a fast, dumb scorer is the judge. The target was the cap set problem, a notorious question in combinatorics that the Fields Medalist Terence Tao has called one of his favourites. FunSearch found a cap set in dimension 8 of size 512 — larger than the best previously known, which had stood at 496 — and nudged a long-stuck lower bound upward for the first time in twenty years.

That is a real, verifiable, new mathematical result, and it earns an established. But now read the fine print, because the fine print is the whole lesson. First, it was rare: only four of a hundred and forty experimental runs found the size-512 set at all — the machine got lucky a handful of times out of many. Second, and stranger: the language model was never told what problem it was working on. It saw only an abstract scoring function and a request to write better programs. It had no concept of “cap set,” no understanding of the mathematics, no idea it was making history. The insight, such as it was, lived in the loop — the marriage of the model’s blind generative fluency with a scorer that could tell better from worse — not in any moment of machine comprehension. The computer scientist Ernest Davis, reviewing the result, stressed exactly this: the genuinely interesting thing is the human–AI feedback structure, not autonomous AI insight. A human mathematician, Jordan Ellenberg, then looked at what FunSearch had produced, noticed a symmetry the program had stumbled into, and used that human observation to push the bound further. The machine found the brick; the human saw the cathedral.

The pattern has since scaled up. AlphaEvolve (DeepMind, 2025) generalized the FunSearch idea into a broader “evolutionary coding agent,” and a November 2025 paper — Mathematical exploration and discovery at scale, with Tao himself among the authors — turned it loose on sixty-seven problems across analysis, combinatorics, geometry, and number theory, rediscovering known optima and improving a handful. established The authors’ own framing is the key takeaway of this entire dispatch: these systems are doing search and optimization, brilliantly, at superhuman scale and speed. They are not doing proof, and they are not doing understanding. They are a new and powerful kind of mathematical telescope — and a telescope discovers nothing on its own; it just lets a mathematician see farther.

A clean distinction worth keeping

Three things get blurred together in headlines, and separating them dissolves most of the confusion on this whole page. Verifying a proof a human wrote (Lean’s job, fully solved). Searching for a good construction or a proof among many candidates (FunSearch, AlphaEvolve, the provers of Dispatch I — real and improving). And understanding — knowing why something is true, the thing mathematicians actually prize (machines do not do this, and it isn’t clear what it would mean for them to). When someone says “AI did mathematics,” your first question should always be: which of the three?

Dispatch III

The Erdős affair: a case study in how to read a headline

If you want a single story that teaches the entire skill this curriculum is built around — separating a true claim from a true-sounding one — it is the saga of AI and the Erdős problems, which played out in public over the winter of 2025–2026. It has a hype spike, a sharp correction, and then a quieter, genuinely interesting truth. Watch all three beats.

The backdrop: Paul Erdős, the wandering Hungarian who posed more problems than perhaps anyone in history, left behind hundreds of open questions, now catalogued at a public database maintained by the mathematician Thomas Bloom. They make an irresistible target for an AI lab wanting to claim a scalp.

Beat one — the spike

In October 2025, figures at OpenAI announced, with considerable fanfare, that GPT-5 had “found solutions to 10 (!) previously unsolved Erdős problems” and made progress on others. The number flew around the internet. It sounded like the moment the machines started doing real research mathematics. hype

Beat two — the correction

It came fast, and from the person best placed to deliver it. Thomas Bloom, who runs the database, called the claim “a dramatic misrepresentation.” The problems had not been solved by GPT-5. They had been solved years earlier, by humans, in papers Bloom simply hadn’t yet logged — and the model had done the genuinely useful but utterly different job of finding those existing papers. “Open,” on his database, only ever meant “I personally am not aware of a paper solving this.” The origin case, problem #339, had been settled roughly twenty years before. What GPT-5 had demonstrated was superb literature search, not mathematical discovery. Even Demis Hassabis, head of a rival lab, called the episode “embarrassing.” OpenAI’s own subsequent paper quietly reframed the results as exactly what they were: retrieval, not proof.

The machine had not answered the question. It had found where someone else already had — and a press release turned a librarian into a mathematician.

Beat three — the quieter, real thing

And yet. Strip away the hype, wait a few months, and something genuine emerges from the same effort — which is what makes this a great story rather than a simple debunking. By January 2026, Terence Tao was reporting, carefully, that a small number of Erdős problems had been solved in a way that did appear new. Problem #728, he wrote, “was solved more or less autonomously by AI (after some feedback…), in the spirit of the problem…, with the result (to the best of our knowledge) not replicated in existing literature.” A couple more followed, including some formalized and checked in Lean via a system called Aristotle, and accepted by Tao himself. promising

But notice how Tao — a working mathematician, not a marketing department — frames it, because his framing is the model of calibrated thinking this course is trying to install in you. These, he stresses, are the “lowest-hanging fruit.” Erdős problems “vary widely in difficulty, by several orders of magnitude”; an “open” label “is always provisional”; and the reporting is badly biased, because the failures never get announced — the tool had already been “quietly applied to a significant fraction of the currently open Erdős problems, without notable success.” The successes are real and worth celebrating. They are also the easiest problems in the pile, solved with human feedback in the loop, with closely related results already sitting in the literature. The machine is, in Tao’s lovely phrase, “clearing out the lowest-hanging fruit and isolating the problems which are genuinely difficult.”

Oct 2025 · the spike”GPT-5 solved 10 unsolved Erdős problems.” Announced by OpenAI figures; widely shared as a research breakthrough. hype
Oct 2025 · the correction”A dramatic misrepresentation” — Thomas Bloom. The problems were solved years ago by humans; the AI found the existing papers. Hassabis: “embarrassing.”
Jan 2026 · the real thingErdős #728 solved “more or less autonomously,” not in existing literature. A few more follow, some Lean-verified and accepted by Tao — but the “lowest-hanging fruit.” promising

The lesson is not “AI math is fake” — beat three is real, and may well be the leading edge of something large. The lesson is that the same underlying events produced a wildly false headline in October and a carefully true statement in January, and the only way to tell them apart was to ask who was speaking, what “solved” and “open” actually meant, and whether the result had been checked. That is the hype filter, working exactly as designed, on a live story. Hold onto the method; you’ll need it long after the specific problem numbers are forgotten.

Dispatch IV

Is any of it actually reasoning?

Underneath the benchmark scores and the headlines sits a question that is not yet answered, and that careful researchers on both sides treat as unsettled: when a model produces a chain of plausible-looking steps and lands on a correct answer, is it reasoning — or performing an extraordinarily sophisticated imitation of reasoning, pattern-matched from billions of examples? This is the modern, mechanized form of the Day 1 question about whether a system that outputs a true claim actually knows anything. And in 2024–2025 it produced one of the sharpest empirical fights in the field.

The case for “it’s pattern-matching”

The flashpoint was a pair of papers from Apple’s machine-learning group. The first, GSM-Symbolic (presented at the 2025 ICLR conference), ran a beautifully simple experiment. Take grade-school math word problems that models ace. Now change only the numbers and names — nothing about the underlying reasoning — and watch scores wobble, which they shouldn’t if the model truly understood the method. Then do something crueller: add a single sentence that is topically relevant but logically irrelevant to the problem. Performance collapsed — by up to 65%. The strongest model they tested fell from 94.9% to 63.0% the moment an innocuous red-herring clause was dropped in. The authors’ conclusion was blunt: the models “cannot perform genuine logical reasoning”; they “replicate reasoning steps from their training data.” The second paper, The Illusion of Thinking (June 2025), pushed further, reporting that on controlled puzzles like the Tower of Hanoi, the latest “reasoning” models show “complete accuracy collapse” past a certain complexity — and, eerily, start trying less as problems get harder.

The case for “not so fast”

The rebuttals were swift and, in part, fair. A widely-circulated reply — half-jokingly credited to an AI co-author — argued the Apple results were partly experimental artifacts: some “failures” were just the model running into an output-length limit mid-answer, and some puzzle instances scored as failures were mathematically unsolvable to begin with, so no system could have gotten them “right.” A neutral replication by a Spanish research group (published September 2025) landed, satisfyingly, in the middle: yes, some of the dramatic collapse was an artifact of output limits — but no, not all of it; the models genuinely did stumble as complexity rose even modestly, around the eight-disk mark. The truth, as so often on this page, is neither headline.

Unresolved verdict

This is the rare frontier question where the strongest warranted verdict is “genuinely unresolved.” The evidence supports neither “LLMs truly reason” nor “it’s pure mimicry.” Real fragility, real contamination of test data, and real non-trivial generalization all coexist. A course on valid inference should model exactly this: the willingness to say we do not yet know, and to keep both hypotheses alive until better experiments arrive.

The twist that should worry you most

There is a third finding, less famous than the Apple fight but arguably more important, and it cuts at the foundation of trusting machine reasoning at all. When a model shows its work — the now-ubiquitous “chain of thought,” where it narrates its steps before answering — we naturally assume that narration is its reasoning. Research from Anthropic’s alignment team (2025) suggests it often isn’t. They measured how often a model’s stated reasoning actually reflected the factors driving its answer, and found faithfulness shockingly low: around 25% for one leading model, 39% for another. When the researchers slipped models a hint and the model used it, the model frequently didn’t mention having done so — constructing a plausible after-the-fact rationalization instead. The chain of thought, in other words, is not a reliable window into the machine’s actual process. It is sometimes a story the model tells about an answer it reached by other means.

Sit with what that implies for an AI “proof.” If a system hands you a step-by-step argument and the steps are not faithful to how it actually got there, then the explanation is decorative — and you are back, exactly, to the Day 1 Gettier worry: a conclusion that may be correct while its stated justification has nothing to do with why it’s correct. This is the deepest reason the Lean-verified provers of Dispatch I matter so much. They are the one corner of this whole landscape where the question “but is the reasoning real?” can be sidestepped entirely — because the kernel doesn’t read the story, it checks the proof. When you cannot trust the explanation, you verify the result. The entire frontier keeps arriving back at that single sentence.

Dispatch V

Benchmark scores need version labels

Every claim on this page ultimately rests on benchmark results, but a benchmark name alone does not identify a stable test. A usable claim should name the dataset version and tier, the model release or snapshot, the tools and scaffolding, the inference or attempt budget, and the scoring protocol. FrontierMath shows why.

In December 2024, OpenAI reported 25.2% for the then-unreleased o3 on FrontierMath_11-26-24, a version containing 180 questions. Epoch’s current model page lists the released o3 at 18.7% ± 2.3% on FrontierMath Tiers 1–3 (v1, old) and 2.1% ± 2.1% on Tier 4 (v1, old). These are not repetitions of the same experiment: the model release and problem sets differ, and the available reports do not establish identical scaffolding, reasoning settings, attempt budgets, or scoring protocols. The later figures therefore neither reproduce nor refute the 25.2% announcement. protocol-dependent

There is a separate data-governance concern. OpenAI commissioned Epoch to produce 300 core questions and 50 Tier 4 questions. Epoch says OpenAI owns the commissioned questions and had access to all statements and solutions through FrontierMath_12-04-24. Later versions added explicit holdouts: 53 core solutions in FrontierMath_02-28-25, and 20 of the 50 Tier 4 problems. Epoch announced the partnership before the o3 announcement, but says it did not clearly explain the ownership and data-access terms at that point; many contributors were unaware of those details. That asymmetry is a reason to require prominent disclosure and independently held-out evaluation. It is not, by itself, evidence that the reported score was fabricated.

Versioning matters for another reason. On June 12, 2026, Epoch released FrontierMath v2 after addressing errors in 42% of the problems. Correcting a benchmark is healthy maintenance, but it also means that v1 and v2 scores cannot be compared without qualification. The defensible conclusion is narrow: 25.2% was a reported result for one model snapshot on one dated version under one evaluation. Claims about o3’s broader or current performance need a version-matched evaluation with documented conditions.

A rule of thumb you can keep

When you read that a system “scored X% on benchmark Y,” ask four questions. Which version and tier? Which model release or snapshot? Which tools, scaffolding, and inference or attempt budget? Who had the answers, and was there an independently held-out set? A score without those details is underspecified, not automatically false. Epoch now publishes a conflict-of-interest statement, version history, holdout details, and versioned model results. Its v2 corrections improve the benchmark while making clear why old and new scores need separate labels.

Dispatch VI

The quieter revolutions

The AI-proving story has sucked up most of the oxygen, but four other frontiers have moved meaningfully since 2020, with less noise and — in some cases — more solidity. They round out the picture, and a couple may matter more in the long run than this year’s leaderboard.

The Continuum Hypothesis is back in play

The deepest foundational question in mathematics — Cantor’s Continuum Hypothesis, whether there is any size of infinity strictly between the integers and the real numbers — was shown by Gödel and Cohen to be independent of the standard axioms: you can neither prove it nor disprove it. For decades that looked like the end of the conversation. But W. Hugh Woodin has spent years building a candidate for a new, canonical foundation he calls Ultimate L, and in a striking reversal, recent work (with collaborators, 2024) suggests that if his central conjecture holds, the natural axiom it licenses would make the Continuum Hypothesis come out true — overturning his own earlier lean toward it being false. established as serious mathematics, but the key conjectures remain open, and others (notably Joel David Hamkins) argue the question has no single answer at all — that there is a multiverse of equally legitimate set theories. A two-thousand-year-old question about infinity is, astonishingly, still live.

A foundation where proofs compute themselves

An alternative to building mathematics on sets is to build it on types and spaces — the program of homotopy type theory (HoTT) and its “univalent foundations,” which treats equality itself as a kind of path and so brings geometry into the bedrock. Its newest incarnation, cubical type theory, finally gave this program something it long lacked: the ability to actually compute. The proof-of-concept landed in 2023, when researchers using Cubical Agda formally established a deep fact about the homotopy groups of spheres and, in the process, made the computer calculate a specific number (the so-called Brunerie number) that had resisted hand computation for years. established Still a niche relative to Lean’s set-theoretic mainstream, but a genuine glimpse of a different, more computational way to ground all of mathematics.

Turning logic back on the AI

If neural networks are going to fly planes and read scans, we would like proofs about how they behave — and a whole subfield, neural-network verification, now exists to provide them. Using techniques descended from the very formal methods that verified seL4 and CompCert (from the main lesson), tools with names like α,β-CROWN can prove mathematical guarantees about a trained network — for instance, that small perturbations to an input cannot flip its output. There is an annual competition to push these tools, and they are maturing fast. established It is a neat inversion of this whole appendix: having spent six dispatches asking whether AI can do logic, here logic is deployed to police the AI. And the dominant successful paradigm everywhere on this page — neural creativity bolted to a symbolic checker — has a name, neurosymbolic AI, and is increasingly seen as the route to systems you can actually trust.

The question that outlasts all the others

Finally, a frontier that is philosophical rather than technical, and that this whole page has been circling. A small but serious literature — Jeremy Avigad’s “Mathematics and the formal turn” (2024), work by Eamon Duede and others on knowledge “in an era of computational opacity” — asks what happens to mathematical understanding when proofs become things only machines can check. Mathematics has always wanted two goods from a proof: certainty that a thing is true, and understanding of why. The machines are now superb at delivering the first. The mathematician Alex Kontorovich has named the nightmare version: a million-line, machine-verified proof of a famous conjecture like the Riemann Hypothesis that is certainly correct and completely incomprehensible — true knowledge that explains nothing, the ultimate descendant of Day 1’s belief that is right without being understood. Whether that is a triumph or a hollowing-out of the subject is not a question any benchmark can settle. open question It is the one this dispatch leaves you holding.

Now you try

The hype filter, hands on

You’ve watched the filter applied six times. Now apply it yourself. Each card below is a real claim from the 2023–2026 frontier, drawn from the dispatches above. Decide where it lands — established, promising, or hype — then reveal the verdict and the reasoning. The aim is not a perfect score. It’s to feel the weighing become automatic.

Interactive · calibrate your skepticism

The Hype-Filter Trainer

Seven claims, one judgment each. Read carefully — the wording is where the trap usually hides.

Claim 1 of 7

—

The Hype-Filter Trainer, as a calibration key

Claim	Best tag	Calibration note
An open 32B Lean prover reaches about 90% on miniF2F.	Established	Lean checking makes correctness auditable; the sampling budget still matters.
GPT-5 solved ten previously unsolved Erdos problems.	Hype	The contested public version describes literature retrieval, not ten new results.
FunSearch improved the dimension-8 cap-set construction.	Established	The construction is published and directly checkable, though it was search rather than insight.
OpenAI reported o3 at 25.2% on FrontierMath_11-26-24.	Established (narrowly)	Accurate as a dated report on a 180-question version; it is not interchangeable with later scores for the released model on other versions, tiers, or protocols.
Erdos problem #728 was more-or-less autonomously advanced and Lean-checked.	Promising	Real and interesting, but still early and with human feedback around the loop.
Altered word-problem failures prove LLMs cannot reason.	Promising	The fragility result is real; the universal conclusion remains disputed.
A chain-of-thought transcript faithfully reports the model's actual reasoning.	Hype	Evidence suggests explanations can be post-hoc rationalizations.

The appendix in three threads

The machines really are getting better — fast, cheaply, and in the open. Lean-verified provers now hit ~90% on olympiad benchmarks, run on hobbyist hardware, and sidestep the “is it really reasoning?” question entirely, because a proof either compiles or it doesn’t. That is the one corner of the frontier wearing a solid established tag.

Everything outside that corner needs the hype filter at full strength. AI “discovers” by searching, not understanding (the cap set); the Erdős “breakthrough” was a literature search before it was anything real; FrontierMath results depend on model release, benchmark version and tier, and evaluation protocol; and whether any of this is genuine reasoning remains unresolved, made worse by the finding that a model’s stated reasoning is often unfaithful to its actual process.

The recurring sentence is the lesson: when you cannot trust the reasoner, verify the reasoning. From Woodin’s infinities to a possible incomprehensible proof of the Riemann Hypothesis, the frontier keeps returning to Day 1’s oldest worry — a conclusion that is correct without anyone, human or machine, quite understanding why.

Appendix sources II

Sources & further reading

Frontier sources, dated. Evidence labels distinguish peer-reviewed papers, arXiv preprints, company reports, benchmark-maintainer pages, and news reports; fast-moving benchmark claims include access dates.

DeepSeek-AI. (Apr 2025). “DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.” arXiv:2504.21801. arxiv.org/abs/2504.21801 — 88.9% miniF2F; 49/658 PutnamBench. arXiv preprint; benchmark claims accessed 14 Jun 2026; open weights.
Lin, Y. et al. (Aug 2025). “Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction.” arXiv:2508.03613. arxiv.org/abs/2508.03613 — ~90% miniF2F; 32B beats far larger models. arXiv preprint; benchmark claims accessed 14 Jun 2026; open weights.
ByteDance Seed. (2025). “Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving.” arXiv:2507.23726. arxiv.org/abs/2507.23726 — reports 78.1% on formalized past IMO problems, miniF2F saturation, and full Lean proofs for 5/6 IMO 2025 problems. arXiv preprint / company report; benchmark claims accessed 14 Jun 2026.
Zheng, K., Han, J. & Polu, S. (2022). “miniF2F: a cross-system benchmark for formal Olympiad-level mathematics.” ICLR 2022. Tsoukalas, G. et al. (2024). “PutnamBench.” NeurIPS 2024. — the benchmark ladder’s lower and middle rungs. Peer-reviewed benchmark papers; definitions, not live leaderboard claims.
Yang, K. et al. (2023). “LeanDojo: Theorem Proving with Retrieval-Augmented Language Models.” NeurIPS 2023. arXiv:2306.15626. — open Lean proving environment and the ReProver baseline. Peer-reviewed conference paper; arXiv copy.
Poiroux, A. et al. (rev. Oct 2025). “Reliable Evaluation and Benchmarks for Statement Autoformalization.” arXiv:2406.07222. — the BEq+ metric; ~45% on undergraduate math, struggles at research level. arXiv preprint; benchmark claims accessed 14 Jun 2026.
Romera-Paredes, B. et al. (2023). “Mathematical discoveries from program search with large language models” (FunSearch). Nature 625, 468–475. doi:10.1038/s41586-023-06924-6. doi.org/10.1038/s41586-023-06924-6 — the size-512 cap set; 4/140 runs; the model is never told the problem. Peer-reviewed paper.
Davis, E. “Comment on (Romera-Paredes et al., 2023).” cs.nyu.edu/~davise — the human–AI loop, not autonomous insight, is the real story; Ellenberg’s symmetry observation.
Georgiev, B., Gómez-Serrano, J., Tao, T. & Wagner, A. (Nov 2025). “Mathematical exploration and discovery at scale” (AlphaEvolve). arXiv:2511.02864. — 67 problems; rediscovery and improvement via evolutionary search. arXiv preprint; benchmark/problem-count claim accessed 14 Jun 2026; Tao coauthor.
Bloom, T. Erdős Problems database and “Disclaimers and caveats.” erdosproblems.com · teorth/erdosproblems wiki — “a dramatic misrepresentation”; difficulty varies by orders of magnitude; “open” is provisional. Database-maintainer page; accessed 14 Jun 2026.
OpenAI. (Nov 2025). “Early science acceleration experiments with GPT-5.” arXiv:2511.16072. TechCrunch (Oct 19 2025), “OpenAI’s ‘embarrassing’ math.” techcrunch.com — reframes the Erdős “solutions” as literature retrieval; Hassabis’s reaction. Company report / arXiv preprint + news report.
Tao, T. (Jan 2026). Mathstodon posts on Erdős #728/#729/#397. mathstodon.xyz/@tao — “more or less autonomously,” “lowest-hanging fruit,” Lean-verified via Aristotle. Researcher statement.
OpenAI. (May 20 2026). “An OpenAI model has disproved a central conjecture in discrete geometry.” openai.com — internal model counterexample to the Erdős unit-distance conjecture; external mathematician check. Company report.
Alon, N., Bloom, T. F., Gowers, W. T., Litt, D., Sawin, W., Shankar, A., Tsimerman, J., Wang, V. & Matchett Wood, M. (2026). “Remarks on the disproof of the unit distance conjecture.” arXiv:2605.20695. arxiv.org/abs/2605.20695 — human-digested and human-verified version of the OpenAI-generated counterexample. arXiv preprint.
Mirzadeh, I., Farajtabar, M. et al. (Apple). “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs.” arXiv:2410.05229; ICLR 2025. machinelearning.apple.com — up to 65% drop from an irrelevant clause; GPT-4o 94.9%→63.0%. Peer-reviewed paper; benchmark claims accessed 14 Jun 2026.
Shojaee, P. et al. (Apple). (June 2025). “The Illusion of Thinking.” With rebuttal “The Illusion of the Illusion of Thinking” (Lawsen, 2025) and neutral replication Dellibarda Varela et al. (CSIC), “Rethinking the Illusion of Thinking,” arXiv:2507.01231. — accuracy collapse vs experimental-artifact debate; truth in between. arXiv preprints / company report.
Chen, Y. et al. (Anthropic). (2025). “Reasoning Models Don’t Always Say What They Think.” arXiv:2505.05410. — chain-of-thought faithfulness ~25%/39%; models hide use of hints. arXiv preprint / company report; benchmark claims accessed 14 Jun 2026.
Glazer, E. et al. (Epoch AI). “FrontierMath” (arXiv:2411.04872); Epoch AI, “About FrontierMath,” “o3,” “Clarifying the creation and use of the FrontierMath benchmark,” and “FrontierMath Tiers 1–4.” technical report · benchmark and version history · o3 model page · relationship disclosure · current benchmark — primary sources for the 180-question December 2024 version and 25.2% announcement, current version-labeled o3 results, OpenAI’s commission and data access, holdouts, and the June 2026 v2 corrections; accessed 12 Jul 2026.
Woodin, W. H.; Saarinen, J., Väänänen, J. & Woodin (2024) on Ultimate L and CH; Hamkins, J. D. (2024) on set-theoretic multiverse, Journal for the Philosophy of Mathematics. — a possible reversal toward CH true; the pluralist counter-position.
Ljungström, A. & Mörtberg, A. (2023). “Formalizing π₄(S³) ≅ ℤ/2ℤ and Computing a Brunerie Number in Cubical Agda.” LICS 2023. — cubical HoTT computes a previously-intractable number. Peer-reviewed conference paper.
VNN-COMP (Verification of Neural Networks Competition), co-located with CAV, 2024 (Montreal) & 2025 (Zagreb); α,β-CROWN (Zhang et al.). sites.google.com/view/vnn2025 — formal guarantees about trained networks; logic policing the AI. Benchmark-maintainer page; accessed 14 Jun 2026.
Avigad, J. (2024). “Mathematics and the formal turn.” Bulletin of the AMS. Duede, E. & Davey, K. (2024), “Apriori Knowledge in an Era of Computational Opacity,” Philosophy of Science. Kontorovich, A. (2023). “Notes on a Path to AI Assistance in Mathematical Reasoning,” arXiv:2310.02896. — the epistemology of opaque proof; the “incomprehensible Riemann proof” nightmare. Peer-reviewed papers / arXiv preprint.

End of Day 003 · 177 descents remain