Download this day:EPUB PDF

Block I · Foundations of Knowledge & Reasoning · Day 007 / 180

Information Theory

How many yes/no questions is a fact worth? And what does it cost, in heat, to forget one?

Four good questions pin down one of sixteen things.

lo g_{2} 16 = 4

bits.

Pick a whole number between 1 and 16 and hold it in your head. Assume each number is equally likely. I’ll find it with four balanced yes/no questions: “Is it 9 or higher?” — that splits the field in half. “Within that half, is it in the upper part?” — half again. Two more cuts and I have your number, every time. Notice what just happened: a fact that felt fuzzy and personal turned out to have an exact size. Four questions are sufficient — and in the worst case necessary — because $2^{4} = 16$ . Four bits.

Shannon’s 1948 synthesis built on earlier work by Nyquist and Hartley, then added the decisive probabilistic move. And a separate physical sting took until the 1960s to feel and until 2012 to measure in a lab: if information is represented in matter, then erasing it is a physical act with a non-negotiable price, paid in heat. Today we earn both halves of that story.

Where we are

We’ve spent six days building an epistemic toolkit; today we discover it has a currency. On Day 4 we met surprise — Monty Hall’s host opening a goat door was information, not noise — and the e-value as evidence you could literally bank. Today that intuition gets its unit. On Day 3, Boole’s logic became physical switches (Shannon’s other famous paper, 1938); today the same man makes information physical too. And the graded belief of Day 1 connects to Friston’s framework, where organisms are modeled as minimizing variational free energy — a tractable upper bound on surprisal under a generative model. The information thread, traced quietly since Day 1, finally gets its hard number. Keep it ready: it reappears as the arrow of time on Day 33 and as the thing life itself seems to defy on Days 83–85.

The model

A bit is a halving

Before Shannon, “information” was a word for newspapers and telegrams — content, meaning, gossip. Shannon’s radical move was to throw the meaning away. For an engineer trying to push messages down a noisy wire, what matters is not what a message says but how much it could have said: how much uncertainty it removes. Strip out meaning and you’re left with something you can count.

The unit is the bit — short for “binary digit,” a contraction coined by Shannon’s colleague John Tukey in a 1947 Bell Labs memo and credited by name in Shannon’s paper. One bit is the amount of information in a single fair yes/no answer: the resolution of one perfectly balanced uncertainty. Two equally likely possibilities, one bit. Sixteen, four bits, because $2^{4} = 16$ . The bit is a halving.

But real choices aren’t always balanced. The genius of the theory is how it handles loaded dice. Shannon defined the information content — the surprise — of an outcome with probability p as:

surprise (p) = - lo g_{2} p bits

A sure thing ( $p = 1$ ) carries zero surprise. A one-in-a-million event carries about twenty bits.

It clicks the moment you test it. A coin you know is two-headed: calling “heads” right tells you nothing — $p = 1$ , surprise = $- lo g_{2} 1 = 0$ bits. A fair coin: $p = \frac{1}{2}$ , surprise = $- lo g_{2} \frac{1}{2} = 1$ bit, the textbook halving. If rain was already certain, observing rain has little surprisal; if rain was rare, observing rain has a lot. A forecast is a different object: it is informative when it changes your probability model, whether or not the eventual weather is surprising. The logarithm is what makes surprises add up the way intuition demands: learn two independent facts and the surprises sum, just as the possibilities multiply.

Entropy: the average surprise

Now zoom out from one outcome to a whole source — a language, a die, a stream of symbols. How surprising is it on average? That average is the crown jewel of the theory, the Shannon entropy:

H = - i \sum p_{i} lo g_{2} p_{i} bits per symbol

Each outcome’s surprise, weighted by how often it happens. The expected number of yes/no questions per symbol.

Entropy is the irreducible core of a source model. For one-symbol binary prefix codes, the best average length generally satisfies $H \leq L < H + 1$ ; over long blocks, lossless coding can approach H bits per symbol asymptotically. A fair coin has entropy 1; a fair eight-sided die, 3; the binary event “u follows q?” is much lower because one answer is strongly expected. It is, in a precise sense, the amount of genuine choice a source exercises, or equivalently the amount of your uncertainty it resolves. The same quantity, read from two ends.

The most useful word he never quite chose

Shannon’s H has the exact algebraic shape — a sum of p log p — of a quantity physicists had used since the 1870s to measure disorder: entropy. The story, told decades later by Myron Tribus, is that Shannon was unsure what to call his new measure, and John von Neumann told him to call it entropy for two reasons — the formula already had that name in statistical mechanics, and “nobody knows what entropy really is, so in any debate you’ll have the advantage.” It’s probably too good to be literally true (it surfaces in print only in 1971). But the coincidence it points at is real, deep, and still argued over — and it’s the hinge the entire back half of this course will swing on. Hold that thought.

Turn the live dial from a fair coin toward certainty and watch its entropy collapse.

The entropy cases show the same pattern: maximum uncertainty carries the most information, while near-certainty carries almost none.

Interactive · weigh a coin

The Entropy Dial

Bend a coin from fair to fixed and watch its information content collapse. The curve is H for a two-outcome source; the bars are the surprise of each face. Maximum information lives at maximum doubt — the fair coin, peak 1 bit. Certainty carries nothing.

1.00bits / flip

heads · p=0.50 · surprise 1.00 b

tails · p=0.50 · surprise 1.00 b

$H = 0.50 \times 1.00 + 0.50 \times 1.00 = 1.00$ bits/flip.

A fair coin: every flip is a genuine question with no shortcut. To send 1,000 flips you need 1,000 bits — there's nothing to compress.

P(heads) 0.50

Load a source

The Entropy Dial

A two-outcome source reaches its maximum entropy when both outcomes are equally likely.

Source	P(heads)	Entropy	Reading
Fair coin	0.50	1.00 bit/flip	Every flip answers a full yes/no question.
Loaded coin	0.88	0.53 bits/flip	The common face is cheap to encode; the rare face is expensive.
Near-certain source	0.99	0.08 bits/flip	The outcome is almost known in advance, so little information arrives.
"u follows q?"	0.95	0.29 bits/event	This is the entropy of a binary event, not the full next-character distribution.

Why it mattered

The theorem that built the modern world

Defining information would have been a nice piece of bookkeeping. What made Shannon’s 1948 paper — modestly titled “A Mathematical Theory of Communication” and later dubbed the “Magna Carta of the Information Age” — a foundation stone was a single staggering result about noise.

Every real channel corrupts its messages: static on a line, scratches on a disc, cosmic rays flipping bits in deep space. Shannon’s noisy-channel theorem applies once you specify the channel model and constraints — bandwidth, power, alphabet, noise law, and allowed inputs. Under those assumptions the channel has a capacity C. At rates below C, sufficiently long codes can drive the decoding-error probability arbitrarily close to zero. Above C, vanishing error is impossible for that same channel model. Lowering the rate is part of the story; clever coding is what keeps the reliable rate positive instead of collapsing toward zero.

Capacity is an asymptotic boundary for a specified channel. Below it, vanishing error is possible with sufficiently long codes; finite messages still trade rate, latency, and error probability.

Here’s the kicker: Shannon proved the good codes exist without saying how to build them. He left engineers a treasure map with an X but no path. Chasing that X became one of the great quests of applied mathematics — Reed–Solomon codes (which armor your CDs, QR codes, and the data beamed back from Mars), then turbo codes (1993), then the low-density parity-check codes now humming inside Wi-Fi, 5G data channels, and storage. Each crept closer to Shannon’s wall under real constraints. Every time a movie streams smoothly over a flaky connection, you are watching this theorem work together with buffering, retransmission, congestion control, source coding, and error concealment. The shorter codeword for the commoner symbol — Morse’s single dot for “E,” the logic behind Huffman coding (1952) — is the same principle running underneath: spend your bits where the surprise is.

The debate

Is information physical?

So far, information sounds like mathematics — abstract, dimensionless, the stuff of probability and logarithms. A bit itself has no fixed mass, size, or temperature. But stored and processed bits always have physical representations, and a paradox more than a century old forced the issue: forgetting information in matter can warm the room.

A demon at the trapdoor

In 1867 James Clerk Maxwell dreamed up a troublemaker. Picture a box of gas split by a wall, with a tiny trapdoor and a tiny intelligent being — later christened Maxwell’s demon — guarding it. The demon watches the molecules. When a fast one approaches from the right, it opens the door and lets it through to the left; when a slow one approaches from the left, it lets that through to the right. It never does any work on the molecules — just opens and shuts a frictionless door at the right moments. Slowly, patiently, it sorts hot from cold, building a temperature difference out of a uniform gas.

Schematic of Maxwell's demon opening a trapdoor between two gas chambers to separate slow blue particles from fast red particles. — Maxwell’s demon as a sorting thought experiment: a trapdoor keeper separates slower and faster molecules, apparently creating a temperature difference from information alone. Image: Htkym, Wikimedia Commons, CC BY 2.5; colors adapted locally.

This should be impossible. Building order from equilibrium, for free, is exactly what the second law of thermodynamics forbids — it’s the statistical law behind coffee cooling and eggs not unscrambling. The demon seems to break the deepest bookkeeping rule in physics using nothing but information about which molecules are which. For a hundred years it haunted the field. Leó Szilárd sharpened it in 1929 down to a single molecule in a box, and showed the demon could extract a tidy packet of work — exactly $k_{B} T ln 2$ — from one bit of “which side is it on?” knowledge. The arrow pointed somewhere uncomfortable: information could apparently be used to extract work.

The twist: it’s not knowing, it’s forgetting

The resolution is one of the most beautiful pieces of reasoning in twentieth-century physics, and it came from the people who built computers. In 1961, IBM’s Rolf Landauer asked a question nobody had thought to ask: is computation necessarily dissipative? Must shuffling bits around always cost energy? His surprising answer: no — logically reversible steps can in principle be done with arbitrarily little dissipation, run as slowly and gently as you like. The unavoidable lower bound attaches to logically irreversible operations. The canonical example is erasure.

Erase one bit ⟹ dissipate at least k_{B} T ln 2

For an initially unknown, equiprobable bit reset in an isothermal cycle with no usable side information: ≈ 2.8 × $1 0^{- 21}$ joules (2.8 zeptojoules ≈ 0.018 eV) at room temperature. Landauer’s principle, 1961.

Why erasure specifically? Because erasure is not reversible. If I tell you a bit is now “0,” you cannot recover whether it was “0” or “1” a moment ago — that history is gone, two possible pasts crushed into one present. Logically irreversible operations destroy distinctions, and in a physical device, distinctions live in physical states. For biased bits, correlated memories, or usable side information, the lower bound is governed by the entropy actually discarded and by changes in the memory’s physical free energy. For the standard unknown fair bit, the cost is $k_{B} T ln 2$ . Landauer’s slogan became a rallying cry: “Information is physical.”

In 1982 his IBM colleague Charles Bennett closed the trap on Maxwell’s demon with this insight. The demon’s mistake was never simply measuring the molecules — measurement has no Landauer minimum in principle when implemented reversibly with a suitably prepared memory. The demon’s mistake is that it has a memory, and that memory fills up. To keep sorting forever as a cyclic device, it must eventually restore that memory. Each logically irreversible reset pays back, as heat, the entropy the demon thought it was removing from the gas. The books balance. The cost of the demon’s cleverness isn’t thinking — it’s forgetting.

Step through the irreversible reset and watch the heat meter climb as two possible pasts become one present.

The worked example traces the erasure cycle: protect a bit, remove its distinction, force it to 0, and pay the heat cost.

Interactive · pay the toll

The Landauer Erasure Machine

A single unknown, equiprobable bit, stored as a ball in a double-welled landscape: left = 0, right = 1. To reset it to 0 no matter where it started, you lower the wall, bias the landscape toward 0, and trap it left again — squeezing two possible pasts into one. In the ideal quasistatic limit, the work put into this irreversible reset is dissipated as heat; releasing the bias does not give the missing distinction back.

Logical state

bit = 1

Heat dissipated

0.0 zJ

Stage 0 — a bit at rest. The ball sits in the right well: this memory holds a 1. Two wells, two possible values, the wall between them keeping the bit stable. Press Next step to begin erasing it to 0.

Worked example

The Landauer Erasure Machine

Step	Logical state	Physical move	Thermodynamic reading
0	Bit can be 0 or 1	A barrier separates two stable wells.	No heat cost is forced by the logic.
1	Old value becomes unprotected	Lower the barrier between wells.	This can be done reversibly in principle.
2	Drive toward 0	Bias the landscape so both possible starts end left.	Work is dissipated as heat because the reset maps two possible inputs to one output.
3	Bit reads 0	Raise the barrier again.	Two possible pasts have become one present.
4	One fair unknown bit erased	Reset complete.	In the ideal standard model, at least $k_{B} T ln 2$ of work is required and the corresponding entropy is transferred to the environment.

The frontier · 2026

From a thought experiment to a zeptojoule on a bench

For fifty years Landauer’s principle was a piece of theory so clean it felt almost philosophical. The numbers were absurd — a few zeptojoules, billionths of a billionth of a joule, lost in the thermal roar of any real device. Measuring it seemed hopeless. Then, beginning in 2012, a string of breathtaking experiments dragged the principle out of the realm of argument and onto the lab bench, across system after system. Here is where the hype filter earns its keep — and here, unusually, the boldest-sounding claims are the ones that survived.

Edge 01Landauer

The bound, measured five ways

The first direct confirmation came from Bérut and colleagues (Nature, 8 March 2012): a single micron-sized glass bead held in a double-well optical trap — a physical one-bit memory — was erased over and over, and the average heat released settled onto the $k_{B} T ln 2$ floor as the erasure was done more gently. Two years later Jun, Gavrilov & Bechhoefer (Physical Review Letters, 4 Nov 2014) pushed precision higher in a feedback trap, confirming that halving the number of accessible states costs at least $k_{B} T ln 2$ — with individual cycles dipping below the bound, exactly as the fluctuation theorems (a Day-85 preview) predict, while the average holds firm.

What makes it count as established is the diversity of confirmations under carefully specified protocols. Cross-platform tests A single electron in a box run as a Szilárd engine (Koski et al., PNAS, 2014). An array of nanoscale magnets — the closest thing yet to a real digital memory bit — pinned the cost near “2.8 zJ at 300 K,” measuring (4.2 ± 0.9) zJ (Hong, Lambson, Dhuey & Bokor, Science Advances, 11 March 2016). And a single trapped calcium ion extended the principle into the fully quantum regime (Yan et al., Physical Review Letters, 21 May 2018). Glass, electrons, magnets, atoms — wildly different platforms, one consistent lower bound. That’s what a real law looks like.

Edge 02Info enginesFastest claim

Running the demon in reverse: information as fuel

If erasing a bit costs energy, can measured information help extract work? Szilárd said yes on paper; the lab now says yes under full measurement-feedback accounting. Toyabe and colleagues (Nature Physics, 2010) built the first real information engine: a Brownian particle climbing a staircase, lifted against gravity using well-timed measurements of its position and a feedback ratchet — converting information in the controller into mechanical work, and validating a feedback-generalized Jarzynski equality in the process. Koski’s single-electron Szilárd engine (2014–2015) did it with one electron and even built an “information-powered refrigerator.” More recently, the Bechhoefer lab (Saha et al., PNAS, 18 May 2021) optimized a colloidal information ratchet whose output power, they report, rivals the molecular machinery inside living cells — a genuinely striking result, though the “world’s fastest information engine” billing comes from the press release, not the peer-reviewed claim, so treat that phrase as marketing, the underlying physics as solid.

Edge 03Physical floorLandauer wall

The new physics of information — and what it says about your GPU

All of this now lives inside a mature framework called stochastic thermodynamics, which extends the old laws of heat to tiny, jittering systems where fluctuations dominate. Its engine room includes exact fluctuation relations such as the Jarzynski equality (1997) and the Crooks fluctuation theorem (1999). Mutual information terms enter their feedback-controlled generalizations, the Sagawa-Ueda lineage used in information-engine experiments. The authoritative synthesis is Parrondo, Horowitz & Sagawa’s review, bluntly titled “Thermodynamics of information” (Nature Physics, 2015): information belongs in the ledger alongside work, heat, measurement, feedback, controller memory, and reset. In 2024, David H. Wolpert and collaborators at the Santa Fe Institute extended finite-time stochastic-computation theory, adding to the older mismatch-cost story rather than inventing it from scratch.

Which raises the question lurking behind every data center. How far above the floor are our machines? Very. A single transistor switch dissipates on the order of $1 0^{- 18}$ joules — hundreds to thousands of times the Landauer bound at the device level, and far more once you count memory traffic, cooling, and power conversion. We are nowhere near the wall. The useful framing is this: all data centers drew about 1.5% of global electricity in 2024, and AI is a rapidly growing workload expected to drive a substantial share of the increase through 2030 (IEA, Energy and AI, April 2025). For now this is an engineering and economic problem, not a fundamental-physics one. The “AI is about to hit the Landauer limit” headline is Limit headline — but the floor is real, and it is one reason reversible, thermodynamic, and neuromorphic computing remain interesting.

Diagram · orders of magnitude

Illustrative energy scales — not directly comparable

Logarithmic energy scale for different operations and system boundaries. The rungs compare an ideal 300 K fair-bit reset, quasistatic single-bit erasure experiments, approximate transistor-switching energy, and full logic or system-level computation. They are orientation markers, not one clean historical efficiency curve.

The lab experiments are not better general computers than your laptop: they isolate one erasure protocol and run it gently enough to approach the thermodynamic bound. Practical chips spend energy on speed, reliability, memory movement, control, cooling, and power delivery. The comparison is useful only if the boundary around each rung is kept visible.

Edge 04Bit metaphysics

Where the idea overreaches

“Information is physical” is established. The temptation is to sand off the qualifier and declare that information is fundamental — that reality is, at bottom, made of bits. The physicist John Archibald Wheeler gave this its slogan in 1989: “it from bit,” the conjecture that every particle and field derives its very existence from yes/no answers, from information. It is a gorgeous, generative idea — and it is metaphysics, not a tested result. Critics note the obvious circularity: a bit has to be encoded in something, so information can’t be the bottom turtle. Keep it in the “stimulating speculation” drawer, clearly labeled.

Further out lies genuine fringe. Melvin Vopson’s “mass–energy–information equivalence” claims information has rest mass; his “second law of infodynamics” claims information entropy must decrease over time; his 2025 paper deriving gravity from information drew a flat verdict from physicist Sabine Hossenfelder that it “makes no sense.” When a frontier is hot, it grows a fringe — and learning to tell the two apart, using exactly the hype filter we’ve run all course, is the skill this whole project is really about.

Open questions

What’s genuinely unsettled

Is Shannon entropy the same thing as thermodynamic entropy, or just shaped like it? The formula is identical; whether that’s a deep physical identity (the maximum-entropy view of Jaynes) or a profound analogy is still argued. This is the question Day 33 and Days 83–85 will reopen with the stakes raised to “what is life?”
Can the demon ever truly be beaten? A minority of physicists (Earman & Norton) argue the standard exorcism is subtly circular — using the second law to derive the erasure cost, then using that cost to defend the second law. A contested 2016 result even claimed a logically irreversible gate run below $k_{B} T ln 2$ . The mainstream, backed by the experiments, says no — but the foundations aren’t fully closed.
How low can real computation actually go? Reversible computing promises to dodge erasure costs almost entirely. Nobody has built a useful machine that does. Is the Landauer floor a practical target or a permanent curiosity?
And the question waiting in the AI block: when a model “knows” something, is that knowledge ultimately just bits arranged to reduce a loss — and does the thermodynamic cost of those bits tell us anything about what it would take to think? (Days 138–145.)

The day in three sentences

Big idea: Information is measurable: a bit is the logarithmic unit, surprisal is $i (x) = - lo g_{2} p (x)$ , and entropy $H = - \sum p lo g_{2} p$ is expected surprisal measured in bits. When information is physically represented, resetting an initially unknown fair bit in the standard isothermal model requires at least $k_{B} T ln 2$ of work (Landauer), which is how Maxwell’s cyclic demon pays its thermodynamic bill.
Best analogy: A fact’s size is the number of yes/no questions it’s worth; and a memory is a ball in a double well, so resetting it to a standard state — crushing two possible pasts into one present — is the act that leaks heat into the room.
Live controversy: Whether Shannon’s entropy and thermodynamic entropy are one thing or two; whether the demon is truly beaten; and how far the grand claim “information is fundamental” (Wheeler’s “it from bit”) overreaches into hype.

Threads today › information (gets its hard unit at last — the bit, surprise, entropy) · energy (Landauer’s $k_{B} T ln 2$ ties bits to heat) · computation (the thermodynamic cost of erasing, the demon as a memory device) — with first hints of emergence (a law that holds across glass, electrons, magnets, atoms) and a setup for evolution & life on Days 83–85.

Tomorrow → Day 8

Complexity & Emergence

Information theory measured order one bit at a time. Tomorrow order starts assembling itself: starling murmurations, slime molds that redraw railway maps, cellular automata — simple local rules generating global structure, and the unsettled fight over weak versus strong emergence.

Sources

Sources & further reading

Shannon, C. E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal 27: 379–423 (July) & 623–656 (October). — the founding paper: bit, entropy, channel capacity, noisy-channel coding theorem. doi.org/10.1002/j.1538-7305.1948.tb01338.x
Shannon, C. E. (1938). “A Symbolic Analysis of Relay and Switching Circuits.” Trans. AIEE 57(12): 713–723 (MIT master’s thesis, 1937). — Boolean logic in physical switches; the Day-3 bridge.
Soni, J. & Goodman, R. (2017). A Mind at Play: How Claude Shannon Invented the Information Age. Simon & Schuster. — biography; the “Magna Carta” framing and the Tukey “bit” coinage.
Landauer, R. (1961). “Irreversibility and Heat Generation in the Computing Process.” IBM Journal of Research and Development 5(3): 183–191. — the $k_{B} T ln 2$ erasure bound. doi:10.1147/rd.53.0183. doi.org/10.1147/rd.53.0183
Bennett, C. H. (1982). “The Thermodynamics of Computation — a Review.” International Journal of Theoretical Physics 21(12): 905–940. doi:10.1007/BF02084158; erasure, not measurement, exorcises Maxwell’s demon. doi.org/10.1007/BF02084158
Szilárd, L. (1929). “Über die Entropieverminderung in einem thermodynamischen System bei Eingriffen intelligenter Wesen.” Zeitschrift für Physik 53: 840–856. doi:10.1007/BF01341281; the single-molecule engine; one bit ↔ $k_{B} T ln 2$ of work. doi.org/10.1007/BF01341281
Bérut, A., Arakelyan, A., Petrosyan, A., Ciliberto, S., Dillenschneider, R. & Lutz, E. (2012). “Experimental verification of Landauer’s principle linking information and thermodynamics.” Nature 483: 187–189 (8 March 2012). doi:10.1038/nature10872. doi.org/10.1038/nature10872
Jun, Y., Gavrilov, M. & Bechhoefer, J. (2014). “High-Precision Test of Landauer’s Principle in a Feedback Trap.” Physical Review Letters 113: 190601 (4 Nov 2014). doi:10.1103/PhysRevLett.113.190601. doi.org/10.1103/PhysRevLett.113.190601
Hong, J., Lambson, B., Dhuey, S. & Bokor, J. (2016). “Experimental test of Landauer’s principle in single-bit operations on nanomagnetic memory bits.” Science Advances 2(3): e1501492 (11 March 2016). doi:10.1126/sciadv.1501492; measured (4.2 ± 0.9) zJ; “2.8 zJ at 300 K.” doi.org/10.1126/sciadv.1501492
Yan, L. L. et al. (2018). “Single-Atom Demonstration of the Quantum Landauer Principle.” Physical Review Letters 120: 210601 (21 May 2018). — a trapped ⁴⁰Ca⁺ ion; the quantum regime. link.aps.org/doi/10.1103/PhysRevLett.120.210601
Toyabe, S., Sagawa, T., Ueda, M., Muneyuki, E. & Sano, M. (2010). “Experimental demonstration of information-to-energy conversion and validation of the generalized Jarzynski equality.” Nature Physics 6: 988–992. doi:10.1038/nphys1821. doi.org/10.1038/nphys1821
Koski, J. V., Maisi, V. F., Pekola, J. P. & Averin, D. V. (2014). “Experimental realization of a Szilard engine with a single electron.” PNAS 111(38): 13786–13789. doi:10.1073/pnas.1406966111. doi.org/10.1073/pnas.1406966111 See also Koski et al., PRL 113: 030601 (2014) and PRL 115: 260602 (2015).
Saha, T. K., Lucero, J. N. E., Ehrich, J., Sivak, D. A. & Bechhoefer, J. (2021). “Maximizing power and velocity of an information engine.” PNAS 118(20): e2023356118 (18 May 2021). doi:10.1073/pnas.2023356118; optimized colloidal information ratchet (“fastest” is press framing). doi.org/10.1073/pnas.2023356118
Jarzynski, C. (1997). “Nonequilibrium Equality for Free Energy Differences.” Physical Review Letters 78: 2690. doi:10.1103/PhysRevLett.78.2690. doi.org/10.1103/PhysRevLett.78.2690 · Crooks, G. E. (1999). “Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences.” Physical Review E 60: 2721. doi:10.1103/PhysRevE.60.2721. doi.org/10.1103/PhysRevE.60.2721
Parrondo, J. M. R., Horowitz, J. M. & Sagawa, T. (2015). “Thermodynamics of information.” Nature Physics 11: 131–139. doi:10.1038/nphys3230. doi.org/10.1038/nphys3230
Manzano, G., Kardeş, G., Roldán, É. & Wolpert, D. H. (2024). “Thermodynamics of Computations with Absolute Irreversibility, Unidirectional Transitions, and Stochastic Computation Times.” Physical Review X 14: 021026 (13 May 2024). doi:10.1103/PhysRevX.14.021026; the “mismatch cost” above the Landauer floor. doi.org/10.1103/PhysRevX.14.021026
International Energy Agency (2025). Energy and AI. (April 2025). — data-center electricity ≈1.5% of global demand in 2024 (~415 TWh), projected ~945 TWh by 2030. iea.org/reports/energy-and-ai
Wheeler, J. A. (1990). “Information, Physics, Quantum: The Search for Links.” In Complexity, Entropy, and the Physics of Information. — “it from bit” (CONTESTED metaphysics).
Vopson, M. M. (2023). “The second law of infodynamics and its implications for the simulated universe hypothesis.” AIP Advances 13: 105308. doi:10.1063/5.0173278; CONTESTED/HYPE. doi.org/10.1063/5.0173278 See also Vopson (2025), “Is gravity evidence of a computational universe?”, AIP Advances 15: 045035, doi:10.1063/5.0264945, and Sabine Hossenfelder’s 2025 public critique. doi.org/10.1063/5.0264945

Deep dive appendixThe Deeper CurrentsOptional extension.

We left the main lesson having done two remarkable things: we weighed a fact in bits, and we paid — in heat, in zeptojoules — to forget one. That is the spine of the subject. But a spine is not a body. Between Shannon’s definition of the bit and Landauer’s price for erasing it lies a century of ideas we sprinted past: a second quantity, just as fundamental as entropy, that measures how much one thing tells you about another; the deep reason your phone can shrink a photo by 95% and stream it through static without a flaw; a maze-solving mechanical mouse; and a chain of physical limits that ends, quite literally, at the computational capacity of the universe. This appendix is the unhurried tour.

01 · the missing quantity

Mutual information, and the tax for a wrong belief

Entropy measures the uncertainty inside one source. But the whole point of communication — and of evidence, and of perception — is that one thing tells you about another. The thunder tells you about the lightning; the test result tells you about the disease; the open goat door on Day 4 told you about the car. The quantity that captures this is the true workhorse of the theory: mutual information.

Start with two simpler pieces. Joint entropy $H (X, Y)$ is the total uncertainty in the pair. Conditional entropy $H (Y ∣ X)$ is the uncertainty left in Y once you already know X — the surprise that survives. Mutual information is simply the gap between them: how much knowing X shrinks your uncertainty about Y.

I (X; Y) = H (Y) - H (Y ∣ X) = H (X) + H (Y) - H (X, Y)

The uncertainty about Y, minus what’s left after you learn X. Symmetric: X tells you exactly as much about Y as Y tells you about X.

For two variables, the overlap of two circles is a useful accounting picture. Do not push it too far: with three or more variables, information diagrams can contain signed interaction regions. Mutual information measures statistical dependence, not causation. A causal graph (Day 5) encodes structural assumptions and conditional independences; two variables can share information because one causes the other, because they share a cause, or because of selection. In Bayesian experimental design (Day 2), expected information gain about a hypothesis is one useful objective and can be written as a mutual information. A realized observation is evidence through likelihood ratios, Bayes factors, log scores, or posterior updates.

Mutual information measures statistical dependence. Channel capacity is the maximum mutual information rate achievable under a specified channel model and constraints. Causation and realized evidence require additional structure beyond mutual information alone.

Interactive · the overlap of two uncertainties

Mutual Information, drawn

Two fair coins, X and Y, each carrying 1 bit. Slide them from independent (knowing one tells you nothing about the other) to identical (knowing one tells you everything). The teal overlap is $I (X; Y)$ ; the outer crescents are the surprise that survives — the conditional entropies. Watch the channel "open."

H (X)

H (Y)

1.00 · 1.00

H (Y ∣ X)

— surprise left1.00

I (X; Y)

— shared0.00

Independent. The coins ignore each other; learning X leaves all of Y's uncertainty intact. The channel carries nothing.

agreement P(X = Y)0.50

Mutual information, as agreement changes

Agreement $P (X = Y)$	$H (Y ∣ X)$	$I (X; Y)$	Interpretation
0.50	1.00 bit	0.00 bits	Independent coins; X tells you nothing about Y.
0.75	0.81 bits	0.19 bits	A noisy channel; X removes a little uncertainty about Y.
0.90	0.47 bits	0.53 bits	Strong correlation; knowing X removes most of Y’s uncertainty.
1.00	0.00 bits	1.00 bit	Identical variables; knowing X tells you Y completely.

The surprise tax: relative entropy

There’s a cousin of entropy that turns out to be one of the most useful quantities in all of science, and it answers a sharp question: what does it cost to believe the wrong thing? Suppose reality draws outcomes from a distribution $p$ , but you’ve modeled the world as $q$ . Every time you’re surprised, you’re surprised by the wrong amount — you budgeted $- lo g_{2} q$ bits but reality charged you according to $p$ . The average overpayment is the relative entropy, or Kullback–Leibler divergence:

D (p ∥ q) = i \sum p_{i} lo g_{2} \frac{p _{i}}{q _{i}} \geq 0

The extra bits per symbol you waste by coding for q when the truth is p. Zero only when your model is exactly right. Never negative.

This single inequality is doing astonishing amounts of work across the whole curriculum. It is the penalty for a bad prior — which is why it sits at the heart of Bayesian updating (Day 1, Day 4). KL is a divergence, not a true distance: it is asymmetric and lacks the triangle inequality. Cross-entropy is related but not identical: $H (p, q) = H (p) + D_{KL} (p ∥ q)$ , so minimizing cross-entropy is equivalent to minimizing KL only when the true distribution p is fixed. Language-model training can be idealized as minimizing an empirical estimate of expected cross-entropy, not as literally accessing $D (reality ∥ model)$ . Friston’s variational free energy is also subtler: it upper-bounds surprisal under a generative model and contains a KL term between an approximate posterior and the model posterior. Mutual information itself can be written as a KL divergence between the joint distribution P(X,Y) and the product P(X)P(Y).

02 · compression

Squeezing out the air

Entropy isn’t just an abstraction; it’s an asymptotic floor you can approach. Shannon’s source coding theorem — the quieter twin of the noisy-channel theorem — says that long blocks from a specified source can be compressed to an average rate arbitrarily close to H bits per symbol, and not below H without losing information or exploiting structure outside the source model. Entropy is the irreducible residue relative to that model: the message with the air squeezed out. Everything above that floor is redundancy, and redundancy is exactly what compression hunts.

How much air is in ordinary text? A great deal. With 27 symbols (26 letters and a space), a perfectly random stream would carry $lo g_{2} 27 \approx 4.7$ bits per letter. But English is wildly predictable — q often drags u behind it, th and he recur, vowels are partly constrained. In a famous 1951 experiment, Shannon had people guess the next letter of a hidden text and measured how often they were right, arriving at an entropy of roughly one bit per letter (his estimates ranged about 0.6 to 1.3). The implication is startling: English is something like three-quarters redundant. Yu cn prbbly rd ths sntnc wth th vwls rmvd — because context already supplied much of the information needed to recover the vowels. Crossword puzzles, autocomplete, and the ability to hear someone at a noisy party all run on the same surplus.

The compression ladder

Huffman coding (1952) — short codewords for common symbols — was optimal for coding symbols one at a time, but it can’t capture patterns across symbols. Arithmetic coding (1970s) does better by encoding a whole message as a single number in [0,1), squeezing right up against the entropy floor. And Lempel–Ziv (1977–78) is adaptive: it builds a dictionary of repeated substrings on the fly, so the second time a pattern appears it costs almost nothing. LZ methods run inside ZIP, gzip, PNG, and many everyday lossless formats, though audio, video, and modern image codecs also use transform, predictive, and entropy-coding machinery. None can beat Shannon’s floor for the modeled source; good ones approach it under their assumptions.

All of that is lossless: unzip and you get back every bit. But your eyes and ears are forgiving instruments, and that forgiveness is worth money. Rate–distortion theory (Shannon, 1959) asks the harder question: if you’re allowed to lose a little, how few bits can you spend for a given level of acceptable distortion? This is the governing math of lossy compression — JPEG throwing away the fine color detail your retina can’t resolve, MP3 discarding the quiet tones a louder one is already masking. The art is to spend your bits only where a human will notice, and to let the rest evaporate. Streaming, video calls, the entire visual texture of the modern internet — all of it is rate–distortion theory cashed in.

03 · the ultimate compression

Kolmogorov complexity, where information meets computation

Shannon’s entropy is a property of a source — a probability distribution. But here’s a puzzle it can’t touch. Consider two strings of a million digits: one is a million random coin-derived digits, the other is the first million digits of $π$ . To Shannon, sampled from “uniform digit sources,” they look equally dense with information. Yet $π$ can be regenerated from a comparatively short program, while a typical random string cannot be compressed much at all. The random-looking string has the higher description complexity.

That something is its Kolmogorov complexity (Solomonoff, Kolmogorov, and Chaitin, independently, in the 1960s): the length of the shortest computer program that outputs the string. It is the compression limit measured not against a probability model but against computation itself, and it is machine-independent only up to an additive constant determined by the chosen universal machine. $π$ is profoundly compressible (short program, infinite output); a typical long random string is incompressible in a precise counting sense — most strings have no description much shorter than themselves. In fact this becomes a definition of algorithmic randomness.

And here the subject grows a sharp, beautiful tooth. Kolmogorov complexity is provably uncomputable proven. There is no algorithm that, given any string, returns the length of its shortest program — a result that rhymes with the Gödel and Turing limits waiting on Days 27–28, and with Chaitin’s eerie constant Ω, a specific, perfectly well-defined number. A suitable consistent formal system can determine some finite facts about Ω, but not arbitrarily many of its digits. We will meet related ideas again tomorrow under the heading of complexity, but with an important warning: high description complexity alone does not measure meaningful or organized complexity. Typical random strings maximize it too.

04 · codes against the dark

How information survives a hostile world

The main lesson promised that, below a specified channel’s capacity, vanishing error is asymptotically possible if you can find sufficiently long codes. That treasure hunt deserves its own telling, and it starts with a frustrated man at Bell Labs on a Friday night.

In 1947, Richard Hamming had weekend access to a relay computer, and it kept letting him down: whenever it detected an error in his input, it simply halted and waited for a human who wouldn’t arrive until Monday. “If the machine can detect an error,” Hamming fumed, “why can’t it locate and correct it?” Out of that irritation came the Hamming code (published 1950), the first true error-correcting code, and with it a geometric way of seeing the whole field.

The idea is Hamming distance: the number of bit-flips separating two codewords. If your only valid messages are 000 and 111, they sit distance 3 apart. A single error turns 000 into something like 010 — but 010 is still closer to 000 than to 111, so the receiver can quietly correct it by majority vote. Spread your valid codewords far enough apart in this space of bit-strings, and you carve out a little protective “sphere” of correctable errors around each one. Redundancy, deployed with geometric cunning.

Interactive · the geometry of a code

The Hamming Cube

Every 3-bit word is a corner of a cube; an edge is one bit-flip. Our code uses just two valid words — 000 and 111, the far corners (teal). Click any corner to receive it through a noisy channel: the decoder snaps it to the nearest valid word by majority vote. One error is always corrected; the cube shows you why.

codeword decodes → 000 decodes → 111

Click a corner. The two codewords sit at opposite ends of the cube, Hamming distance 3 apart. That gap is the code's armor: any single bit-flip lands you one step from where you started, still firmly inside one codeword's territory.

The 3-bit repetition code, as nearest-neighbor decoding

Received word	Distance to 000	Distance to 111	Decodes as
`000`	0	3	`000`
`001`, `010`, `100`	1	2	`000`
`011`, `101`, `110`	2	1	`111`
`111`	3	0	`111`

The codewords are Hamming distance 3 apart, so any one-bit error remains closer to the original codeword than to the other one.

From Mars to your pocket

Hamming’s geometry scaled into an engineering miracle that you rely on hourly without noticing. Reed–Solomon codes (1960) treat data as points on polynomials and can repair whole bursts of damage — which is why a scratched CD still plays, why a torn QR code still scans, and why the Voyager probes could whisper pictures of the outer planets across billions of kilometers on a transmitter weaker than a refrigerator bulb. Deep-space missions stacked codes inside codes (concatenation) to claw their signal out of almost pure noise.

But for decades a gap remained between what Shannon promised and what engineers could build. That gap closed in a rush. Turbo codes (1993) stunned the field by getting within a fraction of a decibel of capacity using two simple coders trading guesses back and forth. Low-density parity-check codes — invented by Robert Gallager in 1960, then forgotten for thirty years as too demanding for the hardware of the day, and rediscovered in the late 1990s — now armor Wi-Fi, 5G data channels, and storage. And in 2009, Erdal Arıkan’s polar codes became the first explicitly constructed family proved to achieve the symmetric capacity of binary-input memoryless channels asymptotically, with efficient encoding and decoding established. In 2016 the global 5G standard adopted polar coding for control-related channels, while LDPC carries much of the data-channel burden. Sixty-eight years after Shannon drew the X on the map, engineers had practical paths toward it.

05 · the shoulders & the showman

The people behind the page

Shannon didn’t arrive from nowhere. Two Bell Labs predecessors had circled the idea. Harry Nyquist (1924) worked out how fast a telegraph line could signal; Ralph Hartley (1928), in a paper actually titled “Transmission of Information,” proposed measuring information as the logarithm of the number of possible messages — the seed of the bit, missing only Shannon’s crucial leap of bringing in probability and throwing out meaning. Shannon honored the debt: the unit of information in some conventions is still the “hartley.”

But the man himself is one of science’s great delights, and worth a paragraph of pure pleasure. Claude Shannon rode a unicycle down the Bell Labs corridors while juggling. He built a mechanical mouse named Theseus (1950) that could learn its way through a maze — one of the first demonstrations of machine learning, sitting on his desk. He built, at Marvin Minsky’s suggestion, the “Ultimate Machine”: a box with a single switch that, when flipped on, opened, extended a mechanical hand, switched itself off, and withdrew. With the mathematician Ed Thorp he built what is often called the first wearable computer — a cigarette-pack-sized device to beat roulette, smuggled into Las Vegas in 1961. His parlor trick of having guests predict the next letter of a sentence wasn’t a party game; it was him measuring the entropy of English in real time. The inventor of the most disembodied, abstract quantity in all of engineering was, in person, irrepressibly, gloriously physical. It is fitting that the punchline of his theory — information is physical — was waiting at the end of the road he started down.

06 · when information goes quantum

The bit’s stranger sibling

Everything so far assumed a bit is a definite 0 or 1. Quantum mechanics offers a richer carrier — the qubit — and the information theory built on it is so different it deserves its own block (Day 47). Here is just enough to feel the strangeness.

A qubit can hold a superposition of 0 and 1, a point on a whole sphere of possibilities rather than two poles. You might think it therefore carries infinitely more information. It doesn’t — and the reason is one of the deepest facts in the subject. Holevo’s bound (1973) limits the accessible classical information from a quantum ensemble; in the ordinary unassisted setting, one transmitted qubit cannot reveal an arbitrary continuum of classical data. Entanglement-assisted protocols such as superdense coding change the resources being counted. Quantum measurement can also have more than two outcomes, so “yes/no” is a simplification. The continuum is real but coy: it shapes how the qubit evolves and interferes, yet cannot be fully read out as ordinary classical information.

Two more jewels, both forward-pointers. The no-cloning theorem (Wootters & Zurek; Dieks, 1982) established says an unknown quantum state cannot be copied — there is no quantum photocopier — which helps explain why eavesdropping can disturb quantum-cryptographic protocols, though security requires a full protocol proof, not no-cloning alone. And the quantum analogue of Shannon entropy, the von Neumann entropy $S = - Tr (ρ lo g ρ)$ , measures uncertainty in a quantum state and equals entanglement entropy for a pure bipartite state. Mixed-state entanglement needs other measures. The bit, it turns out, was only the first draft.

07 · cheating the toll

Beating Landauer with reversible computing

The main lesson delivered Landauer’s standard verdict: resetting an initially unknown fair bit in an isothermal cycle costs at least $k_{B} T ln 2$ because the operation crushes two possible pasts into one. But read the fine print and a loophole gleams. The fundamental cost is charged on logically irreversible steps — the ones you can’t run backward. What if you simply never erased anything?

This is reversible computing, and it is theoretically sound established in principle. Charles Bennett showed in 1973 that any computation can be rebuilt as a sequence of reversible steps that never destroy information — instead of overwriting intermediate results, you keep them, run your computation, copy out the answer, then run the whole thing backwards to cleanly “uncompute” the garbage, returning every bit to its start with no net erasure. Special reversible logic gates make this concrete: the Toffoli gate (a controlled-controlled-NOT) and the Fredkin gate (a controlled swap) are universal — you can build any circuit from them — yet each is perfectly invertible, with the same number of outputs as inputs and nothing thrown away.

In principle, then, reversible computing can avoid the fundamental erasure toll until information is intentionally discarded. Physical implementations can reduce dissipation by using gentler, more nearly quasistatic protocols, but the speed-energy trade-off depends on the device model; logical reversibility by itself does not mean a computer must run vanishingly slowly. Reversible algorithms also involve time-space trade-offs rather than simply storing an unlimited history forever. No one has built a useful general-purpose reversible computer; it remains an exquisite idea more than a product. But it is exactly why the field is suddenly interesting again: as conventional chips press toward physical limits, not forgetting starts to look like an engineering strategy rather than a curiosity. The thread running into Day 47: ideal closed-system quantum gates are unitary and reversible, while initialization, measurement, error correction, control, and reset require their own thermodynamic accounting.

08 · the last wall

The ultimate physical limits of computation

Landauer bounds the energy of forgetting. But there are other walls — on speed, on density, on how much computing a lump of matter can ever do — and tracing them out leads somewhere genuinely vertiginous. In 2000, MIT’s Seth Lloyd estimated an “ultimate laptop”: one kilogram of matter, one liter of volume, organized as efficiently as the laws of physics permit (Nature, 31 August 2000). These are model-dependent upper bounds, not a practical device specification.

The speed limit comes from quantum mechanics itself. The Margolus–Levitin theorem (1998) bounds how fast a system can pass through distinguishable states in terms of energy above its ground state, with the exact interpretation depending on what is counted as an operation. Pour all of one kilogram’s rest-mass energy ( $E = m c^{2}$ , about $1 0^{17}$ joules) into Lloyd’s idealized computation, and you get a ceiling of roughly $1 0^{51}$ operations per second. The memory limit comes from counting the states such a system can occupy: about $1 0^{31}$ bits. These numbers are not engineering targets; they are made of $c$ , $ℏ$ , and $k_{B}$ , the bedrock constants. Lloyd notes the obvious snag — a device running at this limit has turned its kilogram into a roughly billion-degree plasma, “a packaging problem,” he dryly concedes.

Information lives on surfaces

The deepest hint that information is woven into physics at the ground floor comes from gravity. The Bekenstein bound (1981) is fundamentally an energy-radius bound on entropy, $S \leq 2 π k_{B} E R / (ℏ c)$ . Area scaling enters when gravitational collapse and black-hole entropy are brought into the story: black holes saturate the relevant bound, with entropy proportional to event-horizon area. That oddity — that the maximum information associated with a gravitational region can be controlled by its boundary — is the seed of the holographic principle (‘t Hooft, Susskind, 1990s) leading theoretical idea, not confirmed. In theories where holographic duality applies, a gravitational bulk can have a lower-dimensional boundary description. That is subtler than saying our universe is simply encoded on a distant 2D surface. It also sets up the black-hole information paradox, which we’ll meet head-on on Day 40.

Push the same accounting to the largest scale and you can ask: how much has the entire universe computed since the Big Bang? Lloyd’s 2002 estimate is about $1 0^{120}$ elementary operations on $1 0^{90}$ ordinary-matter bits, with a much larger information budget if gravitational degrees of freedom are included. Enormous, yes. But finite. There is a hard, countable ceiling on how much computation reality itself has performed. Whether that means the universe quite literally is a computer is a live and contested interpretation contested — interpretation, not result, a close relative of the “it from bit” overreach the main lesson flagged. The numerical bounds are physics; the metaphysics is optional.

Diagram · rates and totals

The ladder of computational capacity

Two separate logarithmic scales: device rates in operations per second, then Lloyd’s cumulative universe estimate in total operations. They are adjacent because both are physical bounds on computation, but they are not the same dimension.

The jump from your laptop to the ultimate laptop is ~10³⁹ in rate. The universe marker is a separate cumulative estimate, not an operations-per-second benchmark. Both scales say the same constraint: computation, like everything else, is rationed by physics.

09 · the thread, everywhere

Once you have the unit, you see it everywhere

The quiet power of Shannon’s idea is that the moment information became a number, it stopped belonging to telegraphy and started turning up in every field that deals with uncertainty, variety, or surprise. A quick tour of the thread’s reach, each a doorway to a later day:

In ecology, the Shannon diversity index — the very same H = −Σ p log p — measures biodiversity: a rainforest has high entropy (many species, evenly spread), a monoculture field has almost none. Surprise, applied to “which species is this?”
In machine learning, autoregressive language-model pretraining commonly minimizes empirical token-level cross-entropy. Perplexity is the exponential of that average loss and measures predictive fit on a specified dataset, but it is not a complete measure of model quality. Instruction tuning, preference optimization, reinforcement learning, distillation, and multimodal training can use other objectives.
In biology, DNA is a four-letter code carrying ~2 bits per base — heredity as literal digital information storage, a framing that runs straight into the central dogma on Day 77 and the origin of life on Days 87–90.
In physics, the entropy we defined today and the thermodynamic entropy of heat engines turn out to be — perhaps — the same thing, the question the main lesson flagged and that Day 33 will reopen as the arrow of time.
In life itself, Schrödinger’s idea that organisms survive by feeding on “negative entropy” (Brillouin later called it negentropy) is information thermodynamics in embryo — the puzzle that Days 83–85 are built around, where today’s $k_{B} T ln 2$ returns to govern the living cell.

That is the real reason this day sits where it does in the course, only seventh of a hundred and eighty. Information is one of the five threads — alongside energy, evolution, emergence, and computation — precisely because, once weighed, it refuses to stay in its lane. It is the connective tissue of the whole map.

The appendix in three sentences

What we added: Beneath the bit lies a fuller algebra — mutual information (statistical dependence and the core of channel capacity under a specified model) and relative entropy (the excess code length for a wrong model, related to but not identical with cross-entropy and variational free energy) — plus asymptotic compression limits, practical error-correcting codes, and reversible computing as a way to avoid unnecessary erasure.
The widest view: Information’s physicality doesn’t stop at the warm room of erasure: it scales up to hard limits on the speed and density of computation, black-hole entropy, holographic boundary descriptions in theories where they apply, and a finite, countable estimate of the universe’s cumulative operations.
Why it matters: The instant Shannon made information a number, it escaped engineering — and now the same H shows up in biodiversity, neural-network training, DNA, the arrow of time, and the thermodynamics of life, which is exactly why it is one of the five threads this whole course is built to follow.

Threads deepened › information (mutual information, KL divergence, Kolmogorov complexity) · energy (reversible computing, Margolus–Levitin, the ultimate laptop) · computation (codes, quantum information, limits of all computation) · emergence (why compression is not the whole story → Day 8) · evolution & life (DNA as bits, negentropy → Days 77, 83–85).

Sources · appendix

Sources & further reading

(New to this appendix; the core Shannon/Landauer citations are on the main Day 007 page.)

Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. — the standard text: mutual information, relative entropy (KL), source & channel coding, rate–distortion.
Shannon, C. E. (1951). “Prediction and Entropy of Printed English.” Bell System Technical Journal 30: 50–64. doi:10.1002/j.1538-7305.1951.tb01366.x — entropy of English $\approx 1$ bit/letter; $\sim 75%$ redundancy.
Shannon, C. E. (1959). “Coding Theorems for a Discrete Source with a Fidelity Criterion.” IRE Nat. Conv. Rec. — the foundation of rate–distortion (lossy compression) theory.
Hamming, R. W. (1950). “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29(2): 147–160. — Hamming distance and the first error-correcting code.
Nyquist, H. (1924). “Certain Factors Affecting Telegraph Speed.” · Hartley, R. V. L. (1928). “Transmission of Information.” Bell System Technical Journal. — Shannon’s direct precursors.
Ziv, J. & Lempel, A. (1977, 1978). “A Universal Algorithm for Sequential Data Compression” & “Compression of Individual Sequences…” IEEE Trans. Information Theory. — LZ77/LZ78, the basis of ZIP, gzip, PNG.
Kolmogorov, A. N. (1965); Solomonoff, R. (1964); Chaitin, G. (1966). — algorithmic information / Kolmogorov complexity; uncomputability; Chaitin’s Ω. See Li & Vitányi, An Introduction to Kolmogorov Complexity (Springer).
Berrou, C., Glavieux, A. & Thitimajshima, P. (1993). “Near Shannon Limit Error-Correcting Coding: Turbo Codes.” IEEE ICC. · Gallager, R. G. (1962). “Low-Density Parity-Check Codes.” IRE Trans. Inf. Theory.
Arıkan, E. (2009). “Channel Polarization: A Method for Constructing Capacity-Achieving Codes…” IEEE Trans. Information Theory 55(7): 3051–3073 (July 2009). — polar codes; adopted by 3GPP for 5G NR control channels, 2016.
Wootters, W. K. & Zurek, W. H. (1982). “A single quantum cannot be cloned.” Nature 299: 802–803. doi:10.1038/299802a0 — the no-cloning theorem (also Dieks, 1982).
Holevo, A. S. (1973). “Bounds for the quantity of information transmitted by a quantum communication channel.” Problems of Information Transmission 9: 177–183. — accessible classical information from quantum ensembles; ordinary unassisted qubit communication is resource-limited.
Bennett, C. H. (1973). “Logical Reversibility of Computation.” IBM Journal of Research and Development 17(6): 525–532. — reversible computation; with Fredkin & Toffoli (1982) reversible gates.
Margolus, N. & Levitin, L. B. (1998). “The maximum speed of dynamical evolution.” Physica D 120: 188–195. — the quantum speed limit, $\sim 2 E / (π ℏ)$ .
Lloyd, S. (2000). “Ultimate physical limits to computation.” Nature 406: 1047–1054 (31 Aug 2000). doi:10.1038/35023282 — the “ultimate laptop”: ~ $1 0^{51}$ ops/s, ~ $1 0^{31}$ bits. doi.org/10.1038/35023282 · Lloyd, S. (2002). “Computational capacity of the universe.” Physical Review Letters 88: 237901. — ~ $1 0^{120}$ ops on ~ $1 0^{90}$ ordinary-matter bits.
Bekenstein, J. D. (1981). “Universal upper bound on the entropy-to-energy ratio…” Physical Review D 23: 287. — the energy-radius entropy bound; with Hawking’s black-hole entropy, one root of later holographic arguments. [Holographic principle: leading theoretical conjecture, not confirmed.]
Brillouin, L. (1956). Science and Information Theory. Academic Press. — negentropy; the information cost of measurement (later refined by Landauer–Bennett).
Soni, J. & Goodman, R. (2017). A Mind at Play. Simon & Schuster. — Shannon’s life: Theseus, the Ultimate Machine, the roulette computer with Ed Thorp, the juggling unicyclist.

Deep dive appendixThe Bleeding EdgeOptional extension.

The main lesson and the first appendix gave you the foundations: Shannon’s bit, Landauer’s toll, the full architecture. All of it rests on work decades old and thoroughly settled. This appendix is different: recent frontier work from roughly 2019–2026, much of it in the last two years, and some of it still warm from the press. These are the results the people building the field consider most consequential: experiments, frameworks, and engineering gambles most likely to reshape what we think we know. Some will hold. Some won’t. The hype filter earns its keep today more than ever.

Diagram · the frontier landscape

Seven frontier threads, one set of questions

The recent frontier of information physics branches into seven active programmes. Color indicates evidence status: teal for established, amber for promising; a dashed border marks work that remains theory-only, with no experimental test yet.

01 · the headline result

Measuring irreversibility inside a living cell

If we had to name one recent result that could be remembered in twenty years, it is this: in 2024, a team led by Felix Ritort at the University of Barcelona published a way to infer the rate at which a living cell produces entropy — the quantitative arrow of time inside a single red blood cell — from microscopy plus force calibration and model assumptions.

The insight is a new mathematical identity called the Variance Sum Rule. It connects fluctuation statistics of a probe with the restoring forces acting on it, allowing entropy production rate $σ$ to be estimated under the model’s conditions. No calorimeter, no complex perturbation protocol — but not just a bare microscope either.

σ \propto var (displacement) + var (force)

Schematic only — not a generally usable identity. Di Terlizzi et al., Science, 2024.

They pointed this tool at human red blood cells and mapped the entropy production rate across the membrane. What they found was striking: $σ$ is spatially heterogeneous, varying from spot to spot across the cell, with a finite correlation length of about 0.6 micrometres. The average agreed with independent calorimetry measurements. The authors describe this as the first heat map of entropy production in a living system — a major step from qualitative “far from equilibrium” language toward spatially resolved thermodynamic accounting. established under model assumptions

Why it matters for this course

On Days 83–85 we’ll ask: how does life, which looks so orderly, square with the second law? The standard answer — organisms are open systems exporting entropy — is qualitatively right but quantitatively vague. The Variance Sum Rule hands that vague answer a ruler. If you can map entropy production inside cells with the resolution of a light microscope, you can start asking: which metabolic pathways generate the most irreversibility? Where is the thermodynamic cost concentrated? The technique turns stochastic thermodynamics from a framework into a microscopy modality.

Two companion results deserve note. Skinner & Dunkel (PNAS, 2021) developed a rigorous optimization framework for bounding entropy production from partial observations and applied it to bacterial flagellar motors and growing microtubules. And a 2024 PNAS paper used deep-learning probability flows — score-based generative models — to estimate entropy production in high-dimensional active-matter systems, offering a computational path where the analytic rule runs out of dimensions. both established

02 · the new rules

Universal trade-offs: uncertainty, speed, and cost

The deepest conceptual advance in stochastic thermodynamics since 2020 is a family of results that say: precision often costs entropy, under stated dynamical assumptions. If a molecular motor turns with low noise, if a clock ticks with reliable rhythm, if a sensor measures a concentration accurately — something may have to pay, and the currency is dissipation. These thermodynamic uncertainty relations (TURs) are powerful constraints for broad classes of nonequilibrium systems, but they are not assumption-free laws of nature.

2022–23

The standard TUR is broad, not universal

The standard steady-state TUR applies to Markov jump processes and overdamped Langevin systems. Pietzonka showed a classical pendulum clock can violate that simple entropy-production-only form (PRL 128, 130606, 2022), while Dieball & Godec clarified its Langevin route and saturation conditions (PRL 130, 087101, 2023). Underdamped and non-Markovian systems may require modified bounds involving additional dynamical quantities.

Pietzonka, PRL 2022 · Dieball & Godec, PRL 2023

2023

Speed limits, dissipation, and optimal transport unified

Van Vu & Saito showed that TURs and minimum-dissipation protocols fit a single structure built on Wasserstein-distance optimal transport; their account also connects to thermodynamic speed limits (PRX 13, 011013, 2023). Separately, Lee, Lee, Kwon & Park derived a tight finite-time Landauer bound: the minimum dissipation to erase a bit in a fixed time, trading speed for heat (PRL 129, 120603, 2022).

Van Vu & Saito, PRX 2023 · Lee et al., PRL 2022

2025

The thermodynamic cost of communication

Yadav & Wolpert proved that transmitting information between computational subsystems has an unavoidable dissipative cost — an overlooked component of a computer’s heat bill that grows with communication bandwidth. In their framing: “talk isn’t cheap.” (Phys. Rev. Research, 2025.)

Yadav & Wolpert, Phys. Rev. Research 2025

Taken together, these results are building something like a thermodynamic code — a body of constraints on specified nonequilibrium processes, just as Shannon’s channel capacity constrains a specified communication model. The analogy is deliberate: the people doing this work view Shannon-era information theory and 21st-century stochastic thermodynamics as two chapters of the same subject.

03 · the cost of a tick

Clocks run on entropy

Here is a fact that, once you hear it, feels obvious: in important clock models and experiments, higher accuracy comes with higher dissipation.

The team — Pearson, Guryanova, Erker, Laird, Briggs, Huber & Ares, spanning Oxford, Vienna, and Lancaster — built the simplest possible clock: a nanometre-thick mechanical membrane vibrating in a cryogenic cavity, its oscillations read out by a radio-frequency circuit. By varying the drive power, they tuned the clock from sloppy to sharp and measured both accuracy and entropy production. They found a linear relationship: double the accuracy, double the dissipation. Their clock operated within an order of magnitude of the theoretical minimum. established

accuracy \propto entropy production

More precise timekeeping demands more dissipation. Pearson et al., Physical Review X, 2021.

Follow-up work has deepened this. Meier, Schwarzhans, Erker & Huber (PRL 131, 220201, 2023) showed the cost splits into accuracy (regularity) and resolution (tick rate), both requiring entropy in their framework but in different ways. And a 2025 PRL paper on the entropic costs of extracting classical ticks from a quantum clock argued that the dominant cost can be not the clockwork but the measurement — the act of amplifying a quantum oscillation into a classical signal. The demon’s cousin again: the cost is not in the knowing, but in the making-classically-known.

04 · the highest-variance bet

Thermodynamic computing: noise as a feature

For seventy years, engineering’s response to thermal noise has been: suppress it. Every transistor is designed to overpower the random jitter of its atoms with a clean voltage swing many times larger. That swing is where most of the energy goes and most of the heat comes from. The thermodynamic-computing programme asks: what if you stopped fighting the noise and started computing with it?

The idea is surprisingly direct. A system of coupled physical oscillators — RLC circuits, resistor networks — will, if left alone, thermally fluctuate. Those fluctuations sample from the Boltzmann distribution of the system’s energy landscape. Design the landscape carefully, and those samples solve problems: linear systems, Bayesian inference, energy-based generative models. The computation is not imposed against the physics; it is the physics.

Hardware 01peer-reviewed proof of concept

Normal Computing’s Stochastic Processing Unit

Melanson, Abu Khater, Aifer, Donatella, Gordon and colleagues published “Thermodynamic Computing System for AI Applications” in Nature Communications (16, 3757, April 2025). It describes a prototype stochastic processing unit (SPU): eight fully coupled analog RLC oscillator nodes on a printed circuit board, performing Gaussian sampling and matrix inversion by letting the coupled system thermalize. A companion theory paper (npj Unconventional Computing, 2024) maps linear-algebra primitives onto thermodynamic equilibrium sampling, claiming asymptotic speedups scaling linearly in matrix dimension.

The caveat: this is an eight-node prototype. An end-to-end wall-clock or energy advantage over a modern GPU on a real workload has not been demonstrated. The theory is elegant and peer-reviewed; the hardware exists; the performance claims await validation at scale. promising hint

Hardware 02company-reported · unvalidated

Extropic’s thermodynamic sampling units

Extropic, founded by Guillaume Verdon, builds chips that natively sample from energy-based models at physical speeds. Its October 2025 public material presented XTR-0 hardware and company-reported simulations or small benchmarks suggesting roughly 10,000× energy savings against GPU/VAE baselines; it did not independently demonstrate GPU-parity hardware on production workloads. A larger follow-on chip has been reported as planned. These figures are company-reported, based on internal benchmarks, and not independently peer-reviewed. If confirmed, they would be transformative. Until then, the project lives in the same epistemic category as any startup claim — interesting, unvalidated, carrying the full standard deviation of venture capital. Apply Day 2’s lesson: extraordinary claims await extraordinary replication.

What would change the labels

Upgrading thermodynamic computing from promising to established requires one clear thing: an independent, peer-reviewed benchmark demonstrating an end-to-end advantage — wall-clock or energy — over a modern GPU on a real ML workload, not just an isolated primitive. A preprint (arXiv:2503.09980, 2025) argues that analog quasi-static inference could in principle be performed reversibly (no Landauer floor), while training retains a fundamental lower bound. The physics is sound; the engineering is early.

05 · demons, upgraded

Information engines go many-body, quantum, and autonomous

The single-particle engines of 2010–2021 were stunning proofs of concept. The recent push is in three directions: scale them up, make them quantum, and embody more of the sensing and feedback inside the device.

Many-body engines

Chor, Sohachi, Rosen, Rahav & Roichman (Phys. Rev. Research 5, 043193, December 2023) extended the Szilárd engine from a single particle to a many-body colloidal suspension, extracting work from collective number fluctuations. A follow-up (arXiv:2512.01942, December 2025, preprint) describes a “piston-like information engine” that harvests work from an equilibrium bath purely by conditional measurement — Maxwell’s demon writ in a crowd. PRResearch paper established piston preprint pending

A quantum engine charging a quantum battery

Zhang et al. (Kim group, Tsinghua; Phys. Rev. Lett. 135, 140403, 2025) built a cyclic quantum information engine from a single ytterbium-171 ion, using rapid mid-circuit measurement to suppress measurement disturbance. The energy extracted per cycle was used to charge a quantum battery. They report an information-to-ergotropy conversion efficiency of 67% of the theoretical maximum and information-to-work efficiency of 70%. established

Autonomous demons

Autonomous Maxwell demons have been demonstrated experimentally since at least the 2015 single-electron work of Koski and colleagues. In such systems, measurement and feedback are implemented internally rather than by an externally timed controller. That does not mean the device uses no nonequilibrium resources or produces no entropy: the energetic and entropic resources still have to be included in the full accounting. Newer quantum-dot rectifiers (PRR 2019), stochastic-resetting demons (PRR 2023), and transistor-based demons (PRB 2025) extend that design pattern for nanoscale energy harvesting. promising — individual realizations established

A real molecular motor as a demon

Amano, Esposito, Kreidt, Leigh, Penocchio & Roberts (Nature Chemistry 14, 530, 2022) performed a rigorous information-thermodynamic analysis of a real chemically-driven synthetic rotary molecular motor from the synthetic molecular-machines tradition recognized by the 2016 Nobel Prize in Chemistry, showing its operation decomposes into an information-processing cycle. The demon, it turns out, was always just a particularly well-designed molecule. established

06 · erasure goes quantum

Cooperative erasure and beyond-Landauer protocols

Buffoni & Campisi (Quantum 7, 961, 2023) erased 256 qubits simultaneously on a D-Wave Advantage quantum annealer by exploiting spontaneous symmetry breaking amplified by quantum tunnelling. The collective flip achieved a per-bit erasure cost approaching $k_{B} T ln 2$ with ~99.9% success rate, while the per-bit action (energy × time) reached $1 0^{- 22}$ erg·s — extraordinarily fast and efficient. The key: cooperative effects let the system cross the energy barrier collectively rather than bit by bit. established

A separate theoretical programme explores ancilla-assisted erasure (arXiv:2402.15812, 2024), in which an auxiliary quantum system shuttles entropy away from the memory, releasing less than $k_{B} T ln 2$ of heat to the local bath in specific operating regimes. The authors stress the second law is not violated — the total entropy, including the ancilla, still increases — but the local heat can be reduced below the textbook floor. This is a pointer toward ultra-low-dissipation quantum computing, not a free lunch. promising — regime-specific, preprint

07 · information at the edge of spacetime

The island formula and the Page curve

The longest-running open problem in quantum gravity is the black-hole information paradox: does the information a black hole swallows survive its evaporation? Hawking’s 1975 calculation said destroyed; quantum mechanics says nothing is permanently lost. For forty-five years, nobody had a satisfying resolution.

In 2019–2020, a series of papers changed the conversation. Penington (JHEP, 2020), Almheiri, Engelhardt, Marolf & Maxfield (JHEP, 2019), and Almheiri, Hartman, Maldacena, Shaghoulian & Tajdini (JHEP, 2020) found that if you compute the entropy of Hawking radiation using a gravitational path-integral method — replica wormholes — the answer includes contributions from disconnected regions of spacetime called “islands” inside the black hole’s interior. When these kick in, the radiation’s entropy turns over and begins to decrease, tracing the Page curve — exactly the shape a unitary (information-preserving) theory predicts in those controlled models.

Island calculations are a leading theoretical resolution in controlled semiclassical-gravity models. They strongly support unitary evaporation there; they are not yet a complete microscopic account of realistic black holes.

The authoritative review is Almheiri, Hartman, Maldacena, Shaghoulian & Tajdini, “The entropy of Hawking radiation,” Reviews of Modern Physics 93, 035002 (2021). established as a theoretical result in model settings

The caveat: there is no experimental or observational test. The calculation lives inside a specific theoretical framework, and it is not clear it applies to the real, four-dimensional, non-AdS universe. The result is as mathematically secure as anything in the field — but the field itself floats above experiment. no experimental test

08 · the big picture

A unified physics of computation?

The most ambitious attempt to pull this together is a 2024 PNAS perspective by Wolpert, Korbel, Lynn, Tasnim, Grochow, Kardeş and seventeen co-authors: “Is stochastic thermodynamics the key to understanding the energy costs of computation?” Their argument: the thermodynamics of computation is not just about Landauer’s limit — that’s merely the floor. Above it lies a rich structure of mismatch costs, communication costs, and nonequilibrium constraints that form a comprehensive physics of computation as rigorous as Shannon’s theory of communication.

One comparison from their review stops you in your tracks: biological systems — the molecular machinery of a cell — can appear roughly 10⁵ times more energy-efficient per operation than some human-built computers, depending on what is counted as an operation and a boundary. Treat that as an illustrative, definition-dependent gap, not a universal constant. The main lesson’s Landauer-ladder diagram showed chips sitting far above the thermodynamic floor; the deeper point is that biology often computes closer to physical noise and dissipation limits than conventional digital engineering, and understanding how is a foundational open problem. established framework, illustrative ratio

The bridge to Days 83–85

This is the direct setup for the physics-of-life deep dive. When we return to Schrödinger’s “what is life?” question, we’ll have two new tools that didn’t exist five years ago: the Variance Sum Rule for measuring entropy production inside cells, and Wolpert’s framework for bounding the energy cost of any computation. Together, they begin to close the gap between “life feeds on negentropy” (a slogan) and “here is the thermodynamic ledger of a specific biochemical pathway” (a number).

The frontier in three sentences

What's changed: Since roughly 2019, information thermodynamics and quantum-gravity-adjacent information theory have moved from confirming old principles to building new ones: scoped trade-off laws (TURs, speed limits, cost of timekeeping and communication), new entropy-production measurements in living cells, quantum-cooperative erasure near the Landauer floor, and island-formula progress on the black-hole information paradox in controlled models.
What's betting big: Thermodynamic computing — turning thermal noise from enemy into computational substrate — is the field’s highest-variance wager: elegant physics, early hardware, headline efficiency claims that remain unvalidated by independent benchmarks.
Evidence status: This frontier mixes peer-reviewed experiments, peer-reviewed theory, perspectives, preprints, and company-reported claims. The difference between a breakthrough and a footnote is independent replication, scale, and time — which is exactly what makes a frontier a frontier.

Threads forward › thermodynamic speed limits → Day 33 (thermodynamics) · entropy production in cells → Days 83–85 (physics of life) · thermodynamic computing + Wolpert’s 10⁵ gap → Days 178–179 (energy & AI economy) · island formula → Day 40 (black holes) · TURs → Day 13 (measurement) · quantum erasure → Day 47 (quantum computing).

Sources · frontier appendix

Sources — recent frontier work

Di Terlizzi, I. et al. (2024). “Variance Sum Rule for Entropy Production.” Science 383: 971. doi:10.1126/science.adh1823.
Skinner, D. J. & Dunkel, J. (2021). “Improved bounds on entropy production in living systems.” PNAS 118: e2024300118.
Pearson, A. N. et al. (2021). “Measuring the Thermodynamic Cost of Timekeeping.” Phys. Rev. X 11: 021029.
Meier, F. et al. (2023). “Fundamental accuracy–resolution trade-off for timekeeping devices.” PRL 131: 220201.
Wadhia, V. et al. (2025). “Entropic Costs of Extracting Classical Ticks from a Quantum Clock.” PRL 135: 200407. doi:10.1103/5rtj-djfk.
Van Vu, T. & Saito, K. (2023). “Thermodynamic Unification of Optimal Transport…” Phys. Rev. X 13: 011013.
Lee, J. S. et al. (2022). “Speed Limit for a Highly Irreversible Process and Tight Finite-Time Landauer’s Bound.” PRL 129: 120603.
Pietzonka, P. (2022). “Classical Pendulum Clocks Break the Thermodynamic Uncertainty Relation.” PRL 128: 130606.
Dieball, C. & Godec, A. (2023). “Direct Route to TURs and Their Saturation.” PRL 130: 087101.
Yadav, A. C. & Wolpert, D. H. (2025). “Minimal thermodynamic cost of communication.” Phys. Rev. Research 7: 043324. doi:10.1103/qvc2-32xr.
Melanson, D. et al. (2025). “Thermodynamic Computing System for AI Applications.” Nature Communications 16: 3757. doi:10.1038/s41467-025-59011-x.
Aifer, M. et al. (2024). “Thermodynamic Linear Algebra.” npj Unconventional Computing 1: 13.
Extropic (2025). XTR-0 announcement and thermodynamic-computing posts (extropic.ai, Oct 2025). Company-reported simulations/benchmarks; NOT independently validated.
Chor, R. et al. (2023). “Many-body Szilard engine…” Phys. Rev. Research 5: 043193.
Goerlich, R. et al. (2025). “Piston-Like Information Engine I: Universal Features in Equilibrium.” arXiv:2512.01942. Preprint.
Zhang, Z. et al. (2025). “Single-Ion Information Engine for Charging Quantum Battery.” PRL 135: 140403. doi:10.1103/g45c-ssfx.
Koski, J. V. et al. (2015). “On-Chip Maxwell’s Demon as an Information-Powered Refrigerator.” Physical Review Letters 115: 260602. Autonomous single-electron demon.
Amano, S. et al. (2022). “…information thermodynamics analysis of a synthetic molecular motor.” Nature Chemistry 14: 530. doi:10.1038/s41557-022-00899-z.
Buffoni, L. & Campisi, M. (2023). “Cooperative quantum information erasure.” Quantum 7: 961. doi:10.22331/q-2023-03-23-961.
Almheiri, A. et al. (2021). “The entropy of Hawking radiation.” Rev. Mod. Phys. 93: 035002.
Penington, G. (2020). “Entanglement Wedge Reconstruction…” JHEP 09: 002. · Almheiri, A. et al. (2020). “Replica Wormholes…” JHEP 05: 013.
Wolpert, D. H. et al. (2024). “Is stochastic thermodynamics the key…?” PNAS 121: e2321112121.
Manzano, G. et al. (2024). “Thermodynamics of Computations…” Phys. Rev. X 14: 021026.

End of Day 007 · 173 descents remain