What Just Happened: A Chronicle From the Information Frontier
When Chaitin came upon Turing’s proof of uncomputability, he thought this must be the key. He also found Shannon and Weaver’s book, The Mathematical Theory of Communication, and was struck by its upside-down seeming reformulation of entropy: an entropy of bits, measuring information on the one hand and disorder on the other. The common element was randomness, Chaitin suddenly thought. Shannon linked randomness, perversely, to information. Physicists had found randomness inside the atom—the kind of randomness that Einstein deplored by complaining about God and dice. All these heroes of science were talking about or around randomness.
It is a simple word, random, and everyone knows what it means. Everyone, that is, and no one. Philosophers and mathematicians struggled endlessly. Wheeler said this much, at least: “Probability, like time, is a concept invented by humans, and humans have to bear the responsibility for the obscurities that attend it.”♦ The toss of a fair coin is random, though every detail of the coin’s trajectory may be determined à la Newton. Whether the population of France is an even or odd number at any given instant is random, but the population of France itself is surely not random: it is a definite fact, even if not knowable.♦ John Maynard Keynes tackled randomness in terms of its opposites, and he chose three: knowledge, causality, and design.♦ What is known in advance, determined by a cause, or organized according to plan cannot be random.
“Chance is only the measure of our ignorance,”♦ Henri Poincaré famously said. “Fortuitous phenomena are by definition those whose laws we do not know.” Immediately he recanted: “Is this definition very satisfactory? When the first Chaldean shepherds watched the movements of the stars, they did not yet know the laws of astronomy, but would they have dreamed of saying that the stars move at random?” For Poincaré, who understood chaos long before it became a science, examples of randomness included such phenomena as the scattering of raindrops, their causes physically determined but so numerous and complex as to be unpredictable. In physics—or wherever natural processes seem unpredictable—apparent randomness may be noise or may arise from deeply complex dynamics.
Ignorance is subjective. It is a quality of the observer. Presumably randomness—if it exists at all—should be a quality of the thing itself. Leaving humans out of the picture, one would like to say that an event, a choice, a distribution, a game, or, most simply, a number is random.
The notion of a random number is full of difficulties. Can there be such thing as a particular random number; a certain random number? This number is arguably random:
10097325337652013586346735487680959091173929274945…♦
Then again, it is special. It begins a book published in 1955 with the title A Million Random Digits. The RAND Corporation generated the digits by means of what it described as an electronic roulette wheel: a pulse generator, emitting 100,000 pulses per second, gated through a five-place binary counter, then passed through a binary-to-decimal converter, fed into an IBM punch, and printed by an IBM model 856 Cardatype.♦ The process took years. When the first batch of digits was tested, statisticians discovered significant biases: digits, or groups of digits, or patterns of digits that appeared too frequently or not frequently enough. Finally, however, the tables were published. “Because of the very nature of the tables,” the editors said wryly, “it did not seem necessary to proofread every page of the final manuscript in order to catch random errors of the Cardatype.”
The book had a market because scientists had a working need for random numbers in bulk, to use in designing statistically fair experiments and building realistic models of complex systems. The new method of Monte Carlo simulation employed random sampling to model phenomena that could not be solved analytically; Monte Carlo simulation was invented and named by von Neumann’s team at the atomic-bomb project, desperately trying to generate random numbers to help them calculate neutron diffusion. Von Neumann realized that a mechanical computer, with its deterministic algorithms and finite storage capacity, could never generate truly random numbers. He would have to settle for pseudorandom numbers: deterministically generated numbers that behaved as if random. They were random enough for practical purposes. “Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin,”♦ said von Neumann.
Randomness might be defined in terms of order—its absence, that is. This orderly little number sequence can hardly be called “random”:
00000
Yet it makes a cameo appearance in the middle of the famous million random digits. In terms of probability, that is to be expected: “00000” is as likely to occur as any of the other 99,999 possible five-digit strings. Elsewhere in the million random digits we find:
010101
This, too, appears patterned.
To pick out fragments of pattern in this jungle of digits requires work by an intelligent observer. Given a long enough random string, every possible short-enough substring will appear somewhere. One of them will be the combination to the bank vault. Another will be the encoded complete works of Shakespeare. But they will not do any good, because no one can find them.
Perhaps we may say that numbers like 00000 and 010101 can be random in a particular context. If a person flips a fair coin (one of the simplest mechanical random-number generators) long enough, at some point the coin is bound to come up heads ten times in a row. When that happens, the random-number seeker will typically discard the result and go for a coffee break. This is one of the ways humans do poorly at generating random numbers, even with mechanical assistance. Researchers have established that human intuition is useless both in predicting randomness and in recognizing it. Humans drift toward pattern willy-nilly. The New York Public Library bought A Million Random Digits and shelved it under Psychology. In 2010 it was still available from Amazon for eighty-one dollars.
A number is (we now understand) information. When we modern people, Shannon’s heirs, think about information in its purest form, we may imagine a string of 0s and 1s, a binary number. Here are two binary strings, fifty digits long:
A: 01010101010101010101010101010101010101010101010101
B: 10001010111110101110100110101000011000100111101111
If Alice (A) and Bob (B) both say they generated their strings by flipping a coin, no one will ever believe Alice. The strings are surely not equally random. Classical probability theory offers no solid reason for claiming that B is more random than A, because a random process could produce either string. Probability is about ensembles, not individual events. Probability theory treats events statistically. It does not like questions in the form “How likely was that to happen?” If it happened, it happened.
To Claude Shannon, these strings would look like messages. He would ask, How much information does each string contain? On their face, they both contain fifty bits. A telegraph operator charging by the digit would measure the length of the messages and give Alice and Bob the same bill. Then again, the two messages seem to differ profoundly. Message A immediately becomes boring: once you see the pattern, further repetitions provide no new information. In message B, every bit is as valuable as every other. Shannon’s first formulation of information theory treated messages statistically, as choices from the ensemble of all possible messages—in the case of A and B, 250 of them. But Shannon also considered redundancy within a message: the pattern, the regularity, the order that makes a message compressible. The more regularity in a message, the more predictable it is. The more predictable, the more redundant. The more redundant a message is, the less information it contains.
The telegraph operator sending message A has a shortcut: he can transmit something like “Repeat ‘01’ twenty-five times.” For longer messages with easy patterns, the savings in keystrokes becomes enormous. Once the pattern is clear, the extra characters are free. The operator for message B must soldier on the hard way, sending every character, because every character is a complete surprise; every character costs one bit. This pair of questions—how random and how much information—turn
out to be one and the same. They have a single answer.
Chaitin was not thinking about telegraphs. The device he could not get out of his head was the Turing machine—that impossibly elegant abstraction, marching back and forth along its infinite paper tape, reading and writing symbols. Free from all the real world’s messiness, free from creaking wheel-work and finical electricity, free from any need for speed, the Turing machine was the ideal computer. Von Neumann, too, had kept coming back to Turing machines. They were the ever-handy lab mice of computer theory. Turing’s U had a transcendent power: a universal Turing machine can simulate any other digital computer, so computer scientists can disregard the messy details of any particular make or model. This is liberating.
Claude Shannon, having moved from Bell Labs to MIT, reanalyzed the Turing machine in 1956. He stripped it down to the smallest possible skeleton, proving that the universal computer could be constructed with just two internal states, or with just two symbols, 0 and 1, or blank and nonblank. He wrote his proof in words more pragmatic than mathematical: he described exactly how the two-state Turing machine would step left and right, “bouncing” back and forth to keep track of the larger numbers of states in a more complex computer. It was all very intricate and specific, redolent of Babbage. For example:
When the reading head moves, the state information must be transferred to the next cell of the tape to be visited using only two internal states in machine B. If the next state in machine A is to be (say) state 17 (according to some arbitrary numbering system) this is transferred in machine B by “bouncing” the reading head back and forth between the old cell and the new one 17 times (actually 18 trips to the new cell and 17 back to the old one).♦
The “bouncing operation” carries the information from cell to cell, and the cells act as “transmitters” and “controllers.”
Turing had titled his great paper “On Computable Numbers,” but of course the real focus was on uncomputable numbers. Could uncomputable numbers and random numbers be related? In 1965 Chaitin was an undergraduate at the City College of New York, writing up a discovery he hoped to submit to a journal; it would be his first publication. He began, “In this paper the Turing machine is regarded as a general purpose computer and some practical questions are asked about programming it.” Chaitin, as a high-school student in the Columbia Science Honors Program, had the opportunity to practice programming in machine language on giant IBM mainframes, using decks of punched cards—one card for each line of a program. He would leave his card deck in the computer center and come back the next day for the program’s output. He could run Turing machines in his head, too: write 0, write 1, write blank, shift tape left, shift tape right.… The universal computer gave him a nice way to distinguish between numbers like Alice and Bob’s A and B. He could write a program to make a Turing machine print out “010101 …” a million times, and he could write down the length of that program—quite short. But given a million random digits—no pattern, no regularity, nothing special at all—there could be no shortcut. The computer program would have to incorporate the entire number. To make the IBM mainframe print out those million digits, he would have to put the whole million digits into the punched cards. To make the Turing machine do it, he would still need the million digits for input.
Here is another number (in decimal this time):
C: 3.1415926535897932384626433832795028841971693993751…
This looks random. Statistically each digit appears with the expected frequency (one in ten); likewise each pair of digits (one in a hundred), each triplet, and so on. A statistician would say it appears to be “normal,” as far as anyone can tell. The next digit is always a surprise. The works of Shakespeare will be in there, eventually. But someone might recognize this as a familiar number, Π. So it is not random after all.
But why do we say Π is not random? Chaitin proposed a clear answer: a number is not random if it is computable—if a definable computer program will generate it. Thus computability is a measure of randomness.
For Turing computability was a yes-or-no quality—a given number either is or is not. But we would like to say that some numbers are more random than others—they are less patterned, less orderly. Chaitin said the patterns and the order express computability. Algorithms generate patterns. So we can gauge computability by looking at the size of the algorithm. Given a number—represented as a string of any length—we ask, what is the length of the shortest program that will generate it? Using the language of a Turing machine, that question can have a definite answer, measured in bits.
Chaitin’s algorithmic definition of randomness also provides an algorithmic definition of information: the size of the algorithm measures how much information a given string contains.
Looking for patterns—seeking the order amid chaos—is what scientists do, too. The eighteen-year-old Chaitin felt this was no accident. He ended this first paper by applying algorithmic information theory to the process of science itself. “Consider a scientist,” he proposed, “who has been observing a closed system that once every second either emits a ray of light or does not.”
He summarizes his observations in a sequence of 0s and 1s in which a 0 represents “ray not emitted” and a 1 represents “ray emitted.” The sequence may start
0110101110 …
and continue for a few thousand more bits. The scientist then examines the sequence in the hope of observing some kind of pattern or law. What does he mean by this? It seems plausible that a sequence of 0s and 1s is patternless if there is no better way to calculate it than just by writing it all out at once from a table giving the whole sequence.♦
But if the scientist could discover a way to produce the same sequence with an algorithm, a computer program significantly shorter than the sequence, then he would surely know the events were not random. He would say that he had hit upon a theory. This is what science always seeks: a simple theory that accounts for a large set of facts and allows for prediction of events still to come. It is the famous Occam’s razor. “We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances,” said Newton, “for nature is pleased with simplicity.”♦ Newton quantified mass and force, but simplicity had to wait.
Chaitin sent his paper to the Journal of the Association for Computing Machinery. They were happy to publish it, but one referee mentioned that he had heard rumors of similar work coming from the Soviet Union. Sure enough, the first issue of a new journal arrived (after a journey of months) in early 1966: , Problems of Information Transmission. It contained a paper titled “Three Approaches to the Definition of the Concept ‘Amount of Information,’ ” by A. N. Kolmogorov. Chaitin, who did not read Russian, had just time to add a footnote.
Andrei Nikolaevich Kolmogorov was the outstanding mathematician of the Soviet era. He was born in Tambov, three hundred miles southeast of Moscow, in 1903; his unwed mother, one of three sisters Kolmogorova, died in childbirth, and his aunt Vera raised him in a village near the river Volga. In the waning years of tsarist Russia, this independent-minded woman ran a village school and operated a clandestine printing press in her home, sometimes hiding forbidden documents under baby Andrei’s cradle.♦
Moscow University accepted Andrei Nikolaevich as a student of mathematics soon after the revolution of 1917. Within ten years he was proving a collection of influential results that took form in what became the theory of probability. His Foundations of the Theory of Probability, published in Russian in 1933 and in English in 1950, remains the modern classic. But his interests ranged widely, to physics and linguistics as well as other fast-growing branches of mathematics. Once he made a foray into genetics but drew back after a dangerous run-in with Stalin’s favorite pseudoscientist, Trofim Lysenko. During World War II Kolmogorov applied his efforts to statistical theory in artillery fire and devised a scheme of stochastic distribution of barrage balloons to protect Moscow from Nazi bombers. Apart from his war work, he studied turbulence and r
andom processes. He was a Hero of Socialist Labor and seven times received the Order of Lenin.
He first saw Claude Shannon’s Mathematical Theory of Communication rendered into Russian in 1953, purged of its most interesting features by a translator working in Stalin’s heavy shadow. The title became Statistical Theory of Electrical Signal Transmission. The word information, , was everywhere replaced with , data. The word entropy was placed in quotation marks to warn the reader against inferring a connection with entropy in physics. The section applying information theory to the statistics of natural language was omitted entirely. The result was technical, neutral, juiceless, and thus unlikely to attract interpretation in the terms of Marxist ideology.♦ These were serious concerns; “cybernetics” was initially defined in the Short Philosophical Dictionary (standard reference of ideological orthodoxy) as a “reactionary pseudoscience” and “an ideological weapon of imperialist reaction.” Kolmogorov leapt upon Shannon’s paper nonetheless; he, at least, was unafraid to use the word information. Working with his students in Moscow, he put forth a rigorous mathematical formulation of information theory, with definitions of the fundamental concepts, careful proofs, and new discoveries—some of which, he soon learned to his sorrow, had appeared in Shannon’s original paper but had been omitted from the Russian version.♦