“At each given moment there is only a fine layer between the ‘trivial’ and the impossible,”♦ Kolmogorov mused in his diary. “Mathematical discoveries are made in this layer.” In the new, quantitative view of information he saw a way to attack a problem that had eluded probability theory, the problem of randomness. How much information is contained in a given “finite object”? An object could be a number (a series of digits) or a message or a set of data.
He described three approaches: the combinatorial, the probabilistic, and the algorithmic. The first and second were Shannon’s, with refinements. They focused on the probability of one object among an ensemble of objects—one particular message, say, chosen from a set of possible messages. How would this work, Kolmogorov wondered, when the object was not just a symbol in an alphabet or a lantern in a church window but something big and complicated—a genetic organism, or a work of art? How would one measure the amount of information in Tolstoy’s War and Peace? “Is it possible to include this novel in a reasonable way in the set of ‘all possible novels’ and further to postulate the existence of a certain probability distribution in this set?”♦ he asked. Or could one measure the amount of genetic information in, say, the cuckoo bird by considering a probability distribution in the set of all possible species?
His third approach to measuring information—the algorithmic—avoided the difficulties of starting with ensembles of possible objects. It focused on the object itself.♦♦ Kolmogorov introduced a new word for the thing he was trying to measure: complexity. As he defined this term, the complexity of a number, or message, or set of data is the inverse of simplicity and order and, once again, it corresponds to information. The simpler an object is, the less information it conveys. The more complexity, the more information. And, just as Gregory Chaitin did, Kolmogorov put this idea on a solid mathematical footing by calculating complexity in terms of algorithms. The complexity of an object is the size of the smallest computer program needed to generate it. An object that can be produced by a short algorithm has little complexity. On the other hand, an object needing an algorithm every bit as long as the object itself has maximal complexity.
A simple object can be generated—or computed, or described—with just a few bits. A complex object requires an algorithm of many bits. Put this way, it seemed obvious. But until now it had not been understood mathematically. Kolmogorov put it this way:
The intuitive difference between “simple” and “complicated” objects has apparently been perceived a long time ago. On the way to its formalization, an obvious difficulty arises: something that can be described simply in one language may not have a simple description in another and it is not clear what method of description should be chosen.♦
That difficulty is solved by using computer language. It does not matter which computer language, because they are all equivalent, reducible to the language of a universal Turing machine. The Kolmogorov complexity of an object is the size, in bits, of the shortest algorithm needed to generate it. This is also the amount of information. And it is also the degree of randomness—Kolmogorov declared “a new conception of the notion ‘random’ corresponding to the natural assumption that randomness is the absence of regularity.”♦ The three are fundamentally equivalent: information, randomness, and complexity—three powerful abstractions, bound all along like secret lovers.
For Kolmogorov, these ideas belonged not only to probability theory but also to physics. To measure the complexity of an orderly crystal or a helter-skelter box of gas, one could measure the shortest algorithm needed to describe the state of the crystal or gas. Once again entropy was the key. Kolmogorov had a useful background in difficult physical problems to which these new methods could be applied. In 1941 he had produced the first useful, though flawed, understanding of the local structure of turbulent flows—equations to predict the distribution of whorls and eddies. He had also worked on perturbations in planetary orbits, another problem surprisingly intractable for classical Newtonian physics. Now he began laying the groundwork for the renaissance in chaos theory to come in the 1970s: analyzing dynamical systems in terms of entropy and information dimension. It made sense now to say that a dynamical system produces information. If it is unpredictable, it produces a great deal of information.
Kolmogorov knew nothing of Gregory Chaitin, nor did either man know of an American probability theorist named Ray Solomonoff, who had developed some of the same ideas. The world was changing. Time, distance, and language still divided mathematicians in Russia from their Western counterparts, but the gulf narrowed every year. Kolmogorov often said that no one should do mathematics after the age of sixty. He dreamed of spending his last years as a buoy keeper on the Volga, making a watery circuit in a boat with oars and a small sail.♦ When the time came, buoy keepers had switched to motorboats, and for Kolmogorov, this ruined the dream.
Now the paradoxes returned.
Zero is an interesting number. Books have been written about it. One is certainly an interesting number—it is the first and the foremost (not counting zero), the singular and unique. Two is interesting in all kinds of ways: the smallest prime, the definitive even number, the number needed for a successful marriage, the atomic number of helium, the number of candles to light on Finnish Independence Day. Interesting is an everyday word, not mathematicians’ jargon. It seems safe to say that any small number is interesting. All the two-digit numbers and many of the three-digit numbers have their own Wikipedia entries.
Number theorists name entire classes of interesting numbers: prime numbers, perfect numbers, squares and cubes, Fibonacci numbers, factorials. The number 593 is more interesting than it looks; it happens to be the sum of nine squared and two to the ninth—thus a “Leyland number” (any number that can be expressed as xy + yx). Wikipedia also devotes an article to the number 9,814,072,356. It is the largest holodigital square—which is to say, the largest square number containing each decimal digit exactly once.
What would be an uninteresting number? Presumably a random number. The English number theorist G. H. Hardy randomly rode in taxi No. 1729 on his way to visit the ailing Srinivasa Ramanujan in 1917 and remarked to his colleague that, as numbers go, 1,729 was “rather a dull one.” On the contrary, replied Ramanujan (according to a standard anecdote of mathematicians), it is the smallest number expressible as the sum of two cubes in two different ways.♦ “Every positive integer is one of Ramanujan’s personal friends,” remarked J. E. Littlewood. Due to the anecdote, 1,729 is known nowadays as the Hardy-Ramanujan number. Nor is that all; 1,729 also happens to be a Carmichael number, an Euler pseudoprime, and a Zeisel number.
But even the mind of Ramanujan was finite, as is Wikipedia, as is the aggregate sum of human knowledge, so the list of interesting numbers must end somewhere. Surely there must be a number about which there is nothing special to say. Wherever it is, there stands a paradox: the number we may describe, interestingly, as “the smallest uninteresting number.”
This is none other than Berry’s paradox reborn, the one described by Bertrand Russell in Principia Mathematica. Berry and Russell had devilishly asked, What is the least integer not nameable in fewer than nineteen syllables? Whatever this number is, it can be named in eighteen syllables: the least integer not nameable in fewer than nineteen syllables. Explanations for why a number is interesting are ways of naming the number: “the square of eleven,” for example, or “the number of stars in the American flag.” Some of these names do not seem particularly helpful, and some are rather fuzzy. Some are pure mathematical facts: whether, for example, a number is expressible as the sum of two cubes in two different ways. But some are facts about the world, or about language, or about human beings, and they may be accidental and ephemeral—for example, whether a number corresponds to a subway stop or a date in history.
Chaitin and Kolmogorov revived Berry’s paradox in inventing algorithmic information theory. An algorithm names a number. “The paradox originally talks about English, but that’
s much too vague,”♦ Chaitin says. “I pick a computer-programming language instead.” Naturally he picks the language of a universal Turing machine.
And then what does it mean, how do you name an integer? Well, you name an integer by giving a way to calculate it. A program names an integer if its output is that integer—you know, it outputs that integer, just one, and then it stops.
Asking whether a number is interesting is the inverse of asking whether it is random. If the number n can be computed by an algorithm that is relatively short, then n is interesting. If not, it is random. The algorithm PRINT 1 AND THEN PRINT 100 ZEROES generates an interesting number (a googol). Similarly, FIND THE FIRST PRIME NUMBER, ADD THE NEXT PRIME NUMBER, AND REPEAT A MILLION TIMES generates a number that is interesting: the sum of the first million primes. It would take a Turing machine a long time to compute that particular number, but a finite time nonetheless. The number is computable.
But if the most concise algorithm for n is “PRINT [n]”—an algorithm incorporating the entire number, with no shorthand—then we may say that there is nothing interesting about n. In Kolmogorov’s terms, this number is random—maximally complex. It will have to be patternless, because any pattern would provide a way to devise a shorthand algorithm. “If there is a small, concise computer program that calculates the number, that means it has some quality or characteristic that enables you to pick it out and to compress it into a smaller algorithmic description,” Chaitin says. “So that’s unusual; that’s an interesting number.”
But is it unusual? Looking generally at all the numbers, how can a mathematician know whether the interesting ones are rare or common? For that matter, looking at any one number, can a mathematician ever know for sure whether a smaller algorithm might be found? For Chaitin, these were the critical questions.
He answered the first with a counting argument. The vast majority of numbers have to be uninteresting because there cannot possibly be enough concise computer programs to go around. Count them. Given 1,000 bits (say), one has 21000 numbers; but not nearly that many useful computer programs can be written in 1,000 bits. “There are a lot of positive integers,” Chaitin says. “If the programs have to be smaller, then there just aren’t enough of them to name all those different positive integers.” So most n’s of any given length are random.
The next question was far more troubling. Knowing that most numbers are random, and given any particular number n, can mathematicians prove it to be random? They cannot tell by looking at it. They can often prove the opposite, that n is interesting: in that case they just have to find a short algorithm that generates n. (Technically, it must be shorter than log2n bits, the number needed to write n in binary.) Proving the negative is a different story. “Even though most positive integers are uninteresting,” Chaitin declared, “you can never be sure.… You can only prove it in a small number of cases.” One could imagine trying to do it by brute force, writing down every possible algorithm and testing them one by one. But a computer will have to perform the tests—an algorithm testing other algorithms—and soon, Chaitin demonstrated, a new version of Berry’s paradox appears. Instead of “the smallest uninteresting number,” one inevitably encounters a statement in the form of “the smallest number that we can prove cannot be named in fewer than n syllables.” (We are not really talking about syllables any more, of course, but Turing-machine states.)♦ It is another recursive, self-looping twist. This was Chaitin’s version of Gödel’s incompleteness. Complexity, defined in terms of program size, is generally uncomputable. Given an arbitrary string of a million digits, a mathematician knows that it is almost certainly random, complex, and patternless—but cannot be absolutely sure.
Chaitin did this work in Buenos Aires. When he was still a teenager, before he could graduate from City College, his parents moved back to their home in Argentina, and he got a job there with IBM World Trade. He continued to nurse his obsession with Gödel and incompleteness and to send papers to the American Mathematical Society and the Association for Computing Machinery. Eight years later, Chaitin returned to the United States to visit IBM’s research center in Yorktown Heights, New York, and placed a telephone call to his hero, then nearing seventy at the Institute for Advanced Study in Princeton. Gödel answered, and Chaitin introduced himself and said he had a new approach to incompleteness, based on Berry’s paradox instead of the liar paradox.
“It doesn’t make any difference which paradox you use,”♦ said Gödel.
“Yes, but …” Chaitin said he was on the trail of a new “information-theoretic” view of incompleteness and asked if he could call on Gödel in Princeton. He was staying in the YMCA in White Plains and would take the train, changing in New York City. Gödel agreed, but when the day came, he canceled. It was snowing, and he was fearful for his health. Chaitin never did meet him. Gödel, increasingly unstable, afraid of poisoning, died in the winter of 1978 of self-starvation.
Chaitin spent the rest of his career at the IBM Watson Research Center, one of the last great scientists to be so well supported in work of no plausible use to his corporate patron. He sometimes said he was “hiding” in a physics department; he felt that more conventional mathematicians dismissed him as “a closet physicist” anyway. His work treated mathematics as a sort of empirical science—not a Platonic pipeline to absolute truth, but a research program subject to the world’s contingencies and uncertainties. “In spite of incompleteness and uncomputability and even algorithmic randomness,” he said, “mathematicians don’t want to give up absolute certainty. Why? Well, absolute certainty is like God.”♦
In quantum physics and later in chaos, scientists found the limits to their knowledge. They explored the fruitful uncertainty that at first so vexed Einstein, who did not want to believe that God plays dice with the universe. Algorithmic information theory applies the same limitations to the universe of whole numbers—an ideal, mental universe. As Chaitin put it, “God not only plays dice in quantum mechanics and nonlinear dynamics, but even in elementary number theory.”♦
Among its lessons were these:
Most numbers are random. Yet very few of them can be proved random.
A chaotic stream of information may yet hide a simple algorithm. Working backward from the chaos to the algorithm may be impossible.
Kolmogorov-Chaitin (KC) complexity is to mathematics what entropy is to thermodynamics: the antidote to perfection. Just as we can have no perpetual-motion machines, there can be no complete formal axiomatic systems.
Some mathematical facts are true for no reason. They are accidental, lacking a cause or deeper meaning.
Joseph Ford, a physicist studying the behavior of unpredictable dynamical systems in the 1980s, said that Chaitin had “charmingly captured the essence of the matter”♦ by showing the path from Gödel’s incompleteness to chaos. This was the “deeper meaning of chaos,” Ford declared:
Chaotic orbits exist but they are Gödel’s children, so complex, so overladen with information that humans can never comprehend them. But chaos is ubiquitous in nature; therefore the universe is filled with countless mysteries that man can never understand.
Yet one still tries to take their measure.
How much information …?
When an object (a number or a bitstream or a dynamical system) can be expressed a different way in fewer bits, it is compressible. A frugal telegraph operator prefers to send the compressed version. Because the spirit of frugal telegraph operators kept the lights on at Bell Labs, it was natural for Claude Shannon to explore data compression, both theory and practice. Compression was fundamental to his vision: his war work on cryptography analyzed the disguising of information at one end and the recovery of the information at the other; data compression likewise encodes the information, with a different motivation—the efficient use of bandwidth. Satellite television channels, pocket music players, efficient cameras and telephones and countless other modern appurtenances depend on coding algorithms to compress numbers—sequences of
bits—and those algorithms trace their lineage to Shannon’s original 1948 paper.
The first of these, now called Shannon-Fano coding, came from his colleague Robert M. Fano. It began with the simple idea of assigning short codes to frequent symbols, as in Morse code. They knew their method was not optimal, however: it could not be relied on to produce the shortest possible messages. Within three years it was surpassed by work of a graduate student of Fano’s at MIT, David Huffman. In the decades since, versions of the Huffman coding algorithm have squeezed many, many bytes.
Ray Solomonoff, a child of Russian immigrants who studied at the University of Chicago, encountered Shannon’s work in the early 1950s and began thinking about what he called the Information Packing Problem: how much information could one “pack” into a given number of bits, or conversely, given some information, how could one pack it into the fewest possible bits.♦ He had majored in physics, studied mathematical biology and probability and logic on the side, and gotten to know Marvin Minsky and John McCarthy, pioneers in what would soon be called artificial intelligence. Then he read Noam Chomsky’s offbeat and original paper “Three Models for the Description of Language,”♦ applying the new information-theoretic ideas to the formalization of structure in language. All this was bouncing around in Solomonoff’s mind; he was not sure where it led, but he found himself focusing on the problem of induction. How do people create theories to account for their experience of the world? They have to make generalizations, find patterns in data that are always influenced by randomness and noise. Could one enable a machine to do that? In other words, could a computer be made to learn from experience?
He worked out an elaborate answer and published it in 1964. It was idiosyncratic, and hardly anyone noticed until the 1970s, when both Chaitin and Kolmogorov discovered that Solomonoff had anticipated the essential features of what by then was called algorithmic information theory. In effect, Solomonoff, too, had been figuring out how a computer might look at sequences of data—number sequences or bit strings—and measure their randomness and their hidden patterns. When humans or computers learn from experience, they are using induction: recognizing regularities amid irregular streams of information. From this point of view, the laws of science represent data compression in action. A theoretical physicist acts like a very clever coding algorithm. “The laws of science that have been discovered can be viewed as summaries of large amounts of empirical data about the universe,”♦ wrote Solomonoff. “In the present context, each such law can be transformed into a method of compactly coding the empirical data that gave rise to that law.” A good scientific theory is economical. This was yet another way of saying so.