The Code Book
The immediate reaction of any cryptanalyst upon seeing such a ciphertext is to analyze the frequency of all the letters, which results in Table 2. Not surprisingly, the letters vary in their frequency. The question is, can we identify what any of them represent, based on their frequencies? The ciphertext is relatively short, so we cannot slavishly apply frequency analysis. It would be naive to assume that the commonest letter in the ciphertext, O, represents the commonest letter in English, e, or that the eighth most frequent letter in the ciphertext, Y, represents the eighth most frequent letter in English, h. An unquestioning application of frequency analysis would lead to gibberish. For example, the first word PCQ would be deciphered as aov.
However, we can begin by focusing attention on the only three letters that appear more than thirty times in the ciphertext, namely O, X and P. It is fairly safe to assume that the commonest letters in the ciphertext probably represent the commonest letters in the English alphabet, but not necessarily in the right order. In other words, we cannot be sure that O = e, X = t, and P = a, but we can make the tentative assumption that:
O = e, t or a, X = e, t or a, P = e, t or a.
Table 2 Frequency analysis of enciphered message.
Letter Frequency
Occurrences Percentage
A 3 0.9
B 25 7.4
C 27 8.0
D 14 4.1
E 5 1.5
F 2 0.6
G 1 0.3
H 0 0.0
I 11 3.3
J 18 5.3
K 26 7.7
L 25 7.4
M 11 3.3
N 3 0.9
O 38 11.2
P 31 9.2
Q 2 0.6
R 6 1.8
S 7 2.1
T 0 0.0
U 6 1.8
V 18 5.3
W 1 0.3
X 34 10.1
Y 19 5.6
Z 5 1.5
In order to proceed with confidence, and pin down the identity of the three most common letters, O, X and P, we need a more subtle form of frequency analysis. Instead of simply counting the frequency of the three letters, we can focus on how often they appear next to all the other letters. For example, does the letter O appear before or after several other letters, or does it tend to neighbor just a few special letters? Answering this question will be a good indication of whether O represents a vowel or a consonant. If O represents a vowel it should appear before and after most of the other letters, whereas if it represents a consonant, it will tend to avoid many of the other letters. For example, the letter e can appear before and after virtually every other letter, but the letter t is rarely seen before or after b, d, g, j, k, m, q or v.
The table below takes the three most common letters in the ciphertext, O, X and P, and lists how frequently each appears before or after every letter. For example, O appears before A on 1 occasion, but never appears immediately after it, giving a total of 1 in the first box. The letter O neighbors the majority of letters, and there are only 7 that it avoids completely, represented by the 7 zeros in the O row. The letter X is equally sociable, because it too neighbors most of the letters, and avoids only 8 of them. However, the letter P is much less friendly. It tends to lurk around just a few letters, and avoids 15 of them. This evidence suggests that O and X represent vowels, while P represents a consonant.
Now we must ask ourselves which vowels are represented by O and X. They are probably e and a, the two most popular vowels in the English language, but does O = e and X = a, or does O = a and X = e? An interesting feature in the ciphertext is that the combination OO appears twice, whereas XX does not appear at all. Since the letters ee appear far more often than aa in plaintext English, it is likely that O = e and X = a.
At this point, we have confidently identified two of the letters in the ciphertext. Our conclusion that X = a is supported by the fact that X appears on its own in the ciphertext, and a is one of only two English words that consist of a single letter. The only other letter that appears on its own in the ciphertext is Y, and it seems highly likely that this represents the only other one-letter English word, which is i. Focusing on words with only one letter is a standard cryptanalytic trick, and I have included it among a list of cryptanalytic tips in Appendix B. This particular trick works only because this ciphertext still has spaces between the words. Often, a cryptographer will remove all the spaces to make it harder for an enemy interceptor to unscramble the message.
Although we have spaces between words, the following trick would also work where the ciphertext has been merged into a single string of characters. The trick allows us to spot the letter h, once we have already identified the letter e. In the English language, the letter h frequently goes before the letter e (as in the, then, they, etc.), but rarely after e. The table below shows how frequently the O, which we think represents e, goes before and after all the other letters in the ciphertext. The table suggests that B represents h, because it appears before 0 on 9 occasions, but it never goes after it. No other letter in the table has such an asymmetric relationship with O.
Each letter in the English language has its own unique personality, which includes its frequency and its relation to other letters. It is this personality that allows us to establish the true identity of a letter, even when it has been disguised by monoalphabetic substitution.
We have now confidently established four letters, O = e, X = a, Y = i and B = h, and we can begin to replace some of the letters in the ciphertext with their plaintext equivalents. I shall stick to the convention of keeping ciphertext letters in upper case, while putting plaintext letters in lower case. This will help to distinguish between those letters we still have to identify, and those that have already been established.
PCQ VMJiPD LhiK LiSe KhahJaWaV haV ZCJPe EiPD KhahJiUaJ LhJee KCPK. CP Lhe LhCMKaPV aPV IiJKL PiDhL, QheP Khe haV ePVeV Lhe LaRe CI Sa’aJMI, Khe JCKe aPV EiKKev Lhe DJCMPV ZeICJe h i S, KaUiPD: “DJeaL EiPD, ICJ a LhCMKaPV aPV CPe PiDhLK i haNe ZeeP JeACMPLiPD LC UCM Lhe IaZReK CI FaKL aDeK aPV Lhe ReDePVK CI aPAiePL EiPDK. SaU i SaEe KC ZCRV aK LC AJaNe a IaNCMJ CI UCMJ SaGeKLU?”
eFiRCDMe, LaReK IJCS Lhe LhCMKaPV aPV CPe PiDhLK
This simple step helps us to identify several other letters, because we can guess some of the words in the ciphertext. For example, the most common three-letter words in English are the and and, and these are relatively easy to spot-Lhe, which appears six times, and aPV, which appears five times. Hence, L probably represents t, P probably represents n, and V probably represents d. We can now replace these letters in the ciphertext with their true values:
nCQ dMJinD thiK tiSe KhahJaWad had ZCJne EinD KhahJiUaJ thJee KCnK. Cn the thCMKand and IiJKt niDht, Qhen Khe had ended the taRe CI Sa’aJMI, Khe JCKe and EiKKed the DJCMnd ZeICJe hiS, KaUinD: “DJeat EinD, ICJ a thCMKand and Cne niDhtK i haNe Zeen JeACMntinD tC UCM the IaZReK CI FaKt aDeK and the ReDendK CI anAient EinDK. SaU i SaEe KC ZCRd aK tC AJaNe a IaNCMJ CI UCMJ SaGeKtU?”
eFiRCDMe, taReK IJCS the thCMKand and Cne niDhtK
Once a few letters have been established, cryptanalysis progresses very rapidly. For example, the word at the beginning of the second sentence is Cn. Every word has a vowel in it, so C must be a vowel. There are only two vowels that remain to be identified, u and o; u does not fit, so C must represent o. We also have the word Khe, which implies that K represents either t or s. But we already know that L = t, so it becomes clear that K = s. Having identified these two letters, we insert them into the ciphertext, and there appears the phrase thoMsand and one niDhts. A sensible guess for this would be thousand and one nights, and it seems likely that the final line is telling us that this is a passage from Tales from the Thousand and One Nights. This implies that M = u, I = f, J = r, D = g, R = l, and S = m.
We could continue trying to establish other letters by guessing other words, but instead let us have a look at what we know about the plain alphabet and cipher alphabet. These two alphabets form the key, and they were used by the cryptographer in order to perform
the substitution that scrambled the message. Already, by identifying the true values of letters in the ciphertext, we have effectively been working out the details of the cipher alphabet. A summary of our achievements, so far, is given in the plain and cipher alphabets below.
By examining the partial cipher alphabet, we can complete the cryptanalysis. The sequence VOIDBY in the cipher alphabet suggests that the cryptographer has chosen a keyphrase as the basis for the key. Some guesswork is enough to suggest the keyphrase might be A VOID BY GEORGES PEREC, which is reduced to AVOID BY GERSPC after removing spaces and repetitions. Thereafter, the letters continue in alphabetical order, omitting any that have already appeared in the keyphrase. In this particular case, the cryptographer took the unusual step of not starting the keyphrase at the beginning of the cipher alphabet, but rather starting it three letters in. This is possibly because the keyphrase begins with the letter A, and the cryptographer wanted to avoid encrypting a as A. At last, having established the complete cipher alphabet, we can unscramble the entire ciphertext, and the cryptanalysis is complete.
Now during this time Shahrazad had borne King Shahriyar three sons. On the thousand and first night, when she had ended the tale of Ma’aruf, she rose and kissed the ground before him, saying: “Great King, for a thousand and one nights I have been recounting to you the fables of past ages and the legends of ancient kings. May I make so bold as to crave a favor of your majesty?”
Epilogue, Tales from the Thousand and One Nights
Renaissance in the West
Between A.D. 800 and 1200, Arab scholars enjoyed a vigorous period of intellectual achievement. At the same time, Europe was firmly stuck in the Dark Ages. While al-Kindī was describing the invention of cryptanalysis, Europeans were still struggling with the basics of cryptography. The only European institutions to encourage the study of secret writing were the monasteries, where monks would study the Bible in search of hidden meanings, a fascination that has persisted through to modern times (see Appendix C).
Medieval monks were intrigued by the fact that the Old Testament contained deliberate and obvious examples of cryptography. For example, the Old Testament includes pieces of text encrypted with atbash, a traditional form of Hebrew substitution cipher. Atbash involves taking each letter, noting the number of places it is from the beginning of the alphabet, and replacing it with a letter that is an equal number of places from the end of the alphabet. In English this would mean that a, at the beginning of the alphabet, is replaced by Z, at the end of the alphabet, b is replaced by Y, and so on. The term atbash itself hints at the substitution it describes, because it consists of the first letter of the Hebrew alphabet, aleph, followed by the last letter taw, and then there is the second letter, beth, followed by the second to last letter shin. An example of atbash appears in Jeremiah 25: 26 and 51: 41, where “Babel” is replaced by the word “Sheshach”; the first letter of Babel is beth, the second letter of the Hebrew alphabet, and this is replaced by shin, the second-to-last letter; the second letter of Babel is also beth, and so it too is replaced by shin; and the last letter of Babel is lamed, the twelfth letter of the Hebrew alphabet, and this is replaced by kaph, the twelfth-to-last letter.
Atbash and other similar biblical ciphers were probably intended only to add mystery, rather than to conceal meaning, but they were enough to spark an interest in serious cryptography. European monks began to rediscover old substitution ciphers, they invented new ones, and, in due course, they helped to reintroduce cryptography into Western civilization. The first known European book to describe the use of cryptography was written in the thirteenth century by the English Franciscan monk and polymath Roger Bacon. Epistle on the Secret Works of Art and the Nullity of Magic included seven methods for keeping messages secret, and cautioned: “A man is crazy who writes a secret in any other way than one which will conceal it from the vulgar.”
By the fourteenth century the use of cryptography had become increasingly widespread, with alchemists and scientists using it to keep their discoveries secret. Although better known for his literary achievements, Geoffrey Chaucer was also an astronomer and a cryptographer, and he is responsible for one of the most famous examples of early European encryption. In his Treatise on the Astrolabe he provided some additional notes entitled “The Equatorie of the Planetis,” which included several encrypted paragraphs. Chaucer’s encryption replaced plaintext letters with symbols, for example b with . A ciphertext consisting of strange symbols rather than letters may at first sight seem more complicated, but it is essentially equivalent to the traditional letter-for-letter substitution. The process of encryption and the level of security are exactly the same.
By the fifteenth century, European cryptography was a burgeoning industry. The revival in the arts, sciences and scholarship during the Renaissance nurtured the capacity for cryptography, while an explosion in political machinations offered ample motivation for secret communication. Italy, in particular, provided the ideal environment for cryptography. As well as being at the heart of the Renaissance, it consisted of independent city states, each trying to outmaneuver the others. Diplomacy flourished, and each state would send ambassadors to the courts of the others. Each ambassador received messages from his respective head of state, describing details of the foreign policy he was to implement. In response, each ambassador would send back any information that he had gleaned. Clearly there was a great incentive to encrypt communications in both directions, so each state established a cipher office, and each ambassador had a cipher secretary.
At the same time that cryptography was becoming a routine diplomatic tool, the science of cryptanalysis was beginning to emerge in the West. Diplomats had only just familiarized themselves with the skills required to establish secure communications, and already there were individuals attempting to destroy this security. It is quite probable that cryptanalysis was independently discovered in Europe, but there is also the possibility that it was introduced from the Arab world. Islamic discoveries in science and mathematics strongly influenced the rebirth of science in Europe, and cryptanalysis might have been among the imported knowledge.
Arguably the first great European cryptanalyst was Giovanni Soro, appointed as Venetian cipher secretary in 1506. Soro’s reputation was known throughout Italy, and friendly states would send intercepted messages to Venice for cryptanalysis. Even the Vatican, probably the second most active center of cryptanalysis, would send Soro seemingly impenetrable messages that had fallen into its hands. In 1526, Pope Clement VII sent him two encrypted messages, and both were returned having been successfully cryptanalyzed. And when one of the Pope’s own encrypted messages was captured by the Florentines, the Pope sent a copy to Soro in the hope that he would be reassured that it was unbreakable. Soro claimed that he could not break the Pope’s cipher, implying that the Florentines would also be unable to decipher it. However, this may have been a ploy to lull the Vatican cryptographers into a false sense of security-Soro might have been reluctant to point out the weaknesses of the Papal cipher, because this would only have encouraged the Vatican to switch to a more secure cipher, one that Soro might not have been able to break.
Elsewhere in Europe, other courts were also beginning to employ skilled cryptanalysts, such as Philibert Babou, cryptanalyst to King Francis I of France. Babou gained a reputation for being incredibly persistent, working day and night and persevering for weeks on end in order to crack an intercepted message. Unfortunately for Babou, this gave the king ample opportunity to carry on a long-term affair with his wife. Toward the end of the sixteenth century the French consolidated their codebreaking prowess with the arrival of François Viète, who took particular pleasure in cracking Spanish ciphers. Spain’s cryptographers, who appear to have been naive compared with their rivals elsewhere in Europe, could not believe it when they discovered that their messages were transparent to the French. King Philip II of Spain went as far as petitioning the Vatican, claiming that the only explanation for Viète’s cryptanalysis
was that he was an “archfiend in league with the devil.” Philip argued that Viète should be tried before a Cardinal’s Court for his demonic deeds; but the Pope, who was aware that his own cryptanalysts had been reading Spanish ciphers for years, rejected the Spanish petition. News of the petition soon reached cipher experts in various countries, and Spanish cryptographers became the laughingstock of Europe.
The Spanish embarrassment was symptomatic of the state of the battle between cryptographers and cryptanalysts. This was a period of transition, with cryptographers still relying on the monoalphabetic substitution cipher, while cryptanalysts were beginning to use frequency analysis to break it. Those yet to discover the power of frequency analysis continued to trust monoalphabetic substitution, ignorant of the extent to which cryptanalysts such as Soro, Babou and Viète were able to read their messages.
Meanwhile, countries that were alert to the weakness of the straightforward monoalphabetic substitution cipher were anxious to develop a better cipher, something that would protect their own nation’s messages from being unscrambled by enemy cryptanalysts. One of the simplest improvements to the security of the monoalphabetic substitution cipher was the introduction of nulls, symbols or letters that were not substitutes for actual letters, merely blanks that represented nothing. For example, one could substitute each plain letter with a number between 1 and 99, which would leave 73 numbers that represent nothing, and these could be randomly sprinkled throughout the ciphertext with varying frequencies. The nulls would pose no problem to the intended recipient, who would know that they were to be ignored. However, the nulls would baffle an enemy interceptor because they would confuse an attack by frequency analysis. An equally simple development was that cryptographers would sometimes deliberately misspell words before encrypting the message. Thys haz thi ifekkt off diztaughting thi ballans off frikwenseas—making it harder for the cryptanalyst to apply frequency analysis. However, the intended recipient, who knows the key, can unscramble the message and then deal with the bad, but not unintelligible, spelling.