Another demonstration that speech perception is not the same thing as fleshing out expectations comes from an illusion that the columnist Jon Carroll has called the mondegreen, after his mis-hearing of the folk ballad “The Bonnie Earl O’Moray”:
Oh, ye hielands and ye lowlands,
Oh, where hae ye been?
They have slain the Earl of Moray,
And laid him on the green.
He had always thought that the lines were “They have slain the Earl of Moray, And Lady Mondegreen.” Mondegreens are fairly common (they are an extreme version of the Pullet Surprises and Pencil Vaneas mentioned earlier); here are some examples:
A girl with colitis goes by. [A girl with kaleidoscope eyes. From the Beatles song “Lucy in the Sky with Diamonds.”]
Our father wishart in heaven; Harold be thy name…Lead us not into Penn Station.
Our father which art in Heaven; hallowed by thy name…Lead us not into temptation. From the Lord’s Prayer.]
He is trampling out the vintage where the grapes are wrapped and stored. […grapes of wrath are stored. From “The Battle Hymn of the Republic.”]
Gladly the cross-eyed bear. [Gladly the cross I’d bear.]
I’ll never be your pizza burnin’. […your beast of burden. From the Rolling Stones song.]
It’s a happy enchilada, and you think you’re gonna drown. [It’s a half an inch of water…From the John Prine song “That’s the Way the World Goes ’Round.”]
The interesting thing about mondegreens is that the mishearings are generally less plausible than the intended lyrics. In no way do they bear out any sane listener’s general expectations of what a speaker is likely to say or mean. (In one case a student stubbornly misheard the Shocking Blue hit song “I’m Your Venus” as “I’m Your Penis” and wondered how it was allowed on the radio.) The mondegreens do conform to English phonology, English syntax (sometimes), and English vocabulary (though not always, as in the word mondegreen itself). Apparently, listeners lock in to some set of words that fit the sound and that hang together more or less as English words and phrases, but plausibility and general expectations are not running the show.
The history of artificial speech recognizers offers a similar moral. In the 1970s a team of artificial intelligence researchers at Carnegie-Mellon University headed by Raj Reddy designed a computer program called HEARSAY that interpreted spoken commands to move chess pieces. Influenced by the top-down theory of speech perception, they designed the program as a “community” of “expert” subprograms cooperating to give the most likely interpretation of the signal. There were subprograms that specialized in acoustic analysis, in phonology, in the dictionary, in syntax, in rules for the legal moves of chess, even in chess strategy as applied to the game in progress. According to one story, a general from the defense agency that was funding the research came up for a demonstration. As the scientists sweated he was seated in front of a chessboard and a microphone hooked up to the computer. The general cleared his throat. The program printed “Pawn to King 4.”
The recent program DragonDictate, mentioned earlier in the chapter, places the burden more on good acoustic, phonological, and lexical analyses, and that seems to be responsible for its greater success. The program has a dictionary of words and their sequences of phonemes. To help anticipate the effects of phonological rules and coarticulation, the program is told what every English phoneme sounds like in the context of every possible preceding phoneme and every possible following phoneme. For each word, these phonemes-in-context are arranged into a little chain, with a probability attached to each transition from one sound unit to the next. This chain serves as a crude model of the speaker, and when a real speaker uses the system, the probabilities in the chain are adjusted to capture that person’s manner of speaking. The entire word, too, has a probability attached to it, which depends on its frequency in the language and on the speaker’s habits. In some versions of the program, the probability value for a word is adjusted depending on which word precedes it; this is the only top-down information that the program uses. All this knowledge allows the program to calculate which word is most likely to have come out of the mouth of the speaker given the input sound. Even then, DragonDictate relies more on expectancies than an able-eared human does. In the demonstration I saw, the program had to be coaxed into recognizing word and worm, even when they were pronounced as clear as a bell, because it kept playing the odds and guessing higher-frequency were instead.
Now that you know how individual speech units are produced, how they are represented in the mental dictionary, and how they are rearranged and smeared before they emerge from the mouth, you have reached the prize at the bottom of this chapter: why English spelling is not as deranged as it first appears.
The complaint about English spelling, of course, is that it pretends to capture the sounds of words but does not. There is a long tradition of doggerel making this point, of which this stanza is a typical example:
Beware of heard, a dreadful word
That looks like beard and sounds like bird,
And dead: it’s said like bed, not bead—
For goodness’ sake don’t call it “deed”!
Watch out for meat and great and threat
(They rhyme with suite and straight and debt).
George Bernard Shaw led a vigorous campaign to reform the English alphabet, a system so illogical, he said, that it could spell fish as “ghoti”—gh as in tough, o as in women, ti as in nation. (“Mnomnoupte” for minute and “mnopspteiche” for mistake are other examples.) In his will Shaw bequeathed a cash prize to be awarded to the designer of a replacement alphabet for English, in which each sound in the spoken language would be recognizable by a single symbol: He wrote:
To realize the annual difference in favour of a forty-two letter phonetic alphabet…you must multiply the number of minutes in the year, the number of people in the world who are continuously writing English words, casting types, manufacturing printing and writing machines, by which time the total figure will have become so astronomical that you will realize that the cost of spelling even one sound with two letters has cost us centuries of unnecessary labour. A new British 42 letter alphabet would pay for itself a million times over not only in hours but in moments. When this is grasped, all the useless twaddle about enough and cough and laugh and simplified spelling will be dropped, and the economists and statisticians will be set to work to gather in the orthographic Golconda.
My defense of English spelling will be halfhearted. For although language is an instinct, written language is not. Writing was invented a small number of times in history, and alphabetic writing, where one character corresponds to one sound, seems to have been invented only once. Most societies have lacked written language, and those that have it inherited it or borrowed it from one of the inventors. Children must be taught to read and write in laborious lessons, and knowledge of spelling involves no daring leaps from the training examples like the leaps we saw in Simon, Mayela, and the Jabba and mice-eater experiments in Chapters 3 and 5. And people do not uniformly succeed. Illiteracy, the result of insufficient teaching, is the rule in much of the world, and dyslexia, a presumed congenital difficulty in learning to read even with sufficient teaching, is a severe problem even in industrial societies, found in five to ten percent of the population.
But though writing is an artificial contraption connecting vision and language, it must tap into the language system at well-demarcated points, and that gives it a modicum of logic. In all known writing systems, the symbols designate only three kinds of linguistic structure: the morpheme, the syllable, and the phoneme. Mesopotamian cuneiform, Egyptian hieroglyphs, Chinese logograms, and Japanese kanji encode morphemes. Cherokee, Ancient Cypriot, and Japanese kana are syllable-based. All modern phonemic alphabets appear to be descended from a system invented by the Canaanites around 1700 B.C. No writing system has symbols for actual sound units that can be identified on an oscilloscope or spectrogram, such as
a phoneme as it is pronounced in a particular context or a syllable chopped in half.
Why has no writing system ever met Shaw’s ideal of one symbol per sound? As Shaw himself said elsewhere, “There are two tragedies in life. One is not to get your heart’s desire. The other is to get it.” Just think back to the workings of phonology and coarticulation. A true Shavian alphabet would mandate different vowels in write and ride, different consonants in write and writing, and different spellings for the past-tense suffix in slapped, sobbed, and sorted. Cape Cod would lose its visual alliteration. A horse would be spelled differently from its horseshoe, and National Public Radio would have the enigmatic abbreviation MPR. We would need brand-new letters for the n in month and the d in width. I would spell often differently from orphan, but my neighbors here in the Hub would not, and their spelling of career would be my spelling of Korea and vice versa.
Obviously, alphabets do not and should not correspond to sounds; at best they correspond to the phonemes specified in the mental dictionary. The actual sounds are different in different contexts, so true phonetic spelling would only obscure their underlying identity. The surface sounds are predictable by phonological rules, though, so there is no need to clutter up the page with symbols for the actual sounds; the reader needs only the abstract blueprint for a word and can flesh out the sound if needed. Indeed, for about eighty-four percent of English words, spelling is completely predictable from regular rules. Moreover, since dialects separated by time and space often differ most in the phonological rules that convert mental dictionary entries into pronunciations, a spelling corresponding to the underlying entries, not the sounds, can be widely shared. The words with truly weird spellings (like of, people, women, have, said, do, done, and give) generally are the commonest ones in the language, so there is ample opportunity for everyone to memorize them.
Even the less predictable aspects of spelling bespeak hidden linguistic regularities. Consider the following pairs of words where the same letters get different pronunciations:
electric-electricity
photograph-photography
grade-gradual
history-historical
revise-revision
adore-adoration
bomb-bombard
nation-national
critical-criticize
mode-modular
resident-residential
declare-declaration
muscle-muscular
condemn-condemnation
courage-courageous
romantic-romanticize
industry-industrial
fact-factual
inspire-inspiration
sign-signature
malign-malignant
Once again the similar spellings, despite differences in pronunciation, are there for a reason: they are identifying two words as being based on the same root morpheme. This shows that English spelling is not completely phonemic; sometimes letters encode phonemes, but sometimes a sequence of letters is specific to a morpheme. And a morphemic writing system is more useful than you might think. The goal of reading, after all, is to understand the text, not to pronounce it. A morphemic spelling can help a reader distinguishing homophones, like meet and mete. It can also tip off a reader that one word contains another (and not just a phonologically identical impostor). For example, spelling tells us that overcome contains come, so we know that its past tense must be overcame, whereas succumb just contains the sound “kum,” not the morpheme come, so its past tense is not succame but succumbed. Similarly, when something recedes, one has a recession, but when someone re-seeds a lawn, we have a re-seeding.
In some ways, a morphemic writing system has served the Chinese well, despite the inherent disadvantage that readers are at a loss when they face a new or rare word. Mutually unintelligible dialects can share texts (even if their speakers pronounce the words very differently), and many documents that are thousands of years old are readable by modern speakers. Mark Twain alluded to such inertia in our own Roman writing system when he wrote, “They spell it Vinci and pronounce it Vinchy; foreigners always spell better than they pronounce.”
Of course English spelling could be better than it is. But it is already much better than people think it is. That is because writing systems do not aim to represent the actual sounds of talking, which we do not hear, but the abstract units of language underlying them, which we do hear.
Talking Heads
For centuries, people have been terrified that their programmed creations might outsmart them, overpower them, or put them out of work. The fear has long been played out in fiction, from the medieval Jewish legend of the Golem, a clay automaton animated by an inscription of the name of God placed in its mouth, to HAL, the mutinous computer of 2001: A Space Odyssey. But when the branch of engineering called “artificial intelligence” (AI) was born in the 1950s, it looked as though fiction was about to turn into frightening fact. It is easy to accept a computer calculating pi to a million decimal places or keeping track of a company’s payroll, but suddenly computers were also proving theorems in logic and playing respectable chess. In the years following there came computers that could beat anyone but a grand master, and programs that outperformed most experts at recommending treatments for bacterial infections and investing pension funds. With computers solving such brainy tasks, it seemed only a matter of time before a C3PO or a Terminator would be available from the mailorder catalogues; only the easy tasks remained to be programmed. According to legend, in the 1970s Marvin Minsky, one of the founders of AI, assigned “vision” to a graduate student as a summer project.
But household robots are still confined to science fiction. The main lesson of thirty-five years of AI research is that the hard problems are easy and the easy problems are hard. The mental abilities of a four-year-old that we take for granted—recognizing a face, lifting a pencil, walking across a room, answering a question—in fact solve some of the hardest engineering problems ever conceived. Do not be fooled by the assembly-line robots in the automobile commercials; all they do is weld and spray-paint, tasks that do not require these clumsy Mr. Magoos to see or hold or place anything. And if you want to stump an artificial intelligence system, ask it questions like, Which is bigger, Chicago or a breadbox? Do zebras wear underwear? Is the floor likely to rise up and bite you? If Susan goes to the store, does her head go with her? Most fears of automation are misplaced. As the new generation of intelligent devices appears, it will be the stock analysts and petrochemical engineers and parole board members who are in danger of being replaced by machines. The gardeners, receptionists, and cooks are secure in their jobs for decades to come.
Understanding a sentence is one of these hard easy problems. To interact with computers we still have to learn their languages; they are not smart enough to learn ours. In fact, it is all too easy to give computers more credit at understanding than they deserve.
Recently an annual competition was set up for the computer program that can best fool users into thinking that they are conversing with another human. The competition for the Loebner Prize was intended to implement a suggestion made by Alan Turing in a famous 1950 paper. He suggested that the philosophical question “Can machines think?” could best be answered in an imitation game, where a judge converses with a person over one terminal and with a computer programmed to imitate a person on another. If the judge cannot guess which is which, Turing suggested, there is no basis for denying that the computer can think. Philosophical questions aside, it was apparent to the committee charged with overseeing the competition that no program could come close to winning the $100,000 prize, so they devised a $1,500 version that would be fairer to the state of the art. Each of the judges had to stick to a single topic of conversation selected by the programmer or by the human foil, whichever it was, and the judge was not allowed to engage in any “trickery or guile” such as repeating a question ten times or asking whether zebras wear underwear; the conversation had to be “natural.” After interacting w
ith several programs and human foils for about seven minutes apiece, the judges ranked all the humans as more humanlike than any of the computers. About half the judges did, however, misidentify the winning program as human.
The accomplishment is less impressive than it sounds. The rules handcuffed the judges: “unnatural trickery or guile” is another way of referring to any attempt to determine whether one is conversing with a human or a machine, which is the whole point of the test! Also, the winning programmer shrewdly exploited the opportunity to designate the topic of conversation for his program. He chose “whimsical conversation,” which is a dubious example of a “topic,” and which, by definition, can be full of non sequiturs: