prize price
five fife
jibe hype
geiger biker
Does this mean there are five different rules that alter i—one for z versus s, one for v versus f, and so on? Surely not. The change-triggering consonants t, s, f, p, and k all differ in the same way from their counterparts d, z, v, b, and g: they are unvoiced, whereas the counterparts are voiced. We need only one rule, then: change i whenever it appears before an unvoiced consonant. The proof that this is the real rule in people’s heads (and not just a way to save ink by replacing five rules with one) is that if an English speaker succeeds in pronouncing the German ch in the Third Reich, that speaker will pronounce the ei as in write, not as in ride. The consonant ch is not in the English inventory, so English speakers could not have learned any rule specifically applying to it. But it is an unvoiced consonant, and if the rule applies to any unvoiced consonant, an English speaker knows exactly what to do.
This selectivity works not only in English but in all languages. Phonological rules are rarely triggered by a single phoneme; they are triggered by an entire class of phonemes that share one or more features (like voicing, stop versus fricative manner, or which organ is doing the articulating). This suggests that rules do not “see” the phonemes in a string but instead look right through them to the features they are made from.
And it is features, not phonemes, that are manipulated by the rules. Pronounce the following past-tense forms:
walked
slapped
passed
jogged
sobbed
fizzed
In walked, slapped, and passed, the -ed is pronounced as a t; in jogged, sobbed, and fizzed, it is pronounced as a d. By now you can probably figure out what is behind the difference: the t pronunciation comes after voiceless consonants like k, p, and s; the d comes after voiced ones like g, b, and z. There must be a rule that adjusts the pronunciation of the suffix -ed by peering back into the final phoneme of the stem and checking to see if it has the voicing feature. We can confirm the hunch by asking people to pronounce Mozart out-Bached Bach. The verb to out-Bach contains the sound ch, which does not exist in English. Nonetheless everyone pronounces the -ed as a t, because the ch is unvoiced, and the rule puts a t next to any unvoiced consonant. We can even determine whether people store the -ed suffix as a t in memory and use the rule to convert it to a d for some words, or the other way around. Words like play and row have no consonant at the end, and everyone pronounces their past tenses like plade and rode, not plate and rote. With no stem consonant triggering a rule, we must be hearing the suffix in its pure, unaltered form in the mental dictionary, that is, d. It is a nice demonstration of one of the main discoveries of modern linguistics: a morpheme may be stored in the mental dictionary in a different form from the one that is ultimately pronounced.
Readers with a taste for theoretical elegance may want to bear with me for one more paragraph. Note that there is an uncanny pattern in what the d-to-t rule is doing. First, d itself is voiced, and it ends up next to voiced consonants, whereas t is unvoiced, and it ends up next to unvoiced consonants. Second, except for voicing, t and d are the same; they use the same speech organ, the tongue tip, and that organ moves in the same way, namely sealing up the mouth at the gum ridge and then releasing. So the rule is not just tossing phonemes around arbitrarily, like changing a p to an l following a high vowel or any other substitution one might pick at random. It is doing delicate surgery on the -ed suffix, adjusting it to be the same in voicing as its neighbor, but leaving the rest of its features alone. That is, in converting slap + ed to slapt, the rule is “spreading” the voicing instruction, packaged with the p at the end of slap, onto the -ed suffix, like this:
The voicelessness of the t in slapped matches the voicelessness of the p in slapped because they are the same voicelessness; they are mentally represented as a single feature linked to two segments. This happens very often in the world’s languages. Features like voicing, vowel quality, and tones can spread sideways or sprout connections to several phonemes in a word, as if each feature lived on its own horizontal “tier,” rather than being tethered to one and only one phoneme.
So phonological rules “see” features, not phonemes, and they adjust features, not phonemes. Recall, too, that languages tend to arrive at an inventory of phonemes by multiplying out the various combinations of some set of features. These facts show that features, not phonemes, are the atoms of linguistic sound stored and manipulated in the brain. A phoneme is merely a bundle of features. Thus even in dealing with its smallest units, the features, language works by using a combinatorial system.
Every language has phonological rules, but what are they for? You may have noticed that they often make articulation easier. Flapping a t or a d between two vowels is faster than keeping the tongue in place long enough for air pressure to build up. Spreading voicelessness from the end of a word to its suffix spares the talker from having to turn the larynx off while pronouncing the end of the stem and then turn it back on again for the suffix. At first glance, phonological rules seem to be a mere summary of articulatory laziness. And from here it is a small step to notice phonological adjustments in some dialect other than one’s own and conclude that they typify the slovenliness of the speakers. Neither side of the Atlantic is safe. George Bernard Shaw wrote:
The English have no respect for their language and will not teach their children to speak it. They cannot spell it because they have nothing to spell it with but an old foreign alphabet of which only the consonants—and not all of them—have any agreed speech value. Consequently it is impossible for an Englishman to open his mouth without making some other Englishman despise him.
In his article “Howta Reckanize American Slurvian,” Richard Lederer writes:
Language lovers have long bewailed the sad state of pronunciation and articulation in the United States. Both in sorrow and in anger, speakers afflicted with sensitive ears wince at such mumblings as guvmint for government and assessories for accessories. Indeed, everywhere we turn we are assaulted by a slew of slurrings.
But if their ears were even more sensitive, these sorrowful speakers might notice that in fact there is no dialect in which sloppiness prevails. Phonological rules give with one hand and take away with the other. The same bumpkins who are derided for dropping g’s in Nothin’ doin’ are likely to enunciate the vowels in pólice and accidént that pointy-headed intellectuals reduce to a neutral “uh” sound. When the Brooklyn Dodgers pitcher Waite Hoyt was hit by a ball, a fan in the bleachers shouted, “Hurt’s hoit!” Bostonians who pahk their cah in Hahvahd Yahd name their daughters Sheiler and Linder. In 1992 an ordinance was proposed that would have banned the hiring of any immigrant teacher who “speaks with an accent” in—I am not making this up—Westfield, Massachusetts. An incredulous woman wrote to the Boston Globe recalling how her native New England teacher defined “homonym” using the example orphan and often. Another amused reader remembered incurring the teacher’s wrath when he spelled “cuh-rée-uh” k-o-r-e-a and “cuh-rée-ur” c-a-r-e-e-r, rather than vice versa. The proposal was quickly withdrawn.
There is a good reason why so-called laziness in pronunciation is in fact tightly regulated by phonological rules, and why, as a consequence, no dialect allows its speakers to cut corners at will. Every act of sloppiness on the part of a speaker demands a compensating measure of mental effort on the part of the conversational partner. A society of lazy talkers would be a society of hard-working listeners. If speakers were to have their way, all rules of phonology would spread and reduce and delete. But if listeners were to have their way, phonology would do the opposite: it would enhance the acoustic differences between confusable phonemes by forcing speakers to exaggerate or embroider them. And indeed, many rules of phonology do that. (For example, there is a rule that forces English speakers to round their lips while saying sh but not while saying s. The benefit of forcing everyone to make this extra gesture is tha
t the long resonant chamber formed by the pursed lips enhances the lower-frequency noise that distinguishes sh from s, allowing for easier identification of the sh by the listener.) Although every speaker soon becomes a listener, human hypocrisy would make it unwise to depend on the speaker’s foresight and consideration. Instead, a single, partly arbitrary set of phonological rules, some reducing, some enhancing, is adopted by every member of a linguistic community when he or she acquires the local dialect as a child.
Phonological rules help listeners even when they do not exaggerate some acoustic difference. By making speech patterns predictable, they add redundancy to a language; English text has been estimated as being between two and four times as long as it has to be for its information content. For example, this book takes up about 900,000 characters on my computer disk, but my file compression program can exploit the redundancy in the letter sequences and squeeze it into about 400,000 characters; computer files that do not contain English text cannot be squished nearly that much. The logician Quine explains why many systems have redundancy built in:
It is the judicious excess over minimum requisite support. It is why a good bridge does not crumble when subjected to stress beyond what reasonably could have been foreseen. It is fallback and failsafe. It is why we address our mail to city and state in so many words, despite the zip code. One indistinct digit in the zip code would spoil everything…. A kingdom, legend tells us, was lost for want of a horseshoe nail. Redundancy is our safeguard against such instability.
Thanks to the redundancy of language, yxx cxn xndxrstxnd whxt x xm wrxtxng xvsn xf x rxplxcx xll thx vxwxls wxth xn “x” (t gts Ittl hrdr f y dn’t vn kn whr th vwls r). In the comprehension of speech, the redundancy conferred by phonological rules can compensate for some of the ambiguity in the sound wave. For example, a listener can know that “thisrip” must be this rip and not the srip because the English consonant cluster sr is illegal.
So why is it that a nation that can put a man on the moon cannot build a computer that can take dictation? According to what I have explained so far, each phoneme should have a telltale acoustic signature: a set of resonances for vowels, a noise band for fricatives, a silence-burst-transition sequence for stops. The sequences of phonemes are massaged in predictable ways by ordered phonological rules, whose effects could presumably be undone by applying them in reverse.
The reason that speech recognition is so hard is that there’s many a slip ’twixt brain and lip. No two people’s voices are alike, either in the shape of the vocal tract that sculpts the sounds, or in the person’s precise habits of articulation. Phonemes also sound very different depending on how much they are stressed and how quickly they are spoken; in rapid speech, many are swallowed outright.
But the main reason an electric stenographer is not just around the corner has to do with a general phenomenon in muscle control called coarticulation. Put a saucer in front of you and a coffee cup a foot or so away from it on one side. Now quickly touch the saucer and pick up the cup. You probably touched the saucer at the edge nearest the cup, not dead center. Your fingers probably assumed the handle-grasping posture while your hand was making its way to the cup, well before it arrived. This graceful smoothing and overlapping of gestures is ubiquitous in motor control. It reduces the forces necessary to move body parts around and lessens the wear and tear on the joints. The tongue and throat are no different. When we want to articulate a phoneme, our tongue cannot assume the target posture instantaneously; it is a heavy slab of meat that takes time to heft into place. So while we are moving it, our brains are anticipating the next posture in planning the trajectory, just like the cup-and-saucer maneuver. Among the range of positions in the mouth that can define a phoneme, we place the tongue in the one that offers the shortest path to the target for the next phoneme. If the current phoneme does not specify where a speech organ should be, we anticipate where the next phoneme wants it to be and put it there in advance. Most of us are completely unaware of these adjustments until they are called to our attention. Say Cape Cod. Until now you probably never noticed that your tongue body is in different positions for the two k sounds. In horseshoe, the first s becomes a sh; in NPR, the n becomes an m; in month and width, the n and d are articulated at the teeth, not the usual gum ridge.
Because sound waves are minutely sensitive to the shapes of the cavities they pass through, this coarticulation wreaks havoc with the speech sound. Each phoneme’s sound signature is colored by the phonemes that come before and after, sometimes to the point of having nothing in common with its sound signature in the company of a different set of phonemes. That is why you cannot cut up a tape of the sound cat and hope to find a beginning piece that contains the k alone. As you make earlier and earlier cuts, the piece may go from sounding like ka to sounding like a chirp or whistle. This shingling of phonemes in the speech stream could, in principle, be a boon to an optimally designed speech recognizer. Consonant and vowels are being signaled simultaneously, greatly increasing the rate of phonemes per second, as I noted at the beginning of this chapter, and there are many redundant sound cues to a given phoneme. But this advantage can be enjoyed only by a high-tech speech recognizer, one that has some kind of knowledge of how vocal tracts blend sounds.
The human brain, of course, is a high-tech speech recognizer, but no one knows how it succeeds. For this reason psychologists who study speech perception and engineers who build speech recognition machines keep a close eye on each other’s work. Speech recognition may be so hard that there are only a few ways it could be solved in principle. If so, the way the brain does it may offer hints as to the best way to build a machine to do it, and how a successful machine does it may suggest hypotheses about how the brain does it.
Early in the history of speech research, it became clear that human listeners might somehow take advantage of their expectations of the kinds of things a speaker is likely to say. This could narrow down the alternatives left open by the acoustic analysis of the speech signal. We have already noted that the rules of phonology provide one sort of redundancy that can be exploited, but people might go even farther. The psychologist George Miller played tapes of sentences in background noise and asked people to repeat back exactly what they heard. Some of the sentences followed the rules of English syntax and made sense.
Furry wildcats fight furious battles.
Respectable jewelers give accurate appraisals.
Lighted cigarettes create smoky fumes.
Gallant gentlemen save distressed damsels.
Soapy detergents dissolve greasy stains.
Others were created by scrambling the words within phrases to create colorless-green-ideas sentences, grammatical but nonsensical:
Furry jewelers create distressed stains.
Respectable cigarettes save greasy battles.
Lighted gentlemen dissolve furious appraisals.
Gallant detergents fight accurate fumes.
Soapy wildcats give smoky damsels.
A third kind was created by scrambling the phrase structure but keeping related words together, as in
Furry fight furious wildcat battles.
Jewelers respectable appraisals accurate give.
Finally, some sentences were utter word salad, like
Furry create distressed jewelers stains.
Cigarettes respectable battles greasy save.
People did best with the grammatical sensible sentences, worse with the grammatical nonsense and the ungrammatical sense, and worst of all with the ungrammatical nonsense. A few years later the psychologist Richard Warren taped sentences like The state governors met with their respective legislatures convening in the capital city, excised the first s from legislatures, and spliced in a cough. Listeners could not tell that any sound was missing.
If one thinks of the sound wave as sitting at the bottom of a hierarchy from sounds to phonemes to words to phrases to the meanings of sentences to general knowledge, these demonstrations seem to imply that human speech percepti
on works from the top down rather than just from the bottom up. Maybe we are constantly guessing what a speaker will say next, using every scrap of conscious and unconscious knowledge at our disposal, from how coarticulation distorts sounds, to the rules of English phonology, to the rules of English syntax, to stereotypes about who tends to do what to whom in the world, to hunches about what our conversational partner has in mind at that very moment. If the expectations are accurate enough, the acoustic analysis can be fairly crude; what the sound wave lacks, the context can fill in. For example, if you are listening to a discussion about the destruction of ecological habitats, you might be on the lookout for words pertaining to threatened animals and plants, and then when you hear speech sounds whose phonemes you cannot pick out like “eesees,” you would perceive it correctly as species—unless you are Emily Litella, the hearing-impaired editorialist on Saturday Night Live who argued passionately against the campaign to protect endangered feces. (Indeed, the humor in the Gilda Radner character, who also fulminated against saving Soviet jewelry, stopping violins in the streets, and preserving natural racehorses, comes not from her impairment at the bottom of the speech-processing system but from her ditziness at the top, the level that should have prevented her from arriving at her interpretations.)
The top-down theory of speech perception exerts a powerful emotional tug on some people. It confirms the relativist philosophy that we hear what we expect to hear, that our knowledge determines our perception, and ultimately that we are not in direct contact with any objective reality. In a sense, perception that is strongly driven from the top down would be a barely controlled hallucination, and that is the problem. A perceiver forced to rely on its expectations is at a severe disadvantage in a world that is unpredictable even under the best of circumstances. There is a reason to believe that human speech perception is, in fact, driven quite strongly by acoustics. If you have an indulgent friend, you can try the following experiment. Pick ten words at random out of a dictionary, phone up the friend, and say the words clearly. Chances are the friend will reproduce them perfectly, relying only on the information in the sound wave and knowledge of English vocabulary and phonology. The friend could not have been using any higher-level expectations about phrase structure, context, or story line because a list of words blurted out of the blue has none. Though we may call upon high-level conceptual knowledge in noisy or degraded circumstances (and even here it is not clear whether the knowledge alters perception or just allows us to guess intelligently after the fact), our brains seem designed to squeeze every last drop of phonetic information out of the sound wave itself. Our sixth sense may perceive speech as language, not as sound, but it is a sense, something that connects us to the world, and not just a form of suggestibility.