.EQ delim $$ .EN .CH "1 WHY SPEECH OUTPUT?" .ds RT "Why speech output? .ds CX "Principles of computer speech .pp Speech is our everyday, informal, communication medium. But although we use it a lot, we probably don't assimilate as much information through our ears as we do through our eyes, by reading or looking at pictures and diagrams. You go to a technical lecture to get the feel of a subject \(em the overall arrangement of ideas and the motivation behind them \(em and fill in the details, if you still want to know them, from a book. You probably find out more about the news from ten minutes with a newspaper than from a ten-minute news broadcast. So it should be emphasized from the start that speech output from computers is not a panacea. It doesn't solve the problems of communicating with computers; it simply enriches the possibilities for communication. .pp What, then, are the advantages of speech output? One good reason for listening to a radio news broadcast instead of spending the time with a newspaper is that you can listen while shaving, doing the housework, or driving the car. Speech leaves hands and eyes free for other tasks. Moreover, it is omnidirectional, and does not require a free line of sight. Related to this is the use of speech as a secondary medium for status reports and warning messages. Occasional interruptions by voice do not interfere with other activities, unless they demand unusual concentration, and people can assimilate spoken messages and queue them for later action quite easily and naturally. .pp The second key feature of speech communication stems from the telephone. It is the universality of the telephone receiver itself that is important here, rather than the existence of a world-wide distribution network; for with special equipment (a modem and a VDU) one does not need speech to take advantage of the telephone network for information transfer. But speech needs no tools other than the telephone, and this gives it a substantial advantage. You can go into a phone booth anywhere in the world, carrying no special equipment, and have access to your computer within seconds. The problem of data input is still there: perhaps your computer system has a limited word recognizer, or you use the touchtone telephone keypad (or a portable calculator-sized tone generator). Easy remote access without special equipment is a great, and unique, asset to speech communication. .pp The third big advantage of speech output is that it is potentially very cheap. Being all-electronic, except for the loudspeaker, speech systems are well suited to high-volume, low-cost, LSI manufacture. Other computer output devices are at present tied either to mechanical moving parts or to the CRT. This was realized quickly by the computer hobbies market, where speech output peripherals have been selling like hot cakes since the mid 1970's. .pp A further point in favour of speech is that it is natural-seeming and somehow cuddly when compared with printers or VDU's. It would have been much more difficult to make this point before the advent of talking toys like Texas Instruments' "Speak 'n Spell" in 1978, but now it is an accepted fact that friendly computer-based gadgets can speak \(em there are talking pocket-watches that really do "tell" the time, talking microwave ovens, talking pinball machines, and, of course, talking calculators. It is, however, difficult to assess whether the appeal stems from mechanical speech's novelty \(em it is still a gimmick \(em and also to what extent it is tied up with economic factors. After all, most of the population don't use high-quality VDU's, and their major experience of real-time interactive computing is through the very limited displays and keypads provided on video games and teletext systems. .pp Articles on speech communication with computers often list many more advantages of voice output (see Hill 1971, Turn 1974, Lea 1980). .[ Hill 1971 Man-machine interaction using speech .] .[ Lea 1980 .] .[ Turn 1974 Speech as a man-computer communication channel .] For example, speech .LB .NP can be used in the dark .NP can be varied from a (confidential) whisper to a (loud) shout .NP requires very little energy .NP is not appreciably affected by weightlessness or vibration. .LE However, these either derive from the three advantages we have discussed above, or relate mainly to exotic applications in space modules and divers' helmets. .pp Useful as it is at present, speech output would be even more attractive if it could be coupled with speech input. In many ways, speech input is its "big brother". Many of the benefits of speech output are even more striking for speech input. Although people can assimilate information faster through the eyes than the ears, the majority of us can generate information faster with the mouth than with the hands. Rapid typing is a relatively uncommon skill, and even high typing rates are much slower than speaking rates (although whether we can originate ideas quickly enough to keep up with fast speech is another matter!) To take full advantage of the telephone for interaction with machines, machine recognition of speech is obviously necessary. A microwave oven, calculator, pinball machine, or alarm clock that responds to spoken commands is certainly more attractive than one that just generates spoken status messages. A book that told you how to recognize speech by machine would undoubtedly be more useful than one like this that just discusses how to synthesize it! But the technology of speech recognition is nowhere near as advanced as that of synthesis \(em it's a much more difficult problem. However, because speech input is obviously complementary to speech output, and even very limited input capabilities will greatly enhance many speech output systems, it is worth summarizing the present state of the art of speech recognition. .pp Commercial speech recognizers do exist. Almost invariably, they accept words spoken in isolation, with gaps of silence between them, rather than connected utterances. It is not difficult to discriminate with high accuracy up to a hundred different words spoken by the same speaker, especially if the vocabulary is carefully selected to avoid words which sound similar. If several different speakers are to be comprehended, performance can be greatly improved if the machine is given an opportunity to calibrate their voices in a training session, and is informed at recognition time which one is to speak. With a large population of unknown speakers, accurate recognition is difficult for vocabularies of more than a few carefully-chosen words. .pp A half-way house between isolated word discrimination and recognition of connected speech is the problem of spotting known words in continuous speech. This allows much more natural input, if the dialogue is structured as keywords which may be interspersed by unimportant "noise words". To speak in truly isolated words requires a great deal of self-discipline and concentration \(em it is surprising how much of ordinary speech is accounted for by vague sounds like um's and aah's, and false starts. Word spotting disregards these and so permits a more relaxed style of speech. Some progress has been made on it in research laboratories, but the vocabularies that can be accomodated are still very small. .pp The difficulty of recognizing connected speech depends crucially on what is known in advance about the dialogue: its pragmatic, semantic, and syntactic constraints. Highly structured dialogues constrain very heavily the choice of the next word. Recognizers which can deal with vocabularies of over 1000 words have been built in research laboratories, but the structure of the input has been such that the average "branching factor" \(em the size of the set out of which the next word must be selected \(em is only around 10 (Lea, 1980). .[ Lea 1980 .] Whether such highly constrained languages would be acceptable in many practical applications is a moot point. One commercial recognizer, developed in 1978, can cope with up to five words spoken continuously from a basic 120-word vocabulary. .pp There has been much debate about whether it will ever be possible for a speech recognizer to step outside rigid constraints imposed on the utterances it can understand, and act, say, as an automatic dictation machine. Certainly the most advanced recognizers to date depend very strongly on a tight context being available. Informed opinion seems to accept that in ten years' time, voice data entry in the office will be an important and economically feasible prospect, but that it would be rash to predict the appearance of unconstrained automatic dictation by then. .pp Let's return now to speech output and take a look at some systems which use it, to illustrate the advantages and disadvantages of speech in practical applications. .sh "1.1 Talking calculator" .pp Figure 1.1 shows a calculator that speaks. .FC "Figure 1.1" Whenever a key is pressed, the device confirms the action by saying the key's name. The result of any computation is also spoken aloud. For most people, the addition of speech output to a calculator is simply a gimmick. (Note incidentally that speech .ul input is a different matter altogether. The ability to dictate lists of numbers and commands to a calculator, without lifting one's eyes from the page, would have very great advantages over keypad input.) Used-car salesmen find that speech output sometimes helps to clinch a deal: they key in the basic car price and their bargain-basement deductions, and the customer is so bemused by the resulting price being spoken aloud to him by a machine that he signs the cheque without thinking! More seriously, there may be some small advantage to be gained when keying a list of figures by touch from having their values read back for confirmation. For blind people, however, such devices are a boon \(em and there are many other applications, like talking elevators and talking clocks, which benefit from even very restricted voice output. Much more sophisticated is a typewriter with audio feedback, designed by IBM for the blind. Although blind typists can remember where the keys on a typewriter are without difficulty, they rely on sighted proof-readers to help check their work. This device could make them more useful as office typists and secretaries. As well as verbalizing the material (including punctuation) that has been typed, either by attempting to pronounce the words or by spelling them out as individual letters, it prompts the user through the more complex action sequences that are possible on the typewriter. .pp The vocabulary of the talking calculator comprises the 24 words of Table 1.1. .RF .nr x1 2.0i+\w'percent'u .nr x1 (\n(.l-\n(x1)/2 .in \n(x1u .ta 2.0i zero percent one low two over three root four em (m) five times six point seven overflow eight minus nine plus times-minus clear equals swap .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 1.1 Vocabulary of a talking calculator" This represents a total of about 13 seconds of speech. It is stored electronically in read-only memory (ROM), and Figure 1.2 shows the circuitry of the speech module inside the calculator. .FC "Figure 1.2" There are three large integrated circuits. Two of them are ROMs, and the other is a special synthesis chip which decodes the highly compressed stored data into an audio waveform. Although the mechanisms used for storing speech by commercial devices are not widely advertised by the manufacturers, the talking calculator almost certainly uses linear predictive coding \(em a technique that we will examine in Chapter 6. The speech quality is very poor because of the highly compressed storage, and words are spoken in a grating monotone. However, because of the very small vocabulary, the quality is certainly good enough for reliable identification. .sh "1.2 Computer-generated wiring instructions" .pp I mentioned earlier that one big advantage of speech over visual output is that it leaves the eyes free for other tasks. When wiring telephone equipment during manufacture, the operator needs to use his hands as well as eyes to keep his place in the task. For some time tape-recorded instructions have been used for this in certain manufacturing plants. For example, the instruction .LB .NI Red 2.5 11A terminal strip 7A tube socket .LE directs the operator to cut 2.5" of red wire, attach one end to a specified point on the terminal strip, and attach the other to a pin of the tube socket. The tape recorder is fitted with a pedal switch to allow a sequence of such instructions to be executed by the operator at his own pace. .pp The usual way of recording the instruction tape is to have a human reader dictate them from a printed list. The tape is then checked against the list by another listener to ensure that the instructions are correct. Since wiring lists are usually stored and maintained in machine-readable form, it is natural to consider whether speech synthesis techniques could be used to generate the acoustic tape directly by a computer (Flanagan .ul et al, 1972). .[ Flanagan Rabiner Schafer Denman 1972 .] .pp Table 1.2 shows the vocabulary needed for this application. .RF .nr x1 2.0i+2.0i+\w'tube socket'u .nr x1 (\n(.l-\n(x1)/2 .in \n(x1u .ta 2.0i +2.0i A green seventeen black left six bottom lower sixteen break make strip C nine ten capacitor nineteen terminal eight one thirteen eighteen P thirty eleven point three fifteen R top fifty red tube socket five repeat coil twelve forty resistor twenty four right two fourteen seven upper .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 1.2 Vocabulary needed for computer-generated wiring instructions" It is rather larger than that of the talking calculator \(em about 25 seconds of speech \(em but well within the limits of single-chip storage in ROM, compressed by the linear predictive technique. However, at the time that the scheme was investigated (1970\-71) the method of linear predictive coding had not been fully developed, and the technology for low-cost microcircuit implementation was not available. But this is not important for this particular application, for there is no need to perform the synthesis on a miniature low-cost computer system, nor need it be accomplished in real time. In fact a technique of concatenating spectrally-encoded words was used (described in Chapter 7), and it was implemented on a minicomputer. Operating much slower than real-time, the system calculated the speech waveform and wrote it to disk storage. A subsequent phase read the pre-computed messages and recorded them on a computer-controlled analogue tape recorder. .pp Informal evaluation showed the scheme to be quite successful. Indeed, the synthetic speech, whose quality was not high, was actually preferred to natural speech in the noisy environment of the production line, for each instruction was spoken in the same format, with the same programmed pause between the items. A list of 58 instructions of the form shown above was recorded and used to wire several pieces of apparatus without errors. .sh "1.3 Telephone enquiry service" .pp The computer-generated wiring scheme illustrates how speech can be used to give instructions without diverting visual attention from the task at hand. The next system we examine shows how speech output can make the telephone receiver into a remote computer terminal for a variety of purposes (Witten and Madams, 1977). .[ Witten Madams 1977 Telephone Enquiry Service .] The caller employs the touch-tone keypad shown in Figure 1.3 for input, and the computer generates a synthetic voice response. .FC "Figure 1.3" Table 1.3 shows the process of making contact with the system. .RF .fi .nh .na .in 0.3i .nr x0 \w'COMPUTER: ' .nr x1 \w'CALLER: ' .in+\n(x0u .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Dials the service. .ti-\n(x0u COMPUTER: Answers telephone. "Hello, Telephone Enquiry Service. Please enter your user number". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters user number. .ti-\n(x0u COMPUTER: "Please enter your password". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters password. .ti-\n(x0u COMPUTER: Checks validity of password. If invalid, the user is asked to re-enter his user number. Otherwise, "Which service do you require?" .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters service number. .in 0 .nf .FG "Table 1.3 Making contact with the telephone enquiry system" .pp Advantage is taken of the disparate speeds of input (keyboard) and output (speech) to hasten the dialogue by imposing a question-answer structure on it, with the computer taking the initiative. The machine can afford to be slightly verbose if by so doing it makes the caller's response easier, and therefore more rapid. Moreover, operators who are experienced enough with the system to anticipate questions can easily forestall them just by typing ahead, for the computer is programmed to examine its input buffer before issuing prompts and to suppress them if input has already been provided. .pp An important aim of the system is to allow application programmers with no special knowledge of speech to write independent services for it. Table 1.4 shows an example of the use of one such application program, .RF .fi .nh .na .in 0.3i .nr x0 \w'COMPUTER: ' .nr x1 \w'CALLER: ' .in+\n(x0u .ti-\n(x0u COMPUTER: "Stores Information Service. Please enter component name". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "SN7406#". .ti-\n(x0u COMPUTER: "The component name is SN7406. Is this correct?" .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "*1#" (system convention for "yes"). .ti-\n(x0u COMPUTER: "This component is in stores". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "*7#" (command for "price"). .ti-\n(x0u COMPUTER: "The component price is 35 pence". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "*8#" (command for "minimum number"). .ti-\n(x0u COMPUTER: "The minimum number of this component kept in stores is 10". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "SN7417#". .ti-\n(x0u COMPUTER: "The component name is SN7417. Is this correct?" .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "*1#". .ti-\n(x0u COMPUTER: "This component is not in stores". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "*9#" (command for "delivery time"). .ti-\n(x0u COMPUTER: "The expected delivery time is 14 days". .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Enters "*0#". .ti-\n(x0u COMPUTER: "Which service do you require?" .in 0 .nf .FG "Table 1.4 The Stores Information Service" the Stores Information Service, which permits enquiries to be made of a database holding information on electronic components kept in stock. This subsystem is driven by .ul alphanumeric data entered on the touch-tone keypad. Two or three letters are associated with each digit, in a manner which is fairly standard in touch-tone telephone applications. These are printed on a card overlay that fits the keypad (see Figure 1.3). Although true alphanumeric data entry would require a multiple key press for each character, the ambiguity inherent in a single-key-per-character convention can usually be resolved by the computer, if it has a list of permissible entries. For example, the component names SN7406 and ZTX300 are read by the machine as "767406" and "189300", respectively. Confusion rarely occurs if the machine is expecting a valid component code. The same holds true of people's names, and file names \(em although with these one must take care not to identify a series of files by similar names, like TX38A, TX38B, TX38C. It is easy for the machine to detect the rare cases where ambiguity occurs, and respond by requesting further information: "The component name is SN7406. Is this correct?" (In fact, the Stores Information Service illustrated in Table 1.4 is defective in that it .ul always requests confirmation of an entry, even when no ambiguity exists.) The use of a telephone keypad for data entry will be taken up again in Chapter 10. .pp A distinction is drawn throughout the system between data entries and commands, the latter being prefixed by a "*". In this example, the programmer chose to define a command for each possible question about a component, so that a new component name can be entered at any time without ambiguity. The price paid for the resulting brevity of dialogue is the burden of memorizing the meaning of the commands. This is an inherent disadvantage of a one-dimensional auditory display over the more conventional graphical output: presenting menus by speech is tedious and long-winded. In practice, however, for a simple task such as the Stores Information Service it is quite convenient for the caller to search for the appropriate command by trying out all possibilities \(em there are only a few. .pp The problem of memorizing commands is alleviated by establishing some system-wide conventions. Each input is terminated by a "#", and the meaning of standard commands is given in Table 1.5. .RF .fi .nh .na .in 0.3i .nr x0 \w'# alone ' .nr x1 \w'\(em ' .ta \n(x0u +\n(x1u .nr x2 \n(x0+\n(x1 .in+\n(x2u .ti-\n(x2u *# \(em Erase this input line, regardless of what has been typed before the "*". .ti-\n(x2u *0# \(em Stop. Used to exit from any service. .ti-\n(x2u *1# \(em Yes. .ti-\n(x2u *2# \(em No. .ti-\n(x2u *3# \(em Repeat question or summarize state of current transaction. .ti-\n(x2u # alone \(em Short form of repeat. Repeats or summarizes in an abbreviated fashion. .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .nf .FG "Table 1.5 System-wide conventions for the service" .pp A summary of services available on the system is given in Table 1.6. .RF .fi .na .in 0.3i .nr x0 \w'000 ' .nr x1 \w'\(em ' .nr x2 \n(x0+\n(x1 .in+\n(x2u .ta \n(x0u +\n(x1u .ti-\n(x2u \0\01 \(em tells the time .ti-\n(x2u \0\02 \(em Biffo (a game of NIM) .ti-\n(x2u \0\03 \(em MOO (a game similar to that marketed under the name "Mastermind") .ti-\n(x2u \0\04 \(em error demonstration .ti-\n(x2u \0\05 \(em speak a file in phonetic format .ti-\n(x2u \0\06 \(em listening test .ti-\n(x2u \0\07 \(em music (allows you to enter a tune and play it) .ti-\n(x2u \0\08 \(em gives the date .sp .ti-\n(x2u 100 \(em squash ladder .ti-\n(x2u 101 \(em stores information service .ti-\n(x2u 102 \(em computes means and standard deviations .ti-\n(x2u 103 \(em telephone directory .sp .ti-\n(x2u 411 \(em user information .ti-\n(x2u 412 \(em change password .ti-\n(x2u 413 \(em gripe (permits feedback on services from caller) .sp .ti-\n(x2u 600 \(em first year laboratory marks entering service .sp .ti-\n(x2u 910 \(em repeat utterance (allows testing of system) .ti-\n(x2u 911 \(em speak utterance (allows testing of system) .ti-\n(x2u 912 \(em enable/disable user 100 (a no-password guest user number) .ti-\n(x2u 913 \(em mount a magnetic tape on the computer .ti-\n(x2u 914 \(em set/reset demonstration mode (prohibits access by low-priority users) .ti-\n(x2u 915 \(em inhibit games .ti-\n(x2u 916 \(em inhibit the MOO game .ti-\n(x2u 917 \(em disable password checking when users log in .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .nf .FG "Table 1.6 Summary of services on a telephone enquiry system" They range from simple games and demonstrations, through serious database services, to system maintenance facilities. A priority structure is imposed upon them, with higher service numbers being available only to higher priority users. Services in the lowest range (1\-99) can be obtained by all, while those in the highest range (900\-999) are maintenance services, available only to the system designers. Access to the lower-numbered "games" services can be inhibited by a priority user \(em this was found necessary to prevent over-use of the system! Another advantage of telephone access to an information retrieval system is that some day-to-day maintenance can be done remotely, from the office telephone. .pp This telephone enquiry service, which was built in 1974, demonstrated that speech synthesis had moved from a specialist phonetic discipline into the province of engineering practicability. The speech was generated "by rule" from a phonetic input (the method is covered in Chapters 7 and 8), which has very low data storage requirements of around 75\ bit/s of speech. Thus an enormous vocabulary and range of services could be accomodated on a small computer system. Despite the fairly low quality of the speech, the response from callers was most encouraging. Admittedly the user population was a self-selected body of University staff, which one might suppose to have high tolerance to new ideas, and a system designed for the general public would require more effort to be spent on developing speech of greater intelligibility. Although it was observed that some callers failed to understand parts of the responses, even after repetition, communication was largely unhindered in most cases; users being driven by a high motivation to help the system help them. .pp The use of speech output in conjunction with a simple input device requires careful thought for interaction to be successful and comfortable. It is necessary that the computer direct the conversation as much as possible, without seeming to be taking charge. Provision for eliminating prompts which are unwanted by sophisticated users is essential to avoid frustration. We will return to the topic of programming techniques for speech interaction in Chapter 10. .pp Making a computer system available over the telephone results in a sudden vast increase in the user population. Although people's reaction to a new computer terminal in every office was overwhelmingly favourable, careful resource allocation was essential to prevent the service being hogged by a persistent few. As with all multi-access computer systems, it is particularly important that error recovery is effected automatically and gracefully. .sh "1.4 Speech output in the telephone exchange" .pp The telephone enquiry service was an experimental vehicle for research on speech interaction, and was developed in 1974. Since then, speech has begun to be used in real commercial applications. One example is System\ X, the British Post Office's computer-controlled telephone exchange. This incorporates many features not found in conventional telephone exchanges. For example, if a number is found to be busy, the call can be attempted again by a "repeat last call" command, without having to re-dial the full number. Alternatively, the last number can be stored for future re-dialling, freeing the phone for other calls. "Short code dialling" allows a customer to associate short codes with commonly-dialled numbers. Alarm calls can be booked at specified times, and are made automatically without human intervention. Incoming calls can be barred, as can outgoing ones. A diversion service allows all incoming calls to be diverted to another telephone, either immediately, or if a call to the original number remains unanswered for a specified period of time, or if the original number is busy. Three-party calls can be set up automatically, without involving the operator. .pp Making use of these facilities presents the caller with something of a problem. With conventional telephone exchanges, feedback is provided on what is happening to a call by the use of four tones \(em the dial tone, the busy tone, the ringing tone, and the number unavailable tone. For the more sophisticated interaction which is expected on the advanced exchange, a much greater variety of status signals is required. The obvious solution is to use computer-generated spoken messages to inform the caller when these services are invoked, and to guide him through the sequences of actions needed to set up facilities like call re-direction. For example, the messages used by the exchange when a user accesses the alarm call service are .LB .NI Alarm call service. Dial the time of your alarm call followed by square\u\(dg\d. .FN 1 \(dg\d"Square" is the term used for the "#" key on the touch-tone telephone.\u .EF .NI You have booked an alarm call for seven thirty hours. .NI Alarm call operator. At the third stroke it will be seven thirty. .LE .pp Because of the rather small vocabulary, the number of messages that can be stored in their entirety rather than being formed by concatenation of smaller units, and the short time which was available for development, System\ X stores speech as a time waveform, slightly compressed by a time-domain encoding operation (such techniques are described in Chapter 3). Utterances which contain variable parts, like the time of alarm in the messages above, are formed by inserting separately-recorded digits in a fixed "carrier" message. No attempt is made to apply uniform intonation contours to the synthetic utterances. The resulting speech is of excellent quality (being a slightly compressed recording of a human voice), but sometimes exhibits somewhat anomalous pitch contours. For example, the digits comprising numbers often sound rather jerky and out-of-context \(em which indeed they are. .pp Even more advanced facilities can be expected on telephone exchanges in the future. A message storage capability is one example. Although automatic call recording machines have been available for years, a centralized facility could time and date a message, collect the caller's identity (using the telephone keypad), and allow the recipient to select messages left for him through an interactive dialogue so that he could control the order in which he listens to them. He could choose to leave certain messages to be dealt with later, or re-route them to a colleague. He may even wish to leave reminders for himself, to be dialled automatically at specified times (like alarm calls with user-defined information attached). The sender of a message could be informed automatically by the system when it is delivered. None of this requires speech recognition, but it does need economical speech .ul storage, and also speech .ul synthesis (for time and date tags). .sh "1.5 Travel consultant" .pp Some current research in speech communication with computers is aimed at embedding the dialogue in a task environment which provides it with an overall contextual framework. This in fact was the motivation for the experimental telephone enquiry service described above, and it was observed that people are able to tolerate rather low quality speech if they feel that the interaction is worthwhile. One step further is the use of speech .ul input as well as output, and here the context of the conversation which is provided by the task environment can be used to improve recognition accuracy significantly. A system set up at Bell Telephone Laboratories indicates what we can expect to find operating commercially in the future (Levinson and Shipley, 1980). .[ Levinson Shipley 1980 .] It provides timetable information about airline flights, and allows the user to make reservations. .pp An example dialogue is given in Table 1.7. .RF .fi .nh .na .in 0.3i .nr x0 \w'COMPUTER: ' .nr x1 \w'CALLER: ' .in+\n(x0u .ti-\n(x0u COMPUTER: This is the Bell Laboratories flight information system. May I help you? .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' I would like to go to Chicago. .ti-\n(x0u COMPUTER: Please specify the day of your departure. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' I want to leave at four p.m.. .ti-\n(x0u COMPUTER: Please specify the day of your departure. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' I want to depart on the two five May. .ti-\n(x0u COMPUTER: Flight number nine two leaves New York at four p.m. arrives in Chicago at five twenty five p.m.. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' What plane is on the flight? .ti-\n(x0u COMPUTER: The aircraft is a Boeing seven twenty seven. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' How much is the fare? .ti-\n(x0u COMPUTER: The round trip coach fare is one hundred forty two dollars. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' I want a first class seat. .ti-\n(x0u COMPUTER: One seat thank you. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' What is the fare? .ti-\n(x0u COMPUTER: The round trip first class fare is one hundred eighty four dollars. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' Please repeat the arrival time. .ti-\n(x0u COMPUTER: The arrival time is five twenty five p.m.. .ti-\n(x0u CALLER:\h'\n(x0u-\n(x1u' I will take the flight. .ti-\n(x0u COMPUTER: Your reservation on flight number nine two to Chicago on Friday is confirmed. Thank you. .in 0 .nf .MT 2 Table 1.7 A conversation with an airline flight information service (from Levinson and Shipley, 1980) .TE .[ Levinson Shipley 1980 .] The user dials the system from an ordinary telephone. The recognition side must be trained by each user, and accepts isolated words spoken with brief pauses between them. The voice response unit has a vocabulary of around 200 words, and synthesizes its answers by slotting words into "templates" evoked by the speech understanding part in response to a query. For example, .LB .NI This flight makes \(em stops .NI Flight number \(em leaves \(em at \(em , arrives in \(em at \(em .LE are templates which when called with specific slot fillers could produce the utterances .LB .NI This flight makes three stops .NI Flight number nine two leaves New York at four p.m., arrives in Chicago at five twenty-five p.m. .LE The chief research interest of the system is in its speech understanding capabilities, and the method used for speech output is relatively straightforward. The templates and words are recorded, digitized, compressed slightly, and stored on disk files (totalling a few hundred thousand bytes of storage), using techniques similar to those of System\ X. Again, no independent manipulation of pitch is possible, and so the utterances sound intelligible but the transition between templates and slot fillers is not completely fluent. However, the overall context of the interaction means that the communication is not seriously disrupted even if the machine occasionally misunderstands the man or vice versa. The user's attention is drawn away from recognition accuracy and focussed on the exchange of information with the machine. The authors conclude that progress in speech recognition can best be made by studying it in the context of communication rather than in a vacuum or as part of a one-way channel, and the same is undoubtedly true of speech synthesis as well. .sh "1.6 Reading machine for the blind" .pp Perhaps the most advanced attempt to provide speech output from a computer is the Kurzweil reading machine for the blind, first marketed in the late 1970's (Figure 1.4). .FC "Figure 1.4" This device reads an ordinary book aloud. Users adjust the reading speed according to the content of the material and their familiarity with it, and the maximum rate has recently been improved to around 225 words per minute \(em perhaps half as fast again as normal human speech rates. .pp As well as generating speech from text, the machine has to scan the document being read and identify the characters presented to it. A scanning camera is used, controlled by a program which searches for and tracks the lines of text. The output of the camera is digitized, and the image is enhanced using signal-processing techniques. Next each individual letter must be isolated, and its geometric features identified and compared with a pre-stored table of letter shapes. Isolation of letters is not at all trivial, for many type fonts have "ligatures" which are combinations of characters joined together (for example, the letters "fi" are often run together.) The machine must cope with many printed type fonts, as well as typewritten ones. The text-recognition side of the Kurzweil reading machine is in fact one of its most advanced features. .pp We will discuss the problem of speech generation from text in Chapter 9. It has many facets. First there is pronunciation, the translation of letters to sounds. It is important to take into account the morphological structure of words, dividing them into "root" and "endings". Many words have concatenated suffixes (like "like-li-ness"). These are important to detect, because a final "e" which appears on a root word is not pronounced itself but affects the pronunciation of the previous vowel. Then there is the difficulty that some words look the same but are pronounced differently, depending on their meaning or on the syntactic part that they play in the sentence. Appropriate intonation is extremely difficult to generate from a plain textual representation, for it depends on the meaning of the text and the way in which emphasis is given to it by the reader. Similarly the rhythmic structure is important, partly for correct pronunciation and partly for purposes of emphasis. Finally the sounds that have been deduced from the text need to be synthesized into acoustic form, taking due account of the many and varied contextual effects that occur in natural speech. This by itself is a challenging problem. .pp The performance of the Kurzweil reading machine is not good. While it seems to be true that some blind people can make use of it, it is far from comprehensible to an untrained listener. For example, it will miss out words and even whole phrases, hesitate in a stuttering manner, blatantly mis-pronounce many words, fail to detect "e"s which should be silent, and give completely wrong rhythms to words, making them impossible to understand. Its intonation is decidedly unnatural, monotonous, and often downright misleading. When it reads completely new text to people unfamiliar with its quirks, they invariably fail to understand more than an odd word here and there, and do not improve significantly when the text is repeated more than once. Naturally performance improves if the material is familiar or expected in some way. One useful feature is the machine's ability to spell out difficult words on command from the user. .pp While not wishing to denigrate the Kurzweil machine, which is a remarkable achievement in that it integrates together many different advanced technologies, there is no doubt that the state of the art in speech synthesis directly from unadorned text is extremely primitive, at present. It is vital not to overemphasize the potential usefulness of abysmal speech, which takes a great deal of training on the part of the user before it becomes at all intelligible. To make a rather extreme analogy, Morse code could be used as audio output, requiring a great deal of training, but capable of being understood at quite high rates by an expert. It could be generated very cheaply. But clearly the man in the street would find it quite unacceptable as an audio output medium, because of the excessive effort required to learn to use it. In many applications, very bad synthetic speech is just as useless. However, the issue is complicated by the fact that for people who use synthesizers regularly, synthetic speech becomes quite easily comprehensible. We will return to the problem of evaluating the quality of artificial speech later in the book (Chapter 8). .sh "1.7 System considerations for speech output" .pp Fortunately, very many of the applications of speech output from computers do not need to read unadorned text. In all the example systems described above (except the reading machine), it is enough to be able to store utterances in some representation which can include pre-programmed cues for pronunciation, rhythm, and intonation in a much more explicit way than ordinary text does. .pp Of course, techniques for storing audio information have been in use for decades. For example, a domestic cassette tape recorder stores speech at much better than telephone quality at very low cost. The method of direct recording of an analogue waveform is currently used for announcements in the telephone network to provide information such as the time, weather forecasts, and even bedtime stories. However, it is difficult to provide rapid access to messages stored in analogue form, and although some computer peripherals which use analogue recordings for voice-response applications have been marketed \(em they are discussed briefly at the beginning of Chapter 3 \(em they have been superseded by digital storage techniques. .pp Although direct storage of a digitized audio waveform is used in some voice-response systems, the approach has certain limitations. The most obvious one is the large storage requirement: suitable coding can reduce the data-rate of speech to as little as one hundredth of that needed by direct digitization, and textual representations reduce it by another factor of ten or twenty. (Of course, the speech quality is inevitably compromised somewhat by data-compression techniques.) However, the cost of storage is dropping so fast that this is not necessarily an overriding factor. A more fundamental limitation is that utterances stored directly cannot sensibly be modified in any way to take account of differing contexts. .pp If the results of certain kinds of analyses of utterances are stored, instead of simply the digitized waveform, a great deal more flexibility can be gained. It is possible to separate out the features of intonation and amplitude from the articulation of the speech, and this raises the attractive possibility of regenerating utterances with pitch contours different from those with which they were recorded. The primary analysis technique used for this purpose is .ul linear prediction of speech, and this is treated in some detail in Chapter 6. It also reduces drastically the data-rate of speech, by a factor of around 50. It is likely that many voice-response systems in the short- and medium-term future will use linear predictive representations for utterance storage. .pp For maximum flexibility, however, it is preferable to store a textual representation of the utterance. There is an important distinction between speech .ul storage, where an actual human utterance is recorded, perhaps processed to lower the data-rate, and stored for subsequent regeneration when required, and speech .ul synthesis, where the machine produces its own individual utterances which are not based on recordings of a person saying the same thing. The difference is summarized in Figure 1.5. .FC "Figure 1.5" In both cases something is stored: for the first it is a direct representation of an actual human utterance, while for the second it is a typed .ul description of the utterance in terms of the sounds, or phonemes, which constitute it. The accent and tone of voice of the human speaker will be apparent in the stored speech output, while for synthetic speech the accent is the machine's and the tone of voice is determined by the synthesis program. .pp Probably the most attractive representation of utterances in man-machine systems is ordinary English text, as used by the Kurzweil reading machine. But, as noted above, this poses extraordinarily difficult problems for the synthesis procedure, and these inevitably result in severely degraded speech. Although in the very long term these problems may indeed be solved, most speech output systems can adopt as their representation of an utterance a description of it which explicitly conveys the difficult features of intonation, rhythm, and even pronunciation. In the kind of applications described above (barring the reading machine), input will be prepared by a programmer as he builds the software system which supports the interactive dialogue. Although it is important that the method of specifying utterances be easily learned, it is not necessary that plain English is used. It should be simple for the programmer to enter new utterances and modify them on-line in cut-and-try attempts to render the man-machine dialogue as natural as possible. A phonetic input can be quite adequate for this, especially if the system allows the programmer to hear immediately the synthesized version of the message he types. Furthermore, markers which indicate rhythm and intonation can be added to the message so that the system does not have to deduce these features by attempting to "understand" the plain text. .pp This brings us to another disadvantage of speech storage as compared with speech synthesis. To provide utterances for a voice response system using stored human speech, one must assemble together special input hardware, a quiet room, and (probably) a dedicated computer. If the speech is to be heavily encoded, either expensive special hardware is required or the encoding process, if performed by software on a general-purpose computer, will take a considerable length of time (perhaps hundreds of times real-time). In either case, time-consuming editing of the speech will be necessary, with follow-up recordings to clarify sections of speech which turn out to be unsuitable or badly recorded. If at a later date the voice response system needs modification, it will be necessary to recall the same speaker, or re-record the entire utterance set. This discourages the application programmer from adjusting his dialogue in the light of experience. Synthesizing from a textual representation, on the other hand, allows him to change a speech prompt as simply as he could a VDU one, and evaluate its effect immediately. .pp We will return to methods of digitizing and compacting speech in Chapters 3 and 4, and carry on to consider speech synthesis in subsequent chapters. Firstly, however, it is necessary to take a look at what speech is and how people produce it. .sh "1.8 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "1.9 Further reading" .pp There are remarkably few general books on speech output, although a substantial specialist literature exists for the subject. In addition to the references listed above, I suggest that you look at the following. .LB "nn" .\"Ainsworth-1976-1 .]- .ds [A Ainsworth, W.A. .ds [D 1976 .ds [T Mechanisms of speech recognition .ds [I Pergamon .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n A nice, easy-going introduction to speech recognition, this book covers the acoustic structure of the speech signal in a way which makes it useful as background reading for speech synthesis as well. It complements Lea, 1980, cited above; which presents more recent results in greater depth. .in-2n .\"Flanagan-1973-2 .]- .ds [A Flanagan, J.L. .as [A " and Rabiner, L.R. (Editors) .ds [D 1973 .ds [T Speech synthesis .ds [I Wiley .nr [T 0 .nr [A 0 .nr [O 0 .][ 2 book .in+2n This is a collection of previously-published research papers on speech synthesis, rather than a unified book. It contains many of the classic papers on the subject from 1940\ -\ 1972, and is a very useful reference work. .in-2n .\"LeBoss-1980-3 .]- .ds [A LeBoss, B. .ds [D 1980 .ds [K * .ds [T Speech I/O is making itself heard .ds [J Electronics .ds [O May\ 22 .ds [P 95-105 .nr [P 1 .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n The magazine .ul Electronics is an excellent source of up-to-the-minute news, product announcements, titbits, and rumours in the commercial speech technology world. This particular article discusses the projected size of the voice output market and gives a brief synopsis of the activities of several interested companies. .in-2n .\"Witten-1980-5 .]- .ds [A Witten, I.H. .ds [D 1980 .ds [T Communicating with microcomputers .ds [I Academic Press .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n A recent book on microcomputer technology, this is unusual in that it contains a major section on speech communication with computers (as well as ones on computer buses, interfaces, and graphics). .in-2n .LE "nn" .EQ delim $$ .EN .CH "2 WHAT IS SPEECH?" .ds RT "What is speech? .ds CX "Principles of computer speech .pp People speak by using their vocal cords as a sound source, and making rapid gestures of the articulatory organs (tongue, lips, jaw, and so on). The resulting changes in shape of the vocal tract allow production of the different sounds that we know as the vowels and consonants of ordinary language. .pp What is it necessary to learn about this process for the purposes of speech output from computers? That depends crucially upon how speech is represented in the system. If utterances are stored as time waveforms \(em and this is what we will be discussing in the next chapter \(em the structure of speech is not important. If frequency-related parameters of particular natural utterances are stored, then it is advantageous to take into account some of the acoustic properties of the speech waveform. .pp This point can be brought into focus by contrasting the transmission (or storage) of speech with that of real-life television pictures, as has been proposed for a videophone service. Massive data reductions, of the order of 50:1, can be achieved for speech, using techniques that are described in later chapters. For pictures, data reduction is still an important issue \(em even more so for the videophone than for the telephone, because of the vastly higher information rates involved. Unfortunately, the potential for data reduction is much smaller \(em nothing like the 50:1 figure quoted above. This is because speech sounds have definite characteristics, imparted by the fact that they are produced by a human vocal tract, which can be exploited for data reduction. Television pictures have no equivalent generative structure, for they show just those things that the camera points at. .pp Moving up from frequency-related parameters of .ul particular utterances, it is possible to store such parameters in a .ul general form which characterizes the sound segments that appear in spoken language. This immediately raises the issue of .ul classification of sound segments, to form a basis for storing generalized acoustic information and for retrieval of the information needed to synthesize any particular utterance. Speech is by nature continuous, and any synthesis system based upon discrete classification must come to terms with this by tackling the problems of transition from one segment to another, and local modification of sound segments as a function of their context. .pp This brings us to another level of representation. So far we have talked of the .ul acoustic nature of speech, but when we have to cope with transitions between discrete sound segments it may be fruitful to consider .ul articulatory properties as well. Any model of the speech production process is in effect a model of the articulatory process that generates the speech. Some speech research is concerned with modelling the vocal tract directly, rather than modelling the acoustic output from it. One might specify, for example, position of tongue and posture of jaw and lips for a vowel, instead of giving frequency-related characteristics of it. This is a potent tool in linguistic research, for it brings one closer to human production of speech \(em in particular to the connection between brain and articulators. .pp Articulatory synthesis holds a promise of high-quality speech, for the transitional effects caused by tongue and jaw inertia can be modelled directly. However, this potential has not yet been realized. Speech from current articulatory models is of much poorer quality than that from acoustically-based synthesis methods. The major problem is in gaining data about articulatory behaviour during running speech \(em it is much easier to perform acoustic analysis on the resulting sound than it is to examine the vocal organs in action. Because of this, the subject is not treated in this book. We will only look at articulatory properties insofar as they help us to understand, in a qualitative way, the acoustic nature of speech. .pp Speech, however, is much more than mere articulation. Consider \(em admittedly a rather extreme and chauvinistic example \(em the number of ways a girl can say "yes". Breathy voice, slow tempo, low pitch \(em these are all characteristics which affect the utterance as a whole, rather than being classifiable into individual sound segments. Linguists call them "prosodic" or "suprasegmental" features, for they relate to overall aspects of the utterance, and distinguish them from "segmental" ones which concern the articulation of individual segments of syllables. The most important prosodic features are pitch, or fundamental frequency of the voice, and rhythm. .pp This chapter provides a brief introduction to the nature of the speech signal. Depending upon what speech output techniques we use, it may be necessary to understand something of the acoustic nature of the speech signal; the system that generates it (the vocal tract); commonly-used classifications of sound segments; and the prosodic aspects of speech. This material is little used in the early chapters of the book, but becomes increasingly important as the story unfolds. Hence you may skip the remainder of this chapter if you wish, but should return to it later to pick up more background whenever it becomes necessary. .sh "2.1 The anatomy of speech" .pp The so-called "voiced" sounds of speech \(em like the sound you make when you say "aaah" \(em are produced by passing air up from the lungs through the larynx or voicebox, which is situated just behind the Adam's apple. The vocal tract from the larynx to the lips acts as a resonant cavity, amplifying certain frequencies and attenuating others. .pp The waveform generated by the larynx, however, is not simply sinusoidal. (If it were, the vocal tract resonances would merely give a sine wave of the same frequency but amplified or attenuated according to how close it was to the nearest resonance.) The larynx contains two folds of skin \(em the vocal cords \(em which blow apart and flap together again in each cycle of the pitch period. The pitch of a male voice in speech varies from as low as 50\ Hz (cycles per second) to perhaps 250\ Hz, with a typical median value of 100\ Hz. For a female voice the range is higher, up to about 500\ Hz in speech. Singing can go much higher: a top C sung by a soprano has a frequency of just over 1000\ Hz, and some opera singers can reach substantially higher than this. .pp The flapping action of the vocal cords gives a waveform which can be approximated by a triangular pulse (this and other approximations will be discussed in Chapter 5). It has a rich spectrum of harmonics, decaying at around 12\ dB/octave, and each harmonic is affected by the vocal tract resonances. .rh "Vocal tract resonances." A simple model of the vocal tract is an organ-pipe-like cylindrical tube (Figure 2.1), with a sound source at one end (the larynx) and open at the other (the lips). .FC "Figure 2.1" This has resonances at wavelengths $4L$, $4L/3$, $4L/5$, ..., where $L$ is the length of the tube; and these correspond to frequencies $c/4L$, $3c/4L$, $5c/4L$, ...\ Hz, $c$ being the speed of sound in air. Calculating these frequencies, using a typical figure for the distance between larynx and lips of 17\ cm, and $c = 340$\ m/s for the speed of sound, leads to resonances at approximately 500\ Hz, 1500\ Hz, 2500\ Hz, ... . .pp When excited by the harmonic-rich waveform of the larynx, the vocal tract resonances produce peaks known as .ul formants in the energy spectrum of the speech wave (Figure 2.2). .FC "Figure 2.2" The lowest formant, called formant one, varies from around 200\ Hz to 1000\ Hz during speech, the exact range depending on the size of the vocal tract. Formant two varies from around 500 to 2500\ Hz, and formant three from around 1500 to 3500\ Hz. .pp You can easily hear the lowest formant by whispering the vowels in the words "heed", "hid", "head", "had", "hod", "hawed", and "who'd". They appear to have a steadily descending pitch, yet since you are whispering there is no fundamental frequency. What you hear is the lowest resonance of the vocal tract \(em formant one. Some masochistic people can play simple tunes with this formant by putting their mouth in successive vowel shapes and knocking the top of their head with their knuckles \(em hard! .pp A difficulty occurs when trying to identify the lower formants for speakers with high-pitched voices. When a formant frequency falls below the fundamental excitation frequency of the voice, its effect is diminished \(em although it is still present. The vibrato used by opera singers provides a very low-frequency excitation (at the vibrato rate) which helps to illuminate the lower formants even when the pitch of the voice is very high. .pp Of course, speech is not a static phenomenon. The organ-pipe model describes the speech spectrum during a continuously held vowel with the mouth in a neutral position such as for "aaah". But in real speech the tongue and lips are in continuous motion, altering the shape of the vocal tract and hence the positions of the resonances. It is as if the organ-pipe were being squeezed and expanded in different places all the time. Say .ul ee as in "heed" and feel how close your tongue is to the roof of your mouth, causing a constriction near the front of the vocal cavity. .pp Linguists and speech engineers use a special frequency analyser called a "sound spectrograph" to make a three-dimensional plot of the variation of the speech energy spectrum with time. Figure 2.3 shows a spectrogram of the utterance "go away". .FC "Figure 2.3" Frequency is given on the vertical axis, and bands are shown at the beginning to indicate the scale. Time is plotted horizontally, and energy is given by the darkness of any particular area. The lower few formants can be seen as dark bands extending horizontally, and they are in continuous motion. In the neutral first vowel of "away", the formant frequencies pass through approximately the 500\ Hz, 1500\ Hz, and 2500\ Hz that we calculated earlier. (In fact, formants two and three are somewhat lower than these values.) .pp The fine vertical striations in the spectrogram correspond to single openings of the vocal cords. Pitch changes continuously throughout an utterance, and this can be seen on the spectrogram by the differences in spacing of the striations. Pitch change, or .ul intonation, is singularly important in lending naturalness to speech. .pp On a spectrogram, a continuously held vowel shows up as a static energy spectrum. But beware \(em what we call a vowel in everyday language is not the same thing as a "vowel" in phonetic terms. Say "I" and feel how the tongue moves continuously while you're speaking. Technically, this is a .ul diphthong or slide between two vowel positions, and not a single vowel. If you say .ul ar as in "hard", and change slowly to .ul ee as in "heed", you will obtain a diphthong not unlike that in "I". And there are many more phonetically different vowel sounds than the a, e, i, o, and u that we normally think of. The words "hood" and "mood" have different vowels, for example, as do "head" and "mead". The principal acoustic difference between the various vowel sounds is in the frequencies of the first two formants. .pp A further complication is introduced by the nasal tract. This is a large cavity which is coupled to the oral tract by a passage at the back of the mouth. The passage is guarded by a flap of skin called the "velum". You know about this because inadvertent opening of the velum while swallowing causes food or drink to go up your nose. The nasal cavity is switched in and out of the vocal tract by the velum during speech. It is used for consonants .ul m, .ul n, and the .ul ng sound in the word "singing". Vowels are frequently nasalized too. A very effective demonstration of the amount of nasalization in ordinary speech can be obtained by cutting a nose-shaped hole in a large baffle which divides a room, speaking normally with one's nose in the hole, and having someone listen on the other side. The frequency of occurrence of nasal sounds, and the volume of sound that is emitted through the nose, are both surprisingly large. Interestingly enough, when we say in conversation that someone sounds "nasal", we usually mean "non-nasal". When the nasal passages are blocked by a cold, nasal sounds are missing \(em .ul n\c \&'s turn into .ul d\c \&'s, and .ul m\c \&'s to .ul b\c \&'s. .pp When the nasal cavity is switched in to the vocal tract, it introduces formant resonances, just as the oral cavity does. Although we cannot alter the shape of the nasal tract significantly, the nasal formant pattern is not fixed, because the oral tract does play a part in nasal resonances. If you say .ul m, .ul n, and .ul ng continuously, you can hear the difference and feel how it is produced by altering the combined nasal/oral tract resonances with your tongue position. The nasal cavity operates in parallel with the oral one: this causes the two resonance patterns to be summed together, with resulting complications which will be discussed in Chapter 5. .rh "Sound sources." Speech involves sounds other than those caused by regular vibration of the larynx. When you whisper, the folds of the larynx are held slightly apart so that the air passing between them becomes turbulent, causing a noisy excitation of the resonant cavity. The formant peaks are still present, superimposed on the noise. Such "aspirated" sounds occur in the .ul h of "hello", and for a very short time after the lips are opened at the beginning of "pit". .pp Constrictions made in the mouth produce hissy noises such as .ul ss, .ul sh, and .ul f. For example, in .ul ss the tip of the tongue is high up, very close to the roof of the mouth. Turbulent air passing through this constriction causes a random noise excitation, known as "frication". Actually, the roof of the mouth is quite a complicated object. You can feel with your tongue a bony hump or ridge just behind the front teeth, and it is this that forms a constriction with the tongue for .ul s. In .ul sh, the tongue is flattened close to the roof of the mouth slightly farther back, in a position rather similar to that for .ul ee but with a narrower constriction, while .ul f is produced with the upper teeth and lower lip. Because they are made near the front of the mouth, the resonances of the vocal tract have little effect on these fricative sounds. .pp To distinguish them from aspiration and frication, the ordinary speech sounds (like "aaah") which have their source in larynx vibration are known technically as "voiced". Aspirated and fricative sounds are called "unvoiced". Thus the three different sound types can be classified as .LB .NP voiced .NP unvoiced (fricative) .NP unvoiced (aspirated). .LE Can any of these three types occur together? It would seem that voicing and aspiration can not, for the former requires the larynx to be vibrating regularly, but for the latter it must be generating turbulent noise. However, there is a condition known technically as "breathy voice" which occurs when the vocal cords are slightly apart, still vibrating, but with a large volume of air passing between to create turbulence. Voicing can easily occur in conjunction with frication. Corresponding to .ul s, .ul sh, and .ul f we get the .ul voiced fricatives .ul z, the sound in the middle of words like "vision" which I will call .ul zh, and .ul v. A simple illustration of voicing is to say "ffffvvvvffff\ ...". During the voiced part you can feel the larynx vibrations with a finger on your Adam's apple, and it can be heard quite clearly if you stop up your ears. Technically, there is nothing to prevent frication and aspiration from occurring together \(em they do, for example, when a voiced fricative is whispered \(em but the combination is not an important one. .pp The complicated acoustic effects of noisy excitations in speech can be seen in the spectrogram in Figure 2.4 of "high altitude jets whizz past screaming". .FC "Figure 2.4" .rh "The source-filter model of speech production." We have been talking in terms of a sound source (be it voiced or unvoiced) exciting the resonances of the oral (and possible the nasal) tract. This model, which is used extensively in speech analysis and synthesis, is known as the source-filter model of speech production. The reason for its success is that the effect of the resonances can be modelled as a frequency-selective filter, operating on an input which is the source excitation. Thus the frequency spectrum of the source is modified by multiplying it by the frequency characteristic of the filter (or adding it, if amplitudes are expressed logarithmically). This can be seen in Figure 2.5, which shows a source spectrum and filter characteristic which combine to give the overall spectrum of Figure 2.2. .FC "Figure 2.5" .pp Although, as mentioned above, the various fricatives are not subjected to the resonances of the vocal tract to the same extent that voiced and aspirated sounds are, they can still be modelled as a noise source followed by a filter to give them their different sound qualities. .pp The source-filter model is an oversimplification of the actual speech production system. There is inevitably some coupling between the vocal tract and the lungs, through the glottis, during the period when it is open. This effectively makes the filter characteristics change during each individual cycle of the excitation. However, although the effect is of interest to speech researchers, it is probably not of great significance for practical speech output. .pp One very interesting implication of the source-filter model is that the prosodic features of pitch and amplitude are largely properties of the source; while segmental ones are introduced by the filter. This makes it possible to separate some aspects of overall prosody from the actual segmental content of an utterance, so that, for example, a human utterance can be stored initially and then spoken by a machine with a variety of different intonations. .sh "2.2 Classification of speech sounds" .pp The need to classify sound segments as a basis for storing generalized acoustic information and retrieving it was mentioned earlier. There is a real difficulty here because speech is by nature continuous and classifications are discrete. It is important to remember this difficulty because it is all too easy to criticize the complex and often confusing attempts of linguists to tackle the classification task. .pp Linguists call a written representation of the .ul sounds of an utterance a "phonetic transcription" of it. The same utterance can be transcribed at different levels of detail: simple transcriptions are called "broad" and more specific ones are called "narrow". Perhaps the most logically satisfying kind of transcription employs units termed "phonemes". This is the broadest transcription, and is sometimes called a .ul phonemic transcription to emphasize that that it is in terms of phonemes. Unfortunately, the word "phoneme" is often used somewhat loosely. In its true sense, a phoneme is a .ul logical unit, rather than a physical, acoustic, one, and is defined in relation to a particular language by reference to its use in discriminating different words. Classifications of sounds which are based on their semantic role as word-discriminators are called .ul phonological classifications: we could ensure that there is no ambiguity in the sense with which we use the term "phoneme" by calling it a phonological unit, and the phonemic transcription could be called a phonological one. .rh "Broad phonetic transcription." A phoneme is an abstract unit representing a set of different sounds. The issue is confused by the fact that the members of the set actually sound very similar, if not identical, to the untrained ear \(em precisely because the difference between them plays no part in distinguishing words from each other in the particular language concerned. .pp Take the words "key" and "caw", for example. Despite the difference in spelling, both of them begin with a .ul k sound that belongs (in English) to the same phoneme set, called .ul k. However, say them two or three times each, concentrating on the position of the tongue during the .ul k. It is quite different in each case. For "key", it is raised, close to the roof of the mouth, in preparation for the .ul ee, whereas in "caw" it is much lower down. The sound of the .ul k is actually quite different in the two cases. Yet they belong to the same phoneme, for there is no pair of words which relies on this difference to distinguish them \(em "key" and "caw" are obviously distinguished by their vowels, not by the initial consonant. You probably cannot hear clearly the difference between the two .ul k\c \&'s, precisely because they belong to the same phoneme and so the difference is not important (for English). .pp The point is sharpened by considering another language where we make a distinction \(em and hence can hear the difference \(em between two sounds that belong, in the language, to the same phoneme. Japanese does not distinguish .ul r from .ul l. Japanese people .ul do not hear the difference between "lice" and "rice", in the same way that you do not hear the difference between the two .ul k\c \&'s above. Cockneys do not hear, except with a special effort, the difference between "has" and "as", or "haitch" and "aitch", for the Cockney dialect does not recognize initial .ul h\c \&'s. .pp So what is a phoneme? It is a set of sounds whose members do not discriminate between any words in the language under consideration. If you are mathematically minded you could think of it as an equivalence class of sounds, determined by the relationship .LB $sound sub 1$ is related to $sound sub 2$ if $sound sub 1$ and $sound sub 2$ do not discriminate any pair of words in the language. .LE The .ul p and .ul d in "pig" and "dig" belong to different phonemes (in English), because they discriminate the two words. .ul b, .ul f, and .ul j belong to different phonemes again. .ul i and .ul a in "hid" and "had" belong to different phonemes too. Proceeding like this, a list of phonemes can be drawn up. .pp Such a list is shown in Table 2.1, for British English. (The layout of the list does have some significance in terms of different categories of phonemes, which will be explained later.) In fact, linguists use an assortment of English letters, foreign letters, and special symbols to represent phonemes. In this book we use one- or two-letter codes, partly because they are more mnemonic, and partly because they are more suitable for communication to computers using standard peripheral devices. They are a direct transliteration of linguists' standard International Phonetic Association symbols. .RF .nr x1 3m+1.0i+0.5i+0.5i+0.5i+\w'y'u .nr x1 (\n(.l-\n(x1)/2 .in \n(x1u .ta 3m +1.0i +0.5i +0.5i +0.5i +0.5i +0.5i \fIuh\fR (the) \fIp\fR \fIt\fR \fIk\fR \fIa\fR (bud) \fIb\fR \fId\fR \fIg\fR \fIe\fR (head) \fIm\fR \fIn\fR \fIng\fR \fIi\fR (hid) \fIo\fR (hod) \fIr\fR \fIw\fR \fIl\fR \fIy\fR \fIu\fR (hood) \fIaa\fR (had) \fIs\fR \fIz\fR \fIee\fR (heed) \fIsh\fR \fIzh\fR \fIer\fR (heard) \fIf\fR \fIv\fR \fIuu\fR (food) \fIth\fR \fIdh\fR \fIar\fR (hard) \fIch\fR \fIj\fR \fIaw\fR (hoard) \fIh\fR .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 2.1 The phonemes of British English" .pp We will discuss the sounds which make up each of these phoneme classes shortly. First, however, it is worthwhile pointing out some rather tricky points in the definition of these phonemes. .rh "Phonological difficulties." There are snags with phonological classification, as there are in any area where attempts are made to make completely logical statements about human activity. Consider .ul h and the .ul ng in "singing". (\c .ul ng is certainly not an .ul n sound followed by a .ul g sound, although it is true that in some English accents "singing" is rendered with the .ul ng followed by a .ul g at each of its two occurrences.) No words end with .ul h, and none begin with .ul ng. (Notice that we are still talking about British English. In Chinese, the sound .ul ng is a word in its own right, and is a common family name. But we must stick with one language for phonological classification.) Hence it follows that there is no pair of words which is distinguished by the difference between .ul h and .ul ng. Technically, they belong to the same phoneme. However, technical considerations in this case must take second place to common sense! .pp The .ul j in "jig" is another interesting case. It can be considered to belong to a .ul j phoneme, or to be a sequence of two phonemes, .ul d followed by .ul zh (the sound in "vision"). There is disagreement on this point in phonetics textbooks, and we do not have the time (nor, probably, the inclination!) to consider the pros and cons of this moot point. I have resolved the matter arbitrarily by writing it as a separate phoneme. The .ul ch in "choose" is a similar case (\c .ul t followed by the .ul sh in "shoes"). .pp Another difficulty, this time where Table 2.1 does not show how to distinguish between two sounds which .ul do discriminate words in many people's English, is the .ul w in "witch" and that in "which". The latter is conventionally transcribed as a sequence of two phonemes, .ul h w. .pp The last few difficulties are all to do with deciding whether a sound belongs to a single phoneme class, or comprises a sequence of sounds each of which belongs to a phoneme. Are the .ul j in "jug", the .ul ch in "chug", and the .ul w in "which", single phonemes or not? The definition above of a phoneme as a "set of sounds whose members do not discriminate any words in the language" does not help us to answer this question. As far as this definition is concerned, we could go so far as to call each and every word of the language an individual phoneme! It is clear that some acoustic evidence, and quite a lot of judgement, is being used when phonemes such as those of Table 2.1 are defined. .pp So much for the consonants. This same problem occurs in vowel sounds, particularly in diphthongs, which are sequences of two vowel-like sounds. Do the vowels of "main" and "man" belong to different phonemes? Clearly so, if they are both transcribed as single units, for they distinguish the two words. Notwithstanding the fact that they are sequences of separate sounds, a logically consistent system could be constructed which gave separate, unitary, symbols to each diphthong. However, it is usual to employ a compound symbol which indicates explicitly the character of the two vowel-like sounds involved. We will transcribe the diphthong of "main" as a sequence of two vowels, .ul e (as in "head") and .ul i (as in "hid", not "I"). This is done primarily for economy of symbols, choosing the constituent sounds on the basis of the closest match to existing vowel sounds. (Note that this again violates purely .ul logical criteria for identifying phonemes.) .rh "Categories of speech sounds." A phoneme is defined as a set of sounds whose members to not discriminate between any words in the language under consideration. The phonemes themselves can be classified into groups which reflect similarities between them. This can be done in many different ways, using various criteria for classification. In fact, one branch of linguistic research is concerned with defining a set of "distinctive features" such that a phoneme class is uniquely identified by the values of the features. Distinctive features are binary, and include such things as voiced\(emunvoiced, fricative\(emnot\ fricative, aspirated\(emunaspirated. We will not be concerned here with such detailed classifications, but it is as well to know that they exist. .pp There is an everyday distinction between vowels and consonants. A vowel forms the nucleus of every syllable, and one or more consonants may optionally surround the vowel. But the distinction sometimes becomes a little ambiguous. Syllables like .ul sh are commonly uttered and certainly do not contain a vowel. Furthermore, when we say "vowel" in everyday language we usually refer to the .ul written vowels a, e, i, o, and u; there are many more vowel sounds. A vowel in orthography is different to a vowel as a phoneme. Is a diphthong a phonetic vowel? \(em certainly, by the syllable-nucleus criterion; but it is a little different from ordinary vowels because it is a changing sound rather than a constant one. .pp Table 2.2 shows one classification of the phonemes of Table 2.1, which will be useful in our later studies of speech synthesis from phonetics. It shows twelve vowels, including the rather peculiar one .ul uh (which corresponds to the first vowel in the word "above"). This is the sound produced by the vocal tract when it is in a relaxed, neutral position; and it never occurs in prominent, stressed, syllables. The vowels later in the list are almost always longer than the earlier ones. In fact, the first six (\c .ul uh, a, e, i, o, u\c ) are often called "short" vowels, and the last five (\c .ul ee, er, uu, ar, aw\c ) "long" ones. The shortness or longness of the one in the middle (\c .ul aa\c ) is rather ambiguous. .RF .nr x0 \w'000unvoiced fricative 'u .nr x1 \n(x0+\w'[not classified as individual phonemes]'u .nr x1 (\n(.l-\n(x1)/2 .in \n(x1u .ta \n(x0u .fi vowel \c .ul uh a e i o u aa ee er uu ar aw .br diphthong [not classified as individual phonemes] .br glide (or liquid) \c .ul r w l y .br stop .br \0\0\0unvoiced stop \c .ul p t k .br \0\0\0voiced stop \c .ul b d g .br nasal \c .ul m n ng .br fricative .br \0\0\0unvoiced fricative \c .ul s sh f th .br \0\0\0voiced fricative \c .ul z zh v dh .br affricate .br \0\0\0unvoiced affricate \c .ul ch .br \0\0\0voiced affricate \c .ul j .br aspirate \c .ul h .nf .in 0 .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .FG "Table 2.2 Phoneme categories" .pp Diphthongs pose no problem here because we have not classified them as single phonemes. .pp The remaining categories are consonants. The glides are quite similar to vowels and diphthongs, though; for they are voiced, continuous sounds. You can say them and prolong them. (This is also true of the fricatives.) .ul r is interesting because it can be realized acoustically in very different ways. Some people curl the tip of the tongue back \(em a so-called retroflex action of the tongue. Many people cannot do this, and their .ul r\c \&'s sound like .ul w\c \&'s. The stage Scotsman's .ul r is a trill where the tip of the tongue vibrates against the roof of the mouth. .ul l is also slightly unusual, for it is the only English phoneme which is "lateral" \(em air passes either side of it, in two separate passages. Welsh has another lateral sound, a fricative, which is written "ll" as in "Llandudno". .pp The next category is the stops. These are formed by stopping up the mouth, so that air pressure builds up behind the lips, and releasing this pressure suddenly. The result is a little explosion (and the stops are often called "plosives"), which usually creates a very short burst of fricative noise (and, in some cases, aspiration as well). They are further subdivided into voiced and unvoiced stops, depending upon whether voicing starts as soon as the plosion occurs (sometimes even before) or well after it. If you put your hand in front of your mouth when saying "pit" you can easily feel the puff of air that signals the plosion on the .ul p, and probably on the .ul t as well. .pp In a sense, nasals are really stops as well (and they are often called stops), for the oral tract is blocked although the nasal one is not. The peculiar fact that the nasal .ul ng never occurs at the beginning of a word (in English) was mentioned earlier. Notice that for stops and nasals there is a similarity in the .ul vertical direction of Table 2.2, between .ul p, .ul b, and .ul m; .ul t, .ul d, and .ul n; and .ul k, .ul g, and .ul ng. .ul p is an unvoiced version of .ul b (try saying them), and .ul m is a nasalized version (for .ul b is what you get when you have a cold and try to say .ul m\c ). These three sounds are all made at the front of the mouth, while .ul t, .ul d, and .ul n, which bear the same resemblance to each other, are made in the middle; and .ul k, .ul g, and .ul ng are made at the back. This introduces another possible classification, according to .ul place of articulation. .pp The unvoiced fricatives are quite straightforward, except perhaps for .ul th, which is the sound at the beginning of "thigh". They are paired with the voiced fricatives on the basis of place of articulation. The voiced version of .ul th is the .ul dh at the beginning of "thy". .ul zh is a fairly rare phoneme, which is heard in the middle of "vision". Affricates are similar to fricatives but begin with a stopped posture, and we mentioned earlier the controversy as to whether they should be considered to be single phonemes, or sequences of stop phonemes and fricatives. Finally comes the lonely aspirate, .ul h. Aspiration does occur elsewhere in speech, during the plosive burst of unvoiced stops. .rh "Narrow phonetic transcription." The phonological classification outlined above is based upon a clear rationale for distinguishing between sounds according to how they affect meaning \(em although the rationale does become somewhat muddied in difficult cases. Narrower transcriptions are not so systematic. They use units called .ul allophones, which are defined by reference to physical, acoustic, criteria rather than purely logical ones. ("Phone" is a more old-fashioned term for the same thing, and the misused word "phoneme" is often employed where allophone is meant, that is, as a physical rather than a logical unit.) Each phoneme has several allophones, more or less depending on how narrow or broad the transcription is, and the allophones are different acoustic realizations of the same logical unit. For example, the .ul k\c \&'s in "key" and "caw" may be considered as different allophones (in a slightly narrow transcription). Although we will not use symbols for allophones here, they are often indicated by diacritical marks in a text which modify the basic phoneme classes. For example, a tilde (~) over a vowel means that it is nasalized, while a small circle underneath a consonant means that it is devoiced. .pp Allophonic variation in speech is governed by a mechanism called .ul coarticulation, where a sound is affected by those that come either side of it. "Key"\-"caw" is a clear example of this, where the tongue position in the .ul k anticipates that of the following vowel \(em high in the first case, low in the second. Most allophonic variation in English is anticipatory, in that the sound is influenced by the following articulation rather than by preceding ones. .pp Nasalization is a feature which applies to vowels in English through anticipatory coarticulation. In many languages (for example, French) it is a .ul distinctive feature for vowels in that it serves to distinguish one vowel phoneme class from another. That this is not so in English sometimes tempts us to assume, incorrectly, that nasalization does not occur in vowels. It does, typically when the vowel is followed by a nasal consonant, and it is important for synthesis that nasalized vowel allophones are recognized and treated accordingly. .pp Coarticulation can be predicted by phonological rules, which show how a phonemic sequence will be realized by allophones. Such rules have been studied extensively by linguists. .pp The reason for coarticulation, and for the existence of allophones, lies in the physical constraints imposed by the motion of the articulatory organs \(em particularly their acceleration and deceleration. An immensely crude model is that the brain decides what phonemes to say (for it is concerned with semantic things, and the definition of a phoneme is a semantic one). It then takes this sequence and translates it into neural commands which actually move the articulators into target positions. However, other commands may be issued, and executed, before these targets are reached, and this accounts for coarticulation effects. Phonological rules for converting a phonemic sequence to an allophonic one are a sort of discrete model of the process. Particularly for work involving computers, it is possible that this rule-based approach will be overtaken by potentially more accurate methods which attempt to model the continuous articulatory phenomena directly. .sh "2.3 Prosody" .pp The phonetic classification introduced above divides speech into segments and classifies these into phonemes or allophones. Riding on top of this stream of segments are other, more global, attributes that dictate the overall prosody of the utterance. Prosody is defined by the Oxford English Dictionary as the "science of versification, laws of metre," which emphasizes the aspects of stress and rhythm that are central to classical verse. There are, however, many other features which are more or less global. These are collectively called prosodic or, equivalently, suprasegmental, features, for they lie above the level of phoneme or syllable segments. .pp Prosodic features can be split into two basic categories: features of voice quality and features of voice dynamics. Variations in voice quality, which are sometimes called "paralinguistic" phenomena, are accounted for by anatomical differences and long-term muscular idiosyncrasies (like a sore throat), and have little part to play in the kind of applications for speech output that have been sketched in Chapter 1. Variations in voice dynamics occur in three dimensions: pitch or fundamental frequency of the voice, time, and amplitude. Within the first, the pattern of pitch variation, or .ul intonation, can be distinguished from the overall range within which that variation occurs. The time dimension encompasses the rhythm of the speech, pauses, and the overall tempo \(em whether it is uttered quickly or slowly. The third dimension, amplitude, is of relatively minor importance. Intonation and rhythm work together to produce an effect commonly called "stress", and we will elaborate further on the nature of stress and discuss algorithms for synthesizing intonation and rhythm in Chapter 8. .pp These features have a very important role to play in communicating meaning. They are not fancy, optional components. It is their neglect which is largely responsible for the layman's stereotype of computer speech, a caricature of living speech \(em abrupt, arhythmic, and in a grating monotone \(em which was well characterized by Isaac Asimov when he wrote of speaking "all in capital letters". .pp Timing has a syntactic function in that it sometimes helps to distinguish nouns from verbs (\c .ul ex\c tract versus ex\c .ul tract\c ). and adjectives from verbs (app\c .ul rox\c imate versus approxi\c .ul mate\c ) \(em although segmental aspects play a part here too, for the vowel qualities differ in each pair of words. Nevertheless, if you make a mistake when assigning stress to words like these in conversation you are very likely to be queried as to what you actually said. .pp Intonation has a big effect on meaning too. Pitch often \(em but by no means always \(em rises on a question, the extent and abruptness of the rise depending on features like whether a genuine information-bearing reply or merely confirmation is expected. A distinctive pitch pattern accompanies the introduction of a new topic. In conjunction with rhythm, intonation can be used to bring out contrasts as in .LB .NI "He didn't have a .ul red car, he had a .ul black one." .LE In general, the intonation patterns used by a reader depend not only on the text itself, but on his interpretation of it, and also on his expectation of the listener's interpretation of it. For example: .LB .NI "He had a .ul red car" (I think you thought it was black), .NI "He had a red .ul bi\c cycle" (I think you thought it was a car). .LE .pp In natural speech, prosodic features are significantly influenced by whether the utterance is generated spontaneously or read aloud. The variations in spontaneous speech are enormous. There are all sorts of emotions which are plainly audible in everyday speech: sarcasm, excitement, rudeness, disagreement, sadness, fright, love. Variations in voice quality certainly play a part here. Even with "ordinary" cooperative friendly conversation, the need to find words and somehow fit them into an overall utterance produces great diversity of prosodic structures. Applications for speech output from computers do not, however, call for spontaneous conversation, but for a controlled delivery which is like that when reading aloud. Here, the speaker is articulating utterances which have been set out for him, reducing his cognitive load to one of understanding and interpreting the text rather than generating it. Unfortunately for us, linguists are (quite rightly) primarily interested in living, spontaneous speech rather than pre-prepared readings. .pp Nevertheless, the richness of prosody in speech even when reading from a book should not be underestimated. Read aloud to an audience and listen to the contrasts in voice dynamics deliberately introduced for variety's sake. If stories are to be read there is even a case for controlling voice .ul quality to cope with quotations and affective imitations. .pp We saw earlier that the source-filter model is particularly helpful in distinguishing prosodic features, which are largely properties of the source, from segmental ones, which belong to the filter. Pitch and amplitude are primarily source properties. Rhythm and speed of speaking are not, but neither are they filter properties, for they belong to the source-filter system as a whole and not specifically to either part of it. The difficult notion of stress is, from an acoustic point of view, a combination of pitch, rhythm, and amplitude. Even some features of voice quality can be attributed to the source (like laryngitis), although others \(em cleft palate, badly-fitting dentures \(em affect segmental features as well. .sh "2.4 Further reading" .pp This chapter has been no more than a cursory introduction to some of the difficult problems of linguistics and phonetics. Here are some readable books which discuss these problems further. .LB "nn" .\"Abercrombie-1967-1 .ds [F 1 .]- .ds [A Abercrombie, D. .ds [D 1967 .ds [T Elements of general phonetics .ds [I Edinburgh Univ Press .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is an excellent book which covers all of the areas of this chapter, in much more detail than has been possible here. .in-2n .\"Brown-1980-2 .ds [F 2 .]- .ds [A Brown, Gill .as [A ", Currie, K.L. .as [A ", and Kenworthy, J. .ds [D 1980 .ds [T Questions of intonation .ds [I Croom Helm .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n An intensive study of the prosodics of colloquial, living speech is presented, with particular reference to intonation. Although not particularly relevant to speech output from computers, this book gives great insight into how conversational speech differs from reading aloud. .in-2n .\"Fry-1979-1 .ds [F 1 .]- .ds [A Fry, D.B. .ds [D 1979 .ds [T The physics of speech .ds [I Cambridge University Press .ds [C Cambridge, England .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is a simple and readable account of speech science, with a good and completely non-mathematical introduction to frequency analysis. .in-2n .\"Ladefoged-1975-4 .ds [F 4 .]- .ds [A Ladefoged, P. .ds [D 1975 .ds [T A course in phonetics .ds [I Harcourt Brace and Johanovich .ds [C New York .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Usually books entitled "A course on ..." are dreadfully dull, but this is a wonderful exception. An exciting, readable, almost racy introduction to phonetics, full of little experiments you can try yourself. .in-2n .\"Lehiste-1970-5 .ds [F 5 .]- .ds [A Lehiste, I. .ds [D 1970 .ds [T Suprasegmentals .ds [I MIT Press .ds [C Cambridge, Massachusetts .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This fairly comprehensive study of the prosodics of speech complements Ladefoged's book, which is mainly concerned with segmental phonetics. .in-2n .\"O'Connor-1973-1 .ds [F 1 .]- .ds [A O'Connor, J.D. .ds [D 1973 .ds [T Phonetics .ds [I Penguin .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is another introductory book on phonetics. It is packed with information on all aspects of the subject. .in-2n .LE "nn" .EQ delim $$ .EN .CH "3 SPEECH STORAGE" .ds RT "Speech storage .ds CX "Principles of computer speech .pp The most familiar device that produces speech output is the ordinary tape recorder, which stores information in analogue form on magnetic tape. However, this is unsuitable for speech output from computers. One reason is that it is difficult to access different utterances quickly. Although random-access tape recorders do exist, they are expensive and subject to mechanical breakdown because of the stresses associated with frequent starting and stopping. .pp Storing speech on a rotating drum instead of tape offers the possibility of access to any track within one revolution time. For example, the IBM 7770 Audio Response Unit employs drums rotating twice a second which are able to store up to 32 500-msec words. These can be accessed randomly, within half a second at most. Although one can arrange to store longer words by allowing overflow on to an adjacent track at the end of the rotation period, the discrete time-slots provided by this system make it virtually impossible for it to generate connected utterances by assembling appropriate words from the store. .pp The Cognitronics Speechmaker has a similar structure, but with the analogue speech waveform recorded on photographic film. Storing audio waveforms optically is not an unusual technique, for this is how soundtracks are recorded on ordinary movie films. The original version of the "speaking clock" of the British Post Office used optical storage in concentric tracks on flat glass discs. It is described by Speight and Gill (1937), who include a fascinating account of how the utterances are synchronized. .[ Speight Gill 1937 .] A 4\ Hz signal from a pendulum clock was used to supply current to an electric motor, which drove a shaft equipped with cams and gears that rotated the glass discs containing utterances for seconds, minutes, and hours at appropriate speeds! .pp A second reason for avoiding analogue storage is price. It is difficult to see how a random-access tape recorder could be incorporated into a talking pocket calculator or child's toy without considerably inflating the cost. Solid-state electronics is much cheaper than mechanics. .pp But the best reason is that, in many of the applications we have discussed, it is necessary to form utterances by concatenating separately-recorded parts. It is totally infeasible, for example, to store each and every possible telephone number as an individual recording! And utterances that are formed by concatenating individual words which were recorded in isolation, or in a different context, do not sound completely natural. For example, in an early experiment, Stowe and Hampton (1961) recorded individual words on acoustic tape, spliced the tape with the words in a different order to make sentences, and played the result to subjects who were scored on the number of key words which they identified correctly. .[ Stowe Hampton 1961 .] The overall conclusion was that while embedding a word in normally-spoken sentences .ul increases the probability of recognition (because the extra context gives clues about the word), embedding a word in a constructed sentence, where intonation and rhythm are not properly rendered, .ul decreases the probability of recognition. When the speech was uttered slowly, however, a considerable improvement was noticed, indicating that if the listener has more processing time he can overcome the lack of proper intonation and rhythm. .pp Nevertheless, many present-day voice response systems .ul do store what amounts to a direct recording of the acoustic wave. However, the storage medium is digital rather than analogue. This means that standard computer storage devices can be used, providing rapid access to any segment of the speech at relatively low cost \(em for the economics of mass-production ensures a low price for random-access digital devices compared with random-access analogue ones. Furthermore, it reduces the amount of special equipment needed for speech output. One can buy very cheap speech input/output interfaces for home computers which connect to standard hobby buses. Another advantage of digital over analogue recording is that integrated circuit read-only memories (ROMs) can be used for hand-held devices which need small quantities of speech. Hence this chapter begins by showing how waveforms are stored digitally, and then describes some techniques for reducing the data needed for a given utterance. .sh "3.1 Storing waveforms digitally" .pp When an analogue signal is converted to digital form, it is made discrete both in time and in amplitude. Discretization in time is the operation of .ul sampling, whilst in amplitude it is .ul quantizing. It is worth pointing out that the transmission of analogue information by digital means is called "PCM" (standing for "pulse code modulation") in telecommunications jargon. Much of the theory of digital signal processing investigates signals which are sampled but not quantized (or quantized into sufficiently many levels to avoid inaccuracies). The operation of quantization, being non-linear, is not very amenable to theoretical analysis. Quantization introduces issues such as accumulation of round-off noise in arithmetic operations, which, although they are very important in practical implementations, can only be treated theoretically under certain somewhat unrealistic assumptions (in particular, independence of the quantization error from sample to sample). .rh "Sampling." A fundamental theorem of telecommunications states that a signal can only be reconstructed accurately from a sampled version if it does not contain components whose frequency is greater than half the frequency at which the sampling takes place. Figure 3.1(a) shows how a component of slightly greater than half the sampling frequency can masquerade, as far as an observer with access only to the sampled data can tell, as a component at slightly less than half the sampling frequency. .FC "Figure 3.1" Call the sampling interval $T$ seconds, so that the sampling frequency is $1/T$\ Hz. Then components at $1/2T+f$, $3/2T-f$, $3/2T+f$ and so on all masquerade as a component at $1/2T-f$. Similarly, components at frequencies just under the sampling frequency masquerade as very low-frequency components, as shown in Figure 3.1(b). This phenomenon is often called "aliasing". .pp Thus the continuous, infinite, frequency axis for the unsampled signal, where two components at different frequencies can always be distinguished, maps into a repetitive frequency axis when the signal is sampled. As depicted in Figure 3.2, the frequency interval $[1/T,~ 2/T)$ \u\(dg\d .FN 3 .sp \u\(dg\dIntervals are specified in brackets, with a square bracket representing a closed end of the interval and a round one representing an open one. Thus the interval $[1/T,~ 2/T)$ specifies the range $1/T ~ <= ~ frequency ~ < ~ 2/T$. .EF is mapped back into the band $[0,~ 1/T)$, as are the intervals $[2/T,~ 3/T)$, $[3/T,~ 4/T)$, and so on. .FC "Figure 3.2" Furthermore, the interval $[1/2T,~ 1/T)$ between half the sampling frequency and the sampling frequency, is mapped back into the interval below half the sampling frequency; but this time the mapping is backwards, with frequencies at just under $1/T$ being mapped to frequencies slightly greater than zero, and frequencies just over $1/2T$ being mapped to ones just under $1/2T$. The best way to represent a repeating frequency axis like this is as a circle. Figure 3.3 shows how the linear frequency axis for continuous systems maps on to a circular axis for sampled systems. .FC "Figure 3.3" For present purposes it is easiest to imagine the bottom half of the circle as being reflected into the top half, so that traversing the upper semicircle in the anticlockwise direction corresponds to frequencies increasing from 0 to $1/2T$ (half the sample frequency), and returning along the lower semicircle is actually the same as coming back round the upper one, and corresponds to frequencies from $1/2T$ to $1/T$ being mapped into the range $1/2T$ to 0. .pp As far as speech is concerned, then, we must ensure that before sampling a signal no significant components at greater than half the sample frequency are present. Furthermore, the sampled signal will only contain information about frequency components less than this, so the sample frequency must be chosen as twice the highest frequency of interest. For example, consider telephone-quality speech. Telephones provide a familiar standard of speech quality which, although it can only be an approximate "standard", will be much used throughout this book. The telephone network aims to transmit only frequencies lower than 3.4\ kHz. We saw in the previous chapter that this region will contain the information-bearing formants, and some \(em but not all \(em of the fricative and aspiration energy. Actually, transmitting speech through the telephone system degrades its quality very significantly, probably more than you realize since everyone is so accustomed to telephone speech. Try the dial-a-disc service and compare it with high-fidelity music for a striking example of the kind of degradation suffered. .pp For telephone speech, the sampling frequency must be chosen to be at least 6.8\ kHz. Since speech contains significant amounts of energy above 3.4\ kHz, it should be filtered before sampling to remove this; otherwise the higher components would be mapped back into the baseband and distort the low-frequency information. Because it is difficult to make filters that cut off very sharply, the sampling frequency is chosen rather greater than twice the highest frequency of interest. For example, the digital telephone network samples at 8\ kHz. The pre-sampling filter should have a cutoff frequency of 4\ kHz; aim for negligible distortion below 3.4\ kHz; and transmit negligible components above 4.6\ kHz \(em for these are reflected back into the band of interest, namely 0 to 3.4\ kHz. Figure 3.4 shows a block diagram for the input hardware. .FC "Figure 3.4" .rh "Quantization." Before considering specifications for the pre-sampling filter, let us turn from discretization in time to discretization in amplitude, that is, quantization. This is performed by an A/D converter (analogue-to-digital), which takes as input a constant analogue voltage (produced by the sampler) and generates a corresponding binary value as output. The simplest correspondence is .ul uniform quantization, where the amplitude range is split into equal regions by points termed "quantization levels", and the output is a binary representation of the nearest quantization level to the input voltage. Typically, 11-bit conversion is used for speech, giving 2048 quantization levels, and the signal is adjusted to have zero mean so that half the levels correspond to negative input voltages and the other half to positive ones. .pp It is, at first sight, surprising that as many as 11 bits are needed for adequate representation of speech signals. Research on the digital telephone network, for example, has concluded that a signal-to-noise ratio of some 26\-27\ dB is enough to avoid undue harshness of quality, loss of intelligibility, and listener fatigue for speech at a comfortable level in an otherwise reasonably good channel. Rabiner and Schafer (1978) suggest that about 36\ dB signal-to-noise ratio would "most likely provide adequate quality in a communications system". .[ Rabiner Schafer 1978 Digital processing of speech signals .] But 11-bit quantization seems to give a very much better signal-to-noise ratio than these figures. To estimate its magnitude, note that for N-bit quantization the error for each sample will lie between .LB $ - ~ 1 over 2 ~. 2 sup -N$ and $+ ~ 1 over 2 ~. 2 sup -N . $ .LE Assuming that it is uniformly distributed in this range \(em an assumption which is likely to be justified if the number of levels is sufficiently large \(em leads to a mean-squared error of .LB .EQ integral from {-2 sup -N-1} to {2 sup -N-1} ~e sup 2 p(e) de, .EN .LE where $p(e)$, the probability density function of the error $e$, is a constant which satisfies the usual probability normalization constraint, namely .LB .EQ integral from {-2 sup -N-1} to {2 sup -N-1} ~ p(e) de ~~=~ 1. .EN .LE Hence $p(e)=2 sup N $, and so the mean-squared error is $2 sup -2N /12$. This is $10 ~ log sub 10 (2 sup -2N /12)$\ dB, or around \-77\ dB for 11-bit quantization. .pp This noise level is relative to the maximum amplitude range of the conversion. A maximum-amplitude sine wave has a power of \-9\ dB relative to the same reference, giving a signal-to-noise ratio of some 68\ dB. This is far in excess of that needed for telephone-quality speech. However, look at the very peaky nature of the typical speech waveform given in Figure 3.5. .FC "Figure 3.5" If clipping is to be avoided, the maximum amplitude level of the A/D converter must be set at a value which makes the power of the speech signal very much less than a maximum-amplitude sine wave. Furthermore, different people speak at very different volumes, and the overall level fluctuates constantly with just one speaker. Experience shows that while 8- or 9-bit quantization may provide sufficient signal-to-noise ratio to preserve telephone-quality speech if the overall speaker levels are carefully controlled, about 11 bits are generally required to provide high-quality representation of speech with a uniform quantization. With 11 bits, a sine wave whose amplitude is only 1/32 of the full-scale value would be digitized with a signal-to-noise ratio of around 36\ dB, the most pessimistic figure quoted above for adequate quality. Even then it is useful if the speaker is provided with an indication of the amplitude of his speech: a traffic-light indicator with red signifying clipping overload, orange a suitable level, and green too low a value, is often convenient for this. .rh "Logarithmic quantization." For the purposes of speech .ul processing, it is essential to have the signal quantized uniformly. This is because all of the theory applies to linear systems, and nonlinearities introduce complexities which are not amenable to analysis. Uniform quantization, although a nonlinear operation, is linear in the limiting case as the number of levels becomes large, and for most purposes its effect can be modelled by assuming that the quantized signal is obtained from the original analogue one by the addition of a small amount of uniformly-distributed quantizing noise, as in fact was done above. Usually the quantization noise is disregarded in subsequent analysis. .pp However, the peakiness of the speech signal illustrated in Figure 3.5 leads one to suspect that a non-linear representation, for example a logarithmic one, could provide a better signal-to-noise ratio over a wider range of input amplitudes, and hence be more useful than linear quantization \(em at least for speech storage (and transmission). And indeed this is the case. Linear quantization has the unfortunate effect that the absolute noise level is independent of the signal level, so that an excessive number of bits must be used if a reasonable ratio is to be achieved for peaky signals. It can be shown that a logarithmic representation like .LB .EQ y ~ = ~ 1 ~ + ~ k ~ log ~ x, .EN .LE where $x$ is the original signal and $y$ is the value which is to be quantized, gives a signal-to-noise .ul ratio which is independent of the input signal level. This relationship cannot be realized physically, for it is undefined when the signal is negative and diverges when it is zero. However, realizable approximations to it can be made which retain the advantages of constant signal-to-noise ratio within a useful range of signal amplitudes. Figure 3.6 shows the logarithmic relation with one widely-used approximation to it, called the A-law. .FC "Figure 3.6" The idea of non-linearly quantizing a signal to achieve adequate signal-to-noise ratios for a wide variety of amplitudes is called "companding", a contraction of "compressing-expanding". The original signal can be retrieved from its A-law compression by antilogarithmic expansion. .pp Figure 3.6 also shows one common coding scheme which is a piecewise linear approximation to the A-law. This provides an 8-bit code, and gives the equivalent of 12-bit linear quantization for small signal levels. It approximates the A-law in 16 linear segments, 8 for positive and 8 for negative inputs. Consider the positive part of the curve. The first two segments, which are actually collinear, correspond exactly to 12-bit linear conversion. Thus the output codes 0 to 31 correspond to inputs from 0 to 31/2048, in equal steps. (Remember that both positive and negative signals must be converted, so a 12-bit linear converter will allocate 2048 levels for positive signals and 2048 for negative ones.) The next segment provides 11-bit linear quantization, output codes 32 to 47 corresponding to inputs from 16/1024 to 31/1024. Similarly, the next segment corresponds to 10-bit quantization, covering inputs from 16/512 to 31/512. And so on, the last section giving 6-bit quantization of inputs from 16/32 to 31/32, the full-scale positive value. Negative inputs are converted similarly. For signal levels of less than 32/2048, that is, $2 sup -8$, this implementation of the A-law provides full 12-bit precision. As the signal level increases, the precision decreases gradually to 6 bits at maximum amplitudes. .pp Logarithmic encoding provides what is in effect a floating-point representation of the input. The conventional floating-point format, however, is not used because many different codes can represent the same value. For example, with a 4-bit exponent preceding a 4-bit mantissa, the words 0000:1000, 0001:0100, 0010:0010, and 0011:0001 represent the numbers $0.1 ~ times ~ 2 sup 0$, $0.01 ~ times ~ 2 sup 1 $, $0.001 ~ times ~ 2 sup 2$, \c and $0.0001 ~ times ~ 2 sup 3$ respectively, which are the same. (Some floating-point conventions assume that an unwritten "1" bit precedes the mantissa, except when the whole word is zero; but this gives decreased resolution around zero \(em which is exactly where we want the resolution to be greatest.) Table 3.1 shows the 8-bit A-law codes, .RF .in+0.7i .ta 1.6i +\w'bits 1-3 'u 8-bit codeword: bit 0 sign bit bits 1-3 3-bit exponent bits 4-7 4-bit mantissa .sp2 .ta 1.6i 3.5i .ul codeword interpretation .sp 0000 0000 \h'\w'\0-\0 + 'u'$.0000 ~ times ~ 2 sup -7$ \0\0\0... \0\0\0\0... 0000 1111 \h'\w'\0-\0 + 'u'$.1111 ~ times ~ 2 sup -7$ 0001 0000 $2 sup -7 ~~ + ~~ .0000 ~ times ~ 2 sup -7$ \0\0\0... \0\0\0\0... 0001 1111 $2 sup -7 ~~ + ~~ .1111 ~ times ~ 2 sup -7$ 0010 0000 $2 sup -6 ~~ + ~~ .0000 ~ times ~ 2 sup -6$ \0\0\0... \0\0\0\0... 0010 1111 $2 sup -6 ~~ + ~~ .1111 ~ times ~ 2 sup -6$ 0011 0000 $2 sup -5 ~~ + ~~ .0000 ~ times ~ 2 sup -5$ \0\0\0... \0\0\0\0... 0011 1111 $2 sup -5 ~~ + ~~ .1111 ~ times ~ 2 sup -5$ 0100 0000 $2 sup -4 ~~ + ~~ .0000 ~ times ~ 2 sup -4$ \0\0\0... \0\0\0\0... 0100 1111 $2 sup -4 ~~ + ~~ .1111 ~ times ~ 2 sup -4$ 0101 0000 $2 sup -3 ~~ + ~~ .0000 ~ times ~ 2 sup -3$ \0\0\0... \0\0\0\0... 0101 1111 $2 sup -3 ~~ + ~~ .1111 ~ times ~ 2 sup -3$ 0110 0000 $2 sup -2 ~~ + ~~ .0000 ~ times ~ 2 sup -2$ \0\0\0... \0\0\0\0... 0110 1111 $2 sup -2 ~~ + ~~ .1111 ~ times ~ 2 sup -2$ 0111 0000 $2 sup -1 ~~ + ~~ .0000 ~ times ~ 2 sup -1$ \0\0\0... \0\0\0\0... 0111 1111 $2 sup -1 ~~ + ~~ .1111 ~ times ~ 2 sup -1$ 1000 0000 \h'\w'\0-\0 'u'$- ~~ .0000 ~ times ~ 2 sup -7$ negative numbers treated as \0\0\0... \0\0\0\0... above, with a sign bit of 1 1111 1111 \h'-\w'\- 'u'\- $2 sup -1 ~~ - ~~ .1111 ~ times ~ 2 sup -1$ .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 3.1 8-bit A-law codes, with their floating-point equivalents" according to the piecewise linear approximation of Figure 3.6, written in a notation which suggests floating point. Each linear segment has a different exponent except the first two segments, which as explained above are collinear. .pp Logarithmic encoders and decoders are available from many semiconductor manufacturers as single-chip devices called "codecs" (for "coder/decoder"). Intended for use on digital communication links, these generally provide a serial output bit-stream, which should be converted to parallel by a shift register if the data is intended for a computer. Because of the potentially vast market for codecs in telecommunications, they are made in great quantities and are consequently very cheap. Estimates of the speech quality necessary for telephone applications indicate that somewhat less than this accuracy is needed \(em 7-bit logarithmic encoding was used in early digital communications links, and it may be that even 6 bits are adequate. However, during the transition period when digital networks must coexist with the present analogue one, it is anticipated that a particular telephone call may have to pass through several links, some using analogue technology and some being digital. The possibility of several successive encodings and decodings has led telecommunications engineers to standardize on 8-bit representations, leaving some margin before additional degradation of signal quality becomes unduly distracting. .pp Unfortunately, world telecommunications authorities cannot agree on a single standard for logarithmic encoding. The A-law, which we have described, is the European standard, but there is another system, called the $mu$-law, which is used universally in North America. It also is available in single-chip form with an 8-bit code. It has very similar quantization error characteristics to the A-law, and would be indistinguishable from it on the scale of Figure 3.6. .rh "The pre-sampling filter." Now that we have some idea of the accuracy requirements for quantization, let us discuss quantitative specifications for the pre-sampling filter. Figure 3.7 sketches the characteristics of this filter. .FC "Figure 3.7" Assume a sampling frequency of 8\ kHz and a range of interest from 0 to 3.4\ kHz. Although all components at frequencies above 4\ kHz will fold back into the 0\ \-\ 4\ kHz baseband, those below 4.6\ kHz fold back above 3.4\ kHz and are therefore outside the range of interest. This gives a "guard band" between 3.4 and 4.6\ kHz which separates the passband from the stopband. The filter should transmit negligible components in the stopband above 4.6\ kHz. To reduce the harmonic distortion caused by aliasing to the same level as the quantization noise in 11-bit linear conversion, the stopband attenuation should be around \-68\ dB (the signal-to-noise ratio for a full-scale sine wave). Passband ripple is not so critical, for two reasons. Whilst the presence of aliased components means that information has been lost about the frequency components within the range of interest, passband ripple does not actually cause a loss of information but only a distortion, and could, if necessary, be compensated by a suitable filter acting on the digitized waveform. Secondly, distortion of the passband spectrum is not nearly so audible as the frequency images caused by aliasing. Hence one usually aims for a passband ripple of around 0.5\ dB. .pp The pass and stopband targets we have mentioned above can be achieved with a 9'th order elliptic filter. While such a filter is often used in high-quality signal-processing systems, for telephone-quality speech much less stringent specifications seem to be sufficient. Figure 3.8, for example, shows a template which has been recommended by telecommunications authorities. .FC "Figure 3.8" A 5'th order elliptic filter can easily meet this specification. Such filters, implemented by switched-capacitor means, are available in single-chip form. Integrated CCD (charge-coupled device) filters which meet the same specification are also marketed. Indeed, some codecs provide input filtering on the same chip as the A/D converter. .pp Instead of implementing a filter by analogue means to meet the aliasing specifications, digital filtering can be used. A high sample-rate A/D converter, operating at, say, 32\ kHz, and preceded by a very simple low-pass pre-sampling filter, is followed by a digital filter which meets the desired specification, and its output is subsampled to provide an 8\ kHz sample rate. While such implementations may be economic where a multichannel digitizing capability is required, as in local telephone exchanges where the subscriber connection is an analogue one, they are unlikely to prove cost-effective for a single channel. .rh "Reconstructing the analogue waveform." Having digitized and stored a signal, it needs to be passed though a D/A converter (digital-to-analogue) and low-pass filter when replayed. D/A converters are cheaper than A/D converters, and the characteristics of the low-pass filter for output can be the same as those for input. However, the desampling operation introduces an additional distortion, which has an effect on the component at frequency $f$ of .LB .EQ { sin ( pi f/f sub s )} over { pi f/f sub s } ~ , .EN .LE where $f sub s$ is the sampling frequency. An "aperture correction" filter is needed to compensate for this, although many systems simply do without it. Such a filter is sometimes incorporated into the codec chip. .rh "Summary." For telephone-quality speech, existing codec chips, coupled if necessary with integrated pre-sampling filters, can be used, at a remarkably low cost. For higher-quality speech storage the analogue interface can become quite complex. A comprehensive study of the problems as they relate to digitization of audio, which demands much greater fidelity than speech, has been made by Blesser (1978). .[ Blesser 1978 .] He notes the following sources of error (amongst others): .LB .NP slew-rate distortion in the pre-sampling filter for signals at the upper end of the audio band; .NP insufficient filtering of high-frequency input signals; .NP noise generated by the sample-and-hold amplifier or pre-sampling filter; .NP acquisition errors because of the finite settling time of the sample-and-hold circuit; .NP insufficient settling time in the A/D conversion; .NP errors in the quantization levels of the A/D and D/A converters; .NP noise in the converters; .NP jitter on the clock used for timing input or output samples; .NP aperture distortion in the output sampler; .NP noise in the output filter as a result of limited dynamic range of the integrated circuits; .NP power-supply noise injection or ground coupling; .NP changes in characteristics as a result of temperature or ageing. .LE Care must be taken with the analogue interface to ensure that the precision implied by the resolution of the A/D and D/A converters is not compromised by inadequate analogue circuitry. It is especially important to eliminate high-frequency noise caused by fast edges on nearby computer buses. .sh "3.2 Coding in the time domain" .pp There are several methods of coding the time waveform of a speech signal to reduce the data rate for a given signal-to-noise ratio, or alternatively to reduce the signal-to-noise ratio for a given data rate. They almost all require more processing, both at the encoding (for storage) and decoding (for regeneration) ends of the digitization process. They are sometimes used to economize on memory in systems using stored speech, for example the System\ X telephone exchange and the travel consultant described in Chapter 1, and so will be described here. However, it is to be expected that simple time-domain coding techniques will be superseded by the more complex linear predictive method, which is covered in Chapter 6, because this can give a much more substantial reduction in the data rate for only a small degradation in speech quality. Hence the aim of this section is to introduce the ideas in a qualitative way: theoretical development and summaries of results of listening tests can be found elsewhere (eg Rabiner and Schafer, 1978). .[ Rabiner Schafer 1978 Digital processing of speech signals .] The methods we will examine are summarized in Table 3.2. .RF .nr x0 \w'linear PCM 'u .nr x1 \n(x0+\w' adaptive quantization, or adaptive prediction,'u .nr x2 (\n(.l-\n(x1)/2 .in \n(x2u .ta \n(x0u \l'\n(x1u\(ul' .sp linear PCM linearly-quantized pulse code modulation .sp log PCM logarithmically-quantized pulse code modulation (instantaneous companding) .sp APCM adaptively quantized pulse code modulation (usually syllabic companding) .sp DPCM differential pulse code modulation .sp ADPCM differential pulse code modulation with either adaptive quantization, or adaptive prediction, or both .sp DM delta modulation (1-bit DPCM) .sp ADM delta modulation with adaptive quantization \l'\n(x1u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 3.2 Time-domain encoding techniques" .rh "Syllabic companding." We have already studied one time-domain encoding technique, namely logarithmic quantization, or log PCM (sometimes called "instantaneous companding"). A more sophisticated encoder could track slowly varying trends in the overall amplitude of the speech signal and use this information to adjust the quantization levels dynamically. Speech coding methods based on this principle are called adaptive pulse code modulation systems (APCM). Because the overall amplitude changes slowly, it is sufficient to adjust the quantization relatively infrequently (compared with the sampling rate), and this is often done at rates approximating the syllable rate of running speech, leading to the term "syllabic companding". A block floating-point format can be used, with a common exponent being stored every M samples (with M, say, 125 for a 100\ msec block rate at 8\ kHz sampling), but the mantissa being stored at the regular sample rate. The overall energy in the block, .LB $sum from n=h to h+M-1 ~x(n) sup 2$ ($M = 125$, say), .LE is used to determine a suitable exponent, and every sample in the block \(em namely $x(h)$, $x(h+1)$, ..., $x(h+M-1)$ \(em is scaled according to that exponent. Note that for speech transmission systems this method necessitates a delay of $M$ samples at the encoder, and indeed some methods base the exponent on the energy in the last block to avoid this. For speech storage, however, the delay is irrelevant. A rather different, nonsyllabic, method of adaptive PCM is continually to change the step size of a uniform quantizer, by multiplying it by a constant at each sample which is based on the magnitude of the previous code word. .pp Adaptive quantization exploits information about the amplitude of the signal, and, as a rough generalization, yields a reduction of one bit per sample in the data rate for telephone-quality speech over ordinary logarithmic quantization, for a given signal-to-noise ratio. Alternatively, for the same data rate an improvement of 6\ dB in signal-to-noise ratio can be obtained. Some results for actual schemes are given by Rabiner and Schafer (1978). .[ Rabiner Schafer 1978 Digital processing of speech signals .] However, there is other information in the time waveform of speech, namely, the sample-to-sample correlation, which can be exploited to give further reductions. .rh "Differential coding." Differential pulse code modulation (DPCM), in its simplest form, uses the present speech sample as a prediction of the next one, and stores the prediction error \(em that is, the sample-to-sample difference. This is a simple case of predictive encoding. Referring back to the speech waveform displayed in Figure 3.5, it seems plausible that the data rate can be reduced by transmitting the difference between successive samples instead of their absolute values: less bits are required for the difference signal for a given overall accuracy because it does not assume such extreme values as the absolute signal level. Actually, the improvement is not all that great \(em about 4\ \-\ 5\ dB in signal-to-noise ratio, or just under one bit per sample for a given signal-to-noise ratio \(em for the difference signal can be nearly as large as the absolute signal level. .pp If DPCM is used in conjunction with adaptive quantization, giving one form of adaptive differential pulse code modulation (ADPCM), both the overall amplitude variation and the sample-to-sample correlation are exploited, leading to a combined gain of 10\ \-\ 11\ dB in signal-to-noise ratio (or just under two bits reduction per sample for telephone-quality speech). Another form of adaptation is to alter the predictor by multiplying the previous sample value by a parameter which is adjusted for best performance. Then the transmitted signal at time $n$ is .LB .EQ e(n) ~~ = ~~ x(n)~ - ~ax(n-1), .EN .LE where the parameter $a$ is adapted (and stored) on a syllabic time-scale. This leads to a slight improvement in signal-to-noise ratio, which can be combined with that achieved by adaptive quantization. Much more substantial benefits can be realized by using a weighted sum of the past several (up to 15) speech samples, and adapting all the weights. This is the basic idea of linear prediction, which is developed in Chapter 6. .rh "Delta modulation." The coding methods presented so far all increase the complexity of the analogue-to-digital interface (or, if the sampled waveform is coded digitally, they increase the processing required before and after storage). One method which considerably .ul simplifies the interface is the limiting case of DPCM with just 1-bit quantization. Only the sign of the difference between the current and last values is transmitted. Figure 3.9 shows the conversion hardware. .FC "Figure 3.9" The encoding part is essentially the same as a tracking D/A, where the value in a counter is forced to track the analogue input by incrementing or decrementing the counter according as the input exceeds or falls short of the analogue equivalent of the counter's contents. However, for this encoding scheme, called "delta modulation", the increment-decrement signal itself forms the discrete representation of the waveform, instead of the counter's contents. The analogue waveform can be reconstituted from the bit stream with another counter and D/A converter. Alternatively, an all-analogue implementation can be used, both for the encoder and decoder, with a capacitor as integrator whose charging current is controlled digitally. This is a much cheaper realization. .pp It is fairly obvious that the sampling frequency for delta modulation will need to be considerably higher than for straightforward PCM. Figure 3.10 shows an effect called "slope overload" which occurs when the sampling rate is too low. .FC "Figure 3.10" Either a higher sample rate or a larger step size will reduce the overload; however, larger steps increase the noise level of the alternate 1's and \-1's that occur when no input is present \(em called "granular noise". A compromise is necessary between slope overload and granular noise for a given bit rate. Delta modulation results in lower data rates than logarithmic quantization for a given signal-to-noise ratio if that ratio is low (poor-quality speech). As the desired speech quality is increased its data rate grows faster than that of logarithmic PCM. The crossover point occurs at much lower than telephone quality speech, and so although delta modulation is used for some applications where the permissible data rate is severely constrained, it is not really suitable for speech output from computers. .pp It is profitable to adjust the step size, leading to .ul adaptive delta modulation. A common strategy is to increase or decrease the step size by a multiplicative constant, which depends on whether the new transmitted bit will be equal to or different from the last one. That is, .LB "nnnn" .NI "nn" $stepsize(n+1) = stepsize(n) times 2$ if $x(n+1)x(n)>x(n-1)$ .br (slope overload condition); .NI "nn" $stepsize(n+1) = stepsize(n)/2$ if $x(n+1),~x(n-1)x(n)$ .br (granular noise condition). .LE "nnnn" Despite these adaptive equations, the step size should be constrained to lie between a predetermined fixed maximum and minimum, to prevent it from becoming so large or so small that rapid accomodation to changing input signals is impossible. Then, in a period of potential slope overload the step size will grow, preventing overload, possibly to its maximum value when overload may resume. In a quiet period it will decrease to its minimum value which determines the granular noise in the idle condition. Note that the step size need not be stored, for it can be deduced from the bit changes in the digitized data. Although adaptation improves the performance of delta modulation, it is still inferior to PCM at telephone qualities. .rh "Summary." It seems that ADPCM, with adaptive quantization and adaptive prediction, can provide a worthwhile advantage for speech storage, reducing the number of bits needed per sample of telephone-quality speech from 7 for logarithmic PCM to perhaps 5, and the data rate from 56\ Kbit/s to 40\ Kbit/s. Disadvantages are additional complexity in the encoding and decoding processes, and the fact that byte-oriented storage, with 8 bits/sample in logarithmic PCM, is more convenient for computer use. For low quality speech where hardware complexity is to be minimized, adaptive delta modulation could provide worthwhile \(em although the ready availability of PCM codec chips reduces the cost advantage. .sh "3.3 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "3.4 Further reading" .pp Probably the best single reference on time-domain coding of speech is the book by Rabiner and Schafer (1978), cited above. However, this does not contain a great deal of information on practical aspects of the analogue-to-digital conversion process; this is covered by Blesser (1978) above, who is especially interested in high-quality conversion for digital audio applications, and Garrett (1978) below. There are many textbooks in the telecommunications area which are relevant to the subject of the chapter, although they concentrate primarily on fundamental theoretical aspects rather than the practical application of the technology. .LB "nn" .\"Cattermole-1969-1 .]- .ds [A Cattermole, K.W. .ds [D 1969 .ds [T Principles of pulse code modulation .ds [I Iliffe .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is a standard, definitive, work on PCM, and provides a good grounding in the theory. It goes into the subject in much more depth than we have been able to here. .in-2n .\"Garrett-1978-1 .]- .ds [A Garrett, P.H. .ds [D 1978 .ds [T Analog systems for microprocessors and minicomputers .ds [I Reston Publishing Company .ds [C Reston, Virginia .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Garrett discusses the technology of data conversion systems, including A/D and D/A converters and basic analogue filter design, in a clear and practical manner. .in-2n .\"Inose-1979-2 .]- .ds [A Inose, H. .ds [D 1979 .ds [T An introduction to digital integrated communications systems .ds [I Peter Peregrinus .ds [C Stevenage, England .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Inose's book is a recent one which covers the whole area of digital transmission and switching technology. It gives a good idea of what is happening to the telephone networks in the era of digital communications. .in-2n .\"Steele-1975-3 .]- .ds [A Steele, R. .ds [D 1975 .ds [T Delta modulation systems .ds [I Pentech Press .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Again a standard work, this time on delta modulation techniques. Steele gives an excellent and exhaustive treatment of the subject from a communications viewpoint. .in-2n .LE "nn" .EQ delim $$ .EN .CH "4 SPEECH ANALYSIS" .ds RT "Speech analysis .ds CX "Principles of computer speech .pp Digital recordings of speech provide a jumping-off point for further processing of the audio waveform, which is usually necessary for the purpose of speech output. It is difficult to synthesize natural sounds by concatenating individually-spoken words. Pitch is perhaps the most perceptually significant contextual effect which must be taken into account when forming connected speech out of isolated words. The intonation of an utterance, which manifests itself as a continually changing pitch, is a holistic property of the utterance and not the sum of components determined by the individual words alone. Happily, and quite coincidentally, communications engineers in their quest for reduced-bandwidth telephony have invented methods of coding speech that separate the pitch information from that carried by the articulation. .pp Although these analysis techniques, which were first introduced in the late 1930's (Dudley, 1939), were originally implemented by analogue means \(em and in many systems still are (Blankenship, 1978, describes a recent switched-capacitor realization) \(em there is a continuing trend towards digital implementations, particularly for the more sophisticated coding schemes. .[ Dudley 1939 .] .[ Blankenship 1978 .] It is hard to see how the technique of linear prediction of speech, which is described in detail in Chapter 6, could be accomplished in the absence of digital processing. Some groundwork is laid for the theory of digital signal analysis in this chapter. The ideas are not presented in a formal, axiomatic way; but are developed as and when they are needed to examine some of the structures that turn out to be useful in speech processing. .pp Most speech analysis views speech according to the source-filter model which was introduced in Chapter 2, and aims to separate the effects of the source from those of the filter. The frequency spectrum of the vocal tract filter is of great interest, and the technique of discrete Fourier transformation is discussed in this chapter. For many purposes it is better to extract the formant frequencies from the spectrum and use these alone (or in conjunction with their bandwidths) to characterize it. As far as the signal source in the source-filter model is concerned, its most interesting features are pitch and amplitude \(em the latter being easy to estimate. Hence we go on to look at pitch extraction. Related to this is the problem of deciding whether a segment of speech has voiced or unvoiced excitation, or both. .pp Estimating formant and pitch parameters is one of the messiest areas of speech processing. There is a delightful paper which points this out (Schroeder, 1970), entitled "Parameter estimation in speech: a lesson in unorthodoxy". .[ Schroeder 1970 .] It emphasizes that the most successful estimation procedures "have often relied on intuition based on knowledge of speech signals and their production in the human vocal apparatus rather than routine applications of well-established theoretical methods". Fortunately, the emphasis of the present book is on speech .ul output, which involves parameter estimation only in so far as it is needed to produce coded speech for storage, and to illuminate the acoustic nature of speech for the development of synthesis by rule from phonetics or text. Hence the many methods of formant and pitch estimation are treated rather cursorily and qualitatively here: our main interest is in how to .ul use such information for speech output. .pp If the incoming speech can be analysed into its formant frequencies, amplitude, excitation mode, and pitch (if voiced), it is quite easy to resynthesize it directly from these parameters. Speech synthesizers are described in the next chapter. They can be realized in either analogue or digital hardware, the former being predominant in production systems and the latter in research systems \(em although, as in other areas of electronics, the balance is changing in favour of digital implementations. .sh "4.1 The channel vocoder" .pp A direct representation of the frequency spectrum of a signal can be obtained by a bank of bandpass filters. This is the basis of the .ul channel vocoder, which was the first device that attempted to take advantage of the source-filter model for speech coding (Dudley, 1939). .[ Dudley 1939 .] The word "vocoder" is a contraction of .ul vo\c ice .ul coder. The energy in each filter band is estimated by rectification and smoothing, and the resulting approximation to the frequency spectrum is transmitted or stored. The source properties are represented by the type of excitation (voiced or unvoiced), and if voiced, the pitch. It is not necessary to include the overall amplitude of the speech explicitly, because this is conveyed by the energy levels from the separate bandpass filters. .pp Figure 4.1 shows the encoding part of a channel vocoder which has been used successfully for many years (Holmes, 1980). .[ Holmes 1980 JSRU channel vocoder .] .FC "Figure 4.1" We will discuss the block labelled "pre-emphasis" shortly. The shape of the spectrum is estimated by 19 bandpass filters, whose spacing and bandwidth decrease slightly with decreasing frequency to obtain the rather greater resolution that is needed in the lower frequency region, as shown in Table 4.1. .RF .nr x0 4n+2.6i+\w'\0\0'u+(\w'bandwidth'/2) .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta 4n +1.3i +1.3i \l'\n(x0u\(ul' .sp .nr x1 (\w'channel'/2) .nr x2 (\w'centre'/2) .nr x3 (\w'analysis'/2) \0\h'-\n(x1u'channel \0\h'-\n(x2u'centre \0\0\h'-\n(x3u'analysis .nr x1 (\w'number'/2) .nr x2 (\w'frequency'/2) .nr x3 (\w'bandwidth'/2) \0\h'-\n(x1u'number \0\0\h'-\n(x2u'frequency \0\0\h'-\n(x3u'bandwidth .nr x2 (\w'(Hz)'/2) \0\h'-\n(x2u'(Hz) \0\0\h'-\n(x2u'(Hz) \l'\n(x0u\(ul' .sp \01 \0240 \0120 \02 \0360 \0120 \03 \0480 \0120 \04 \0600 \0120 \05 \0720 \0120 \06 \0840 \0120 \07 1000 \0150 \08 1150 \0150 \09 1300 \0150 10 1450 \0150 11 1600 \0150 12 1800 \0200 13 2000 \0200 14 2200 \0200 15 2400 \0200 16 2700 \0200 17 3000 \0300 18 3300 \0300 19 3750 \0500 \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 4.1 Filter specifications for a vocoder analyser (after Holmes, 1980)" .[ Holmes 1980 JSRU channel vocoder .] The 3\ dB points of adjacent filters are halfway between their centre frequencies, so that there is some overlap between bands. The filter characteristics do not need to have very sharp edges, because the energy in neighbouring bands is fairly highly correlated. Indeed, there is a disadvantage in making them too sharp, because the phase delays associated with sharp cutoff filters induce "smearing" of the spectrum in the time domain. This particular channel vocoder uses second-order Butterworth bandpass filters. .pp For regenerating speech stored in this way, an excitation of unit impulses at the specified pitch period (for voiced sounds) or white noise (for unvoiced sounds) is produced and passed through a bank of bandpass filters similar to the analysis ones. The excitation has a flat spectrum, for regular impulses have harmonics at multiples of the repetition frequency which are all of the same size, and so the spectrum of the output signal is completely determined by the filter bank. The gain of each filter is controlled by the stored magnitude of the spectrum at that frequency. .pp The frequency spectrum and voicing pitch of speech change at much slower rates than the time waveform. The changes are due to movements of the articulatory organs (tongue, lips, etc) in the speaker, and so are limited in their speed by physical constraints. A typical rate of production of phonemes is 15 per second, but in fact the spectrum can change quite a lot within a single phoneme (especially a stop sound). Between 10 and 25\ msec (100\ Hz and 40\ Hz) is generally thought to be a satisfactory interval for transmitting or storing the spectrum, to preserve a reasonably faithful representation of the speech. Of course, the entire spectrum, as well as the source characteristics, must be stored at this rate. The channel vocoder described by Holmes (1980) uses 48 bits to encode the information. .[ Holmes 1980 JSRU channel vocoder .] Repeated every 20\ msec, this gives a data rate of 2400\ bit/s \(em very considerably less than any of the time-domain encoding techniques. .pp It needs some care to encode the output of 19 filters, the excitation type, and the pitch into 48 bits of information. Holmes uses 6 bits for pitch, logarithmically encoded, and one bit for excitation type. This leaves 41 bits to encode the output of the 19 filters, and so a differential technique is used which transmits just the difference between adjacent channels \(em for the spectrum does not change abruptly in the frequency domain. Three bits are used for the absolute level in channel 1, and two bits for each channel-to-channel difference, giving a total of 39 bits for the whole spectrum. The remaining two bits per frame are reserved for signalling or monitoring purposes. .pp A 2400 bit/s channel vocoder degrades the speech in a telephone channel quite perceptibly. It is sufficient for interactive communication, where if you do not understand something you can always ask for it to be repeated. It is probably not good enough for most voice response applications. However, the vocoder principle can be used with larger filter banks and much higher bit rates, and still reduce the data rate substantially below that required by log PCM. .sh "4.2 Pre-emphasis" .pp There is an overall \-6\ dB/octave trend in speech radiated from the lips, as frequency increases. We will discuss why this is so in the next chapter. Notice that this trend means that the signal power is reduced by a factor of 4, or the signal amplitude by a factor of 16, for each doubling in frequency. For vocoders, and indeed for other methods of spectral analysis of speech, it is usually desirable to equalize this by a +6\ dB/octave lift prior to processing, so that the channel outputs occupy a similar range of levels. On regeneration, the output speech is passed through an inverse filter which provides 6\ dB/octave of attenuation. .pp For a digital system, such pre-emphasis can either be implemented as an analogue circuit which precedes the presampling filter and digitizer, or as a digital operation on the sampled and quantized signal. In the former case, the characteristic is usually flat up to a certain breakpoint, which occurs somewhere between 100\ Hz and 1\ kHz \(em the exact position does not seem to be critical \(em at which point the +6\ dB/octave lift begins. Although de-emphasis on output ought to have an exactly inverse characteristic, it is sometimes modified or even eliminated altogether in an attempt to counteract approximately the $sin( pi f/f sub s )/( pi f/f sub s )$ distortion introduced by the desampling operation, which was discussed in an earlier section. Above half the sampling frequency, the characteristic of the pre-emphasis is irrelevant because any effect will be suppressed by the presampling filter. .pp The effect of a 6\ dB/octave lift can also be achieved digitally, by differencing the input. The operation .LB .EQ y(n)~~ = ~~ x(n)~ -~ ax(n-1) .EN .LE is suitable, where the constant parameter $a$ is usually chosen between 0.9 and 1. The latter value gives straightforward differencing, and this amounts to creating a DPCM signal as input to the spectral analysis. Figure 4.2 plots the frequency response of this operation, with a sample frequency of 8\ kHz, for two values of the parameter; together with that of a 6\ dB/octave lift above 100\ Hz. .FC "Figure 4.2" The vertical positions of the plots have been adjusted to give the same gain, 20\ dB, at 1\ kHz. The difference at 3.4\ kHz, the upper end of the telephone spectrum, is just over 2\ dB. At frequencies below the breakpoint, in this case 100\ Hz, the difference between analogue and digital pre-emphasis can be very great. For $a=0.9$ the attenuation at DC (zero frequency) is 18\ dB below that at 1\ kHz, which happens to be close to that of the analogue filter for frequencies below the breakpoint. However, if the breakpoint had been at 1\ kHz there would have been 20\ dB difference between the analogue and $a=0.9$ plots at DC. And of course the $a=1$ characteristic has infinite attenuation at DC. In practice, however, the exact form of the pre-emphasis does not seem to be at all critical. .pp The above remarks apply only to voiced speech. For unvoiced speech there appears to be no real need for pre-emphasis; indeed, it may do harm by reinforcing the already large high-frequency components. There is a case for altering the parameter $a$ according to the excitation mode of the speech: $a=1$ for voiced excitation and $a=0$ for unvoiced gives pre-emphasis just when it is needed. This can be achieved by expressing the parameter in terms of the autocorrelation of the incoming signal, as .LB .EQ a ~~ = ~~ R(1) over R(0) ~ , .EN .LE where $R(1)$ is the correlation of the signal with itself delayed by one sample, and $R(0)$ is the correlation without delay (that is, the signal variance). This is reasonable intuitively because high sample-to-sample correlation is to be expected in voiced speech, so that $R(1)$ is very nearly as great as $R(0)$ and the ratio becomes 1; whereas little or no sample-to-sample correlation will be present in unvoiced speech, making the ratio close to 0. Such a scheme is reminiscent of ADPCM with adaptive prediction. .pp However, this sophisticated pre-emphasis method does not seem to be worthwhile in practice. Usually the breakpoint in an analogue pre-emphasis filter is chosen to be rather greater than 100\ Hz to limit the amplification of fricative energy. In fact, the channel vocoder described by Holmes (1980) has the breakpoint at 1\ kHz, limiting the gain to 12\ dB at 4\ kHz, two octaves above. .[ Holmes 1980 JSRU channel vocoder .] .sh "4.3 Digital signal analysis" .pp You may be wondering how the frequency response for the digital pre-emphasis filters, displayed in Figure 4.2, can be calculated. Suppose a digitized sinusoid is applied as input to the filter .LB .EQ y(n) ~~ = ~~ x(n)~ - ~ax(n-1). .EN .LE A sine wave of frequency $f$ has equation $x(t) ~ = ~ sin ~ 2 pi ft$, and when sampled at $t=0,~ T,~ 2T,~ ...$ (where $T$ is the sampling interval, 125\ msec for an 8\ kHz sample rate), this becomes $x(n) ~ = ~ sin ~ 2 pi fnT.$ It is much more convenient to consider a complex exponential input, $e sup { j2 pi fnT}$ \(em the response to a sinusoid can then be derived by taking imaginary parts, if necessary. The output for this input is .LB .EQ y(n) ~~ = ~~ e sup {j2 pi fnT} ~~-~ae sup {j2 pi f(n-1)T} ~~ = ~~ (1~-~ae sup {-j2 pi fT} )~e sup {j2 pi fnT} , .EN .LE a sinusoid at the same frequency as the input. The factor $1~-~ae sup {-j2 pi fT}$ is complex, with both amplitude and phase components. Thus the output will be a phase-shifted and amplified version of the input. The amplitude response at frequency $f$ is therefore .LB .EQ |1~ - ~ ae sup {-j2 pi fT} | ~~ = ~~ [1~ +~ a sup 2 ~-~ 2a~cos~2 pi fT ] sup 1/2 , .EN .LE or .LB .EQ 10 ~ log sub 10 (1~ +~ a sup 2 ~ - ~ 2a~ cos 2 pi fT) .EN dB. .LE Normalizing to 20\ dB at 1\ kHz, and assuming 8\ kHz sampling, yields .LB .EQ 20~ + ~~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ { pi f} over 4000 ) ~~ -~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ pi over 4 ) .EN dB. .LE With $a=0.9$ and 1 this gives the graphs of Figure 4.2. .pp Frequency responses for analogue filters are often plotted with a logarithmic frequency scale, as well as a logarithmic amplitude one, to bring out the asymptotes in dB/octave as straight lines. For digital filters the response is usually drawn on a .ul linear frequency axis extending to half the sampling frequency. The response is symmetric about this point. .pp Analyses like the above are usually expressed in terms of the $z$-transform. Denote the unit delay operation by $z sup -1$. The choice of the inverse rather than $z$ itself is of course an arbitrary matter, but the convention has stuck. Then the filter can be characterized by Figure 4.3, which signifies that the output is the input minus a delayed and scaled version of itself. .FC "Figure 4.3" The transfer function of the filter is .LB .EQ H(z) ~~ = ~~ 1~ -~ az sup -1 , .EN .LE and we have seen that the effect of the system on a (complex) exponential of frequency $f$ is to multiply it by .LB .EQ 1~ -~ ae sup {-j2 pi fT}. .EN .LE To get the frequency response from the transfer function, replace $z sup -1$ by $e sup {-j2 pi fT}$. Amplitude and phase responses can then be found by taking the modulus and angle of the complex frequency response. .pp If $z sup -1$ is treated as an .ul operator, it is quite in order to summarize the action of the filter by .LB .EQ y(n) ~~ = ~~ x(n)~ - ~az sup -1 x(n) ~~ = ~~ (1~ -~ az sup -1 )x(n). .EN .LE However, it is usual to derive from the sequence $x(n)$ a .ul transform $X(z)$ upon which $z sup -1$ acts as a .ul multiplier. If the transform of $x(n)$ is defined as .LB .EQ X(z) ~~ = ~~ sum from {n=- infinity} to infinity ~x(n) z sup -n , .EN .LE then on multiplication by $z sup -1$ we get a new transform, say $V(z)$: .LB .EQ V(z) ~~ = ~~ z sup -1 X(z) ~~ = ~~ z sup -1 sum from {n=- infinity} to infinity ~x(n) z sup -n ~~ = ~~ sum ~x(n)z sup -n-1 ~~ = ~~ sum ~x(n-1)z sup -n . .EN .LE $V(z)$ can also be expressed as the transform of a new sequence, say $v(n)$, by .LB .EQ V(z) ~~ = ~~ sum from {n=- infinity} to infinity ~v(n) z sup -n , .EN .LE from which it becomes apparent that .LB .EQ v(n) ~~ = ~~ x(n-1). .EN .LE Thus $v(n)$ is a delayed version of $x(n)$, and we have accomplished what we set out to do, namely to show that the delay .ul operator $z sup -1$ can be treated as an ordinary .ul multiplier in the $z$-transform domain, where $z$-transforms are defined as the infinite sums given above. .pp In terms of $z$-transforms, the filter can be written .LB .EQ Y(z) ~~ = ~~ (1~ -~ az sup -1 )X(z), .EN .LE where $z sup -1$ is now treated as a multiplier. The transfer function of the filter is .LB .EQ H(z) ~~ = ~~ Y(z) over X(z) ~~ = ~~ 1 - az sup -1 , .EN .LE the ratio of the output to the input transform. .pp It may seem that little has been gained by inventing this rather abstract notion of transform, simply to change an operator to a multiplier. After all, the equation of the filter is no simpler in the transform domain than it was in the time domain using $z sup -1$ as an operator. However, we will need to go on to examine more complex filters. Consider, for example, the transfer function .LB .EQ H(z) ~~ = ~~ {1~+~az sup -1 ~+~bz sup -2} over {1~+~cz sup -1 ~+~dz sup -2} ~ . .EN .LE If $z sup -1$ is treated as an operator, it is not immediately obvious how this transfer function can be realized by a time-domain recurrence relation. However, with $z sup -1$ as an ordinary multiplier in the transform domain, we can make purely mechanical manipulations with infinite sums to see what the transfer function means as a recurrence relation. .pp It is worth noting the similarity between the $z$-transform in the discrete domain and the Fourier and Laplace transforms in the continuous domains. In fact, the $z$-transform plays an analogous role in digital signal processing to the Laplace transform in continuous theory, for the delay operator $z sup -1$ performs a similar service to the differentiation operator $s$. Recall first the continuous Fourier transform, .LB $ G(f) ~~ = ~~ integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt $, where $f$ is real, .LE and the Laplace transform, .LB $ F(s) ~~ = ~~ integral from 0 to infinity ~f(t)~e sup -st dt $, where $s$ is complex. .LE The main difference between these two transforms is that the range of integration begins at -$infinity$ for the Fourier transform and at 0 for the Laplace. Advocates of the Fourier transform, which typically include people involved with telecommunications, enjoy the freedom from initial conditions which is bestowed by an origin way back in the mists of time. Advocates of Laplace, including most analogue filter theorists, invariably consider systems where all is quiet before $t=0$ \(em altering the origin of measurement of time to achieve this if necessary \(em and welcome the opportunity to include initial conditions explicitly .ul without having to worry about what happens in the mists of time. Although there is a two-sided Laplace transform where the integration begins at -$infinity$, it is not generally used because it causes some convergence complications. Ignoring this difference between the transforms (by considering signals which are zero when $t<0$), the Fourier spectrum can be found from the Laplace transform by writing $s=j2 pi f$; that is, by considering values of $s$ which lie on the imaginary axis. .pp The $z$-transform is .LB $ H(z) ~~ = ~~ sum from n=0 to infinity ~h(n)~z sup -n $, or $ H(z) ~~ = ~~ sum from {n=- infinity} to infinity ~h(n)~z sup -n , $ .LE depending on whether a one-sided or two-sided transform is used. The advantages and disadvantages of one- and two-sided transforms are the same as in the analogue case. $z$ plays the role of $e sup sT $, and so it is not surprising that the response to a (sampled) sinusoid input can be found by setting .LB .EQ z ~~ = ~~ e sup {j2 pi fT} .EN .LE in $H(z)$, as we proved explicitly above for the pre-emphasis filter. .pp The above relation between $z$ and $f$ means that real-valued frequencies correspond to points where $|z|=1$, that is, the unit circle in the complex $z$-plane. As you travel anticlockwise around this unit circle, starting from the point $z=1$, the corresponding frequency increases from 0, to $1/2T$ half-way round ($z=-1$), to $1/T$ when you get back to the beginning ($z=1$) again. Frequencies greater than the sampling frequency are aliased back into the sampling band, corresponding to further circuits of $|z|=1$ with frequency going from $1/T$ to $2/T$, $2/T$ to $3/T$, and so on. In fact, this is the circle of Figure 3.3 which was used earlier to explain how sampling affects the frequency spectrum! .sh "4.4 Discrete Fourier transform" .pp Let us return from this brief digression into techniques of digital signal analysis to the problem of determining the frequency spectrum of speech. Although a bank of bandpass filters such as is used in the channel vocoder is the perhaps most straightforward way to obtain a frequency spectrum, there are other techniques which are in fact more commonly used in digital speech processing. .pp It is possible to define the Fourier transform of a discrete sequence of points. To motivate the definition, consider first the ordinary Fourier transform (FT), which is .LB $ g(t) ~~ = ~~ integral from {- infinity} to infinity ~G(f)~e sup {+j2 pi ft} df ~~~~~~~~~~~~~~~~ G(f) ~~ = ~~ integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt . $ .LE This takes a continuous time domain into a continuous frequency domain. Sometimes you see a normalizing factor $1/2 pi$ multiplying the integral in either the forward or the reverse transform. This is only needed when the frequency variable is expressed in radians/s, and we will find it more convenient to express frequencies in\ Hz. .pp The Fourier series (FS), which should also be familiar to you, operates on a periodic time waveform (or, equivalently, one that only exists for a finite period of time, which is notionally extended periodically). If a period lies in the time range $[0,b)$, then the transform is .LB $ g(t) ~~ = ~~ sum from {r = - infinity} to infinity ~G(r)~e sup {+j2 pi rt/b} ~~~~~~~~~~~~~~~~ G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} dt . $ .LE The Fourier series takes a periodic time-domain function into a discrete frequency-domain one. Because of the basic duality between the time and frequency domains in the Fourier transforms, it is not surprising that another version of the transform can be defined which takes a periodic .ul frequency\c -domain function into a discrete .ul time\c -domain one. .pp Fourier transforms can only deal with a finite stretch of a time signal by assuming that the signal is periodic, for if $g(t)$ is evaluated from its transform $G(r)$ according to the formula above, and $t$ is chosen outside the interval $[0,b)$, then a periodic extension of the function $g(t)$ is obtained automatically. Furthermore, periodicity in one domain implies discreteness in the other. Hence if we transform a .ul finite stretch of a .ul discrete time waveform, we get a frequency-domain representation which is also finite (or, equivalently, periodic), and discrete. This is the discrete Fourier transform (DFT), and takes a discrete periodic time-domain function into a discrete periodic frequency-domain one as illustrated in Figure 4.4. .FC "Figure 4.4" It is defined by .LB $ g(n) ~~ = ~~ 1 over N ~ sum from r=0 to N-1~G(r)~e sup { + j2 pi rn/N} ~~~~~~~~~~~~~~~~ G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~e sup { - j2 pi rn/N} , $ .LE or, writing $W=e sup {-j2 pi /N}$, .LB $ g(n) ~~ = ~~ 1 over N ~ sum from r=0 to N-1~G(r)~W sup -rn ~~~~~~~~~~~~~~~~ G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup rn . $ .LE .sp The $1/N$ in the first equation is the same normalizing factor as the $1/b$ in the Fourier series, for the finite time domain is $[0,N)$ in the discrete case and $[0,b)$ in the Fourier series case. It does not matter whether it is written into the forward or the reverse transform, but it is usually placed as shown above as a matter of convention. .pp As illustrated by Figure 4.5, discrete Fourier transforms take an input of $N$ real values, representing equally-spaced time samples in the interval $[0,b)$, and produce as output $N$ complex values, representing equally-spaced frequency samples in the interval $[0,N/b)$. .FC "Figure 4.5" Note that the end-point of this frequency interval is the sampling frequency. It seems odd that the input is real and the output is the same number of .ul complex quantities: we seem to be getting some numbers for nothing! However, this isn't so, for it is easy to show that if the input sequence is real, the output frequency spectrum has a symmetry about its mid-point (half the sampling frequency). This can be expressed as .LB DFT symmetry:\0\0\0\0\0\0 $ ~ mark G( half N +r) ~=~ G( half N -r) sup *$ if $g$ is real-valued, .LE where $*$ denotes the conjugate of a complex quantity (that is, $(a+jb) sup * = a-jb$). .pp It was argued above that the frequency spectrum in the DFT is periodic, with the spectrum from 0 to the sampling frequency being repeated regularly up and down the frequency axis. It can easily be seen from the DFT equation that this is so. It can be written .LB DFT periodicity:$ lineup G(N+r) ~=~ G(r)$ always. .LE Figure 4.6 illustrates the properties of symmetry and periodicity. .FC "Figure 4.6" .sh "4.5 Estimating the frequency spectrum of speech using the DFT" .pp Speech signals are not exactly periodic. Although the waveform in a particular pitch period will usually resemble those in the preceding and following pitch periods, it will certainly not be identical to them. As the articulation of the speech changes, the formant positions will alter. As we saw in Chapter 2, the pitch itself is certainly not constant. Hence the fundamental assumption of the DFT, that the waveform is periodic, is not really justified. However, the signal is quasi-periodic, for changes from period to period will not usually be very great. One way of computing the short-term frequency spectrum of speech is to use .ul pitch-synchronous Fourier transformation, where single pitch periods are isolated from the waveform and processed with the DFT. This gives a rather accurate estimate of the spectrum. Unfortunately, it is difficult to determine the beginning and end of each pitch cycle, as we shall see later in this chapter when discussing pitch extraction techniques. .pp If a finite stretch of a speech waveform is isolated and Fourier transformed, without regard to pitch of the speech, then the periodicity assumption will be grossly violated. Figure 4.7 illustrates that the effect is the same as multiplying the signal by a rectangular .ul window function, which is 0 except during the period to be analysed, where it is 1. .FC "Figure 4.7" The windowed sequence will almost certainly have discontinuities at its edges, and these will affect the resulting spectrum. The effect can be analysed quite easily, but we will not do so here. It is enough to say that the high frequencies associated with the edges of the window cause considerable distortion of the spectrum. The effect can be alleviated by using a smoother window than a rectangular one, and several have been investigated extensively. The commonly-used windows of Bartlett, Blackman, and Hamming are illustrated in Figure 4.8. .FC "Figure 4.8" .pp Because the DFT produces the same number of frequency samples, equally spaced, as there were points in the time waveform, there is a tradeoff between frequency resolution and time resolution (for a given sampling rate). For example, a 256-point transform with a sample rate of 8\ kHz gives the 256 equally-spaced frequency components between 0 and 8\ kHz that are shown in Table 4.2. .RF .nr x0 (\w'time domain'/2) .nr x1 (\w'frequency domain'/2) .in+1.0i .ta 1.0i 3.0i 4.0i \h'0.5i+2n-\n(x0u'time domain\h'|3.5i+2n-\n(x1u'frequency domain .sp sample time sample \h'-3n'frequency number number .nr x0 1i+\w'00000' \l'\n(x0u\(ul' \l'\n(x0u\(ul' .sp \0\0\00 \0\0\0\00 $mu$sec \0\0\00 \0\0\0\00 Hz \0\0\01 \0\0125 \0\0\01 \0\0\031 \0\0\02 \0\0250 \0\0\02 \0\0\062 \0\0\03 \0\0375 \0\0\03 \0\0\094 \0\0\04 \0\0500 \0\0\04 \0\0125 .nr x2 (\w'...'/2) \h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'... \h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'... \h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'... .sp \0254 31750 \0254 \07938 \0255 31875 $mu$sec \0255 \07969 Hz \l'\n(x0u\(ul' \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .MT 2 Table 4.2 Time domain and frequency domain samples for a 256-point DFT, with 8\ kHz sampling .TE The top half of the frequency spectrum is of no interest, because it contains the complex conjugates of the bottom half (in reverse order), corresponding to frequencies greater than half the sampling frequency. Thus for a 30\ Hz resolution in the frequency domain, 256 time samples, or a 32\ msec stretch of speech, needs to be transformed. A common technique is to take overlapping periods in the time domain to give a new frequency spectrum every 16\ msec. From the acoustic point of view this is a reasonable rate to re-compute the spectrum, for as noted above when discussing channel vocoders the rate of change in the spectrum is limited by the speed that the speaker can move his vocal organs, and anything between 10 and 25\ msec is a reasonable figure for transmitting or storing the spectrum. .pp The DFT is a complex transform, and speech is a real signal. It is possible to do two DFT's at once by putting one time waveform into the real parts of the input and another into the imaginary parts. This destroys the DFT symmetry property, for it only holds for real inputs. But given the DFT of a complex sequence formed in this way, it is easy to separate out the DFT's of the two real time sequences. If the two time sequences are $x(n)$ and $y(n)$, then the transform of the complex sequence .LB .EQ g(n) ~~ = ~~ x(n) ~+~ jy(n) .EN .LE is .LB .EQ G(r) ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn ~+~ y(n)W sup rn ] . .EN .LE It follows that the complex conjugate of the aliased parts of the spectrum, in the upper frequency region, are .LB .EQ G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup -(N-r)n ~-~ y(n)W sup -(N-r)n ] , .EN .LE and this is the same as .LB .EQ G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn ~-~ y(n)W sup rn ] , .EN .LE because $W sup N$ is 1 (recall the definition of $W$), and so $W sup -Nn$ is 1 for any $n$. Thus .LB .EQ X(r) ~~ = ~~ {G(r) ~+~ G(N-r) sup * } over 2 ~~~~~~~~~~~~~~~~ Y(r) ~~ = ~~ {G(r) ~-~ G(N-r) sup * } over 2 .EN .LE extracts the transforms $X(r)$ and $Y(r)$ of the original sequences $x$ and $y$. .pp With speech, this trick is frequently used to calculate two spectra at once. Using 256-point transforms, a new estimate of the spectrum can be obtained every 16\ msec by taking overlapping 32\ msec stretches of speech, with a computational requirement of one 256-point transform every 32\ msec. .sh "4.6 The fast Fourier transform" .pp Straightforward calculation of the DFT, expressed as .LB .EQ G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup nr , .EN .LE for $r=0,~ 1,~ 2,~ ...,~ N-1$, takes $N sup 2$ operations, where each operation is a complex multiply and add (for $W$ is, of course, a complex number). There is a better way, invented in the early sixties, which reduces this to $N ~ log sub 2 N$ operations \(em a very considerable improvement. Dubbed the "fast Fourier transform" (FFT) for historical reasons, it would actually be better called the "Fourier transform", with the straightforward method above known as the "slow Fourier transform"! There is no reason nowadays to use the slow method, except for tiny transforms. It is worth describing the basic principle of the FFT, for it is surprisingly simple. More details on actual implementations can be found in Brigham (1974). .[ Brigham 1974 .] .pp It is important to realize that the FFT involves no approximation. It is an .ul exact calculation of the values that would be obtained by the slow method (although it may be affected differently by round-off errors). Problems of aliasing and windowing occur in all discrete Fourier transforms, and they are neither alleviated nor exacerbated by the FFT. .pp To gain insight into the working of the FFT, imagine the sequence $g(n)$ split into two halves, containing the even and odd points respectively. .LB even half $e(n)$ is $g(0)~ g(2)~ .~ .~ .~ g(N-2)$ .br odd half $o(n)$ is $g(1)~ g(3)~ .~ .~ .~ g(N-1)$. .LE Then it is easy to show that if $G$ is the transform of $g$, $E$ the transform of $e$, and $O$ that of $o$, then .LB $ G(r) ~~ = ~~ E(r) ~+~ W sup r O(r)$ for $r=0,~ 1,~ ...,~ half N -1$, .LE and .LB $ G( half N +r ) ~~ = ~~ E(r) ~+~ W sup { half N +r} O(r)$ for $ r = 0,~ 1,~ ...,~ half N -1$. .LE Calculation of the $E$ and $O$ transforms involves $( half N) sup 2$ operations each, while combining them together according to the above relationship occupies $N$ operations. Thus the total is $N + half N sup 2 $ operations, which is considerably less than $N sup 2$. .pp But don't stop there! The even half can itself be broken down into even and odd parts to expedite its calculation, and the same with the odd half. The only constraint is that the number of elements in the sequences splits exactly into two at each stage. Providing $N$ is a power of 2, then, we are left at the end with some 1-point transforms to do. But transforming a single point leaves it unaffected! (Check the definition of the DFT.) A quick calculation shows that the number of operations needed is not $N + half N sup 2$, but $N~ log sub 2 N$. Figure 4.9 compares this with $N sup 2$, the number of operations for straightforward DFT calculation, and it can be seen that the FFT is very much faster. .FC "Figure 4.9" .pp The only restriction on the use of the FFT is that $N$ must be a power of two. If it is not, alternative, more complicated, algorithms can be used which give comparable computational advantages. However, for speech processing the number of samples that are transformed is usually arranged to be a power of two. If a pitch synchronous analysis is undertaken, the time stretch that is to be transformed is dictated by the length of the pitch period, and will vary from time to time. Then, it is usual to pad out the time waveform with zeros to bring the number of samples up to a power of two; otherwise, if different-length time stretches were transformed the scale of the resulting frequency components would vary too. .pp The FFT provides very worthwhile cost savings over the use of a bank of bandpass filters for spectral analysis. Take the example of a 256-point transform with 8\ kHz sampling, giving 128 frequency components spaced by 31.25\ Hz from 0 up to almost 4\ kHz. This can be computed on overlapping 32\ msec stretches of the time waveform, giving a new spectrum every 16\ msec, by a single FFT calculation every 32\ msec (putting successive pairs of time stretches in the real and imaginary parts of the complex input sequence, as described earlier). The FFT algorithm requires $N~ log sub 2 N$ operations, which is 2048 when $N=256$. An additional 512 operations are required for the windowing calculation. Repeated every 32\ msec, this gives a rate of 80,000 operations per second. To achieve a much lower frequency resolution with 20 bandpass filters, each of which are fourth-order, will need a great deal more operations. Each filter will need between 4 and 8 multiplications per sample, depending on its exact digital implementation. But new samples appear every 125 .ul micro\c seconds, and so somewhere around a million operations will be required every second. If we increased the frequency resolution to that obtained by the FFT, 128 filters would be needed, requiring between 4 and 8 million operations! .sh "4.7 Formant estimation" .pp Once the frequency spectrum of a speech signal has been calculated, it may seem a simple matter to estimate the positions of the formants. But it is not! Spectra obtained in practice are not usually like the idealized ones of Figure 2.2. One reason for this is that, unless the analysis is pitch-synchronous, the frequency spectrum of the excitation source is mixed in with that of the vocal tract filter. There are other reasons, which will be discussed later in this section. But first, let us consider how to extract the vocal tract filter characteristics from the combined spectrum of source and filter. To do so we must begin to explore the theory of linear systems. .rh "Discrete linear systems." Figure 4.10 shows an input signal exciting a filter to produce an output signal. .FC "Figure 4.10" For present purposes, imagine the input to be a glottal waveform, the filter a vocal tract one, and the output a speech signal (which is then subjected to high-frequency de-emphasis by radiation from the lips). We will consider here .ul discrete systems, so that the input $x(n)$ and output $y(n)$ are sampled signals, defined only when $n$ is integral. The theory is quite similar for continuous systems. .pp Assume that the system is .ul linear, that is, if input $x sub 1 (n)$ produces output $y sub 1 (n)$ and input $x sub 2 (n)$ produces output $y sub 2 (n)$, then the sum of $x sub 1 (n)$ and $x sub 2 (n)$ will produce the sum of $y sub 1 (n)$ and $y sub 2 (n)$. It is easy to show from this that, for any constant multiplier $a$, the input $ax(n)$ will produce output $ay(n)$ \(em it is pretty obvious when $a=2$, or indeed any positive integer; for then $ax(n)$ can be written as $x(n)+x(n)+...$ . Assume further that the system is .ul time-invariant, that is, if input $x(n)$ produces output $y(n)$ then a time-shifted version of $x$, say $x(n+n sub 0 )$ for some constant $n sub 0$, will produce the same output, only time-shifted; namely $y(n+n sub 0)$. .pp Now consider the discrete delta function $delta (n)$, which is 0 except at $n=0$ when it is 1. If this single impulse is presented as input to the system, the output is called the .ul impulse response, and will be denoted by $h(n)$. The fact that the system is time-invariant guarantees that the response does not depend upon the particular time at which the impulse occurred, so that, for example, the impulsive input $delta (n+n sub 0 )$ will produce output $h(n+n sub 0 )$. A delta-function input and corresponding impulse response are shown in Figure 4.10. .pp The impulse response of a linear, time-invariant system is an extremely useful thing to know, for it can be used to calculate the output of the system for any input at all! Specifically, an input signal $x(n)$ can be written .LB .EQ x(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) delta (n-k) , .EN .LE because $delta (n-k)$ is non-zero only when $k=n$, and so for any particular value of $n$, the summation contains only one non-zero term \(em that is, $x(n)$. The action of the system on each term of the sum is to produce an output $x(k)h(n-k)$, because $x(k)$ is just a constant, and the system is linear. Furthermore, the complete input $x(n)$ is just the sum of such terms, and since the system is linear, the output is the sum of $x(k)h(n-k)$. Hence the response of the system to an arbitrary input is .LB .EQ y(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) h(n-k) . .EN .LE This is called a .ul convolution sum, and is sometimes written .LB .EQ y(n)~ =~ x(n) ~*~ h(n). .EN .LE .pp Let's write this in terms of $z$-transforms. The (two-sided) $z$-transform of y(n) is .LB .EQ Y(z)~ = ~~ sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ = ~~ sum from n ~ sum from k ~x(k)h(n-k) ~z sup -n , .EN .LE Writing $z sup -n$ as $z sup -(n-k) z sup -k$, and interchanging the order of summation, this becomes .LB .EQ Y(z)~ mark = ~~ sum from k ~[~ sum from n ~ h(n-k)z sup -(n-k) ~]~x(k)z sup -k .EN .br .EQ lineup = ~~ sum from k ~H(z)~z sup -k ~~ = ~~ H(z)~ sum from k ~x(k)z sup -k ~~=~~H(z)X(z) . .EN .LE Thus convolution in the time domain is the same as multiplication in the $z$-transform domain; a very important result. Applied to the linear system of Figure 4.10, this means that the output $z$-transform is the input $z$-transform multiplied by the $z$-transform of the system's impulse response. .pp What we really want to do is to relate the frequency spectrum of the output to the response of the system and the spectrum of the input. In fact, frequency spectra are very closely connected with $z$-transforms. A periodic signal $x(n)$ which repeats every $N$ samples has DFT .LB .EQ sum from n=0 to N-1 ~x(n)~e sup {-j2 pi rn/N} , .EN .LE and its $z$-transform is .LB .EQ sum from {n=- infinity} to infinity ~x(n) ~z sup -n . .EN .LE Hence the DFT is the same as the $z$-transform of a single cycle of the signal, evaluated at the points $z= e sup {j2 pi r/N}$ for $r=0,~ 1,~ ...~ ,~ N-1$. In other words, the frequency components are samples of the $z$-transform at $N$ equally-spaced points around the unit circle. Hence the frequency spectrum at the output of a linear system is the product of the input spectrum and the frequency response of the system itself (that is, the transform of its impulse response function). It should be admitted that this statement is somewhat questionable, because to get from $z$-transforms to DFT's we have assumed that a single cycle only is transformed \(em and the impulse response function of a system is not necessarily periodic. The real action of the system is to multiply $z$-transforms, not DFT's. However, it is useful in imagining the behaviour of the system to think in terms of products of DFT's; and in practice it is always these rather than $z$-transforms which are computed because of the existence of the FFT algorithm. .pp Figure 4.11 shows the frequency spectrum of a typical voiced speech signal. .FC "Figure 4.11" The overall shape shows humps at the formant positions, like those in the idealized Figure 2.2. However, superimposed on this is an "oscillation" (in the frequency domain!) at the pitch frequency. This occurs because the transform of the vocal tract filter has been multiplied by that of the pitch pulse, the latter having components at harmonics of the pitch frequency. The oscillation must be suppressed before the formants can be estimated to any degree of accuracy. .pp One way of eliminating the oscillation is to perform pitch-synchronous analysis. This removes the influence of pitch from the frequency domain by dealing with it in the time domain! The snag is, of course, that it is not easy to estimate the pitch frequency: some techniques for doing so are discussed in the next main section. Another way is to use linear predictive analysis, which really does get rid of pitch information without having to estimate the pitch period first. A smooth frequency spectrum can be produced using the analysis techniques described in Chapter 6, which provides a suitable starting-point for formant frequency estimation. The third method is to remove the pitch ripple from the frequency spectrum directly. This will be discussed in an intuitive rather than a theoretical way, because linear predictive methods are becoming dominant in speech processing. .rh "Cepstral processing of speech." Suppose the frequency spectrum of Figure 4.11 were actually a time waveform. To remove the high-frequency pitch ripple is easy: just filter it out! However, filtering removes .ul additive ripples, whereas this is a .ul multiplicative ripple. To turn multiplication into addition, take logarithms. Then the procedure would be .LB .NP compute the DFT of the speech waveform (windowed, overlapped); .NP take the logarithm of the transform; .NP filter out the high-frequency part, corresponding to pitch ripple. .LE .pp Filtering is often best done using the DFT. If the rippled waveform of Figure 4.11 is transformed, a strong component could be expected at the ripple frequency, with weaker ones at its harmonics. These components can be simply removed by setting them to zero, and inverse-transforming the result to give a smoothed version of the original frequency spectrum. A spectrum of the logarithm of a frequency spectrum is often called a .ul cepstrum \(em a sort of backwards spectrum. The horizontal axis of the cepstrum, having the dimension of time, is called "quefrency"! Note that high-frequency signals have low quefrencies and vice versa. In practice, because the pitch ripple is usually well above the quefrency of interest for formants, the upper end of the cepstrum is often simply cut off from a fixed quefrency which corresponds to the maximum pitch expected. However, identifying the pitch peaks of the cepstrum has the useful byproduct of giving the pitch period of the original speech. .pp To summarize, then, the procedure for spectral smoothing by the cepstral method is .LB .NP compute the DFT of the speech waveform (windowed, overlapped); .NP take the logarithm of the transform; .NP take the DFT of this log-transform, calling it the cepstrum; .NP identify the lowest-quefrency peak in the spectrum as the pitch, confirming it by examining its harmonics, which should be equally spaced at the pitch quefrency; .NP remove pitch effects from the cepstrum by cutting off its high-quefrency part above either the pitch quefrency or some constant representing the maximum expected pitch (which is the minimum expected pitch quefrency); .NP inverse DFT the resulting cepstrum to give a smoothed spectrum. .LE .rh "Estimating formant frequencies from smoothed spectra." The difficulties of formant extraction are not over even when a smooth frequency spectrum has been obtained. A simple peak-picking algorithm which identifies a peak at the $k$'th frequency component whenever .LB $ X(k-1) ~<~ X(k) $ and $ X(k) ~>~ X(k+1) $ .LE will quite often identify formants incorrectly. It helps to specify in advance minimum and maximum formant frequencies \(em say 100\ Hz and 3\ kHz for three-formant identification, and ignore peaks lying outside these limits. It helps to estimate the bandwidth of the peaks and reject those with bandwidths greater than 500\ Hz \(em for real formants are never this wide. However, if two formants are very close, then they may appear as a single, wide, peak and be rejected by this criterion. It is usual to take account of formant positions identified in previous frames under these conditions. .pp Markel and Gray (1976) describe in detail several estimation algorithms. .[ Markel Gray 1976 Linear prediction of speech .] Their simplest uses the number of peaks identified in the raw spectrum (under 3\ kHz, and with bandwidths greater than 500\ Hz), to determine what to do. If exactly three peaks are found, they are used as the formant positions. It is claimed that this happens about 85% to 90% of the time. If only one peak is found, the present frame is ignored and the previously-identified formant positions are used (this happens less than 1% of the time). The remaining cases are two peaks \(em corresponding to omission of one formant \(em and four peaks \(em corresponding to an extra formant being included. (More than four peaks never occurred in their data.) Under these conditions, a nearest-neighbour measure is used for disambiguation. The measure is .LB .EQ v sub ij ~ = ~ |{ F sup * } sub i (k) ~-~ F sub j (k-1)| , .EN .LE where $F sub j sup (k-1)$ is the $j$'th formant frequency defined in the previous frame $k-1$ and ${ F sup * } sub i (k)$ is the $i$'th raw data frequency estimate for frame $k$. If two peaks only are found, this measure is used to identify the closest peaks in the previous frame; and then the third peak of that frame is taken to be the missing formant position. If four peaks are found, the measure is used to determine which of them is furthest from the previous formant values, and this one is discarded. .pp This procedure works forwards, using the previous frame to disambiguate peaks given in the current one. More sophisticated algorithms work backwards as well, identifying .ul anchor points in the data which have clearly-defined formant positions, and moving in both directions from these to disambiguate neighbouring frames of data. Finally, absolute limits can be imposed upon the magnitude of formant movements between frames to give an overall smoothing to the formant tracks. .pp Very often, people will refine the result of such automatic formant estimation procedures by hand, looking at the tracks, knowing what was said, and making adjustments in the light of their experience of how formants move in speech. Unfortunately, it is difficult to obtain high-quality formant tracks by completely automatic means. .pp One of the most difficult cases in formant estimation is where two formants are so close together that the individual peaks cannot be resolved. One simple solution to this problem is to employ "analysis-by-synthesis", whereby once a formant is identified, a standard formant shape at this position is synthesized and subtracted from the logarithmic spectrum (Coker, 1963). .[ Coker 1963 .] Then, even if two formants are right on top of each other, the second is not missed because it remains after the first one has been subtracted. .pp Unfortunately, however, the single peak which appears when two formants are close together usually does not correspond exactly with the position of either one. There is one rather advanced signal-processing technique that can help in this case. The frequency spectrum of speech is determined by .ul poles which lie in the complex $z$-plane inside the unit circle. (They must be inside the unit circle if the system is stable. Those familiar with Laplace analysis of analogue systems may like to note that the left half of the $s$-plane corresponds with the inside of the unit circle in the $z$-plane.) As shown earlier, computing a DFT is tantamount to evaluating the $z$-transform at equally-spaced points around the unit circle. However, better resolution is obtained by evaluating around a circle which lies .ul inside the unit circle, but .ul outside the outermost pole position. Such a circle is sketched in Figure 4.12. .FC "Figure 4.12" .pp Recall that the FFT is a fast way of calculating the DFT of a sequence. Is there a similarly fast way of evaluating the $z$-transform inside the unit circle? The answer is yes, and the technique is known as the "chirp $z$-transform", because it involves considering a signal whose frequency increases linearly \(em just like a radar chirp signal. The chirp method allows the $z$-transform to be computed quickly at equally-spaced points along spirally-shaped contours around the origin of the $z$-plane \(em corresponding to signals of linearly increasing complex frequency. The spiral nature of these curves is not of particular interest in speech processing. What .ul is of interest, though, is that the spiral can begin at any point on the $z=0$ axis, and its pitch can be set arbitrarily. If we begin spiralling at $z=0.9$, say, and set the pitch to zero, the contour becomes a circle inside the unit one, with radius 0.9. Such a circle is exactly what is needed to refine formant resolution. .sh "4.8 Pitch extraction" .pp The last section discussed how to characterize the vocal tract filter in the source-filter model of speech production: this one looks at how the most important property of the source \(em that is, the pitch period \(em can be derived. In many ways pitch extraction is more important from a practical point of view than is formant estimation. In a voice-output system, formant estimation is only necessary if speech is to be stored in formant-coded form. For linear predictive storage of speech, or for speech synthesis from phonetics or text, formant extraction is unnecessary \(em although of course general information about formant frequencies and formant tracks in natural speech is needed before a synthesis-from-phonetics system can be built. However, knowledge of the pitch contour is needed for many different purposes. For example, compact encoding of linearly predicted speech relies on the pitch being estimated and stored as a parameter separate from the articulation. Significant improvements in frequency analysis can be made by performing pitch-synchronous Fourier transformations, because the need to window is eliminated. Many synthesis-from-phonetics systems require the pitch contour for utterances to be stored rather computed from markers in the phonetic text. .pp Another issue which is closely bound up with pitch extraction is the voiced-unvoiced distinction. A good pitch estimator ought to fail when presented with aperiodic input such as an unvoiced sound, and so give a reliable indication of whether the frame of speech is voiced or not. .pp One method of pitch estimation, which uses the cepstrum, has been outlined above. It involves a substantial amount of computation, and has a high degree of complexity. However, if implemented properly it gives excellent results, because the source-filter structure of the speech is fully utilized. Another method, using the linear prediction residual, will be described in Chapter 6. Again, this requires a great deal of computation of a fairly sophisticated nature, and gives good results \(em although it relies on a somewhat more restricted version of the source-filter model than cepstral analysis. .rh "Autocorrelation methods." The most reliable way of estimating the pitch of a periodic signal which is corrupted by noise is to examine its short-time autocorrelation function. The autocorrelation of a signal $x(n)$ with lag $k$ is defined as .LB .EQ phi (k) ~~ = ~~ sum from {n=- infinity} to infinity ~ x(n)x(n+k) . .EN .LE If the signal is quasi-periodic, with slowly varying period, a finite stretch of it can be isolated with a window $w(i)$, which is 0 when $i$ is outside the range $[0,N)$. Beginning this window at sample $m$ gives the windowed signal .LB .EQ x(n)w(n-m), .EN .LE whose autocorrelation, the .ul short-time autocorrelation of the signal $x$ at point $m$ is .LB .EQ phi sub m (k)~ = ~~ sum from n ~ x(n)w(n-m)x(n+k)w(n-m+k) . .EN .LE .pp The autocorrelation function exhibits peaks at lags which correspond to the pitch periods and multiples of it. At such lags, the signal is in phase with a delayed version of itself, giving high correlation. The pitch of natural speech ranges about three octaves, from 50\ Hz (low-pitched men) to around 400\ Hz (children). To ensure that at least two pitch cycles are seen, even at the low end, the window needs to be at least 40\ msec long, and the autocorrelation function calculated for lags up to 20\ msec. The peaks which occur at lags corresponding to multiples of the pitch become smaller as the multiple increases, because the speech waveform will change slightly and the pitch period is not perfectly constant. If signals at the high end of the pitch range, 400\ Hz, are viewed through a 40\ msec autocorrelation window, considerable smearing of pitch resolution in the time domain is to be expected. Finally, for unvoiced speech, no substantial peaks of autocorrelation will occur. .pp If all deviations from perfect periodicity can be attributed to additive, white, Gaussian noise, then it can be shown from standard detection theory that autocorrelation methods are appropriate for pitch identification. Unfortunately, this is certainly not the case for speech signals. Although the short-time autocorrelation of voiced speech exhibits peaks at multiples of the pitch period, it is not clear that it is any easier to detect these peaks in the autocorrelation function than it is in the original time waveform! To take a simple example, if a signal contains a fundamental and in-phase first and second harmonics, .LB .EQ x(n)~ =~ a sin 2 pi fnT ~+~ b sin 4 pi fnT ~+~ c sin 6 pi fnT , .EN .LE then its autocorrelation function is .LB .EQ phi (k) ~=~~ {a sup 2 ~cos~2 pi fkT~+~b sup 2 ~cos~2 pi fkT~+~c sup 2 ~cos 2 pi fkT} over 2 ~ . .EN .LE There is no reason to believe that detection of the fundamental period of this signal will be any easier in the autocorrelation domain than in the time domain. .pp The most common error of pitch detection by autocorrelation analysis is that the periodicities of the formants are confused with the pitch. This typically leads to the repetition time being identified as $T sub pitch ~ +- ~ T sub formant1$, where the $T$'s are the periods of the pitch and first formant. Fortunately, there are simple ways of processing the signal non-linearly to reduce the effect of formants on pitch estimation using autocorrelation. .pp One way is to low-pass filter the signal with a cut-off above the maximum pitch period, say 600 Hz. However, formant 1 is often below this value. A different technique, which may be used in conjunction with filtering, is to "centre-clip" the signal as shown in Figure 4.13. .FC "Figure 4.13" This removes many of the ripples which are associated with formants. However, it entails the use of an adjustable clipping threshold to cater for speech of varying amplitudes. Sondhi (1968), who introduced the technique, set the clipping level at 30% of the maximum amplitude. .[ Sondhi 1968 .] An alternative which achieves much the same effect without the need to fiddle with thresholds, is to cube the signal, or raise it to some other high (odd!) power, before taking the autocorrelation. This highlights the peaks and suppresses the effect of low-amplitude parts. .pp For very accurate pitch detection, it is best to combine the evidence from several different methods of analysis of the time waveform. The autocorrelation function provides one source of evidence; and the cepstrum provides another. A third source comes from the time waveform itself. McGonegal .ul et al (1975) have described a semi-automatic method of pitch detection which uses human judgement to make a final decision based upon these three sources of evidence. .[ McGonegal Rabiner Rosenberg 1975 SAPD .] This appears to provide highly accurate pitch contours at the expense of considerable human effort \(em it takes an experienced user 30 minutes to process each second of speech. .rh "Speeding up autocorrelation." Calculating the autocorrelation function is an arithmetic-intensive procedure. For large lags, it can best be done using FFT methods; although there are simpler arithmetic tricks which speed it up without going to such complexity. However, with the availability of analogue delay lines using charge-coupled devices, autocorrelation can now be done effectively and cheaply by analogue, sampled-data, hardware. .pp Nevertheless, some techniques to speed up digital calculation of short-time autocorrelations are in wide use. It is tempting to hard-limit the signal so that it becomes binary (Figure 4.14(a)), thus eliminating multiplication. .FC "Figure 4.14" This can be disastrous, however, because hard-limited speech is known to retain considerable intelligibility and therefore the formant structure is still there. A better plan is to take centre-clipped speech and hard-limit that to a ternary signal (Figure 4.14(b)). This simplifies the computation considerably with essentially no degradation in performance (Dubnowski .ul et al, 1976). .[ Dubnowski Schafer Rabiner 1976 Digital hardware pitch detector .] .pp A different approach to reducing the amount of calculation is to perform a kind of autocorrelation which does not use multiplications. The "average magnitude difference function", which is defined by .LB .EQ d(k)~ = ~~ sum from {n=- infinity} to infinity ~ |x(n)-x(n+k)| , .EN .LE has been used for this purpose with some success (Ross .ul et al, 1974). .[ Ross Schafer Cohen Freuberg Manley 1974 .] It exhibits dips at pitch periods (instead of the peaks of the autocorrelation function). .rh "Feature-extraction methods." Another possible way of extracting pitch in the time domain is to try to integrate information from different sources to give reliable pitch estimates. Several features of the time waveform can be defined, each of which provides an estimate of the pitch period, and an overall estimate can be obtained by majority vote. .pp For example, suppose that the only feature of the speech waveform which is retained is the height and position of the peaks, where a "peak" is defined by the simplistic criterion .LB $ x(n-1) ~<~ x(n) $ and $ x(n) $>$ x(n+1) . $ .LE Having found a peak which is thought to represent a pitch pulse, one could define a "blanking period", based upon the current pitch estimate, within which the next pitch pulse could not occur. When this period has expired, the next pitch pulse is sought. At first, a stringent criterion should be used for identifying the next peak as a pitch pulse; but it can gradually be relaxed if time goes on without a suitable pulse being located. Figure 4.15 shows a convenient way of doing this: a decaying exponential is begun at the end of the blanking period and when a peak shows above, it is identified as a pitch pulse. .FC "Figure 4.15" One big advantage of this type of algorithm is that the data is greatly reduced by considering peaks only \(em which can be detected by simple hardware. Thus it can permit real-time operation on a small processor with minimal special-purpose hardware. .pp Such a pitch pulse detector is exceedingly simplistic, and will often identify the pitch incorrectly. However, it can be used in conjunction with other features to produce good pitch estimates. Gold and Rabiner (1969), who pioneered the approach, used six features: .[ Gold Rabiner 1969 Parallel processing techniques for pitch periods .] .LB .NP peak height .NP valley depth .NP valley-to-peak height .NP peak-to-valley depth .NP peak-to-peak height (if greater than 0) .NP valley-to-valley depth (if greater than 0). .LE The features are symmetric with regard to peaks and valleys. The first feature is the one described above, and the second one works in exactly the same way. The third feature records the height between each valley and the succeeding peak, and fourth uses the depth between each peak and the succeeding valley. The purpose of the final two detectors is to eliminate secondary, but rather large, peaks from consideration. Figure 4.16 shows the kind of waveform on which the other features might incorrectly double the pitch, but the last two features identify correctly. .FC "Figure 4.16" .pp Gold and Rabiner also included the last two pitch estimates from each feature detector. Furthermore, for each feature, the present estimate was added to the previous one to make a fourth, and the previous one to the one before that to make a fifth, and all three were added together to make a sixth; so that for each feature there were 6 separate estimates of pitch. The reason for this is that if three consecutive estimates of the fundamental period are $T sub 0$, $T sub 1$ and $T sub 2$; then if some peaks are being falsely identified, the actual period could be any of .LB .EQ T sub 0 ~+~ T sub 1 ~~~~ T sub 1 ~+~ T sub 2 ~~~~ T sub 0 ~+~ T sub 1 ~+~ T sub 2 . .EN .LE It is essential to do this, because a feature of a given type can occur more than once in a pitch period \(em secondary peaks usually exist. .pp Six features, each contributing six separate estimates, makes 36 estimates of pitch in all. An overall figure was obtained from this set by selecting the most popular estimate (within some pre-specified tolerance). The complete scheme has been evaluated extensively (Rabiner .ul et al, 1976) and compares favourably with other methods. .[ Rabiner Cheng Rosenberg McGonegal 1976 .] .pp However, it must be admitted that this procedure seems to be rather .ul ad hoc (as are many other successful speech parameter estimation algorithms!). Specifically, it is not easy to predict what kinds of waveforms it will fail on, and evaluation of it can only be pragmatic. When used to estimate the pitch of musical instruments and singers over a 6-octave range (40\ Hz to 2.5\ kHz), instances were found where it failed dramatically (Tucker and Bates, 1978). .[ Tucker Bates 1978 .] This is, of course, a much more difficult problem than pitch estimation for speech, where the range is typically 3 octaves. In fact, for speech the feature detectors are usually preceded by a low-pass filter to attenuate the myriad of peaks caused by higher formants, and this is inappropriate for musical applications. .pp There is evidence which shows that additional features can assist with pitch identification. The above features are all based upon the signal amplitude, and could be described as .ul secondary features derived from a single .ul primary feature. Other primary features can easily be defined. Tucker and Bates (1978) used a centre-clipped waveform, and considered only the peaks rising above the central region. .[ Tucker Bates 1978 .] They defined two further primary features, in addition to the peak amplitude: the .ul time width of a peak (period for which it is outside the clipping level), and its .ul energy (again, outside the clipping level). The primary features are shown in Figure 4.17. .FC "Figure 4.17" Secondary features are defined, based on these three primary ones, and pitch estimates are made for each one. A further innovation was to combine the individual estimates on a way which is based upon autocorrelation analysis, reducing to some degree the .ul ad-hocery of the pitch detection process. .sh "4.9 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "4.10 Further reading" .pp There are a lot of books on digital signal analysis, although in general I find them rather turgid and difficult to read. .LB "nn" .\"Ackroyd-1973-1 .]- .ds [A Ackroyd, M.H. .ds [D 1973 .ds [T Digital filters .ds [I Butterworths .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Here is the exception to prove the rule. This book .ul is easy to read. It provides a good introduction to digital signal processing, together with a wealth of practical design information on digital filters. .in-2n .\"Committee.I.D.S.P-1979-3 .]- .ds [A IEEE Digital Signal Processing Committee .ds [D 1979 .ds [T Programs for digital signal processing .ds [I Wiley .ds [C New York .nr [T 0 .nr [A 0 .nr [O 0 .][ 2 book .in+2n This is a remarkable collection of tried and tested Fortran programs for digital signal analysis. They are all available from the IEEE in machine-readable form on magnetic tape. Included are programs for digital filter design, discrete Fourier transformation, and cepstral analysis, as well as others (like linear predictive analysis; see Chapter 6). Each program is accompanied by a concise, well-written description of how it works, with references to the relevant literature. .in-2n .\"Oppenheim-1975-4 .]- .ds [A Oppenheim, A.V. .as [A " and Schafer, R.W. .ds [D 1975 .ds [T Digital signal processing .ds [I Prentice Hall .ds [C Englewood Cliffs, New Jersey .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is one of the standard texts on most aspects of digital signal processing. It treats the $z$-transform, digital filters, and discrete Fourier transformation in far more detail than we have been able to here. .in-2n .\"Rabiner-1975-5 .]- .ds [A Rabiner, L.R. .as [A " and Gold, B. .ds [D 1975 .ds [T Theory and application of digital signal processing .ds [I Prentice Hall .ds [C Englewood Cliffs, New Jersey .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is the other standard text on digital signal processing. It covers the same ground as Oppenheim and Schafer (1975) above, but with a slightly faster (and consequently more difficult) presentation. It also contains major sections on special-purpose hardware for digital signal processing. .in-2n .\"Rabiner-1978-1 .]- .ds [A Rabiner, L.R. .as [A " and Schafer, R.W. .ds [D 1978 .ds [T Digital processing of speech signals .ds [I Prentice Hall .ds [C Englewood Cliffs, New Jersey .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Probably the best single reference for digital speech analysis, as it is for the time-domain encoding techniques of the last chapter. Unlike the books cited above, it is specifically oriented to speech processing. .in-2n .LE "nn" .EQ delim $$ .EN .CH "5 RESONANCE SPEECH SYNTHESIZERS" .ds RT "Resonance speech synthesizers .ds CX "Principles of computer speech .pp This chapter considers the design of speech synthesizers which implement a direct electrical analogue of the resonance properties of the vocal tract by providing a filter for each formant whose resonant frequency is to be controlled. Another method is the channel vocoder, with a bank of fixed filters whose gains are varied to match the spectrum of the speech as described in Chapter 4. This is not generally used for synthesis from a written representation, however, because it is hard to get good quality speech. It .ul is used sometimes for low-bandwidth transmission and storage, for it is fairly easy to analyse natural speech into fixed frequency bands. A second alternative to the resonance synthesizer is the linear predictive synthesizer, which at present is used quite extensively and is likely to become even more popular. This is covered in the next chapter. Another alternative is the articulatory synthesizer, which attempts to model the vocal tract directly, rather than modelling the acoustic output from it. Although, as noted in Chapter 2, articulatory synthesis holds a promise of high-quality speech \(em for the coarticulation effects caused by tongue and jaw inertia can be modelled directly \(em this has not yet been realized. .pp The source-filter model of speech production indicates that an electrical analogue of the vocal tract can be obtained by considering the source excitation and the filter that produces the formant frequencies separately. This approach was pioneered by Fant (1960), and we shall present much of his work in this chapter. .[ Fant 1960 Acoustic theory of speech production .] There has been some discussion over whether the source-filter model really is a good one, and some synthesizers explicitly introduce an element of "sub-glottal coupling", which simulates the effect of the lung cavity on the vocal tract transfer function during the periods when the glottis is open (for an example see Rabiner, 1968). .[ Rabiner 1968 Digital formant synthesizer JASA .] However, this is very much a low-order effect when considering speech synthesized by rule from a written representation, for the software which calculates parameter values to drive the synthesizer is a far greater source of degradation in speech quality. .sh "5.1 Overall spectral considerations" .pp Figure 5.1 shows the source-filter model of speech production. .FC "Figure 5.1" For voiced speech, the excitation source produces a waveform whose frequency components decay at about 12\ dB/octave, as we shall see in a later section. The excitation passes into the vocal tract filter. Conceptually, this can best be viewed as an infinite series of formant filters, although for implementation purposes only the first few are modelled explicitly and the effect of the rest is lumped together into a higher-formant compensation network. In either case the overall frequency profile of the filter is a flat one, upon which humps are superimposed at the various formant frequencies. Thus the output of the vocal tract filter falls off at 12\ dB/octave just as the input does. However, measurements of actual speech show a 6\ dB/octave decay with increasing frequency. This is explained by the effect of radiation of speech from the lips, which in fact has a "differentiating" action, producing a 6\ dB/octave rise in the frequency spectrum. This 6\ dB/octave lift is similar to that provided by a treble boost control on a radio or amplifier. Speech synthesized without it sounds unnaturally heavy and bassy. .pp These overall spectral shapes, which are derived from considering the human vocal tract, are summarized in the upper annotations in Figure 5.1. But there is no real necessity for a synthesizer to model the frequency characteristics of the human vocal tract at intermediate points: only the output speech is of any concern. Because the system is a linear one, the filter blocks in the figure can be shuffled around to suit engineering requirements. One such requirement is the desire to minimize internally-generated noise in the electrical implementation, most of which will arise in the vocal tract filter (because it is much more complicated than the other components). For this reason an excitation source with a flat spectrum is often preferred, as shown in the lower annotations. This can be generated either by taking the desired glottal pulse shape, with its 12\ dB/octave fall-off, and passing it through a filter giving 12\ dB/octave lift at higher frequencies; or, if the pulse shape is to be stored digitally, by storing its second derivative instead. Then the radiation compensation, which is now more properly called "spectral equalization", will comprise a 6\ dB/octave fall-off to give the required trend in the output spectrum. .pp For a given pitch period, this scheme yields exactly the same spectral characteristics as the original system which modelled the human vocal tract. However, when the pitch varies there will be a difference, for sounds with higher excitation frequencies will be attenuated by \-6\ dB/octave in the new system and +6\ dB/octave in the old by the final spectral equalization. In practice, the pitch of the human voice lies quite low in the frequency region \(em usually below 400\ Hz \(em and if all filter characteristics begin their roll-off at this frequency the two systems will be the same. This simplifies the implementation with a slight compromise in its accuracy in modelling the spectral trend of human speech, for the overall \-6\ dB/octave decay actually begins at a frequency of around 100\ Hz. If this is implemented, some adjustment will need to be made to the amplitudes to ensure that high-pitched sounds are not attenuated unduly. .pp The discussion so far pertains to voiced speech only. The source spectrum of the random excitation in unvoiced sounds is substantially flat, and combines with the radiation from the lips to give a +6\ dB/octave rise in the output spectrum. Hence if spectral equalization is changed to \-6\ dB/octave to accomodate a voiced excitation with flat spectrum, the noise source should show a 12\ dB/octave rise to give the correct overall effect. .sh "5.2 The excitation sources" .pp In human speech, the excitation source for voiced sounds is produced by two flaps of skin called the "vocal cords". These are blown apart by pressure from the lungs. When they come apart the pressure is relieved, and the muscles tensioning the skin cause the flaps to come together again. Subsequently, the lung pressure \(em called "sub-glottal pressure" \(em builds up once more and the process is repeated. The factors which influence the rate and nature of vibration are muscular tension of the cords and the sub-glottal pressure. The detail of the excitation has considerable importance to speech synthesis because it greatly influences the apparent naturalness of the sound produced. For example, if you have inflamed vocal cords caused by laryngitis the sound quality changes dramatically. Old people who do not have proper muscular control over their vocal cord tension produce a quavering sound. Shouted speech can easily be distinguished from quiet speech even when the volume cue is absent \(em you can verify this by fiddling with the volume control of a tape recorder \(em because when shouting, the vocal cords stay apart for a much smaller fraction of the pitch cycle than at normal volumes. .rh "Voiced excitation in natural speech." There are two basic ways to examine the shape of the excitation source in people. One is to use a dentist's mirror and high-speed photography to observe the vocal cords directly. Although it seems a lot to ask someone to speak naturally with a mirror stuck down the back of his throat, the method has been used and photographs can be found, for example, in Flanagan (1972). .[ Flanagan 1972 Speech analysis synthesis and perception .] The second technique is to process the acoustic waveform digitally, identifying the formant positions and deducting the formant contributions from the waveform by filtering. This leaves the basic excitation waveform, which can then be displayed. Such techniques lead to excitation shapes like those sketched in Figure 5.2, in which the gradual opening and abrupt closure of the vocal cords can easily be seen. .FC "Figure 5.2" .pp It is a fact that if a periodic function has one or more discontinuities, its frequency spectrum will decay at sufficiently high frequencies at the rate of 6\ dB/octave. For example, the components of the square wave .LB $ g(t) ~~ = ~~ mark 0 $ for $ 0 <= t < h $ .br $ lineup 1 $ for $ h <= t < b $ .LE can be calculated from the Fourier series .LB .EQ G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} ~dt ~~ = ~~ j over {2 pi r} ~e sup {-j2 pi rh/b} , .EN .LE so $|G(r)|$ is proportional to $1/r$, and the change in one octave is .LB .EQ 20~log sub 10 ~ |G(2r)| over |G(r)| ~~=~~20~log sub 10 ~ 1 over 2 ~~ = ~ .EN \-6\ dB. .LE However, if the discontinuities are ones of slope only, then the asymptotic decay at high frequencies is 12\ dB/octave. Thus the glottal excitation of Figure 5.2 will decay at this rate. Note that it is not the .ul number but the .ul type of discontinuities which are important in determining the asymptotic spectral trend. .rh "Voiced excitation in synthetic speech." There are several ways that glottal excitation can be simulated in a synthesizer, four of which are shown in Figure 5.3. .FC "Figure 5.3" The square pulse and the sawtooth pulse both exhibit discontinuities, and so will have the wrong asymptotic rate of decay (6\ dB/octave instead of 12\ dB/octave). A better bet is the triangular pulse. This has the correct decay, for there are only discontinuities of slope. However, although the asymptotic rate of decay is of first importance, the fine structure of the frequency spectrum at the lower end is also significant, and the fact that there are two discontinuities of slope instead of just one in the natural waveform means that the spectra cannot match closely. .pp Rosenberg (1971) has investigated several different shapes using listening tests, and he found that the polynomial approximation sketched in Figure 5.3 was preferred by listeners. .[ Rosenberg 1971 .] This has one slope discontinuity, and comprises three sections: .LB $g(t) ~~ = ~~ 0$ for $0 <= t < t sub 1$ (flat during the period of closure) .sp $g(t) ~~ = ~~ A~ u sup 2 (3 - 2u) $, where $u ~=~ {t-t sub 1} over {t sub 2 -t sub 1} $ , for $t sub 1 <= t < t sub 2$ (opening phase) .sp .sp $g(t) ~~ = ~~ A~ (1 - v sup 2 )$, where $v ~=~ {t-t sub 2} over {b-t sub 2} $ , for $t sub 2 <= t < b$ (closing phase). .LE It is easy to see that the joins between the first and second section, and between the second and third section, are smooth; but that the slope of the third section at the end of the cycle when $t=b$ is .LB .EQ dg over dt ~~ = ~~ -~ 2A. .EN .LE $A$ is the maximum amplitude of the pulse, and is reached when $t=t sub 2$. .pp A much simpler glottal pulse shape to implement is the filtered impulse. Passing an impulse through a filter with characteristic .LB .EQ 1 over {(1+sT) sup 2} .EN .LE imparts a 12\ dB/octave decay after frequency $1/T$. This gives a pulse shape of .LB .EQ g(t) ~~ = ~~ A~ t over T ~e sup {1-t/T} , .EN .LE which is sketched in Figure 5.4. .FC "Figure 5.4" The pulse is the wrong way round in time when compared with the desired one; but this is not important under most listening conditions because phase differences are not noticeable (this point is discussed further below). The maximum is reached when $t=T$ and has height $A$. The value zero is never actually attained, for the decay to it is asymptotic, and if the slight discontinuity between pulses shown in the Figure is left, the asymptotic rate of decay of the frequency spectrum will be 6\ dB/octave rather than 12\ dB/octave. However, in a real implementation involving filtering an impulse there will be no such discontinuity, for the next pulse will start off where the last one ended. .pp This seems to be an attractive scheme because of its simplicity, and indeed is sometimes used in speech synthesis. However, it does not have the right properties when the pitch is varied, for in real glottal waveforms the maximum occurs at a fixed .ul fraction of the period, whereas the filtered impulse's maximum is at a fixed time, $T$. If $T$ is chosen to make the system correct at high pitch frequencies (say 400\ Hz), then the pulse will be much too narrow at low pitches and sound rather harsh. The only solution is to vary the filter parameters with the pitch, leading to complexity again. .pp Holmes (1973) has made an extensive study of the effect of the glottal waveshape on the naturalness of high-quality synthesized speech. .[ Holmes 1973 Influence of glottal waveform on naturalness .] He employed a rather special speech synthesizer, which provides far more comprehensive and sophisticated control than most. It was driven by parameters which were extracted from natural utterances by hand \(em but the process of generating and tuning them took many months of a skilled person's time. By using the pulse shape extracted from the natural utterance, he found that synthetic and natural versions could actually be made indistinguishable to most people, even under high-quality listening conditions using headphones. Performance dropped quite drastically when one of Rosenberg's pulse shapes, similar to the three-section one given above, was used. Holmes also investigated phase effects and found that whilst different pulse shapes with identical frequency spectra could easily be distinguished when listening over headphones, there was no perceptible difference if the listener was placed at a comfortable distance from a loudspeaker in a room. This is attributable to the fact that the room itself imposes a complex modification to the phase characteristics of the speech signal. .pp Although a great deal of care must be taken with the glottal pulse shape for very high-quality synthetic speech, for speech synthesized by rule from a written representation the degradation which stems from incorrect control of the synthesizer parameters is much greater than that caused by using a slightly inferior glottal pulse. The triangular pulse illustrated in Figure 5.3 has been found quite satisfactory for speech synthesis by rule. .rh "Unvoiced excitation." Speech quality is much less sensitive to the characteristics of the unvoiced excitation. Broadband white noise will serve admirably. It is quite acceptable to generate this digitally, using a pseudo-random feedback shift register. This gives a bit sequence whose autocorrelation is zero except at multiples of the repetition length. The repetition length can easily be made as long as the number of states in the shift register (less one) \(em in this case, the configuration is called "maximal length" (Gaines, 1969). .[ Gaines 1969 Stochastic computing advances in information science .] For example, an 18-bit maximal-length shift register will repeat every $2 sup 18 -1$ cycles. If the bit-stream is used as a source of analogue noise, the autocorrelation function will have triangular parts whose width is twice the clock period, as shown in Figure 5.5. .FC "Figure 5.5" According to a well-known result (the Weiner-Kinchine theorem; see for example Chirlian, 1973) the power density of the frequency spectrum is the same as the Fourier transform of the autocorrelation function. .[ Chirlian 1973 .] Since the feedback shift register gives a periodic autocorrelation function, its transform is a Fourier series. The $r$'th frequency component is .LB .EQ G(r) ~~ = ~~ {R sup 2} over {4 pi sup 2 r sup 2 T} ~(1~-~~cos~{{2 pi rT} over R}) ~ . .EN .LE Here, $T$ is the clock period and $R=(2 sup N -1)T$ is the repetition time of an $N$-bit shift register. .pp The spectrum is a bar spectrum, with components spaced at .LB $ {1 over R}~~=~~{1 over {(2 sup N -1)T}}$ Hz. .LE These are very close together \(em with $N=18$ and sampling at 20\ kHz (50\ $mu$sec) the spacing becomes under 0.1\ Hz \(em and so it is reasonable to treat the spectrum as continuous, with .LB .EQ G(f) ~~ = ~~ 1 over {4 pi sup 2 f sup 2 T}~~(1~-~cos 2 pi fT) . .EN .LE This spectrum is sketched in Figure 5.6(a), and the measured result of an actual implementation in Figure 5.6(b). .FC "Figure 5.6" The 3\ dB point occurs when .LB .EQ {G(f) over G(0)} ~~=~~{1 over 2} ~ , .EN .LE and $G(0)$ is $T/2$. Hence, at the 3\ dB point, .LB .EQ {1~-~cos 2 pi fT} over {2 pi sup 2 f sup 2 T sup 2} ~~ = ~~ 1 over 2 ~ , .EN .LE which has solution $f=0.45/T$. Thus a pseudo-random shift register generates noise whose spectrum is substantially flat up to half the clock frequency. Anything over 10\ kHz is therefore a suitable clocking rate for speech-quality noise. Choose 20\ kHz to err on the conservative side. If the repetition occurs in less than 3 or 4 seconds, it can be heard quite clearly; but above this figure it is not noticeable. An 18-bit shift register clocked at 20\ kHz repeats every $(2 sup 18 -1)/20000 ~ = ~ 13$ seconds, which is more than adequate. .sh "5.3 Simulating vocal tract resonances" .pp The vocal tract, from glottis to lips, can be modelled as an unconstricted tube of varying cross-section with no side branches and no sub-glottal coupling. This has an all-pole transfer function, which can be written in the form .LB .EQ H(s) ~~ = ~~ {w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2} ~.~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~~ .~ .~ . .EN .LE There is an unspecified (conceptually infinite) number of terms in the product. Each of them produces a peak in the energy spectrum, and these are the formants we observed in Chapter 2. .pp Formants appear even in an over-simplified model of the tract as a tube of uniform cross-section, with a sound source at one end (the larynx) and open at the other (the lips). This extremely crude model was discussed in Chapter 2, and surprisingly, perhaps, it gives a good approximation to the observed formant frequencies for a neutral, relaxed vowel such as that in .ul "a\c bove". .pp Speech is made by varying the postures of the various organs of the vocal tract. Different vowels, for example, result largely from different tongue positions and lip postures. Naturally, such physical changes alter the frequencies of the resonances, and successful automatic speech synthesis depends upon successful movement of the formants. Fortunately, only the first three or four resonances need to be altered even for extremely realistic synthesis, and virtually all existing synthesizers provide control over these formants only. .rh "Analysis of a single formant." Each formant is modelled as a second-order resonance, with transfer function .LB .EQ H(s) ~~ = ~~ {w sub c sup 2} over {s sup 2 ~+~ b s ~+~ w sub c sup 2} ~ . .EN .LE As will be shown below, $w sub c$ is the nominal resonant frequency in radians/s, and $b$ is the approximate 3\ dB bandwidth of the resonance. The term $w sub c sup 2$ in the numerator adjusts the gain to be unity at DC ($s=0$). .pp To calculate the frequency response of the formant, write $s=jw$. Then the energy spectrum is .LB .EQ |H(jw)| sup 2 ~~ mark = ~~ {w sub c sup 4} over {(w sup 2 - w sub c sup 2 ) sup 2 ~+~ b sup 2 w sup 2} .EN .sp .sp .EQ lineup = ~~ {w sub c sup 4} over {[w sup 2 ~-~(w sub c sup 2 -~ {b sup 2} over 2 )] sup 2 ~~ +~~b sup 2 (w sub c sup 2~-~{{b sup 2} over 4})} ~ . .EN .sp .LE This reaches a maximum when the squared term in the denominator of the second expression is zero, namely when $w=(w sub c sup 2 ~-~ b sup 2 /2) sup 1/2$. However, formant bandwidths are low compared with their centre frequencies, and so to a good approximation the peak occurs at $w=w sub c$ and is of amplitude $w sub c /b$, that is, $10~log sub 10 w sub c /b$\ dB above the DC gain. At frequencies higher than the peak the energy falls off as $1/w sup 4$, a factor of 1/16 for each doubling in frequency, and so the asymptotic decay is 12\ dB/octave. .pp At the points which are 3\ dB below the peak, .LB .EQ |H(jw sub 3dB )| sup 2 ~~ = ~~ 1 over 2 ~|H(jw sub max )| sup 2 ~~ = ~~ 1 over 2 ~ times ~ {w sub c sup 2} over {b sup 2} ~ , .EN .LE and it is easy to show that this is satisfied by $w sub 3dB ~ = ~ w sub c ~ +- ~ b/2$ to a good approximation (neglecting higher powers of $b/w sub c )$. Figure 5.7 summarizes the shape of an individual formant resonance. .FC "Figure 5.7" .pp The bandwidth of a formant is fairly constant, regardless of the formant frequency. This makes the formant filter a slightly unusual one: most engineering applications which use variable-frequency resonances require the bandwidth to be a constant proportion of the resonant frequency \(em the ratio $w sub c /b$, often called the "$Q$" of the filter, is to be constant. For formants, we wish the Q to increase linearly with resonant frequency. Since the amplitude gain of the formant at resonance is $w sub c /b$, this peak gain increases as the formant frequency is increased. .pp Although it is easy to measure formant frequencies on a spectrogram (cf Chapter 2), it is not so easy to measure bandwidths accurately. One rather unusual method was reported by van den Berg (1955), who took a subject who had had a partial laryngectomy, an operation which left an opening into the vocal tract near the larynx position. Into this he inserted a sound source and made a swept-frequency calibration of the vocal tract! .[ Berg van den 1955 .] Almost as bizarre is a technique which involves setting off a spark inside the mouth of a subject as he holds his articulators in a given position. .pp The results of several different kinds of experiment are reported by Dunn (1961), and are summarized in Table 5.1, along with the formant frequency ranges. .[ Dunn 1961 .] .RF .in+0.5i .ta 1.7i +2.5i .nr x1 (\w'range of formant'/2) .nr x2 (\w'range of bandwidths'/2) \h'-\n(x1u'range of formant \h'-\n(x2u'range of bandwidths .nr x1 (\w'frequencies (Hz)'/2) .nr x2 (\w'as measured in different'/2) \h'-\n(x1u'frequencies (Hz) \h'-\n(x2u'as measured in different .nr x1 (\w'experiments (Hz)'/2) \h'-\n(x1u'experiments (Hz) .nr x1 (\w'0000 \- 0000'/2) .nr x2 (\w'000 \- 000'/2) .nr x0 2.5i+(\w'range of formant'/2)+(\w'as measured in different'/2) .nr x3 (\w'range of formant'/2) \h'-\n(x3u'\l'\n(x0u\(ul' .sp formant 1 \h'-\n(x1u'\0100 \- 1100 \h'-\n(x2u'\045 \- 130 formant 2 \h'-\n(x1u'\0500 \- 2500 \h'-\n(x2u'\050 \- 190 formant 3 \h'-\n(x1u'1500 \- 3500 \h'-\n(x2u'\070 \- 260 \h'-\n(x3u'\l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in-0.5i .MT 2 Table 5.1 Different estimates of formant bandwidths, with range of formant frequencies for reference .TE Note that the bandwidths really are narrow compared with the resonant frequencies of the filters, except at the lower end of the formant 1 range. Choosing the lowest bandwidth estimate leads to an amplification factor at resonance of 50 for formant 2 when its frequency is at the top of its range; and formant 3 happens to give the same value. .rh "Series synthesizers." The simplest realization of the vocal tract filter is a chain of formant filters in series, as illustrated in Figure 5.8. .FC "Figure 5.8" This leads to particular difficulties if the frequencies of two formants stray close together. The worst case occurs if formants 2 and 3 have the same resonant frequencies, at the top of the range of formant 2, namely 2500\ Hz. In this case, and if the bandwidths of the formants are set to the lowest estimates, a combined amplification factor of $(2500/50) times (2500/70)=1800$ is obtained at the point of resonance \(em that is, 65\ dB above the DC value. This is enough to tax most analogue implementations, and can evoke clipping in the formant filters, with a very noticeable effect on speech quality. This extreme case will not occur during synthesis of realistic speech, for although the formant .ul ranges overlap, the values for any particular (human) sound will not coincide exactly. However, it illustrates the difficulty of designing a series synthesizer which copes sensibly with arbitrary parameter settings, and explains why designers often choose formant bandwidths in the top half of the ranges given in Table 5.1. .pp The problem of excessive amplification within a series synthesizer can be alleviated to a small extent by choosing carefully the order in which the filters are placed in the chain. In a linear system, of course, the order in which the components occur does not matter. In physical implementations, however, it is advantageous to minimize extreme amplification at intermediate points. By placing the formant 1 filter between formants 2 and 3, the formant 2 resonance is attenuated somewhat before it reaches formant 3. Continuing with the extreme example above, where both formants 2 and 3 were set to 2500\ Hz; assume that formant 1 is at its nominal value of 500\ Hz. It provides attenuation at approximately 12\ dB/octave above this, and so at the formant 2 peak, 2.3\ octaves higher, the attenuation is 28\ dB. Thus the gain at 2500\ Hz, which is $20 ~ log sub 10 ~ 2500/50 ~ = ~ 34$\ dB after passing through the formant 2 filter, is reduced to 6\ dB by formant 1, only to be increased by $20 ~ log sub 10 ~ 2500/70 ~ = ~ 31$\ dB to a value of 37\ dB by formant 3. This avoids the extreme 65\ dB gain of formants 2 and 3 combined. .pp Figure 5.8 shows only three formant filters modelled explicitly. The effect of the rest \(em and they do have an effect, although it is small at low frequencies \(em is incorporated by lumping them together into the "higher-formant correction" filter. To calculate the characteristics of this filter, assume that the lumped formants have the values given by the simple uniform-tube model of Chapter 2, namely 3500\ Hz for formant 4, 4500\ Hz for formant 5, and, in general, $500(2n-1)$\ Hz for formant $n$. The effect of each of these on the spectrum is .LB .EQ 10~ log sub 10 {w sub n sup 4} over {(w sup 2 ~-~w sub n sup 2 ) sup 2 ~~+~~b sub n sup 2 w sup 2} ~~ = ~~ -~ 10~ log sub 10 ~[(1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2 ~~+~~ {{b sub n sup 2 w sup 2} over {w sub n sup 4}}] .EN dB, .LE following from what was calculated above. We will have to approximate this by assuming that $b sub n sup 2 /w sub n sup 2$ is negligible \(em this is quite reasonable for these higher formants because Table 5.1 shows that the bandwidth does not increase in proportion to the formant frequency range \(em and approximate the logarithm by the first term of its series expansion: .LB .EQ -10 ~ log sub 10 ~ (1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2 ~~ = ~~ -20~ log sub 10 ~ e ~ log sub e (1~-~~{{w sup 2} over {w sub n sup 2}}) ~~ = ~~ 20~ log sub 10 ~ e ~ times ~ {w sup 2} over {w sub n sup 2} ~ . .EN .LE .pp Now the total effect of formants 4, 5, ... at frequency $f$\ Hz (as distinct from $w$\ radians/s) is .LB .EQ 20~ log sub 10 ~ e ~ times ~ sum from n=4 to infinity ~{{f sup 2} over {500 sup 2 (2n-1) sup 2}} ~ . .EN .LE This expression is .LB .EQ 20~ log sub 10 ~ e ~ times ~ {{f sup 2} over {500 sup 2}}~~(~sum from n=1 to infinity ~{1 over {(2n-1) sup 2}} ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~) ~ . .EN .LE The infinite sum can actually be calculated in closed form, and is equal to $pi sup 2 /8$. Hence the total correction is .LB .EQ 20~ log sub 10 ~ e ~ times {{f sup 2} over {500 sup 2}} ~~(~{pi sup 2} over 8 ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~) ~~ = ~~ 2.87 times 10 sup -6 f sup 2 .EN dB. .LE .pp Although this may at first seem to be a rather small correction, it is in fact 72\ dB when $f=5$\ kHz! On further reflection this is not an unreasonable figure, for the 12\ dB/octave decays contributed by formants 1, 2, and 3 must all be annihilated by the higher-formant correction to give an overall flat spectral trend. In fact, formant 1 will contribute 12\ dB/octave from 500\ Hz (3.3\ octaves to 5\ kHz, representing 40\ dB); formant 2 will contribute 12\ dB/octave from 1500\ Hz (1.7\ octaves to 5\ kHz, representing 21\ dB); and formant 3 will contribute 12\ dB/octave from 2500\ Hz (1\ octave to 5\ kHz, representing 12\ dB). These sum to 73\ dB. .pp If the first five formants are synthesized explicitly instead of just the first three, the correction is .LB .EQ 20~ log sub 10 ~ e ~ times ~ {{f sup 2} over {500 sup 2}} ~~(~{pi sup 2} over 8 ~-~~ sum from n=1 to 5 ~{1 over {(2n-1) sup 2}}~) ~~ = ~~ 1.73 times 10 sup -6 f sup 2 .EN dB, .LE giving a rather more reasonable value of 43\ dB when $f=5$\ kHz. In actual implementations, fixed filters are sometimes included explicitly for formants 4 and 5. Although this lowers the gain of the higher-formant correction filter, the total amplification at 5\ kHz of the combined correction is still 72\ dB. If one is less demanding and aims for a synthesizer that produces a correct spectrum only up to 3.5\ kHz, it is 35\ dB. This places quite stringent requirements on the preceding formant filters if the stray noise that they generate internally is not to be amplified to perceptible magnitudes by the correction filter at high frequencies. .pp Explicit inclusion of fixed filters for formants 4 and 5 undoubtedly improves the accuracy of the higher-formant correction. Recall that the above derivation of the correction filter characteristic used the first-order approximation .LB .EQ log sub e (1~-~{{w sup 2} over {w sub n sup 2}}) ~~ = ~~ -~ {w sup 2} over {w sub n sup 2} ~ , .EN .LE which is only valid if $w << w sub n$. Thus it only holds at frequencies less than the highest explicitly synthesized formant, and so with formants 4 (3.5\ kHz) and 5 (4.5\ kHz) included a reasonable correction should be obtained for telephone-quality speech. However, detailed analysis with a second-order approximation shows that the coefficient of the neglected term is in fact small (Fant, 1960). .[ Fant 1960 Acoustic theory of speech production .] A second, perhaps more compelling, reason for explicitly including a couple of fixed formants is that the otherwise enormous amplification provided by the correction can be distributed throughout the formant chain. We saw earlier why there is reason to prefer the order F3\(emF1\(emF2 over F1\(emF2\(emF3. With explicit formants 4 and 5, a suitable order which helps to keep the amplification at intermediate points in the chain within reasonable bounds is F3\(emF5\(emF2\(emF4\(emF1. .rh "Parallel synthesizers." A series synthesizer models the vocal tract resonances by a chain of formant filters in series. A parallel synthesizer utilizes a parallel connection of filters as illustrated in Figure 5.9. .FC "Figure 5.9" .pp Consider a parallel combination of two formants with individually-controllable amplitudes. The combined transfer function is .LB .EQ H(s) ~~ mark = ~~ {A sub 1 w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2} ~~+~~{A sub 2 w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} .EN .sp .sp .EQ lineup = ~~ { (A sub 1 w sub 1 sup 2 + A sub 2 w sub 2 sup 2 )s sup 2 ~+~(A sub 1 b sub 2 w sub 1 sup 2 + A sub 2 b sub 1 w sub 2 sup 2 )s ~+~ (A sub 1 +A sub 2 )w sub 1 sup 2 w sub 2 sup 2 } over { (s sup 2 ~+~b sub 1 s~+~w sub 1 sup 2 ) (s sup 2 ~+~b sub 2 s~+~w sub 2 sup 2 ) } .EN .LE If the formant bandwidths $b sub 1$ and $b sub 2$ are equal and the amplitudes are chosen as .LB .EQ A sub 1 ~~=~~ {w sub 2 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2} ~~~~~~~~ A sub 2 ~~=~~-~ {w sub 1 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2} ~ , .EN .LE then the transfer function becomes the same as that of a two-formant series synthesizer, namely .LB .EQ H(s) ~~ = ~~ {w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2} ~ . ~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~ . .EN .LE The argument can be extended to any number of formants, under the assumption that the formant bandwidths are equal. Note that the signs of $A sub 1$ and $A sub 2$ differ: in general the formant amplitudes for a parallel synthesizer alternate in sign. .pp In theory, therefore, it would be possible to use five parallel formants to model a five-formant series synthesizer exactly. Then the same higher-formant correction filter would be needed for the parallel synthesizer as for the series one. If the formant amplitudes were set slightly incorrectly, however, the five filters would not combine to give a total of 60\ dB/octave high-frequency decay above the resonances. It is easy to see this in the context of the simplified two-formant combination above: if the amplitudes were not chosen exactly right then the $s sup 2$ term in the numerator would not be quite zero. Then, the decay in the two-formant combination would be \-12\ dB/octave instead of \-24\ dB/octave, and in the five-formant case the decay would in fact still be \-12\ dB/octave. Advantage can be taken of this to equalize the levels within the synthesizer so that large amplitude variations do not occur. This can best be done by associating relatively low-gain fixed correction filters with each formant instead of providing one comprehensive correction to the combined spectrum: these are shown in Figure 5.9. Suitable correction filters have been determined empirically by Holmes (1972). .[ Holmes 1972 Speech synthesis .] They provide a 6\ dB/octave lift above 640\ Hz for formant 1, and 6\ dB/octave lift above 300\ Hz for formant 2. Formants 3 and 4 are uncorrected, whilst for formant 5 the correction begins as a 6\ dB/octave decay above 600\ Hz and increases to an 18\ dB/octave decay above 5.5\ kHz. .pp The disadvantage of a parallel synthesizer is that the amplitudes of the formants must be specified as well as their frequencies. (Furthermore, the formant bandwidths should all be equal, but they are often chosen to be such in series synthesizers because of the uncertainty as to their exact values.) However, the extra amplitude parameters clearly give greater control over the frequency spectrum of the synthesized speech. .pp A good example of how this extra control can usefully be exploited is the synthesis of nasal sounds. Nasalization introduces a cavity parallel to the oral tract, as illustrated in Figure 5.10, and this causes zeros in the transfer function. .FC "Figure 5.10" It is as if two different copies of the vocal tract transfer function, one for the oral and the other for the nasal passage, were added together. We have seen the effect of this above when considering parallel synthesis. The combination .LB .EQ H(s) ~~ = ~~ {A sub 1 w sub o sup 2} over {s sup 2 ~+~ b sub o s ~+~ w sub o sup 2} ~~+~~{A sub 2 w sub n sup 2} over {s sup 2 ~+~ b sub n s ~+~ w sub n sup 2} ~ , .EN .LE where the subscript "$o$" stands for oral and "$n$" for nasal, produces zeros in the numerator (unless the amplitudes are carefully adjusted to avoid them). These cannot be modelled by a series synthesizer, but they obviously can be by a parallel one. .pp Although they are certainly needed for accurate imitation of human speech, transfer function zeros to simulate nasal sounds are not essential for synthesis of intelligible English. It is not difficult to get a sound like a nasal consonant (\c .ul n, or .ul m\c ) with an all-pole synthesizer. Nevertheless, it is certainly true that a parallel synthesizer gives better .ul potential control over the spectrum than a series one. Whether the added flexibility can be used properly by a synthesis-by-rule computer program is another matter. .rh "Implementation of formant filters." Formant filters can be built in either analogue or digital form. A second-order resonance is needed, whose centre frequency can be controlled but whose bandwidth is fixed. If the control can be arranged as two tracking resistors, then the simple analogue configuration of Figure 5.11, with two operational amplifiers, will suffice. .FC "Figure 5.11" .pp The transfer function of this arrangement is .LB .EQ - ~~ { 1/C sub 1 R sub 1 C sub 2 R sub 2 } over { s sup 2 ~~+~~ {1 over {C sub 2 R sub 2}}~s ~~+~~{1 over {C sub 1 R' sub 1 C sub 2 R sub 2 }}} ~ , .EN .LE which characterizes it as a low-pass resonator with DC gain of $- R' sub 1 /R sub 1 $, bandwidth of $1/2 pi C sub 2 R sub 2$\ Hz, and centre frequency of $1/2 pi (C sub 1 R' sub 1 C sub 2 R sub 2 ) sup 1/2$\ Hz. Tracking $R' sub 1$ with $R sub 1$ ensures that the DC gain remains constant, and that the centre frequency follows $R sub 1 sup -1/2$. Moreover, neither is especially sensitive to slight departures from exact tracking of $R' sub 1$ with $R sub 1$. Such a filter has been used in a simple hand-controlled speech synthesizer, built for demonstration and amusement (Witten and Madams, 1978). .[ Witten Madams 1978 Chatterbox .] However, the need for tracking resistors, and the inverse square root variation of the formant frequency with $R sub 1$, makes it rather unsuitable for serious applications. .pp A better analogue filter is the ring-of-three configuration shown in Figure 5.12. .FC "Figure 5.12" (Ignore the secondary output for now.) Control is achieved over the centre frequency by two multipliers, driven from the same control input $k$. These have a high-impedance output, producing a current $kx$ if the input voltage is $x$. It is not too difficult to show that the transfer function of the circuit is .LB .EQ - ~~ { {k sup 2} over {C sup 2} } over { s sup 2 ~~+~~ 2 over RC ~s ~~+~~{1+k sup 2 R sup 2} over {R sup 2 C sup 2} } ~ . .EN .LE Suppose that $R$ is chosen so that $k sup 2 R sup 2 ~ >>~ 1$. Then this is a unity-gain resonator with constant bandwidth $1/ pi RC$\ Hz and centre frequency $k/2 pi C$\ Hz. Note that it is the combination of both multipliers that makes the centre frequency grow linearly with $k$: with one multiplier there would be a square-root relationship. .pp The ring-of-three filter of Figure 5.12 is arranged in a slightly unusual way, with an inverting stage at the beginning and the two resonant stages following it. This ensures that the signal level at intermediate points in the filter does not exceed that at the output, and gives the filter the best chance of coping with a wide range of input amplitudes without clipping. This contrasts markedly with the resonator of Figure 5.11, where the voltage at the output of the first integrator is $w/b$ times the final output \(em a factor of 50 in the worst case. .pp For a digital implementation of a formant, consider the recurrence relation .LB .EQ y(n)~ = ~~ a sub 1 y(n-1) ~-~ a sub 2 y(n-2) ~+~ a sub 0 x(n) , .EN .LE where $x(n)$ is the input and $y(n)$ the output at time $n$, $y(n-1)$ and $y(n-2)$ are the previous two values of the output, and $a sub 0$, $a sub 1$, and $a sub 2$ are (real) constants. The minus sign is in front of the second term because it makes $a sub 2$ turn out to be positive. To calculate the $z$-transform version of this relationship, multiply through by $z sup -n$ and sum from $n=- infinity$ to $infinity$ : .LB "nn" .EQ sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ mark =~~ a sub 1 sum from {n=- infinity} to infinity ~y(n-1)z sup -n ~~-~ a sub 2 sum from {n=- infinity} to infinity ~y(n-2)z sup -n ~~+~ a sub 0 sum from {n=- infinity} to infinity ~x(n)z sup -n .EN .sp .EQ lineup = ~~ a sub 1 z sup -1 ~ sum ~y(n-1)z sup -(n-1) ~~-~~ a sub 2 z sup -2 ~ sum ~y(n-2)z sup -(n-2) ~~+~~ a sub 0 ~ sum ~x(n)x sup -n ~ . .EN .LE "nn" Writing this in terms of $z$-transforms, .LB .EQ Y(z)~ = ~~ a sub 1 z sup -1 Y(z) ~-~ a sub 2 z sup -2 Y(z) ~+~ a sub 0 X(z) . .EN .LE Thus the input-output transfer function of the system is .LB .EQ H(z)~ = ~~ Y(z) over X(z) ~~=~~ {a sub 0 } over {1~-~a sub 1 z sup -1 ~+~a sub 2 z sup -2} ~ . .EN .LE .pp We learned in the previous chapter that the frequency response is obtained from the $z$-transform of a system by replacing $z sup -1$ by $e sup {-j2 pi fT}$, where $f$ is the frequency variable in\ Hz. Hence the amplitude response of the digital formant filter is .LB .EQ |H(e sup {j2 pi fT} )| sup 2 ~~ = ~~ left [ {a sub 0} over {1~-~a sub 1 e sup {-j2 pi fT} ~+~a sub 2 e sup {-j4 pi fT} } ~ right ] sup 2 ~ . .EN .sp .LE It is fairly obvious from this that a DC gain of 1 is obtained if .LB .EQ a sub 0 ~ = ~~ 1 ~-~ a sub 1 ~+~ a sub 2 , .EN .LE for $e sup {-j2 pi fT}$ is 1 at a frequency of 0\ Hz. Some manipulation is required to show that, under the usual assumption that the bandwidth is small, the centre frequency is .LB .EQ 1 over {2 pi T} ~~ cos sup -1 ~ {a sub 1} over {2 a sub 2 sup 1/2} ~ .EN Hz. .LE Furthermore, the 3\ dB bandwidth of the resonance is given approximately by .LB .EQ -~ 1 over {2 pi T} ~~ log sub e a sub 2 ~ .EN Hz. .LE .pp As an example, Figure 5.13 shows an amplitude response for this digital filter. .FC "Figure 5.13" The parameters $a sub 0$, $a sub 1$ and $a sub 2$ were generated from the above relationships for a sampling frequency of 8\ kHz, centre frequency of 1\ kHz, and bandwidth of 75\ Hz. It exhibits a peak of approximately the right bandwidth at the correct frequency, 1\ kHz. Note that the response is flat at half the sampling frequency, for the frequency response from 4\ kHz to 8\ kHz is just a reflection of that up to 4\ kHz. This contrasts sharply with that of an analogue formant filter, also shown in Figure 5.13, which slopes at \-12\ dB/octave at frequencies above resonance. .pp The behaviour of a digital formant filter at frequencies above resonance actually makes it preferable to an analogue implementation. We saw earlier that considerable trouble must be taken with the latter to compensate for the cumulative effect of \-12\ dB/octave at higher frequencies for each of the formants. This is not necessary with digital implementations, for the response of a digital formant filter is flat at half the sampling frequency. In fact, further study shows that digital synthesizers without any higher-pole correction give a closer approximation to the vocal tract than analogue ones with higher-pole correction (Gold and Rabiner, 1968). .[ Gold Rabiner 1968 Analysis of digital and analogue formant synthesizers .] .rh "Time-domain methods." An interesting alternative to frequency-domain speech synthesis is to construct the formants in the time domain. When a second-order resonance is excited by an impulse, an exponentially decaying sinusoid is produced, as illustrated by Figure 5.14. .FC "Figure 5.14" The oscillation occurs at the resonant frequency of the filter, while the decay is related to the bandwidth. In fact, if the formant filter has transfer function .LB .EQ {w sup 2} over {s sup 2 ~+~ b s ~+~ w sup 2} ~ , .EN .LE the time waveform for impulsive excitation is .LB .EQ x(t)~ = ~~ w~ e sup -bt/2 ~ sin ~ wt ~~~~~~~~ .EN (neglecting $b sup 2 /w sup 2$). .LE It is the combination of several such time waveforms, coupled with the regular reappearance of excitation at the pitch period, that produces the characteristic wiggly waveform of voiced speech. .pp Now suppose we take a sine wave of frequency $w$ and multiply it by a decaying exponential $e sup -bt/2$. This gives a signal .LB .EQ x(t)~ = ~~ e sup -bt/2 ~ sin ~ wt , .EN .LE which is identical with the filtered impulse except for a factor $w$. If there are several formants in parallel, all with the same bandwidth, the exponential factor is the same for each: .LB .EQ x(t)~ = ~~ e sup -bt/2 ~ (A sub 1 ~ sin ~ w sub 1 t ~~+ ~~ A sub 2 ~ sin ~ w sub 2 t ~~ + ~~ A sub 3 ~ sin ~ w sub 3 t) . .EN .LE $A sub 1$, $A sub 2$, and $A sub 3$ control the formant amplitudes, as in an ordinary parallel synthesizer; except that they need adjusting to account for the missing factors $w sub 1$, $w sub 2$, and $w sub 3$. .pp A neat way of implementing such a synthesizer digitally is to store one cycle of a sine wave in a read-only memory (ROM). Then, the formant frequencies can be controlled by reading the ROM at different rates. For example, if twice the basic frequency is desired, every second value should be read. Multiplication is needed for amplitude control of each formant: this can be accomplished by shifting the digital word (each place shifted accounts for 6\ dB of attenuation). Finally, the exponential damping factor can be provided in analogue hardware by a single capacitor after the D/A converter. This implementation gives a system for hardware-software synthesis which involves an absolutely minimal amount of extra hardware apart from the computer, and does not need hardware multiplication for real-time operation. It could easily be made to work in real time with a microprocessor coupled to a D/A converter, damping capacitor, and fixed tone-control filter to give the required spectral equalization. .pp Because the overall spectral decay of an impulse exciting a second-order formant filter is 12\ dB/octave, the appropriate equalization is +6\ dB/octave lift at high frequencies, to give an overall \-6\ dB/octave spectral trend. .pp Note, however, that this synthesis model is an extremely basic one. Only impulsive excitation can be accomodated. For fricatives, which we will discuss in more detail below, a different implementation is needed. A hardware noise generator, with a few fixed filters \(em one for each fricative type \(em will suffice for a simple system. More damaging is the lack of aspiration, where random noise excites the vocal tract resonances. This cannot be simulated in the model. The .ul h sound can be provided by treating it as a fricative, and although it will not sound completely realistic, because there will be no variation with the formant positions of adjacent phonemes, this can be tolerated because .ul h is not too important for speech intelligibility. A bigger disadvantage is the lack of proper aspiration control for producing unvoiced stops, which as mentioned in Chapter 2 consist of an silent phase followed by a burst of aspiration. Experience has shown that although it is difficult to drive such a synthesizer from a software synthesis-by-rule system, quite intelligible output can be obtained if parameters are derived from real speech and tweaked by hand. Then, for each aspiration burst the most closely-matching fricative sound can be used. .sh "5.4 Aspiration and frication" .pp The model of the vocal tract as a filter which affects the frequency spectrum of the basic voiced excitation breaks down if there are constrictions in it, for these introduce new sound sources caused by turbulent air. The generation of unvoiced excitation has been discussed earlier in this chapter: now we must consider how to simulate the filtering action of the vocal tract for unvoiced sounds. .pp Aspiration and frication need to be dealt with separately. The former is caused by excitation at the vocal cords \(em the cords are held so close together that turbulent noise is produced. This noise passes through the same vocal tract filter that modifies voiced sounds, and the same kind of formant structure can be observed. All that is needed to simulate it is to replace the voiced excitation source by white noise, as shown in the upper part of Figure 5.15. .FC "Figure 5.15" .pp Speech can be whispered by substituting aspiration for voicing throughout. Of course, there is no fundamental frequency associated with aspiration. An interesting way of assessing informally the degradation caused by inadequate pitch control in a speech synthesis-by-rule system is to listen to whispered speech, in which pitch variations play no part. .pp Voiced and aspirative excitation are rarely produced at the same time in natural speech (but see the discussion in Chapter 2 about breathy voice). However, the excitation can change from one to the other quite quickly, and when this happens there is no discontinuity in the formant structure. .pp Fricative, or sibilant, excitation is quite different from aspiration, because it introduces a new sound source at a different place from the vocal cords. The constriction which produces the sound may be at the lips, the teeth, the hard ridge just behind the top front teeth, or further back along the palate. These positions each produce a different sound (\c .ul f, .ul th, .ul s, and .ul sh respectively). However, smooth transitions from one of these sounds to another do not occur in natural speech; and dynamical movement of the frequency spectrum during a fricative is unnecessary for speech synthesis. .pp It is necessary, however, to be able to produce an approximation to the noise spectrum for each of these sound types. This is commonly achieved by a single high-pass resonance whose centre frequency can be controlled. This is the purpose of the secondary output of the formant filter of Figure 5.12. Taking the output from this point gives a high-pass instead of a low-pass resonance, and this same filter configuration is quite acceptable for fricatives. Figure 5.15 shows the fricative sound path as a noise generator followed by such a filter. .pp Unlike aspiration, fricative excitation is frequently combined with voicing. This gives the voiced fricative sounds .ul v, .ul dh, .ul z, and .ul zh. It is possible to produce frication and aspiration together, and although there are no examples of this in English, speech synthesis-by-rule programs often use a short burst of aspiration .ul and frication when simulating the opening of unvoiced stops. Separate amplitude controls are therefore needed for voicing and frication, but the former can be used for aspiration as well, with a "glottal excitation type" switch to indicate aspiration rather than voicing. .sh "5.5 Summary" .pp A resonance speech synthesizer consists of a vocal tract filter, excited by either a periodic pitch pulse or aspiration noise. In addition, a set of sibilant sounds must be provided. The vocal tract filter is dynamic, with three controllable resonances. These, coupled with some fixed spectral compensation, give it a fairly high order \(em about 10 complex poles are needed. Although several different sibilant sound types must be simulated, dynamical movement is less important in fricative sound spectra than for voiced and aspirated sounds because smooth transitions between one fricative and another are not important in speech. However, fricative timing and amplitude must be controlled rather precisely. .pp The speech synthesizer is controlled by several parameters. These include fundamental frequency (if voiced), amplitude of voicing, frequency of the first few \(em typically three \(em formants, aspiration amplitude, sibilance amplitude, and frequency of one (or more) sibilance filters. Additionally, if the synthesizer is a parallel one, parameters for the amplitudes of individual formants will need to be included. It may be that some control over formant bandwidths is provided too. Thus synthesizers have from eight up to about 20 parameters (Klatt, 1980, describes one with 20 parameters). .[ Klatt 1980 Software for a cascade/parallel formant synthesizer .] .pp The parameters are supplied to the synthesizer at regular intervals of time. For a 10-parameter synthesizer, the control can be thought of as a set of 10 graphs, each representing the time evolution of one parameter. They are usually called parameter .ul tracks, the terminology dating from the days when a track was painted on a glass slide for each parameter to provide dynamic control of the synthesizer (Lawrence, 1953). .[ Lawrence 1953 .] The pitch track is often called a pitch .ul contour; this is a common phonetician's usage. Do not confuse this with the everyday meaning of "contour" as a line joining points of equal height on a map \(em a pitch contour is just the time evolution of the pitch frequency. .pp For computer-controlled synthesizers, of course, the parameter tracks are sampled, typically every 5 to 20\ msec. The rate is determined by the need to generate fast amplitude transitions for nasals and stop consonants. Contrast it with the 125\ $mu$sec sampling period needed to digitize telephone-quality speech. The raw data rate for a 10-parameter synthesizer updated every 10 msec is 1,000 parameters/sec, or 6\ Kbit/s if each parameter is represented by 6\ bits. This is a substantial reduction over the 56\ Kbit/s needed for PCM representation. For speech synthesis by rule (Chapter 7), these parameter tracks are generated by a computer program from a phonetic (or English) version of the utterance, lowering the data rate by a further one or two orders of magnitude. .pp Filters for speech synthesizers can be implemented in either analogue or digital form. High-order filters are usually broken down into second-order sections in parallel or in series. A third possibility, which has not been discussed above, is to implement a single high-order filter directly. Finally, the action of formant filters can be synthesized in the time domain. This gives eight possibilities which are summarized in Table 5.2. .RF .in +0.5i .ta 2.1i +2.0i .nr x1 (\w'Analogue'/2) .nr x2 (\w'Digital'/2) \h'-\n(x1u'Analogue \h'-\n(x2u'Digital .nr x0 2.0i+(\w'Liljencrants (1968)'/2)+(\w'Morris and Paillet (1972)'/2) .nr x3 (\w'Liljencrants (1968)'/2) \h'-\n(x3u'\l'\n(x0u\(ul' .sp .nr x1 (\w'Rice (1976)'/2) .nr x2 (\w'Rabiner \fIet al\fR'/2) Series \h'-\n(x1u'Rice (1976) \h'-\n(x2u'Rabiner \fIet al\fR .nr x1 (\w'Liljencrants (1968)'/2) .nr x2 (\w'Holmes (1973)'/2) Parallel \h'-\n(x1u'Liljencrants (1968) \h'-\n(x2u'Holmes (1973) .nr x1 (\w'unpublished'/2) .nr x2 (\w'unpublished'/2 Time-domain \h'-\n(x1u'unpublished \h'-\n(x2u'unpublished .nr x1 (\w'\(em'/2) .nr x2 (\w'Morris and Paillet (1972)'/2) High-order filter \h'-\n(x1u'\(em \h'-\n(x2u'Morris and Paillet (1972) \h'-\n(x3u'\l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in-0.5i .FG "Table 5.2 Implementation options for resonance speech synthesizers" .[ Rice 1976 Byte .] .[ Rabiner Jackson Schafer Coker 1971 .] .[ Liljencrants 1968 .] .[ Holmes 1973 Influence of glottal waveform on naturalness .] .[ Morris and Paillet 1972 .] All but one have certainly been used as the basis for synthesis, and the table includes reference to published descriptions. .pp Each method has advantages and disadvantages. Series decomposition obviates the need for control over the amplitudes of individual formants, but does not allow synthesis of sounds which use the nasal tract as well as the oral one; for these are in parallel. Analogue implementation of series synthesizers is complicated by the need for higher-pole correction, and the fact that the gains at different frequencies can vary widely throughout the system. Higher-pole correction is not so important for digital synthesizers. Parallel decomposition eliminates some of these problems: higher-pole correction can be implemented individually for each formant. However, the formant amplitudes must be controlled rather precisely to simulate the vocal tract, which is essentially serial. Time-domain synthesis is associated with low hardware costs but does not easily allow proper control over the excitation sources. In particular, it cannot simulate dynamical movement of the spectrum during aspiration. Implementation of the entire vocal tract model as a single high-order filter, without breaking it down into individual formants in series or parallel, is attractive from the computational point of view because less arithmetic operations are required. It is best analysed in terms of linear predictive coding, which is the subject of the next chapter. .sh "5.6 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "5.7 Further reading" .pp Historically-minded readers should look at the early speech synthesizer designed by Lawrence (1953). This and other classic papers on the subject are reprinted in Flanagan and Rabiner (1973). A good description of a quite sophisticated parallel synthesizer can be found in Holmes (1973), above, and another of a switchable series/parallel one in Klatt (1980), who even includes a listing of the Fortran program that implements it. Here are some useful books on speech synthesizers. .LB "nn" .\"Fant-1960-1 .]- .ds [A Fant, G. .ds [D 1960 .ds [T Acoustic theory of speech production .ds [I Mouton .ds [C The Hague .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Fant really started the study of the vocal tract as an acoustic system, and this book marks the beginning of modern speech synthesis. .in-2n .\"Flanagan-1972-1 .]- .ds [A Flanagan, J.L. .ds [D 1972 .ds [T Speech analysis, synthesis, and perception (2nd, expanded, edition) .ds [I Springer Verlag .ds [C Berlin .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This book is the speech researcher's bible, and like the bible, it's not all that easy to read. However, it is an essential reference source for speech acoustics and speech synthesis (as well as for human speech perception). .in-2n .\"Flanagan-1973-2 .]- .ds [A Flanagan, J.L. .as [A " and Rabiner, L.R.(Editors) .ds [D 1973 .ds [T Speech synthesis .ds [I Dowsen, Hutchinson and Ross .ds [C Stroudsburg, Pennsylvania .nr [T 0 .nr [A 0 .nr [O 0 .][ 2 book .in+2n I recommended this book at the end of Chapter 1 as a collection of classic papers on the subject of speech synthesis and synthesizers. .in-2n .\"Holmes-1972-3 .]- .ds [A Holmes, J.N. .ds [D 1972 .ds [T Speech synthesis .ds [I Mills and Boom .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This little book, by one of Britain's foremost workers in the field, introduces the subject of speech synthesis and speech synthesizers. It has a particularly good discussion of parallel synthesizers. .in-2n .LE "nn" .EQ delim $$ .EN .CH "6 LINEAR PREDICTION OF SPEECH" .ds RT "Linear prediction of speech .ds CX "Principles of computer speech .pp The speech coding techniques which were discussed in Chapter 3 operate in the time domain, while the analysis and synthesis techniques of Chapters 4 and 5 are based in the frequency domain. Linear prediction is a relatively new method of speech analysis-synthesis, introduced in the early 1970's and used extensively since then, which is primarily a time-domain coding method but can be used to give frequency-domain parameters like formant frequency, bandwidth, and amplitude. .pp It has several advantages over other speech analysis techniques, and is likely to become increasingly dominant in speech output systems. As well as bridging the gap between time- and frequency-domain techniques, it is of equal value for both speech storage and speech synthesis, and forms an extremely convenient basis for speech-output systems which use high-quality stored speech for routine messages and synthesis from phonetics or text for unusual or exceptional conditions. Linear prediction can be used to separate the excitation source properties of pitch and amplitude from the vocal tract filter which governs phoneme articulation, or, in other words, to separate much of the prosodic from the segmental information. Hence it makes it easy to use stored segmentals with synthetic prosody, which is just what is needed to enhance the flexibility of stored speech by providing overall intonation contours for utterances formed by word concatenation (see Chapter 7). .pp The frequency-domain analysis technique of Fourier transformation necessarily involves approximation because it applies only to periodic waveforms, and so the artificial operation of windowing is required to suppress the aperiodicity of real speech. In contrast, the linear predictive technique, being a time-domain method, can \(em in certain forms \(em deal more rationally with aperiodic signals. .pp The basic idea of linear predictive coding is exactly the same as one form of adaptive differential pulse code modulation which was introduced briefly in Chapter 3. There it was noted that a speech sample $x(n)$ can be predicted quite closely by the previous sample $x(n-1)$. The prediction can be improved by multiplying the previous sample by a number, say $a sub 1$, which is adapted on a syllabic time-scale. This can be utilized for speech coding by transmitting only the prediction error .LB .EQ e(n)~=~~x(n)~-~a sub 1 x(n-1), .EN .LE and using it (and the value of $a sub 1$) to reconstitute the signal $x(n)$ at the receiver. It is worthwhile noting that exactly the same relationship was used for digital preemphasis in Chapter 4, with the value of $a sub 1$ being constant at about 0.9 \(em although the possibility of adapting it to take into account the difference between voiced and unvoiced speech was discussed. .pp An obvious extension is to use several past values of the signal to form the prediction, instead of just one. Different multipliers for each would be needed, so that the prediction error could be written as .LB .EQ e(n)~~ mark =~~x(n)~-~a sub 1 x(n-1)~-~a sub 2 x(n-2)~-~...~-~a sub p x(n-p) .EN .sp .EQ lineup =~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k). .EN .LE The multipliers $a sub k$ should be adapted to minimize the error signal, and we will consider how to do this in the next section. It turns out that they must be re-calculated and transmitted on a time-scale that is rather faster than syllabic but much slower than the basic sampling rate: intervals of 10\-25\ msec are usually used (compare this with the 125\ $mu$sec sampling rate for telephone-quality speech). A configuration for high-order adaptive differential pulse code modulation is shown in Figure 6.1. .FC "Figure 6.1" .pp Figure 6.2 shows typical time waveforms for each of the ten coefficients over a 1-second stretch of speech. .FC "Figure 6.2" Notice that they vary much more slowly than, say, the speech waveform of Figure 3.5. .pp Turning the above relationship into $z$-transforms gives .LB .EQ E(z)~~=~~X(z)~-~~sum from k=1 to p ~a sub k z sup -k ~X(z)~~=~~(1~-~~ sum from k=1 to p ~a sub k z sup -k )~X(z). .EN .LE Rewriting the speech signal in terms of the error, .LB .EQ X(z)~~=~~1 over {1~-~~ sum ~a sub k z sup -k }~.~E(z) . .EN .LE .pp Now let us bring together some facts from the previous chapter which will allow the time-domain technique of linear prediction to be interpreted in terms of the frequency-domain formant model of speech. Recall that speech can be viewed as an excitation source passing through a vocal tract filter, followed by another filter to model the effect of radiation from the lips. The overall spectral levels can be reassigned as in Figure 5.1 so that the excitation source has a 0\ dB/octave spectral profile, and hence is essentially impulsive. Considering the vocal tract filter as a series connection of digital formant filters, its transfer function is the product of terms like .LB .EQ 1 over {1~-~b sub 1 z sup -1 ~+~b sub 2 z sup -2}~ , .EN .LE where $b sub 1$ and $b sub 2$ control the position and bandwidth of the formant resonances. The \-6\ dB/octave spectral compensation can be modelled by the first-order digital filter .LB .EQ 1 over {1~-~bz sup -1}~ . .EN .LE The product of all these terms, when multiplied out, will have the form .LB .EQ 1 over {1~-~c sub 1 z sup -1 ~-~c sub 2 z sup -2 ~-~...~-~ c sub q z sup -q }~ , .EN .LE where $q$ is twice the number of formants plus one, and the $c$'s are calculated from the positions and bandwidths of the formant resonances and the spectral compensation parameter. Hence the $z$-transform of the speech is .LB .EQ X(z)~=~~1 over {1~-~~ sum from k=1 to q ~c sub k z sup -k }~.~I(z) , .EN .LE where $I(z)$ is the transform of the impulsive excitation. .pp This is remarkably similar to the linear prediction relation given earlier! If $p$ and $q$ are the same, then the linear predictive coefficients $a sub k$ form a $p$'th order polynomial which is the same as that obtained by multiplying together the second-order polynomials representing the individual formants (together with the first-order one for spectral compensation). Furthermore, the predictive error $E(z)$ can be identified with the impulsive excitation $I(z)$. This raises the very interesting possibility of parametrizing the error signal by its frequency and amplitude \(em two relatively slowly-varying quantities \(em instead of transmitting it sample-by-sample (at an 8\ kHz rate). This is how linear prediction separates out the excitation properties of the source from the vocal tract filter: the source parameters can be derived from the error signal and the vocal tract filter is represented by the linear predictive coefficients. Figure 6.3 shows how this can be used for speech transmission. .FC "Figure 6.3" Note that .ul no signals need now be transmitted at the speech sampling rate; for the source parameters vary relatively slowly. This leads to an extremely low data rate. .pp Practical linear predictive coding schemes operate with a value of $p$ between 10 and 15, corresponding approximately to 4-formant and 7-formant synthesis respectively. The $a sub k$'s are re-calculated every 10 to 25\ msec, and transmitted to the receiver. Also, the pitch and amplitude of the speech are estimated and transmitted at the same rate. If the speech is unvoiced, there is no pitch value: an "unvoiced flag" is transmitted instead. Because the linear predictive coefficients are intimately related to formant frequencies and bandwidths, a "frame rate" in the region of 10 to 25\ msec is appropriate because this approximates the maximum rate at which acoustic events happen in speech production. .pp At the receiver, the excitation waveform is reconstituted. For voiced speech, it is impulsive at the specified frequency and with the specified amplitude, while for unvoiced speech it is random, with the specified amplitude. This signal $e(n)$, together with the transmitted parameters $a sub 1$, ..., $a sub p$, is used to regenerate the speech waveform by .LB .EQ x(n)~=~~e(n)~+~~sum from k=1 to p ~a sub k x(n-k) , .EN .LE \(em which is the inverse of the transmitter's formula for calculating $e(n)$, namely .LB .EQ e(n)~=~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k) . .EN .LE This relies on knowing the past $p$ values of the speech samples. Many systems set these past values to zero at the beginning of each pitch cycle. .pp Linear prediction can also be used for speech analysis, rather than for speech coding, as shown in Figure 6.4. .FC "Figure 6.4" Instead of transmitting the coefficients $a sub k$, they are used to determine the formant positions and bandwidths. We saw above that the polynomial .LB .EQ 1~-~a sub 1 z sup -1 ~-~a sub 2 z sup -2 ~-~...~-~a sub p z sup -p , .EN .LE when factored into a product of second-order terms, gives the formant characteristics (as well as the spectral compensation term). Factoring is equivalent to finding the complex roots of the polynomial, and this is fairly demanding computationally \(em especially if done at a high rate. Consequently, peak-picking algorithms are sometimes used instead. The absolute value of the polynomial gives the frequency spectrum of the vocal tract filter, and the formants appear as peaks \(em just as they do in cepstrally smoothed speech (see Chapter 4). .pp The chief deficiency in the linear predictive method, whether it is used for speech coding or for speech analysis, is that \(em like a series synthesizer \(em it implements an all-pole model of the vocal tract. We mentioned in Chapter 5 that this is rather simplistic, especially for nasalized sounds which involve a cavity in parallel with the oral one. Some research has been done on incorporating zeros into a linear predictive model, but it complicates the problem of calculating the parameters enormously. For most purposes people seem to be able to live with the limitations of the all-pole model. .sh "6.1 Linear predictive analysis" .pp The key problem in linear predictive coding is to determine the values of the coefficients $a sub 1$, ..., $a sub p$. If the error signal is to be transmitted on a sample-by-sample basis, as it is in adaptive differential pulse code modulation, then it can be most economically encoded if its mean power is as small as possible. Thus the coefficients are chosen to minimize .LB .EQ sum ~e(n) sup 2 .EN .LE over some period of time. The period of time used is related to the frame rate at which the coefficients are transmitted or stored, although there is no need to make it exactly the same as one frame interval. As mentioned above, the frame size is usually chosen to be in the region of 10 to 25\ msec. Some schemes minimize the error signal over as few as 30 samples (corresponding to 3\ msec at a 10\ kHz sampling rate). Others take longer; up to 250 samples (25\ msec). .pp However, if the error signal is to be considered as impulsive and parametrized by its frequency and amplitude before transmission, or if the coefficients $a sub k$ are to be used for spectral calculations, then it is not immediately obvious how the coefficients should be calculated. In fact, it is still best to choose them to minimize the above sum. This is at least plausible, for an impulsive excitation will have a rather small mean power \(em most of the samples are zero. It can be justified theoretically in terms of .ul spectral whitening, for it can be shown that minimizing the mean-squared error produces an error signal whose spectrum is maximally flat. Now the only two waveforms whose spectra are absolutely flat are a single impulse and white noise. Hence if the speech is voiced, minimizing the mean-squared error will lead to an error signal which is as nearly impulsive as possible. Provided the time-frame for minimizing is short enough, the impulse will correspond to a single excitation pulse. If the speech is unvoiced, minimization will lead to an error signal which is as nearly white noise as possible. .pp How does one choose the linear predictive coefficients to minimize the mean-squared error? The total squared prediction error is .LB .EQ M~=~~sum from n ~e(n) sup 2~~=~~sum from n ~[x(n)~-~ sum from k=1 to p ~a sub k x sub n-k ] sup 2 , .EN .LE leaving the range of summation unspecified for the moment. To minimize $M$ by choice of the coefficients $a sub j$, differentiate with respect to each of them and set the resulting derivatives to zero. .LB .EQ dM over {da sub j} ~~=~~-2 sum from n ~x(n-j)[x(n)~-~~ sum from k=1 to p ~a sub k x(n-k)]~~=~0~, .EN .LE so .LB .EQ sum from k=1 to p ~a sub k ~ sum from n ~x(n-j)x(n-k)~~=~~ sum from n ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p. .EN .LE .pp This is a set of $p$ linear equations for the $p$ unknowns $a sub 1$, ..., $a sub p$. Solving it is equivalent to inverting a $p times p$ matrix. This job must be repeated at the frame rate, and so if real-time operation is desired quite a lot of calculation is needed. .rh "The autocorrelation method." So far, the range of the $n$-summation has been left open. The coefficients of the matrix equation have the form .LB .EQ sum from n ~x(n-j)x(n-k). .EN .LE If a doubly-infinite summation were made, with $x(n)$ being defined as zero whenever $n<0$, we could make use of the fact that .sp .ce .EQ sum from {n=- infinity} to infinity ~x(n-j)x(n-k)~=~~ sum from {n=- infinity} to infinity ~x(n-j+1)x(n-k+1)~=~...~=~~ sum from {n=- infinity} to infinity ~x(n)x(n+j-k) .EN .sp to simplify the matrix equation. This just states that the autocorrelation of an infinite sequence depends only on the lag at which it is computed, and not on absolute time. .pp Defining $R(m)$ as the autocorrelation at lag $m$, that is, .LB .EQ R(m)~=~ sum from n ~x(n)x(n+m), .EN .LE the matrix equation becomes .LB .ne7 .nf .EQ R(0)a sub 1 ~+~R(1)a sub 2 ~+~R(2)a sub 3 ~+~...~~=~R(1) .EN .EQ R(1)a sub 1 ~+~R(0)a sub 2 ~+~R(1)a sub 3 ~+~...~~=~R(2) .EN .EQ R(2)a sub 1 ~+~R(1)a sub 2 ~+~R(0)a sub 3 ~+~...~~=~R(3) .EN .EQ etc .EN .fi .LE An elegant method due to Durbin and Levinson exists for solving this special system of equations. It requires much less computational effort than is generally needed for symmetric matrix equations. .pp Of course, an infinite range of summation can not be used in practice. For one thing, the power spectrum is changing, and only the data from a short time-frame should be used for a realistic estimate of the optimum linear predictive coefficients. Hence a windowing procedure, .LB .EQ x(n) sup * ~=~w sub n x(n), .EN .LE is used to reduce the signal to zero outside a finite range of interest. Windows were discussed in Chapter 4 from the point of view of Fourier analysis of speech signals, and the same sort of considerations apply to choosing a window for linear prediction. .pp This is known as the .ul autocorrelation method of computing prediction parameters. Typically a window of 100 to 250 samples is used for analysis of one frame of speech. .rh "Algorithm for the autocorrelation method." The algorithm for obtaining linear prediction coefficients by the autocorrelation method is quite simple. It is straightforward to compute the matrix coefficients $R(m)$ from the speech samples and window coefficients. The Durbin-Levinson method of solving matrix equations operates directly on this $R$-vector to produce the coefficient vector $a sub k$. The complete procedure is given as Procedure 6.1, and is shown diagrammatically in Figure 6.5. .FC "Figure 6.5" .RF .fi .na .nh .ul const N=256; p=15; .ul type svec = .ul array [0..N\-1] .ul of real; cvec = .ul array [1..p] .ul of real; .sp .ul procedure autocorrelation(signal: vec; window: svec; .ul var coeff: cvec); .sp {computes linear prediction coefficients by autocorrelation method in coeff[1..p]} .sp .ul var R, temp: .ul array [0..p] .ul of real; n: [0..N\-1]; i,j: [0..p]; E: real; .sp .ul begin {window the signal} .in+6n .ul for n:=0 .ul to N\-1 .ul do signal[n] := signal[n]*window[n]; .sp {compute autocorrelation vector} .br .ul for i:=0 .ul to p .ul do begin .in+2n R[i] := 0; .br .ul for n:=0 .ul to N\-1\-i .ul do R[i] := R[i] + signal[n]*signal[n+i] .in-2n .ul end; .sp {solve the matrix equation by the Durbin-Levinson method} .br E := R[0]; .br coeff[1] := R[1]/E; .br .ul for i:=2 .ul to p .ul do begin .in+2n E := (1\-coeff[i\-1]*coeff[i\-1])*E; .br coeff[i] := R[i]; .br .ul for j:=1 .ul to i\-1 .ul do coeff[i] := coeff[i] \- R[i\-j]*coeff[j]; .br coeff[i] := coeff[i]/E; .br .ul for j:=1 .ul to i\-1 .ul do temp[j] := coeff[j] \- coeff[i]*coeff[i\-j]; .br .ul for j:=1 .ul to i\-1 .ul do coeff[j] := temp[j] .in-2n .ul end .in-6n .ul end. .nf .FG "Procedure 6.1 Pascal algorithm for the autocorrelation method" .pp This algorithm is not quite as efficient as it might be, for some multiplications are repeated during the calculation of the autocorrelation vector. Blankinship (1974) shows how the number of multiplications can be reduced by about half. .[ Blankinship 1974 .] .pp If the algorithm is performed in fixed-point arithmetic (as it often is in practice because of speed considerations), some scaling must be done. The maximum and minimum values of the windowed signal can be determined within the window calculation loop, and one extra pass over the vector will suffice to scale it to maximum significance. (Incidentally, if all sample values are the same the procedure cannot produce a solution because $E$ becomes zero, and this can easily be checked when scaling.) .pp The absolute value of the $R$-vector has no significance, and since $R(0)$ is always the greatest element, this can be set to the largest fixed-point number and the other $R$'s scaled down appropriately after they have been calculated. These scaling operations are shown as dashed boxes in Figure 6.5. $E$ decreases monotonically as the computation proceeds, so it is safe to initialize it to $R(0)$ without extra scaling. The remainder of the scaling is straightforward, with the linear prediction coefficients $a sub k$ appearing as fractions. .rh "The covariance method." One of the advantages of linear predictive methods that was promised earlier was that it allows us to escape from the problem of windowing. To do this, we must abandon the requirement that the coefficients of the matrix equation have the symmetry property of autocorrelations. Instead, suppose that the range of $n$-summation uses a fixed number of elements, say N, starting at $n=h$, to estimate the prediction coefficients between sample number $h$ and sample number $h+N$. .pp This leads to the matrix equation .LB .EQ sum from k=1 to p ~a sub k sum from n=h to h+N-1 ~x(n-j)x(n-k) ~~=~~ sum from n=h to h+N-1 ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p. .EN .LE Alternatively, we could write .LB .EQ sum from k=1 to p ~a sub k ~ Q sub jk sup h~~=~~Q sub 0j sup h ~~~~j~=~1,~2,~...,~p; .EN .LE where .LB .EQ Q sub jk sup h~~=~~sum from n=h to h+N-1 ~x(n-j)x(n-k). .EN .LE Note that some values of $x(n)$ outside the range $h ~ <= ~ n ~ < ~ h+N$ are required: these are shown diagrammatically in Figure 6.6. .FC "Figure 6.6" .pp Now $Q sub jk sup h ~=~ Q sub kj sup h$, so the equation has a diagonally symmetric matrix; and in fact the matrix $Q sup h$ can be shown to be positive semidefinite \(em and is almost always positive definite in practice. Advantage can be taken of these facts to provide a computationally efficient method for solving the equation. According to a result called Cholesky's theorem, a positive definite symmetric matrix $Q$ can be factored into the form $Q ~ = ~ LL sup T$, where $L$ is a lower triangular matrix. This leads to an efficient solution algorithm. .pp This method of computing prediction coefficients has become known as the .ul covariance method. It does not use windowing of the speech signal, and can give accurate estimates of the prediction coefficients with a smaller analysis frame than the autocorrelation method. Typically, 50 to 100 speech samples might be used to estimate the coefficients, and they are re-calculated every 100 to 250 samples. .rh "Algorithm for the covariance method." An algorithm for the covariance method is given in Procedure 6.2, .RF .fi .na .nh .ul const N=100; p=15; .ul type svec = .ul array [\-p..N\-1] .ul of real; cvec = .ul array [1..p] .ul of real; .sp .ul procedure covariance(signal: svec; .ul var coeff: cvec); .sp {computes linear prediction coefficients by covariance method in coeff[1..p]} .sp .ul var Q: .ul array [0..p,0..p] .ul of real; n: [0..N\-1]; i,j,r: [0..p]; X: real; .sp .ul begin {calculate upper-triangular covariance matrix in Q} .in+6n .ul for i:=0 .ul to p .ul do .in+2n .ul for j:=i .ul to p .ul do begin .in+2n Q[i,j]:=0; .br .ul for n:=0 .ul to N\-1 .ul do .in+2n Q[i,j] := Q[i,j] + signal[n\-i]*signal[n\-j] .in-2n .in-2n .ul end; .in-2n .sp {calculate the square root of Q} .br .ul for r:=2 .ul to p .ul do .in+2n .ul begin .in+2n .ul for i:=2 .ul to r\-1 .ul do .in+2n .ul for j:=1 .ul to i\-1 .ul do .in+2n Q[i,r] := Q[i,r] \- Q[j,i]*Q[j,r]; .in-2n .ul for j:=1 .ul to r\-1 .ul do .in+2n .ul begin .in+2n X := Q[j,r]; .br Q[j,r] := Q[j,r]/Q[j,i]; .br Q[r,r] := Q[r,r] \- Q[j,r]*X .in-2n .ul end .in-2n .in-2n .in-2n .ul end; .in-2n .sp {calculate coeff[1..p]} .br .ul for r:=2 .ul to p .ul do .in+2n .ul for i:=1 .ul to r\-1 .ul do Q[0,r] := Q[0,r] \- Q[i,r]*Q[0,i]; .in-2n .ul for r:=1 .ul to p .ul do Q[0,r] := Q[0,r]/Q[r,r]; .br .ul for r:=p\-1 .ul downto 1 .ul do .in+2n .ul for i:=r+1 .ul to p .ul do Q[0,r] := Q[0,r] \- Q[r,i]*Q[0,i]; .in-2n .ul for r:=1 .ul to p .ul do coeff[r] := Q[0,r] .in-6n .ul end. .nf .FG "Procedure 6.2 Pascal algorithm for the covariance method" and is shown diagrammatically in Figure 6.7. .FC "Figure 6.7" The algorithm shown is not terribly efficient from a computation and storage point of view, although it is workable. For one thing, it uses the obvious method for computing the covariance matrix by calculating .EQ Q sub 01 sup h , .EN .EQ Q sub 02 sup h , ~ ..., .EN .EQ Q sub 0p sup h , .EN .EQ Q sub 11 sup h , ..., .EN in turn, which repeats most of the multiplications $p$ times \(em not an efficient procedure. A simple alternative is to precompute the necessary multiplications and store them in a $(N+h) times (p+1)$ diagonally symmetric table, but even apart from the extra storage required for this, the number of additions which must be performed subsequently to give the $Q$'s is far larger than necessary. It is possible, however, to write a procedure which is both time- and space-efficient (Witten, 1980). .[ Witten 1980 Algorithms for linear prediction .] .pp The scaling problem is rather more tricky for the covariance method than for the autocorrelation method. The $x$-vector should be scaled initially in the same way as before, but now there are $p+1$ diagonal elements of the covariance matrix, any of which could be the greatest element. Of course, .LB .EQ Q sub jk ~~ <= ~~ Max ( Q sub 11 , Q sub 22 , ..., Q sub pp ), .EN .LE but despite the considerable communality in the summands of the diagonal elements, there are no .ul a priori bounds on the ratios between them. .pp The only way to scale the $Q$ matrix properly is to calculate each of its $p$ diagonal elements and use the greatest as a scaling factor. Alternatively, the fact that .LB .EQ Q sub jk ~~ <= ~~ N times Max( x sub n sup 2 ) .EN .LE can be used to give a bound for scaling purposes; however, this is usually a rather conservative bound, and as $N$ is often around 100, several bits of significance will be lost. .pp Scaling difficulties do not cease when $Q$ has been determined. It is possible to show that the elements of the lower-triangular matrix $L$ which represents the square root of $Q$ are actually .ul unbounded. In fact there is a slightly different variant of the Cholesky decomposition algorithm which guarantees bounded coefficients but suffers from the disadvantage that it requires square roots to be taken (Martin .ul et al, 1965). .[ Martin Peters Wilkinson 1965 .] However, experience with the method indicates that it is rare for the elements of $L$ to exceed 16 times the maximum element of $Q$, and the possibility of occasional failure to adjust the coefficients may be tolerable in a practical linear prediction system. .rh "Comparison of autocorrelation and covariance analysis." There are various factors which should be taken into account when deciding whether to use the autocorrelation or covariance method for linear predictive analysis. Furthermore, there is a rather different technique, called the "lattice method", which will be discussed shortly. The autocorrelation method involves windowing, which means that in practice a rather longer stretch of speech should be used for analysis. We have illustrated this by setting $N$=256 in the autocorrelation algorithm and 100 in the covariance one. Offsetting the extra calculation that this entails is the fact that the Durbin-Levinson method of inverting a matrix is much more efficient than Cholesky decomposition. In practice, this means that similar amounts of computation are needed for each method \(em a detailed comparison is made in Witten (1980). .[ Witten 1980 Algorithms for linear prediction .] .pp A factor which weighs against the covariance method is the difficulty of scaling intermediate quantities within the algorithm. The autocorrelation method can be implemented quite satisfactorily in fixed-point arithmetic, and this makes it more suitable for hardware implementation. Furthermore, serious instabilities sometimes arise with the covariance method, whereas it can be shown that the autocorrelation one is always stable. Nevertheless, the approximations inherent in the windowing operation, and the smearing effect of taking a larger number of sample points, mean that covariance-method coefficients tend to represent the speech more accurately, if they can be obtained. .pp One way of using the covariance method which has proved to be rather satisfactory in practice is to synchronize the analysis frame with the beginning of a pitch period, when the excitation is strongest. Pitch synchronous techniques were discussed in Chapter 4 in the context of discrete Fourier transformation of speech. The snag, of course, is that pitch peaks do not occur uniformly in time, and furthermore it is difficult to estimate their locations precisely. .sh "6.2 Linear predictive synthesis" .pp If the linear predictive coefficients and the error signal are available, it is easy to regenerate the original speech by .LB .EQ x(n)~=~~e(n)~+~~ sum from k=1 to p ~a sub k x(n-k) . .EN .LE If the error signal is parametrized into the sound source type (voiced or unvoiced), amplitude, and pitch (if voiced), it can be regenerated by an impulse repeated at the appropriate pitch frequency (if voiced), or white noise (if unvoiced). .pp However, it may be that the filter represented by the coefficients $a sub k$ is unstable, causing the output speech signal to oscillate wildly. In fact, it is only possible for the covariance method to produce an unstable filter, and not the autocorrelation method \(em although even with the latter, truncation of the $a sub k$'s for transmission may turn a stable filter into an unstable one. Furthermore, the coefficients $a sub k$ are not suitable candidates for quantization, because small changes in them can have a dramatic effect on the characteristics of the synthesis filter. .pp Both of these problems can be solved by using a different set of numbers, called .ul reflection coefficients, for quantization and transmission. Thus, for example, in Figures 6.1 and 6.3 these reflection coefficients could be derived at the transmitter, quantized, and used by the receiver to reproduce the speech waveform. They can be related to reflection and transmission parameters at the junctions of an acoustic tube model of the vocal tract; hence the name. Procedure 6.3 shows an algorithm for calculating the reflection coefficients from the filter coefficients $a sub k$. .RF .fi .na .nh .ul const p=15; .ul type cvec = .ul array [1..p] .ul of real; .sp .ul procedure reflection(coeff: cvec; .ul var refl: cvec); .sp {computes reflection coefficients in refl[1..p] corresponding to linear prediction coefficients in coeff[1..p]} .sp .ul var temp: cvec; i, m: 1..p; .sp .ul begin .in+6n .ul for m:=p .ul downto 1 .ul do begin .in+2n refl[m] := coeff[m]; .br .ul for i:=1 .ul to m\-1 .ul do temp[i] := coeff[i]; .br .ul for i:=1 .ul to m\-1 .ul do .ti+2n coeff[i] := .ti+4n (coeff[i] + refl[m]*temp[m\-i]) / (1 \- refl[m]*refl[m]); .in-2n .ul end .in-6n .ul end. .nf .MT 2 Procedure 6.3 Pascal algorithm for producing reflection coefficients from filter coefficients .TE .pp Although we will not go into the theoretical details here, reflection coefficients are bounded by $+-$1 for stable filters, and hence form a useful test for stability. Having a limited range makes them easy to quantize for transmission, and in fact they behave better under quantization than do the filter coefficients. One could resynthesize speech from reflection coefficients by first converting them to filter coefficients and using the synthesis method described above. However, it is natural to seek a single-stage procedure which can regenerate speech directly from reflection coefficients. .pp Such a procedure does exist, and is called a .ul lattice filter. Figure 6.8 shows one form of lattice for speech synthesis. .FC "Figure 6.8" The error signal (whether transmitted or synthesized) enters at the upper left-hand corner, passes along the top forward signal path, being modified on the way, to give the output signal at the right-hand side. Then it passes back through a chain of delays along the bottom, backward, path, and is used to modify subsequent forward signals. Finally it is discarded at the lower left-hand corner. .pp There are $p$ stages in the lattice structure of Figure 6.8, where $p$ is the order of the linear predictive filter. Each stage involves two multiplications by the appropriate reflection coefficients, one by the backward signal \(em the result of which is added into the forward path \(em and the other by the forward signal \(em the result of which is subtracted from the backward path. Thus the number of multiplications is twice the order of the filter, and hence twice as many as for the realization using coefficients $a sub k$. If the labour necessary to turn the reflection coefficients into $a sub k$'s is included, the computational load becomes the same. Moreover, since the reflection coefficients need fewer quantization bits than the $a sub k$'s (for a given speech quality), the word lengths are smaller in the lattice realization. .pp The advantages of the lattice method of synthesis over direct evaluation of the prediction using filter coefficients $a sub k$, then, are: .LB .NP the reflection coefficients are used directly .NP the stability of the filter is obvious from the reflection coefficient values .NP the system is more tolerant to quantization errors in fixed-point implementations. .LE Although it may seem unlikely that an unstable filter would be produced by linear predictive analysis, instability is in fact a real problem in non-lattice implementations. For example, coefficients are often interpolated at the receiver, to allow longer frame times and smooth over sudden transitions, and it is quite likely that an unstable configuration is obtained when interpolating filter coefficients between two stable configurations. This cannot happen with reflection coefficients, however, because a necessary and sufficient condition for stability is that all coefficients lie in the interval $(-1,+1)$. .sh "6.3 Lattice filtering" .pp Lattice filters are an important new method of linear predictive .ul analysis as well as synthesis, and so it is worth considering the theory behind them a little further. .rh "Theory of the lattice synthesis filter." Figure 6.9 shows a single stage of the synthesis lattice given earlier. .FC "Figure 6.9" There are two signals at each side of the lattice, and the $z$-transforms of these have been labelled $X sup +$ and $X sup -$ at the left-hand side and $Y sup +$ and $Y sup -$ at the right-hand side. The direction of signal flow is forwards along the upper ("positive") path and backwards along the lower ("negative") one. .pp The signal flows show that the following two relationships hold: .LB .EQ Y sup + ~=~~ X sup + ~+~ k z sup -1 Y sup - ~~~~~~ .EN for the forward (upper) path .br .EQ X sup - ~ =~ -kY sup + ~+~ z sup -1 Y sup - ~~~~~~~ .EN \h'-\w'\-'u'for the backward (lower) path. .LE Re-arranging the first equation yields .LB .EQ X sup + ~ =~~ Y sup + ~-~ k z sup -1 Y sup - , .EN .LE and so we can describe the function of the lattice by a single matrix equation: .LB .ne4 .EQ left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {Y sup + above Y sup -}} right ] ~ . .EN .LE It would be nice to be able to call this an input-output equation, but it is not; for the input signals to the lattice stage are $X sup +$ and $Y sup -$, and the outputs are $X sup -$ and $Y sup +$. We have written it in this form because it allows a multi-stage lattice to be described by cascading these matrix equations. .pp A single-stage lattice filter has $Y sup +$ and $Y sup -$ connected together, forming its output (call this $X sub output$), while the input is $X sup +$ ($X sub input$). Hence the input is related to the output by .LB .EQ left [ matrix {ccol {X sub input above \(sq }} right ] ~~ = ~~ left [ matrix {ccol {1 above -k} ccol {-k z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ , .EN .LE so .LB .EQ X sub input ~ = ~~ (1~-~ k z sup -1 )~X sub output , .EN .LE or .LB .EQ {X sub output} over {X sub input} ~~=~~ 1 over {1~-~ k sub 1 z sup -1} ~ . .EN .LE (The symbol \(sq is used here and elsewhere to indicate an unimportant element of a vector or matrix.) This certainly has the form of a linear predictive synthesis filter, which is .LB .EQ X(z) over E(z) ~~=~~ 1 over {1~-~~ sum from k=1 to p ~a sub k z sup -k}~~=~~ 1 over {1~-~a sub 1 z sup -1 } ~~~~~~ .EN when $p=1$. .LE .pp The behaviour of a second-order lattice filter, shown in Figure 6.10, can be described by .LB .ne4 .EQ left [ matrix {ccol {X sub 3 sup + above X sub 3 sup -}} right ] ~~ = ~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ] .EN .sp .ne4 .EQ left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ] ~~ = ~~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {X sub 1 sup + above X sub 1 sup -}} right ] .EN .LE with .LB .ne3 .EQ X sub 3 sup + ~=~X sub input .EN .br .EQ X sub 1 sup + ~=~ X sub 1 sup - ~=~ X sub output . .EN .LE .FC "Figure 6.10" $X sub 2 sup +$ and $X sub 2 sup -$ can be eliminated by substituting the second equation into the first, which yields .LB .EQ left [ matrix {ccol {X sub input above \(sq }} right ] ~~ mark = ~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {X sub output above X sub output}} right ] .EN .sp .sp .EQ lineup = ~~ left [ matrix {ccol {1+k sub 1 k sub 2 z sup -1 above \(sq } ccol { -k sub 1 z sup -1 -k sub 2 z sup -2 above \(sq }} right ] ~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ . .EN .LE This leads to an input-output relationship .LB .EQ {X sub output} over {X sub input} ~~ = ~~ 1 over {1~+~k sub 1 (k sub 2 -1)z sup -1 ~-~k sub 2 z sup -2} ~ , .EN .LE which has the required form, namely .LB .EQ 1 over {1~-~~ sum from k=1 to p ~a sub k z sup -k } ~~~~~~ (p=2) .EN .LE when .LB .EQ a sub 1 ~=~-k sub 1 (k sub 2 -1) .EN .br .EQ a sub 2 ~=~k sub 2. .EN .LE .pp A third-order filter is described by .LB .EQ left [ matrix {ccol {X sub input above \(sq }} right ] ~~ = ~~ left [ matrix {ccol {1 above -k sub 3 } ccol {-k sub 3 z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ , .EN .LE and brave souls can verify that this gives an input-output relationship .LB .EQ {X sub output} over {X sub input} ~~ = ~~ 1 over {1~+~[k sub 2 k sub 3 + k sub 1 (1-k sub 2 )] z sup -1 ~+~ [k sub 1 k sub 3 (1-k sub 2 ) -k sub 2 ] z sup -2 ~-~ k sub 3 z sup -3 } ~ . .EN .LE It is fairly obvious that a $p$'th order lattice filter will give the required all-pole $p$'th order synthesis form, .LB .EQ 1 over { 1~-~~ sum from k=1 to p ~a sub k z sup -k } ~ . .EN .LE .pp We have not shown that the algorithm given in Procedure 6.3 for producing reflection coefficients from filter coefficients gives those values for $k sub i$ which are necessary to make the lattice filter equivalent to the ordinary synthesis filter. However, this is the case, and it is easy to verify by hand for the first, second, and third-order cases. .rh "Different lattice configurations." The lattice filters of Figures 6.8, 6.9, and 6.10 have two multipliers per section. This is called a "two-multiplier" configuration. However, there are other configurations which achieve the same effect, but require different numbers of multiplies. Figure 6.11 shows one-multiplier and four-multiplier configurations, along with the familiar two-multiplier one. .FC "Figure 6.11" It is easy to verify that the three configurations can be modelled in matrix terms by .LB .ne4 $ left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {Y sup + above Y sup -}} right ] $ two-multiplier configuration .sp .sp .ne4 $ left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~ left [ {1-k over 1+k} right ] sup 1/2 ~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {Y sup + above Y sup -}} right ] $ one-multiplier configuration .sp .sp .ne4 $ left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~ 1 over {(1-k sup 2) sup 1/2} ~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {Y sup + above Y sup -}} right ] $ four-multiplier configuration. .LE Each of the three has the same frequency-domain response, although a different constant factor is involved in each case. The effect of this can be annulled by performing a single multiply operation on the output of a complete lattice chain. The multiplier has the form .LB .EQ left [ {1 - k sub p} over {1 + k sub p} ~.~ {1 - k sub p-1} over {1 + k sub p-1} ~.~...~.~ {1 - k sub 1} over {1 + k sub 1} right ] sup 1/2 .EN .sp .LE for single-multiplier lattices, and .LB .EQ left [ 1 over {1 - k sub p sup 2} ~.~ 1 over {1 - k sub p-1 sup 2} ~.~...~.~ 1 over {1 - k sub 1 sup 2} right ] sup 1/2 .EN .LE for four-multiplier lattices, where the reflection coefficients in the lattice are $k sub p$, $k sub p-1$, ..., $k sub 1$. .pp There are important differences between these three configurations. If multiplication is time-consuming, the one-multiplier model has obvious computational advantages over the other two methods. However, the four-multiplier structure behaves substantially better in finite word-length implementations. It is easy to show that, with this configuration, .LB .EQ (X sup - ) sup 2 ~+~ (Y sup + ) sup 2 ~~ = ~~ (X sup + ) sup 2 ~+~ (z sup -1 Y sup - ) sup 2 , .EN .LE \(em a relationship which suggests that the "energy" in the the input signals, namely $X sup +$ and $Y sup -$, is preserved in the output signals, $X sup -$ and $Y sup +$. Notice that care must be taken with the $z$-transforms, since squaring is a non-linear operation. $(z sup -1 Y sup - ) sup 2$ means the square of the previous value of $Y sup -$, which is not the same as $z sup -2 (Y sup - ) sup 2$. .pp It has been shown (Gray and Markel, 1975) that the four-multiplier configuration has some stability properties which are not shared by other digital filter structures. .[ Gray Markel 1975 Normalized digital filter structure .] When a linear predictive filter is used for synthesis, the parameters of the filter \(em the $k$-parameters in the case of lattice filters, and the $a$-parameters in the case of direct ones \(em change with time. It is usually rather difficult to guarantee stability in the case of time-varying filter parameters, but some guarantees can be made for a chain of four-multiplier lattices. Furthermore, if the input is a discrete delta function, the cumulative energies at each stage of the lattice are the same, and so maximum dynamic range will be achieved for the whole filter if each section is implemented with the same word size. .rh "Lattice analysis." It is quite easy to construct a filter which is inverse to a single-stage lattice. The structure of Figure 6.12(a) does the job. (Ignore for a moment the dashed lines connecting Figure 6.12(a) and (b).) Its matrix transfer function is .FC "Figure 6.12" .LB .ne4 $ left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~=~~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {X sup + above X sup -}} right ] $ analysis lattice (Figure 6.12(a)). .LE Notice that this is exactly the same as the transfer function of the synthesis lattice of Figure 6.9, which is reproduced in Figure 6.12(b), except that the $X$'s and $Y$'s are reversed: .LB .ne4 $ left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {Y sup + above Y sup -}} right ] $ synthesis lattice (Figure 6.12(b)), .LE or, in other words, .LB .ne4 $ left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~ = ~~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] sup -1 ~ left [ matrix {ccol {X sup + above X sup -}} right ] $ synthesis lattice (Figure 6.12(b)). .LE Hence if the filters of Figures 6.12(a) and (b) were connected together as shown by the dashed lines, they would cancel each other out, and the overall transfer would be unity: .LB .ne4 .EQ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] ~ left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ] sup -1 ~~ = ~~ left [ matrix {ccol {1 above 0} ccol {0 above 1}} right ] ~ . .EN .LE Actually, such a connection is not possible in physical terms, for although the upper paths can be joined together the lower ones can not. The right-hand lower point of Figure 6.12(a) is an .ul output terminal, and so is the left-hand lower one of Figure 6.12(b)! However, there is no need to envisage a physical connection of the lower paths. It is sufficient for cancellation just to assume that the signals at both of the points turn out to be the same. .pp And they do. The general case of a $p$-stage analysis lattice connected to a $p$-stage synthesis lattice is shown in Figure 6.13. .FC "Figure 6.13" Notice that the forward and backward paths are connected together at both of the extreme ends of the system. It is not difficult to show that under these conditions the signal at the lower righthand terminal of the analysis chain will equal that at the lower lefthand terminal of the synthesis chain, even though they are not connected, provided the upper terminals are connected together as shown by the dashed line. Of course, the reflection coefficients $k sub 1$, $k sub 2$, ..., $k sub p$ in the analysis lattice must equal those in the synthesis lattice, and as Figure 6.13 shows the order is reversed in the synthesis lattice. Successive analysis and synthesis sections pair off, working from the middle outwards. At each stage the sections cancel each other out, giving a unit transfer function as demonstrated above. .rh "Estimating reflection coefficients." As stated earlier in this chapter, the key problem in linear prediction is to determine the values of the predictive coefficients \(em in this case, the reflection coefficients. If this is done correctly, we have shown using Procedure 6.3 that the the synthesis part of Figure 6.13 performs the same calculation that a conventional direct-form linear predictive synthesizer would, and hence the signal that excites it \(em that is, the signal represented by the dashed line \(em must be the prediction residual, or error signal, discussed earlier. The system is effectively the same as the high-order adaptive differential pulse code modulation one of Figure 6.1. .pp One of the most interesting features of the lattice structure for analysis filters is that calculation of suitable values for the reflection coefficients can be done locally at each stage of the lattice. For example, consider the $i$'th section of the analysis lattice in Figure 6.13. It is possible to determine a suitable value of $k sub i$ simply by performing a calculation on the inputs to the $i$'th section (ie $X sup +$ and $X sup -$ in Figure 6.12). No longer need the complicated global optimization technique of matrix inversion be used, as in the autocorrelation and covariance methods discussed earlier. .pp A suitable value for $k$ in the single lattice section of Figure 6.12 is .LB .EQ k~ = ~~ {E[ x sup + (n) x sup - (n-1)]} over {( E[ x sup + (n) sup 2 ] E[ x sup - (n-1) sup 2 ] ) sup 1/2} ~~ ; .EN .LE that is, the statistical correlation between $x sup + (n)$ and $x sup - (n-1)$. Here, $x sup + (n)$ and $x sup - (n)$ represent the input signals to the upper and lower paths (recall that $X sup +$ and $X sup -$ are their $z$-transforms). $x sup - (n-1)$ is just $x sup - (n)$ delayed by one time unit, that is, the output of the $z sup -1$ box in the Figure. .pp The criterion of optimality for the autocorrelation and covariance methods was that the prediction error, that is, the signal which emerges from the right-hand end of the upper path of a lattice analysis filter, should be minimized in a mean-square sense. The reflection coefficients obtained from the above formula do not necessarily satisfy any such global minimization criterion. Nevertheless, they do keep the error signal small, and have been used with success in speech analysis systems. .pp It is easy to minimize the output from either the upper or the lower path of the lattice filter at each stage. For example, the $z$-transform of the upper output is given by .LB .EQ Y sup + ~~=~~ X sup + ~-~ k z sup -1 X sup - , .EN .LE or .LB .EQ y sup + (n) ~~=~~ x sup + (n) ~-~ k x sup - (n-1) . .EN .LE Hence .LB .EQ E[y sup + (n) sup 2 ] ~~ = ~~ E[x sup + (n) sup 2 ] ~-~ 2kE[x sup + (n) x sup - (n-1) ] ~+~ k sup 2 E [x sup - (n-1) sup 2 ] , .EN .LE where $E$ stands for expected value, and this reaches a minimum when the derivative with respect to $k$ becomes zero: .LB .EQ -2E[x sup + (n) x sup - (n-1) ] ~+~ 2kE[x sup - (n-1) sup 2 ] ~~=~0 , .EN .LE that is, when .LB .EQ k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup - (n-1) sup 2 ] } ~ . .EN .LE A similar calculation shows that the output of the lower path is minimized when .LB .EQ k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup + (n-1) sup 2 ] } ~ . .EN .LE Unfortunately, either of these expressions can exceed 1, leading to an unstable filter. The value of $k$ cited earlier is the geometric mean of these two expressions, and since it is a correlation coefficient, must be less than 1. .pp Another possibility is to minimize the expected value of the sum of the squares of the upper and lower outputs: .LB .EQ y sup + (n) sup 2 ~+~ y sup - (n) sup 2 ~~ = ~~ (1+k sup 2 )x sup + (n) sup 2 ~-~ 2kx sup + (n) x sup - (n-1) ~+~ (1+k sup 2 )x sup - (n) sup 2 . .EN .LE Taking expected values and setting the derivative with respect to k to zero leads to .LB .EQ k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over { half ~ E[x sup + (n) sup 2 ~+~ x sup - (n-1) sup 2 ]} ~. .EN .LE This also is guaranteed to be less than 1, and has given good results in speech analysis systems. .pp Figure 6.14 shows the implementation of a single section of an analysis lattice. .FC "Figure 6.14" The signals $x sup + (n)$ and $x sup - (n-1)$ are fed to a correlator, which produces a suitable value for $k$. This value is used to calculate the output of the lattice section, and hence the input to the next lattice section. The reflection coefficient needs to be low-pass filtered, because it will only be transmitted to the synthesizer occasionally (say every 20\ msec) and so a short-term average is required. .pp One implementation of the correlator is shown in Figure 6.15 (Kang, 1974). .[ Kang 1974 .] .FC "Figure 6.15" This calculates the value of $k$ given by the last equation above, and does it by summing and differencing the two signals $x sup + (n)$ and $x sup - (n-1)$, squaring the results to give .LB .EQ x sup + (n) sup 2 + 2x sup + (n mark ) x sup - (n-1) +x sup - (n-1) sup 2 ~~~~~~~~ x sup + (n) sup 2 - 2x sup + (n) x sup - (n-1) +x sup - (n-1) sup 2 ~ , .EN .LE and summing and differencing these, to yield .LB .EQ lineup 2x sup + (n) sup 2 + 2x sup - (n-1) sup 2 ~~~~~~~~ 4x sup + (n) x sup - (n-1) ~ . .EN .LE .sp Before these are divided to give the final coefficient $k$, they are individually low-pass filtered. While some rather complex schemes have been proposed, based upon Kalman filter theory (eg Matsui .ul et al, 1972), .[ Matsui Nakajima Suzuki Omura 1972 .] a simple exponential weighted past average has been found to be satisfactory. This has $z$-transform .LB .EQ 1 over {64 - 63 z sup -1} ~ , .EN .LE that is, in the time domain, .LB .EQ y(n)~ = ~~ 63 over 64 ~ y(n-1) ~+~ 1 over 64 ~ y(n) ~ . .EN .LE This filter exponentially averages past sample values with a time-constant of 64 sampling intervals \(em that is, 8\ msec at an 8\ kHz sampling rate. .sh "6.4 Pitch estimation" .pp It is sometimes useful to think of linear prediction as a kind of curve-fitting technique. Figure 6.16 illustrates how four samples of a speech signal can predict the next one. .FC "Figure 6.16" In essence, a curve is drawn through four points to predict the position of the fifth, and only the prediction error is actually transmitted. Now if the order of linear prediction is high enough (at least 10), and if the coefficients are chosen correctly, the prediction will closely model the resonances of the vocal tract. Thus the error will actually be zero, except at pitch pulses. .pp Figure 6.17 shows a segment of voiced speech together with the prediction error (often called the prediction residual). .FC "Figure 6.17" It is apparent that the error is indeed small, except at pitch pulses. This suggests that a good way to determine the pitch period is to examine the error signal, perhaps by looking at its autocorrelation function. As with all pitch detection methods, one must be careful: spurious peaks can occur, especially in nasal sounds when the all-pole model provided by linear prediction fails. Continuity constraints, which use previous values of pitch period when determining which peak to accept as a new pitch impulse, can eliminate many of these spurious peaks. Unvoiced speech should produce an error signal with no prominent peaks, and this needs to be detected. Voiced fricatives are a difficult case: peaks should be present but the general noise level of the error signal will be greater than it is in purely voiced speech. Such considerations have been taken into account in a practical pitch estimation system based upon this technique (Markel, 1972). .[ Markel 1972 SIFT .] .pp This method of pitch detection highlights another advantage of the lattice analysis technique. When using autocorrelation or covariance analysis to determine the filter (or reflection) coefficients, the error signal is not normally produced. It can, of course, be found by taking the speech samples which constitute the current frame and running them through an analysis filter whose parameters are those determined by the analysis, but this is a computationally demanding exercise, for the filter must run at the speech sampling rate (say 8\ kHz) instead of at the frame rate (say 50\ Hz). Usually, pitch is estimated by other methods, like those discussed in Chapter 4, when using autocorrelation or covariance linear prediction. However, we have seen above that with the lattice method, the error signal is produced as a byproduct: it appears at the right-hand end of the upper path of the lattice chain. Thus it is already available for use in determining pitch periods. .sh "6.5 Parameter coding for linear predictive storage or transmission" .pp In this section, the coding requirements of linear predictive parameters will be examined. The parameters that need to be stored or transmitted are: .LB .NP pitch .NP voiced-unvoiced flag .NP overall amplitude level .NP filter coefficients or reflection coefficients. .LE The first three are parameters of the excitation source. They can be derived directly from the error signal as indicated above, if it is generated (as it is in lattice implementations); or by other methods if no error signal is calculated. The filter or reflection coefficients are, of course, the main product of linear predictive analysis. .pp It is generally agreed that around 60 levels, logarithmically spaced, are needed to represent pitch for telephone quality speech. The voiced-unvoiced indication requires one bit, but since pitch is irrelevant in unvoiced speech it can be coded as one of the pitch levels. For example, with 6-bit coding of pitch, the value 0 can be reserved to indicate unvoiced speech, with values 1\-63 indicating the pitch of voiced speech. The overall gain has not been discussed above: it is simply the average amplitude of the error signal. Five bits on a logarithmic scale are sufficient to represent it. .pp Filter coefficients are not very amenable to quantization. At least 8\-10\ bits are required for each one. However, reflection coefficients are better behaved, and 5\-6\ bits each seems adequate. The number of coefficients that must be stored or transmitted is the same as the order of the linear prediction: 10 is commonly used for low-quality speech, with as many as 15 for higher qualities. .pp These figures give around 100\ bits/frame for a 10'th order system using filter coefficients, and around 65\ bits/frame for a 10'th order system using reflection coefficients. Frame lengths vary between 10\ msec and 25\ msec, depending on the quality desired. Thus for 20\ msec frames, the data rates work out at around 5000\ bit/s using filter coefficients, and 3250\ bit/s using reflection coefficients. .pp Substantially lower data rates can be achieved by more careful coding of parameters. In 1976, the US Government defined a standard coding scheme for 10-pole linear prediction with a data rate of 2400\ bit/s \(em conveniently chosen as one of the commonly-used rates for serial data transmission. This standard, called LPC-10, tackles the difficult problem of protection against transmission errors (Fussell .ul et al, 1978). .[ Fussell Boudra Abzug Cowing 1978 .] .pp Whenever data rates are reduced, redundancy inherent in the signal is necessarily lost and so the effect of transmission errors becomes greatly magnified. For example, a single corrupted sample in PCM transmission of speech will probably not be noticed, and even a short burst of errors will be perceived as a click which can readily be distinguished from the speech. However, any error in LPC transmission will last for one entire frame \(em say 20\ msec \(em and worse still, it will be integrated into the speech signal and not easily discriminated from it by the listener's brain. A single corruption may, for example, change a voiced frame into an unvoiced one, or vice versa. Even if it affects only a reflection coefficient it will change the resonance characteristics of that frame, and change them in a way that does not simply sound like superimposed noise. .pp Table 6.1 shows the LPC-10 coding scheme. .RF .in+0.1i .ta 2.0i +1.8i +0.6i .nr x1 (\w'voiced sounds'/2) .nr x2 (\w'unvoiced sounds'/2) .ul \h'-\n(x1u'voiced sounds \h'-\n(x2u'unvoiced sounds .sp pitch/voicing 7 7 60 pitch levels, Hamming \h'\w'00 'u'and Gray coded energy 5 5 logarithmically coded $k sub 1$ 5 5 coded by table lookup $k sub 2$ 5 5 coded by table lookup $k sub 3$ 5 5 $k sub 4$ 5 5 $k sub 5$ 4 \- $k sub 6$ 4 \- $k sub 7$ 4 \- $k sub 8$ 4 \- $k sub 9$ 3 \- $k sub 10$ 2 \- synchronization 1 1 alternating 1,0 pattern error detection/ \- \h'-\w'0'u'21 correction \h'-\w'__'u+\w'0'u'__ \h'-\w'__'u+\w'0'u'__ .sp \h'-\w'0'u'54 \h'-\w'0'u'54 .sp .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i frame rate: 44.4\ Hz (22.5\ msec frames) .in 0 .FG "Table 6.1 Bit requirements for each parameter in LPC-10 coding scheme" Different coding is used for voiced and unvoiced frames. Only four reflection coefficients are transmitted for unvoiced frames, because it has been determined that no perceptible increase in speech quality occurs when more are used. The bits saved are more fruitfully employed to provide error detection and correction for the other parameters. Seven bits are used for pitch and the voiced-unvoiced flag, and they are redundant in that only 60 possible pitch values are allowed. Most transmission errors in this field will be detected by the receiver; which can then use an estimate of pitch based on previous values and discard the erroneous one. Pitch values are also Gray coded so that even if errors are not detected, there is a good chance that an adjacent pitch value is read instead. Different numbers of bits are allocated to the various reflection coefficients: experience shows that the lower-numbered ones contribute most highly to intelligibility and so these are quantized most finely. In addition, a table lookup operation is performed on the code generated for the first two, providing a non-linear quantization which is chosen to minimize the error on a statistical basis. .pp With 54\ bits/frame and 22.5\ msec frames, LPC-10 requires a 2400\ bit/s data rate. Even lower rates have been used successfully for lower-quality speech. The Speak 'n Spell toy, described in Chapter 11, has an average data rate of 1200\ bit/s. Rates as low as 600\ bit/s have been achieved (Kang and Coulter, 1976) by pattern recognition techniques operating on the reflection coefficients: however, the speech quality is not good. .[ Kang Coulter 1976 .] .sh "6.6 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "6.7 Further reading" .pp Most recent books on digital signal processing contain some information on linear prediction (see Oppenheim and Schafer, 1975; Rabiner and Gold, 1975; and Rabiner and Schafer, 1978; all referenced at the end of Chapter 4). .LB "nn" .\"Atal-1971-1 .]- .ds [A Atal, B.S. .as [A " and Hanauer, S.L. .ds [D 1971 .ds [T Speech analysis and synthesis by linear prediction of the acoustic wave .ds [J JASA .ds [V 50 .ds [P 637-655 .nr [P 1 .ds [O August .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n This paper is of historical importance because it introduced the idea of linear prediction to the speech processing community. .in-2n .\"Makhoul-1975-2 .]- .ds [A Makhoul, J.I. .ds [D 1975 .ds [K * .ds [T Linear prediction: a tutorial review .ds [J Proc IEEE .ds [V 63 .ds [N 4 .ds [P 561-580 .nr [P 1 .ds [O April .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n An interesting, informative, and readable survey of linear prediction. .in-2n .\"Markel-1976-3 .]- .ds [A Markel, J.D. .as [A " and Gray, A.H. .ds [D 1976 .ds [T Linear prediction of speech .ds [I Springer Verlag .ds [C Berlin .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is the only book which is entirely devoted to linear prediction of speech. It is an essential reference work for those interested in the subject. .in-2n .\"Wiener-1947-4 .]- .ds [A Wiener, N. .ds [D 1947 .ds [T Extrapolation, interpolation and smoothing of stationary time series .ds [I MIT Press .ds [C Cambridge, Massachusetts .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Linear prediction is often thought of as a relatively new technique, but it is only its application to speech processing that is novel. Wiener develops all of the basic mathematics used in linear prediction of speech, except the lattice filter structure. .in-2n .LE "nn" .EQ delim $$ .EN .CH "7 JOINING SEGMENTS OF SPEECH" .ds RT "Joining segments of speech .ds CX "Principles of computer speech .pp The obvious way to provide speech output from computers is to select the basic acoustic units to be used; record them; and generate utterances by concatenating together appropriate segments from this pre-stored inventory. The crucial question then becomes, what are the basic units? Should they be whole sentences, words, syllables, or phonemes? .pp There are several trade-offs to be considered here. The larger the units, the more utterances have to be stored. It is not so much the length of individual utterances that is of concern, but rather their variety, which tends to increase exponentially instead of linearly with the size of the basic unit. Numbers provide an easy example: there are $10 sup 7$ 7-digit telephone numbers, and it is certainly infeasible to record each one individually. Note that as storage technology improves the limitation is becoming more and more one of recording the utterances in the first place rather than finding somewhere to store them. At a PCM data rate of 50\ Kbit/s, a 100\ Mbyte disk can hold over 4\ hours of continuous speech. With linear predictive coding at 1\ Kbit/s it holds 0.8 of a megasecond \(em well over a week. And this is a 24-hour 7-day week, which corresponds to a working month; and continuous speech \(em without pauses \(em which probably requires another factor of five for production by a person. Setting up a recording session to fill the disk would be a formidable task indeed! Furthermore, the use of videodisks \(em which will be common domestic items by the end of the decade \(em could increase these figures by a factor of 50. .pp The word seems to be a sensibly-sized basic unit. Many applications use a rather limited vocabulary \(em 190 words for the airline reservation system described in Chapter 1. Even at PCM data rates, this will consume less than 0.5\ Mbyte of storage. Unfortunately, coarticulation and prosodic factors now come into play. .pp Real speech is connected \(em there are few gaps between words. Coarticulation, where sounds are affected by those on either side, naturally operates across word boundaries. And the time constants of coarticulation are associated with the mechanics of the vocal tract and hence measure tens or hundreds of msec. Thus the effects straddle several pitch periods (100\ Hz pitch has 10\ msec period) and cannot be simulated by simple interpolation of the speech waveform. .pp Prosodic features \(em notably pitch and rhythm \(em span much longer stretches of speech than single words. As far as most speech output applications are concerned, they operate at the utterance level of a single, sentence-sized, information unit. They cannot be accomodated if speech waveforms of individual words of the utterance are stored, for it is rarely feasible to alter the fundamental frequency or duration of a time waveform without changing all the formant resonances as well. However, both word-to-word coarticulation and the essential features of rhythm and intonation can be incorporated if the stored words are coded in source-filter form. .pp For more general applications of speech output, the limitations of word storage soon become apparent. Although people's daily vocabularies are not large, most words have a variety of inflected forms which need to be treated separately if a strict policy is adopted of word storage. For instance, in this book there are 84,000 words, and 6,500 (8%) different ones (counting inflected forms). In Chapter 1 alone, there are 6,800 words and 1,700 (25%) different ones. .pp It seems crazy to treat a simple inflection like "$-s$" or its voiced counterpart, "$-z$" (as in "inflection\c .ul s\c "), as a totally different word from the base form. But once you consider storing roots and endings separately, it becomes apparent that there is a vast number of different endings, and it is difficult to know where to draw the line. It is natural to think instead of simply using the syllable as the basic unit. .pp A generous estimate of the number of different syllables in English is 10,000. At three a second, only about an hour's storage is required for them all. But waveform storage will certainly not do. Although coarticulation effects between words are needed to make speech sound fluent, coarticulation between syllables is necessary for it even to be .ul comprehensible. Adopting a source-filter form of representation is essential, as is some scheme of interpolation between syllables which simulates coarticulation. Unfortunately, a great deal of acoustic action occurs at syllable boundaries \(em stops are exploded, the sound source changes between voicing and frication, and so on. It may be more appropriate to consider inverse syllables, comprising a vowel-consonant-vowel sequence instead of consonant-vowel-consonant. (These have jokingly been dubbed "lisibles"!) .pp There is again some considerable practical difficulty in creating an inventory of syllables, or lisibles. Now it is not so much the recording that is impractical, but the editing needed to ensure that the cuts between syllables are made at exactly the right point. As units get smaller, the exact placement of the boundaries becomes ever more critical; and several thousand sensitive editing jobs is no easy task. .pp Since quite general effects of coarticulation must be accomodated with syllable synthesis, there will not necessarily be significant deterioration if smaller, demisyllable, units are employed. This reduces the segment inventory to an estimated 1000\-2000 entries, and the tedious job of editing each one individually becomes at least feasible, if not enviable. Alternatively, the segment inventory could be created by artificial means involving cut-and-try experiments with resonance parameters. .pp The ultimate in economy of inventory size, of course, is to use phonemes as the basic unit. This makes the most critical part of the task interpolation between units, rather than their construction or recording. With only about 40 phonemes in English, each one can be examined in many different contexts to ascertain the best data to store. There is no need to record them directly from a human voice \(em it would be difficult anyway for most cannot be produced in isolation. In fact, a phoneme is an abstract unit, not a particular sound (recall the discussion of phonology in Chapter 2), and so it is most appropriate that data be abstracted from several different realizations rather than an exact record made of any one. .pp If information is stored about phonological units of speech \(em phonemes \(em the difficult task of phonological-to-phonetic conversion must necessarily be performed automatically. Allophones are created by altering the transitions between units, and to a lesser extent by modifying the central parts of the units themselves. The rules for making transitions will have a big effect on the quality of the resulting speech. Instead of trying to perform this task automatically by a computer program, the allophones themselves could be stored. This will ease the job of generating transitions between segments, but will certainly not eliminate it. The total number of allophones will depend on the narrowness of the transcription system: 60\-80 is typical, and it is unlikely to exceed one or two hundred. In any case there will not be a storage problem. However, now the burden of producing an allophonic transcription has been transferred to the person who codes the utterance prior to synthesizing it. If he is skilful and patient, he should be able to coax the system into producing fairly understandable speech, but the effort required for this on a per-utterance basis should not be underestimated. .RF .nr x0 \w'sentences ' .nr x1 \w' ' .nr x2 \w'depends on ' .nr x3 \w'generalized or ' .nr x4 \w'natural speech ' .nr x5 \w'author of segment' .nr x6 \n(x0u+\n(x1u+\n(x2u+\n(x3u+\n(x4u+\n(x5u .nr x7 (\n(.l-\n(x6)/2 .in \n(x7u .ta \n(x0u +\n(x1u +\n(x2u +\n(x3u +\n(x4u | size of storage source of principal | utterance method utterance burden is | inventory inventory placed on |\h'-1.0i'\l'\n(x6u\(ul' | sentences | depends on waveform or natural speech recording artist, | application source-filter storage medium | parameters | words | depends on source-filter natural speech recording artist | application parameters and editor, | storage medium | syllables/ | \0\0\010000 source-filter natural speech recording editor lisibles | parameters | demi- | \0\0\0\01000 source-filter natural speech recording editor syllables | parameters or artificially or inventory | generated compiler | phonemes | \0\0\0\0\0\040 generalized artificially author of segment | parameters generated concatenation | program | allophones | \0\050\-100 generalized or artificially coder of | source-filter generated or synthesized | parameters natural speech utterances |\h'-1.0i'\l'\n(x6u\(ul' .in 0 .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .FG "Table 7.1 Some issues relevant to choice of basic unit" .pp Table 7.1 summarizes in broad brush-strokes the issues which relate to the choice of basic unit for concatenation. The sections which follow provide more detail about the different methods of joining segments of speech together. Only segmental aspects are considered, for the important problems of prosody will be treated in the next chapter. All of the methods rely to some extent on the acoustic properties of speech, and as smaller basic units are considered the role of speech acoustics becomes more important. It is impossible in a book like this to give a detailed account of acoustic phonetics, for it would take several volumes! What I aim to do in the following pages is to highlight some salient features which are relevant to segment concatenation, without attempting to be complete. .sh "7.1 Word concatenation" .pp For general speech output, word concatenation is an inherently limited technique because of the large number of phonetically different words. Despite this fact, it is at present the most widely-used synthesis method, and is likely to remain so for several years. We have seen that the primary problems are word-to-word coarticulation and prosody; and both can be overcome, at least to a useful approximation, by coding the words in source-filter form. .rh "Time-domain techniques." Nevertheless, a surprising number of applications simply store the time waveform, coded, usually, by one of the techniques described in Chapter 3. From an implementation point of view there are many advantages to this. Speech quality can easily be controlled by selecting a suitable sampling rate and coding scheme. A natural-sounding voice is guaranteed; male or female as desired. The equipment required is minimal \(em a digital-to-analogue converter and post-sampling filter will do for synthesis if PCM coding is used, and DPCM, ADPCM, and delta modulation decoders are not much more complicated. .pp From a speech point of view, the resulting utterances can never be made convincingly fluent. We discussed the early experiments of Stowe and Hampton (1961) at the beginning of Chapter 3. .[ Stowe Hampton 1961 .] A major drawback to word concatenation in the analogue domain is the introduction of clicks and other interference between words: it is difficult to prevent the time waveform transitions from adding extraneous sounds. This poses no problem with digital storage, however, for the waveforms can be edited accurately prior to storage so that they start and finish at an exactly zero level. Rather, the lack of fluency stems from the absence of proper control of coarticulation and prosody. .pp But this is not necessarily a serious drawback if the application is a sufficiently limited one. Complete, invariant utterances can be stored as one unit. Often they must contain data-dependent slot-fillers, as in .LB This flight makes \(em stops .LE and .LB Flight number \(em leaves \(em at \(em , arrives in \(em at \(em .LE (taken from the airline reservation system of Chapter 1 (Levinson and Shipley, 1980)). .[ Levinson Shipley 1980 .] Then, each slot-filling word is recorded in an intonation consistent both with its position in the template utterance and with the intonation of that utterance. This could be done by embedding the word in the utterance for recording, and excising it by digital editing before storage. It would be dangerous to try to take into account coarticulation effects, for the coarticulation could not be made consistent with both the several slot-fillers and the single template. This could be overcome if several versions of the template were stored, but then the scheme becomes subject to combinatorial explosion if there is more than one slot in a single utterance. But it is not really necessary, for the lack of fluency will probably be interpreted by a benevolent listener as an attempt to convey the information as clearly as possible. .pp Difficulties will occur if the same slot-filler is used in different contexts. For instance, the first gap in each of the sentences above contains a number; yet the intonation of that number is different. Many systems simply ignore this problem. Then one does notice anomalies, if one is attentive: the words come, as it were, from different mouths, without fluency. However, the problem is not necessarily acute. If it is, two or more versions of each slot-filler can be recorded, one for each context. .pp As an example, consider the synthesis of 7-digit telephone numbers, like 289\-5371. If one version only of each digit is stored, it should be recorded in a level tone of voice. A pause should be inserted after the third digit of the synthetic number, to accord with common elocution. The result will certainly be unnatural, although it should be clear and intelligible. Any pitch errors in the recordings will make certain numbers audibly anomalous. At the other extreme, 70 single digits could be stored, one version of each digit for each position in the number. The recording will be tedious and error-prone, and the synthetic utterances will still not be fluent \(em for coarticulation is ignored \(em but instead unnaturally clearly enunciated. A compromise is to record only three versions of each digit, one for any of the five positions .nr x1 \w'\(ul' .nr x2 (8*\n(x1) .nr x3 0.2m \zx\h'\n(x1u'\zx\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\zx\h'\n(x1u'\zx\h'\n(x1u'\c \zx\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' , another one for the third position \h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c \h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' , and the last for the final position \h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c \h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' . The first version will be in a level voice, the second an incomplete, rising tone; and the third a final, dropping pitch. .rh "Joining formant-coded words." The limitations of the time-domain method are lack of fluency caused by unnatural transitions between words, and the combinatorial explosion created by recording slot-fillers several times in different contexts. Both of these problems can be alleviated by storing formant tracks, concatenating them with suitable interpolation, and applying a complete pitch contour suitable for the whole utterance. But one can still not generate conversational speech, for natural speech rhythms cause non-linear warpings of the time axis which cannot reasonably be imitated by this method. .pp Solving problems often creates others. As we saw in Chapter 4, it is not easy to obtain reliable formant tracks automatically. Yet hand-editing of formant parameters adds a whole new dimension to the problem of vocabulary construction, for it is an exceedingly tiresome and time-consuming task. Even after such tweaking, resynthesized utterances will be degraded considerably from the original, for the source-filter model is by no means a perfect one. A hardware or real-time software formant synthesizer must be added to the system, presenting design problems and creating extra cost. Should a serial or parallel synthesizer be used? \(em the latter offers potentially better speech (especially in nasal sounds), but requires additional parameters, namely formant amplitudes, to be estimated. Finally, as we will see in the next chapter, it is not an easy matter to generate a suitable pitch contour and apply it to the utterance. .pp Strangely enough, the interpolation itself does not present any great difficulty, for there is not enough information in the formant-coded words to make possible sophisticated coarticulation. The need for interpolation is most pressing when one word ends with a voiced sound and the next begins with one. If either the end of the first or the beginning of the second word (or both) is unvoiced, unnatural formant transitions do not matter for they will not be heard. Actually, this is only strictly true for fricative transitions: if the juncture is aspirated then formants will be perceived in the aspiration. However, .ul h is the only fully aspirated sound in English, and it is relatively uncommon. It is not absolutely necessary to interpolate the fricative filter resonance, because smooth transitions from one fricative sound to another are rare in natural speech. .pp Hence unless both sides of the junction are voiced, no interpolation is needed: simple abuttal of the stored parameter tracks will do. Note that this is .ul not the same as joining time waveforms, for the synthesizer will automatically ensure a relatively smooth transition from one segment to another because of energy storage in the filters. A new set of resonance parameters for the formant-coded words will be stored every 10 or 20 msec (see Chapter 5), and so the transition will automatically be smoothed over this time period. .pp For voiced-to-voiced transitions, some interpolation is needed. An overlap period of duration, say, 50\ msec, is established, and the resonance parameters in the final 50\ msec of the first word are averaged with those in the first 50\ msec of the second. The average is weighted, with the first word's formants dominating at the beginning and their effect progressively dying out in favour of the second word. .pp More sophisticated than a simple average is to weight the components according to how rapidly they are changing. If the spectral change in one word is much greater than that in the other, we might expect that this will dominate the transition. A simple measure of spectral derivative at any given time can be found by adding the magnitude of the discrepancies in each formant frequency between one sample and the next. The spectral change in the transition region can be obtained by summing the spectral derivatives at each sample in the region. Such a measure can perhaps be made more accurate by taking into account the relative importance of the formants, but will probably never be more than a rough and ready yardstick. At any rate, it can be used to load the average in favour of the dominant side of the junction. .pp Much more important for naturalness of the speech are the effects of rhythm and intonation, discussed in the next chapter. .pp Such a scheme has been implemented and tested on \(em guess what! \(em 7-digit telephone numbers (Rabiner .ul et al, 1971). .[ Rabiner Schafer Flanagan 1971 .] Significant improvement (at the 5% level of statistical significance) in people's ability to recall numbers was found for this method over direct abuttal of either natural or synthetic versions of the digits. Although the method seemed, on balance, to produce utterances that were recalled less accurately than completely natural spoken telephone numbers, the difference was not significant (at the 5% level). The system was also used to generate wiring instructions by computer directly from the connection list, as described in Chapter 1. As noted there, synthetic speech was actually preferred to natural speech in the noisy environment of the production line. .rh "Joining linear predictive coded words." Because obtaining accurate formant tracks for natural utterances by Fourier transform methods is difficult, it is worth considering the use of linear prediction as the source-filter model. Actually, formant resonances can be extracted from linear predictive coefficients quite easily, but there is no need to do this because the reflection coefficients themselves are quite suitable for interpolation. .pp A slightly different interpolation scheme from that described in the previous section has been reported (Olive, 1975). .[ Olive 1975 .] The reflection coefficients were spliced during an overlap region of only 20\ msec. More interestingly, attempts were made to suppress the plosive bursts of stop sounds in cases where they were followed by another stop at the beginning of the next word. This is a common coarticulation, occurring, for instance, in the phrase "stop burst". In running speech, the plosion on the .ul p of "stop" is normally suppressed because it is followed by another stop. This is a particularly striking case because the place of articulation of the two stops .ul p and .ul b is the same: complete suppression is not as likely to happen in "stop gap", for example (although it may occur). Here is an instance of how extra information could improve the quality of the synthetic transitions considerably. However, automatically identifying the place of articulation of stops is a difficult job, of a complexity far above what is appropriate for simply joining words stored in source-filter form. .pp Another innovation was introduced into the transition between two vowel sounds, when the second word began with an accented syllable. A glottal stop was placed at the juncture. Although the glottal stop was not described in Chapter 2, it is a sound used in many dialects of English. It frequently occurs in the utterance "uh-uh", meaning "no". Here it .ul is used to separate two vowel sounds, but in fact this is not particularly common in most dialects. One could say "the apple", "the orange", "the onion" with a neutral vowel in "the" (to rhyme with "\c .ul a\c bove") and a glottal stop as separator, but it is much more usual to rhyme "the" with "he" and introduce a .ul y between the words. Similarly, even speakers who do not normally pronounce an .ul r at the end of words will introduce one in "bigger apple", rather than using a glottal stop. Note that it would be wrong to put an .ul r in "the apple", even for speakers who usually terminate "the" and "bigger" with the same sound. Such effects occur at a high level of processing, and are practically impossible to simulate with word-interpolation rules. Hence the expedient of introducing a glottal stop is a good one, although it is certainly unnatural. .sh "7.2 Concatenating whole or partial syllables" .pp The use of segments larger than a single phoneme or allophone but smaller than a word as the basic unit for speech synthesis has an interesting history. It has long been realized that transitions between phonemes are extremely sensitive and critical components of speech, and thus are essential for successful synthesis. Consider the unvoiced stop sounds .ul p, t, and .ul k. Their central portion is actually silence! (Try saying a word like "butter" with a very long .ul t.\c ) Hence in this case it is .ul only the transitional information which can distinguish these sounds from each other. .pp Sound segments which comprise the transition from the centre of one phoneme to the centre of the next are called .ul dyads or .ul diphones. The possibility of using them as the basic units for concatenation was first mooted in the mid 1950's. The idea is attractive because there is relatively little spectral movement in the central, so-called "steady-state", portion of many phonemes \(em in the extreme case of unvoiced stops there is not only no spectral movement, but no spectrum at all in the steady state! At that time the resonance synthesizer was in its infancy, and so recorded segments of live speech were used. The early experiments met with little success because of the technical difficulties of joining analogue waveforms and inevitable discrepancies between the steady-state parts of a phoneme recorded in different contexts \(em not to mention the problems of coarticulation and prosody which effectively preclude the use of waveform concatenation at such a low level. .pp In the mid 1960's, with the growing use of resonance synthesizers, it became possible to generate diphones by copying resonance parameters manually from a spectrogram, and improving the result by trial and error. It was not feasible to extract formant frequencies automatically from real speech, though, because the fast Fourier transform was not yet widely known and the computational burden of slow Fourier transformation was prohibitive. For example, a project at IBM stored manually-derived parameter tracks for diphones, identified by pairs of phoneme names (Dixon and Maxey, 1968). .[ Dixon Maxey 1968 .] To generate a synthetic utterance it was coded in phonetic form and used to access the diphone table to give a set of parameter tracks for the complete utterance. Note that this is the first system we have encountered whose input is a phonetic transcription which relates to an inventory of truly synthetic character: all previous schemes used recordings of live speech, albeit processed in some form. Since the inventory was synthetic, there was no difficulty in ensuring that discontinuities did not arise between segments beginning and ending with the same phoneme. Thus interpolation was irrelevant, and the synthesis procedure concentrated on prosodic questions. The resulting speech was reported to be quite impressive. .pp Strictly speaking, diphones are not demisyllables but phoneme pairs. In the simplest case they happen to be similar, for two primary diphones characterize a consonant-vowel-consonant syllable. There is an advantage to using demisyllables rather than diphones as the basic unit, for many syllables begin or end with complicated consonant clusters which are not easy to produce convincingly by diphone concatenation. But they are not easy to produce by hand-editing resonance parameters either! Now that speech analysis methods have been developed and refined, resonance parameters or linear predictive coefficients can be extracted automatically from natural utterances, and there has been a resurgence of interest in syllabic and demisyllabic synthesis methods. The wheel has turned full circle, from segments of natural speech to hand-tailored parameters and back again! .pp The advantage of storing demisyllables over syllables (or lisibles) from the point of view of storage capacity has already been pointed out (perhaps 1,000\-2,000 demisyllables as opposed to 4,000\-10,000 syllables). But it is probably not too significant with the continuing decline of storage costs. The requirements are of the order of 25\ Kbyte versus 0.5\ Mbyte for 1200\ bit/s linear predictive coding, and the latter could almost be accomodated today \(em 1981 \(em on a state-of-the-art read-only memory chip. A bigger advantage comes from rhythmic considerations. As we will see in the next chapter, the rhythms of fluent speech cause dramatic variations in syllable duration, but these seem to affect the vowel and closing consonant cluster much more than the initial consonant cluster. Thus if a demisyllable is deemed to begin shortly (say 60\ msec) after onset of the vowel, when the formant structure has settled down, the bulk of the vowel and the closing consonant cluster will form a single demisyllable. The opening cluster of the next syllable will lie in the next demisyllable. Then differential lengthening can be applied to that part of the syllable which tends to be stretched in live speech. .pp One system for demisyllable concatenation has produced excellent results for monosyllabic English words (Lovins and Fujimura, 1976). .[ Lovins Fujimura 1976 .] Complex word-final consonant clusters are excluded from the inventory by using syllable affixes .ul s, z, t, and .ul d; these are attached to the syllabic core as a separate exercise (Macchi and Nigro, 1977). .[ Macchi Nigro 1977 .] Prosodic rather than segmental considerations are likely to prove the major limiting factor when this scheme is extended to running speech. .pp Monosyllabic words spoken in isolation are coded as linear predictive reflection coefficients, and segmented by digital editing into the initial consonant cluster and the vocalic nucleus plus final cluster. The cut is made 60\ msec into the vowel, as suggested above. This minimizes the difficulty of interpolation when concatenating segments, for there is ample voicing on either side of the juncture. The reflection coefficients should not differ radically because the vowel is the same in each demisyllable. A 40\ msec overlap is used, with the usual linear interpolation. An alternative smoothing rule applies when the second segment has a nasal or glide after the vowel. In this case anticipatory coarticulation occurs, affecting even the early part of the vowel. For example, a vowel is frequently nasalized when followed by a nasal sound \(em even in English where nasalization is not a distinctive feature in vowels (see Chapter 2). Under these circumstances the overlap area is moved forward in time so that the colouration applies throughout almost the whole vowel. .sh "7.3 Phoneme synthesis" .pp Acoustic phonetics is the study of how the acoustic signal relates to the phonetic sequence which was spoken or heard. People \(em especially engineers \(em often ask, how could phonetics not be acoustic? In fact it can be articulatory, auditory, or linguistic (phonological), for example, and we have touched on the first and last in Chapter 2. The invention of the sound spectrograph in the late 1940's was an event of colossal significance for acoustic phonetics, for it somehow seemed to make the intricacies of speech visible. (This was thought to be a greater advance than actually turned out: historically-minded readers should refer to Potter .ul et al, 1947, for an enthusiastic contemporary appraisal of the invention.) A .[ Potter Kopp Green 1947 .] result of several years of research at Haskins Laboratories in New York during the 1950's was a set of "minimal rules for synthesizing speech", which showed how stylized formant patterns could generate cues for identifying vowels and, particularly, consonants (Liberman, 1957; Liberman .ul et al, 1959). .[ Liberman 1957 Some results of research on speech perception .] .[ Liberman Ingemann Lisker Delattre Cooper 1959 .] .pp These were to form the basis of many speech synthesis-by-rule computer programs in the ensuing decades. Such programs take as input a phonetic transcription of the utterance and generate a spoken version of it. The transcription may be broad or narrow, depending on the system. Experience has shown that the Haskins rules really are minimal, and the success of a synthesis-by-rule program depends on a vast collection of minutia, each seemingly insignificant in isolation but whose effects combine to influence the speech quality dramatically. The best current systems produce clearly understandable speech which is nevertheless something of a strain to listen to for long periods. However, many are not good; and some are execrable. In recent times commercial influences have unfortunately restricted the free exchange of results and programs between academic researchers, thus slowing down progress. Research attention has turned to prosodic factors, which are certainly less well understood than segmental ones, and to synthesis from plain English text rather than from phonetic transcriptions. .pp The remainder of this chapter describes the techniques of segmental synthesis. First it is necessary to introduce some elements of acoustic phonetics. It may be worth re-reading Chapter 2 at this point, to refresh your memory about the classification of speech sounds. .sh "7.4 Acoustic characterization of phonemes" .pp Shortly after the invention of the sound spectrograph an inverse instrument was developed, called the "pattern playback" synthesizer. This took as input a spectrogram, either in its original form or painted by hand. An optical arrangment was used to modulate the amplitude of some fifty harmonically-related oscillators by the lightness or darkness of each point on the frequency axis of the spectrogram. As it was drawn past the playing head, sound was produced which had approximately the frequency components shown on the spectrogram, although the fundamental frequency was constant. .pp This device allowed the complicated acoustic effects seen on a spectrogram (see for example Figures 2.3 and 2.4) to be replayed in either original or simplified form. Hence the features which are important for perception of the different sounds could be isolated. The procedure was to copy from an actual spectrogram the features which were most prominent visually, and then to make further changes by trial and error until the result was judged to have reasonable intelligibility when replayed. .pp For the purpose of acoustic characterization of particular phonemes, it is useful to consider the central, steady-state part separately from transitions into and out of the segment. The steady-state part is that sound which is heard when the phoneme is prolonged. The term "phoneme" is being used in a rather loose sense here: it is more appropriate to think of a "sound segment" rather than the abstract unit which forms the basis of phonological classification, and this is the terminology I will adopt. .pp The essential auditory characteristics of some sound segments are inherent in their steady states. If a vowel, for example, is spoken and prolonged, it can readily be identified by listening to any part of the utterance. This is not true for diphthongs: if you say "I" very slowly and freeze your vocal tract posture at any time, the resulting steady-state sound will not be sufficient to identify the diphthong. Rather, it will be a vowel somewhere between .ul aa (in "had") or .ul ar (in "hard") and .ul ee (in "heed"). Neither is it true for glides, for prolonging .ul w (in "want") or .ul y (in "you") results in vowels resembling respectively .ul u ("hood") or .ul ee ("heed"). Fricatives, voiced or unvoiced, can be identified from the steady state; but stops can not, for their's is silent (or \(em in the case of voiced stops \(em something close to it). .pp Segments which are identifiable from their steady state are easy to synthesize. The difficulty lies with the others, for it must be the transitions which carry the information. Thus "transitions" are an essential part of speech, and perhaps the term is unfortunate for it calls to mind an unimportant bridge between one segment and the next. It is tempting to use the words "continuant" and "non-continuant" to distinguish the two categories; unfortunately they are used by phoneticians in a different sense. We will call them "steady-state" and "transient" segments. The latter term is not particularly appropriate, for even sounds in this class .ul can be prolonged: the point is that the identifying information is in the transitions rather than the steady state. .RF .nr x1 (\w'excitation'/2) .nr x2 (\w'formant resonance'/2) .nr x3 (\w'fricative'/2) .nr x4 (\w'frequencies (Hz)'/2) .nr x5 (\w'resonance (Hz)'/2) .nr x0 4n+1.7i+0.8i+0.6i+0.6i+1.0i+\w'00'+\n(x5 .nr x6 (\n(.l-\n(x0)/2 .in \n(x6u .ta 4n +1.7i +0.8i +0.6i +0.6i +1.0i \h'-\n(x1u'excitation \0\0\h'-\n(x2u'formant resonance \0\0\h'-\n(x3u'fricative \0\0\h'-\n(x4u'frequencies (Hz) \0\0\c \h'-\n(x5u'resonance (Hz) \l'\n(x0u\(ul' .sp .nr x1 (\w'voicing'/2) \fIuh\fR (the) \h'-\n(x1u'voicing \0500 1500 2500 \fIa\fR (bud) \h'-\n(x1u'voicing \0700 1250 2550 \fIe\fR (head) \h'-\n(x1u'voicing \0550 1950 2650 \fIi\fR (hid) \h'-\n(x1u'voicing \0350 2100 2700 \fIo\fR (hod) \h'-\n(x1u'voicing \0600 \0900 2600 \fIu\fR (hood) \h'-\n(x1u'voicing \0400 \0950 2450 \fIaa\fR (had) \h'-\n(x1u'voicing \0750 1750 2600 \fIee\fR (heed) \h'-\n(x1u'voicing \0300 2250 3100 \fIer\fR (heard) \h'-\n(x1u'voicing \0600 1400 2450 \fIar\fR (hard) \h'-\n(x1u'voicing \0700 1100 2550 \fIaw\fR (hoard) \h'-\n(x1u'voicing \0450 \0750 2650 \fIuu\fR (food) \h'-\n(x1u'voicing \0300 \0950 2300 .nr x1 (\w'aspiration'/2) \fIh\fR (he) \h'-\n(x1u'aspiration .nr x1 (\w'frication'/2) .nr x2 (\w'frication and voicing'/2) \fIs\fR (sin) \h'-\n(x1u'frication 6000 \fIz\fR (zed) \h'-\n(x2u'frication and voicing 6000 \fIsh\fR (shin) \h'-\n(x1u'frication 2300 \fIzh\fR (vision) \h'-\n(x2u'frication and voicing 2300 \fIf\fR (fin) \h'-\n(x1u'frication 4000 \fIv\fR (vat) \h'-\n(x2u'frication and voicing 4000 \fIth\fR (thin) \h'-\n(x1u'frication 5000 \fIdh\fR (that) \h'-\n(x2u'frication and voicing 5000 \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 7.2 Resonance synthesizer parameters for steady-state sounds" .rh "Steady-state segments." Table 7.2 shows appropriate values for the resonance parameters and excitation sources of a resonance synthesizer, for steady-state segments only. There are several points to note about it. Firstly, all the frequencies involved obviously depend upon the speaker \(em the size of his vocal tract, his accent and speaking habits. The values given are nominal ones for a male speaker with a dialect of British English called "received pronunciation" (RP) \(em for it is what used to be "received" on the wireless in the old days before the British Broadcasting Corporation adopted a policy of more informal, more regional, speech. Female speakers have formant frequencies approximately 15% higher than male ones. Secondly, the third formant is relatively unimportant for vowel identification; it is the first and second that give the vowels their character. Thirdly, formant values for .ul h are not given, for they would be meaningless. Although it is certainly a steady-state sound, .ul h changes radically in context. If you say "had", "heed", "hud", and so on, and freeze your vocal tract posture on the initial .ul h, you will find it already configured for the following vowel \(em an excellent example of anticipatory coarticulation. Fourthly, amplitude values do play some part in identification, particularly for fricatives. .ul th is the weakest sound, closely followed by .ul f, with .ul s and .ul sh the strongest. It is necessary to get a reasonable mix of excitation in the voiced fricatives; the voicing amplitude is considerably less than in vowels. Finally, there are other sounds that might be considered steady state ones. You can probably identify .ul m, n, and .ul ng just by their steady states. However, the difference is not particularly strong; it is the transitional parts which discriminate most effectively between these sounds. The steady state of .ul r is quite distinctive, too, for most speakers, because the top of the tongue is curled back in a so-called "retroflex" action and this causes a radical change in the third formant resonance. .rh "Transient segments." Transient sounds include diphthongs, glides, nasals, voiced and unvoiced stops, and affricates. The first two are relatively easy to characterize, for they are basically continuous, gradual transitions from one vocal tract posture to another \(em sort of dynamic vowels. Diphthongs and glides are similar to each other. In fact "you" could be transcribed as a triphthong, .ul i e uu, except that in the initial posture the tongue is even higher, and the vocal tract correspondingly more constricted, than in .ul i ("hid") \(em though not as constricted as in .ul sh. Both categories can be represented in terms of target formant values, on the understanding that these are not to be interpreted as steady state configurations but strictly as extreme values at the beginning or end of the formant motion (for transitions out of and into the segment, respectively). .pp Nasals have a steady-state portion comprising a strong nasal formant at a fairly low frequency, on account of the large size of the combined nasal and oral cavity which is resonating. Higher formants are relatively weak, because of attenuation effects. Transitions into and out of nasals are strongly nasalized, as indeed are adjacent vocalic segments, with the oral and nasal tract operating in parallel. As discussed in Chapter 5, this cannot be simulated on a series synthesizer. However, extremely fast motions of the formants occur on account of the binary switching action of the velum, and it turns out that fast formant transitions are sufficient to simulate nasals because the speech perception mechanism is accustomed to hearing them only in that context! Contrast this with the extremely slow transitions in diphthongs and glides. .pp Stops form the most interesting category, and research using the pattern playback synthesizer was instrumental in providing adequate acoustic characterizations for them. Consider unvoiced stops. They each have three phases: transition in, silent central portion, and transition out. There is a lot of action on the transition out (and many phoneticians would divide this part alone into several "phases"). First, as the release occurs, there is a small burst of fricative noise. Say "t\ t\ t\ ..." as in "tut-tut", without producing any voicing. Actually, when used as an admonishment this is accompanied by an ingressive, inhaling air-stream instead of the normal egressive, exhaling one used in English speech (although some languages do have ingressive sounds). In any case, a short fricative somewhat resembling a tiny .ul s can be heard as the tongue leaves the roof of the mouth. Frication is produced when the gap is very narrow, and ceases rapidly as it becomes wider. Next, when an unvoiced stop is released, a significant amount of aspiration follows the release. Say "pot", "tot", "cot" with force and you will hear the .ul h\c -like aspiration quite clearly. It doesn't always occur, though; for example you will hear little aspiration when a fricative like .ul s precedes the stop in the same syllable, as in "spot", "scot". The aspiration is a distinguishing feature between "white spot" and the rather unlikely "White's pot". It tends to increase as the emphasis on the syllable increases, and this in an example of a prosodic feature influencing segmental characteristics. Finally, at the end of the segment, the aspiration \(em if any \(em will turn to voicing. .pp What has been described applies to .ul all unvoiced stops. What distinguishes one from another? The tiny fricative burst will be different because the noise is produced at different places in the vocal tract \(em at the lips for .ul p, tongue and front of palate for .ul t, and tongue and back of palate for .ul k. The most important difference, however, is the formant motion illuminated by the last vestiges of voicing at closure and by both aspiration and the onset of voicing at opening. Each stop has target formant values which, although they cannot be heard during the stopped portion (for there is no sound there), do affect the transitions in and out. An added complexity is that the target positions themselves vary to some extent depending on the adjacent segments. If the stop is heavily aspirated, the vocal posture will have almost attained that for the following vowel before voicing begins, but the formant transitions will be perceived because they affect the sound quality of aspiration. .pp The voiced stops .ul b, d, and .ul g are quite similar to their unvoiced analogues .ul p, t, and .ul k. What distinguishes them from each other are the formant transitions to target positions, heard during closure and opening. They are distinguished from their unvoiced counterparts by the fact that more voicing is present: it lingers on longer at closure and begins earlier on opening. Thus little or no aspiration appears during the opening phase. If an unvoiced stop is uttered in a context where aspiration is suppressed, as in "spot", it is almost identical to the corresponding voiced stop, "sbot". Luckily no words in English require us to make a distinction in such contexts. Voicing sometimes pervades the entire stopped portion of a voiced stop, especially when it is surrounded by other voiced segments. When saying a word like "baby" slowly you can choose whether or not to prolong voicing throughout the second .ul b. If you do, creating what is called a "voice bar" in spectrograms, the sound escapes through the cheeks, for the lips are closed \(em try doing it for a very long time and your cheeks will fill up with air! This severely attenuates high-frequency components, and can be simulated with a weak first formant at a low resonant frequency. .RF .nr x0 \w'unvoiced stops: 'u .nr x1 4n .nr x2 \n(x0+\n(x1+\w'aspiration burst (context- and emphasis-dependent)'u .nr x3 (\n(.l-\n(x2)/2 .in \n(x3u .ta \n(x0u +\n(x1u unvoiced stops: closure (early cessation of voicing) silent steady state opening, comprising short fricative burst aspiration burst (context- and emphasis-dependent) onset of voicing .sp voiced stops: closure (late cessation of voicing) steady state (possibility of voice bar) opening, comprising pre-voicing short fricative burst .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 7.3 Acoustic phases of stop consonants" .pp Table 7.3 summarizes some of the acoustic phases of voiced and unvoiced stops. There are many variations that have not been mentioned. Nasal plosion ("good news") occurs (at the word boundary, in this case) when the nasal formant pervades the opening phase. Stop bursts are suppressed when the next sound is a stop too (the burst on the .ul p of "apt", for example). It is difficult to distinguish a voiced stop from an unvoiced one at the end of a word ("cab" and "cap"); if the speaker is trying to make himself particularly clear he will put a short neutral vowel after the voiced stop to emphasize its early onset of voicing. (If he is Italian he will probably do this anyway, for it is the norm in his own language.) .pp Finally, we turn to affricates, of which there are only two in English: .ul ch ("chin") and .ul j ("djinn"). They are very similar to the stops .ul t and .ul d followed by the fricatives .ul sh and .ul zh respectively, and their acoustic characterization is similar to that of the phoneme pair. .ul ch has a closing phase, a stopped phase, and a long fricative burst. There is no aspiration, for the vocal cords are not involved. .ul j is the same except that voicing extends further into the stopped portion, and the terminating fricative is also voiced. It may be pronounced with a voice bar if the preceding segment is voiced ("adjunct"). .sh "7.5 Speech synthesis by rule" .pp Generation of speech by rules acting upon a phonetic transcription was first investigated in the early 1960's (Kelly and Gerstman, 1961). .[ Kelly Gerstman 1961 .] Most systems employ a hardware resonance synthesizer, analogue or digital, series or parallel, to reduce the load on the computer which operates the rules. The speech-by-rule program, rather than the synthesizer, inevitably contributes by far the greater part of the degradation in the resulting speech. Although parallel synthesizers offer greater potential control over the spectrum, it is not clear to what extent a synthesis program can take advantage of this. Parameter tracks for a series synthesizer can easily be converted into linear predictive coefficients, and systems which use a linear predictive synthesizer will probably become popular in the near future. .pp The phrase "synthesis by rule", which is in common use, does not make it clear just what sort of features the rules are supposed to accomodate, and what information must be included explicitly in the input transcription. Early systems made no attempt to simulate prosodics. Pitch and rhythm could be controlled, but only by inserting pitch specifiers and duration markers in the input. Some kind of prosodic control was often incorporated later, but usually as a completely separate phase from segmental synthesis. This does not allow interaction effects (such as the extra aspiration for voiceless stops in accented syllables) to be taken into account easily. Even systems which perform prosodic operations invariably need to have prosodic specifications embedded explicitly in the input. .pp Generating parameter tracks for a synthesizer from a phonetic transcription is a process of data .ul expansion. Six bits are ample to specify a phoneme, and a speaking rate of 12 phonemes/sec leads to an input data rate of 72 bit/s. The data rate required to control the synthesizer will depend upon the number of parameters and the rate at which they are sampled, but a typical figure is 6 Kbit/s (Chapter 5). Hence there is something like a hundredfold data expansion. .pp Figure 7.1 shows the parameter tracks for a series synthesizer's rendering of the utterance .ul s i k s. .FC "Figure 7.1" There are eight parameters. You can see the onset of frication at the beginning and end (parameter 5), and the amplitude of voicing (parameter 1) come on for the .ul i and off again before the .ul k. The pitch (parameter 0) is falling slowly throughout the utterance. These tracks are stylized: they come from a computer synthesis-by-rule program and not from a human utterance. With a parameter update rate of 10 msec, the graphs can be represented by 90 sets of eight parameter values, a total of 720 values or 4320 bits if a 6-bit representation is used for each value. Contrast this with the input of only four phoneme segments, or say 24 bits. .rh "A segment-by-segment system." A seminal paper appearing in 1964 was the first comprehensive description of a computer-based synthesis-by-rule system (Holmes .ul et al, 1964). .[ Holmes Mattingly Shearme 1964 .] The same system is still in use and has been reimplemented in a more portable form (Wright, 1976). .[ Wright 1976 .] The inventory of sound segments includes the phonemes listed in Table 2.1, as well as diphthongs and a second allophone of .ul l. (Many British speakers use quite a different vocal posture for pre- and post-vocalic .ul l\c \&'s, called clear and dark .ul l\c \&'s respectively.) Some phonemes are expanded into sub-phonemic "phases" by the program. Stops have three phases, corresponding to the closure, silent steady state, and opening. Diphthongs have two phases. We will call individual phases and single-phase phonemes "segments", for they are subject to exactly the same transition rules. .pp Parameter tracks are constructed out of linear pieces. Consider a pair of adjacent segments in an utterance to be synthesized. Each one has a steady-state portion and an internal transition. The internal transition of one phoneme is dubbed "external" as far as the other is concerned. This is important because instead of each segment being responsible for its own internal transition, one of the pair is identified as "dominant" and it controls the duration of both transitions \(em its internal one and its external (the other's internal) one. For example, in Figure 7.2 the segment .ul sh dominates .ul ee and so it governs the duration of both transitions shown. .FC "Figure 7.2" Note that each segment contributes as many as three linear pieces to the parameter track. .pp The notion of domination is similar to that discussed earlier for word concatenation. The difference is that for word concatenation the dominant segment was determined by computing the spectral derivative over the transition region, whereas for synthesis-by-rule segments are ranked according to a static precedence, and the higher-ranking segment dominates. Segments of stop consonants have the highest rank (and also the greatest spectral derivative), while fricatives, nasals, glides, and vowels follow in that order. .pp The concatenation procedure is controlled by a table which associates 25 quantities with each segment. They are .LB .NI rank .NI 2\ \ overall durations (for stressed and unstressed occurrences) .NI 4\ \ transition durations (for internal and external transitions of formant frequencies and amplitudes) .NI 8\ \ target parameter values (amplitudes and frequencies of three formant resonances, plus fricative information) .NI 5\ \ quantities which specify how to calculate boundary values for formant frequencies (two for each formant except the third, which has only one) .NI 5\ \ quantities which specify how to calculate boundary values for amplitudes. .LE This table is rather large. There are 80 segments in all (remember that many phonemes are represented by more than one segment), and so it has 2000 entries. The system was an offline one which ran on what was then \(em 1964 \(em a large computer. .pp The advantage of such a large table of "rules" is the flexibility it affords. Notice that transition durations are specified independently for formant frequency and amplitude parameters \(em this permits fine control which is particularly useful for stops. For each parameter the boundary value between segments is calculated using a fixed contribution from the dominant one and a proportion of the steady state value of the other. .pp It is possible that the two transition durations which are calculated for a segment actually exceed the overall duration specified for it. In this case, the steady-state target values will be approached but not actually attained, simulating a situation where coarticulation effects prevent a target value from being reached. .rh "An event-based system." The synthesis system described above, in common with many others, takes an uncompromisingly segment-by-segment view of speech. The next phoneme is read, perhaps split into a few segments, and these are synthesized one by one with due attention being paid to transitions between them. Some later work has taken a more syllabic view. Mattingly (1976) urges a return to syllables for both practical and theoretical reasons. .[ Mattingly 1976 Syllable synthesis .] Transitional effects are particularly strong within a syllable and comparatively weak (but by no means negligible) from one syllable to the next. From a theoretical viewpoint, there are much stronger phonetic restrictions on phoneme sequences than there are on syllable sequences: pretty well any syllable can follow another (although whether the pair makes sense is a different matter), but the linguistically acceptable phoneme sequences are only a fraction of those formed by combining phonemes in all possible ways. Hill (1978) argues against what be calls the "segmental assumption" that progress through the utterance should be made one segment at a time, and recommends a description of speech based upon perceptually relevant "events". .[ Hill 1978 A program structure for event-based speech synthesis by rules .] This framework is interesting because it provides an opportunity for prosodic considerations to be treated as an integral part of the synthesis process. .pp The phonetic segments and other information that specify an utterance can be regarded as a list of events which describes it at a relatively high level. Synthesis-by-rule is the act of taking this list and elaborating on it to produce lower-level events which are realized by the vocal tract, or acoustically simulated by a resonance synthesizer, to give a speech waveform. In articulatory terms, an event might be "begin tongue motion towards upper teeth with a given effort", while in resonance terms it could be "begin second formant transition towards 1500\ Hz at a given rate". (These two examples are .ul not intended to describe the same event: a tongue motion causes much more than the transition of a single formant.) Coarticulation issues such as stop burst suppression and nasal plosion should be easier to imitate within an event-based scheme than a segment-to-segment one. .pp The ISP system (Witten and Abbess, 1979) is event-based. .[ Witten Abbess 1979 .] The key to its operation is the .ul synthesis list. To prepare an utterance for synthesis, the lexical items which specify it are joined into a linked list. Figure 7.3 shows the start of the list created for .LB 1 .ul dh i z i z /*d zh aa k s /h aa u s .LE (this is Jack's house); the "1\ ...\ /*\ ...\ /\ ..." are prosodic markers which will be discussed in the next chapter. .FC "Figure 7.3" Next, the rhythm and pitch assignment routines augment the list with syllable boundaries, phoneme cluster identifiers, and duration and pitch specifications. Then it is passed to the segmental synthesis routine which chains events into the appropriate places and, as it proceeds, removes the no longer useful elements (phoneme names, pitch specifiers, etc) which originally constituted the synthesis list. Finally, an interrupt-driven speech synthesizer handler removes events from the list as they become due and uses them to control the hardware synthesizer. .pp By adopting the synthesis list as a uniform data structure for holding utterances at every stage of processing, the problems of storage allocation and garbage collection are minimized. Each list element has a forward pointer and five data words, the first indicating what type of element it is. Lexical items which may appear in the input are .LB .NI end of utterance (".", "!", ",", ";") .NI intonation indicator ("1", ...) .NI rhythm indicator ("/", "/*") .NI word boundary (" ") .NI syllable boundary ("'") .NI phoneme segment (\c .ul ar, b, ng, ...\c ) .NI explicit duration or pitch information. .LE Several of these have to do with prosodic features \(em a prime advantage of the structure is that it does not create an artificial division between segmentals and prosody. Syllable boundaries and duration and pitch information are optional. They will normally be computed by ISP, but the user can override them in the input in a natural way. The actual characters which identify lexical items are not fixed but are taken from the rule table. .pp As synthesis proceeds, new elements are chained in to the synthesis list. For segmental purposes, three types of event are defined \(em target events, increment events, and aspiration events. With each event is associated a time at which the event becomes due. For a target event, a parameter number, target parameter value, and time-increment are specified. When it becomes due, motion of the parameter towards the target is begun. If no other event for that parameter intervenes, the target value will be reached after the given time-increment. However, another target event for the parameter may change its motion before the target has been attained. Increment events contain a parameter number, a parameter increment, and a time-increment. The fixed increment is added to the parameter value throughout the time specified. This provides an easy way to make a fricative burst during the opening phase of a stop consonant. Aspiration events switch the mode of excitation from voicing to aspiration for a given period of time. Thus the aspirated part of unvoiced stops can be accomodated in a natural manner, by changing the mode of excitation for the duration of the aspiration. .RF .nr x1 (\w'excitation'/2) .nr x2 (\w'formant resonance'/2) .nr x3 (\w'fricative'/2) .nr x4 (\w'type'/2) .nr x5 (\w'frequencies (Hz)'/2) .nr x6 (\w'resonance (Hz)'/2) .nr x0 1.0i+0.7i+0.6i+0.6i+1.0i+1.2i+(\w'long vowel'/2) .nr x7 (\n(.l-\n(x0)/2 .in \n(x7u .ta 1.0i +0.7i +0.6i +0.6i +1.0i +1.2i \h'-\n(x1u'excitation \0\0\h'-\n(x2u'formant resonance \0\0\h'-\n(x3u'fricative \h'-\n(x4u'type \0\0\h'-\n(x5u'frequencies (Hz) \0\0\h'-\n(x6u'resonance (Hz) \l'\n(x0u\(ul' .sp .nr x1 (\w'voicing'/2) .nr x2 (\w'vowel'/2) \fIuh\fR \h'-\n(x1u'voicing \0490 1480 2500 \c \h'-\n(x2u'vowel \fIa\fR \h'-\n(x1u'voicing \0720 1240 2540 \h'-\n(x2u'vowel \fIe\fR \h'-\n(x1u'voicing \0560 1970 2640 \h'-\n(x2u'vowel \fIi\fR \h'-\n(x1u'voicing \0360 2100 2700 \h'-\n(x2u'vowel \fIo\fR \h'-\n(x1u'voicing \0600 \0890 2600 \h'-\n(x2u'vowel \fIu\fR \h'-\n(x1u'voicing \0380 \0950 2440 \h'-\n(x2u'vowel \fIaa\fR \h'-\n(x1u'voicing \0750 1750 2600 \h'-\n(x2u'vowel .nr x2 (\w'long vowel'/2) \fIee\fR \h'-\n(x1u'voicing \0290 2270 3090 \h'-\n(x2u'long vowel \fIer\fR \h'-\n(x1u'voicing \0580 1380 2440 \h'-\n(x2u'long vowel \fIar\fR \h'-\n(x1u'voicing \0680 1080 2540 \h'-\n(x2u'long vowel \fIaw\fR \h'-\n(x1u'voicing \0450 \0740 2640 \h'-\n(x2u'long vowel \fIuu\fR \h'-\n(x1u'voicing \0310 \0940 2320 \h'-\n(x2u'long vowel .nr x1 (\w'aspiration'/2) .nr x2 (\w'h'/2) \fIh\fR \h'-\n(x1u'aspiration \h'-\n(x2u'h .nr x1 (\w'voicing'/2) .nr x2 (\w'glide'/2) \fIr\fR \h'-\n(x1u'voicing \0240 1190 1550 \h'-\n(x2u'glide \fIw\fR \h'-\n(x1u'voicing \0240 \0650 \h'-\n(x2u'glide \fIl\fR \h'-\n(x1u'voicing \0380 1190 \h'-\n(x2u'glide \fIy\fR \h'-\n(x1u'voicing \0240 2270 \h'-\n(x2u'glide .nr x2 (\w'nasal'/2) \fIm\fR \h'-\n(x1u'voicing \0190 \0690 2000 \h'-\n(x2u'nasal .nr x1 (\w'none'/2) .nr x2 (\w'stop'/2) \fIb\fR \h'-\n(x1u'none \0100 \0690 2000 \h'-\n(x2u'stop \fIp\fR \h'-\n(x1u'none \0100 \0690 2000 \h'-\n(x2u'stop .nr x1 (\w'voicing'/2) .nr x2 (\w'nasal'/2) \fIn\fR \h'-\n(x1u'voicing \0190 1780 3300 \h'-\n(x2u'nasal .nr x1 (\w'none'/2) .nr x2 (\w'stop'/2) \fId\fR \h'-\n(x1u'none \0100 1780 3300 \h'-\n(x2u'stop \fIt\fR \h'-\n(x1u'none \0100 1780 3300 \h'-\n(x2u'stop .nr x1 (\w'voicing'/2) .nr x2 (\w'nasal'/2) \fIng\fR \h'-\n(x1u'voicing \0190 2300 2500 \h'-\n(x2u'nasal .nr x1 (\w'none'/2) .nr x2 (\w'stop'/2) \fIg\fR \h'-\n(x1u'none \0100 2300 2500 \h'-\n(x2u'stop \fIk\fR \h'-\n(x1u'none \0100 2300 2500 \h'-\n(x2u'stop .nr x1 (\w'frication'/2) .nr x2 (\w'voice + fric'/2) .nr x3 (\w'fricative'/2) \fIs\fR \h'-\n(x1u'frication 6000 \h'-\n(x3u'fricative \fIz\fR \h'-\n(x2u'voice + fric \0190 1780 3300 6000 \h'-\n(x3u'fricative \fIsh\fR \h'-\n(x1u'frication 2300 \h'-\n(x3u'fricative \fIzh\fR \h'-\n(x2u'voice + fric \0190 2120 2700 2300 \h'-\n(x3u'fricative \fIf\fR \h'-\n(x1u'frication 4000 \h'-\n(x3u'fricative \fIv\fR \h'-\n(x2u'voice + fric \0190 \0690 3300 4000 \h'-\n(x3u'fricative \fIth\fR \h'-\n(x1u'frication 5000 \h'-\n(x3u'fricative \fIdh\fR \h'-\n(x2u'voice + fric \0190 1780 3300 5000 \h'-\n(x3u'fricative \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 7.4 Rule table for an event-based synthesis-by-rule program" .pp Now the rule table, which is shown in Table 7.4, holds simple target positions for each phoneme segment, as well as the segment type. The latter is used to trigger events by computer procedures which have access to the context of the segment. In principle, this allows considerably more sophistication to be introduced than does a simple segment-by-segment approach. .RF .nr x1 0.5i+0.5i+\w'preceding consonant in this syllable (suppress burst if fricative)'u .nr x1 (\n(.l-\n(x1)/2 .in \n(x1u .ta 0.5i +0.5i fricative bursts on stops aspiration bursts on unvoiced stops, affected by preceding consonant in this syllable (suppress burst if fricative) following consonant (suppress burst if another stop; introduce nasal plosion if a nasal) prosodics (increase burst if syllable is stressed) voice bar on voiced stops (in intervocalic position) post-voicing on terminating voiced stops, if syllable is stressed anticipatory coarticulation for \fIh\fR vowel colouring when a nasal or glide follows .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 7.5 Some coarticulation effects" .pp For example, Table 7.5 summarizes some of the subtleties of the speech production process which have been mentioned earlier in this chapter. Most of them are context-dependent, with the prosodic context (whether two segments are in the same syllable; whether a syllable is stressed) playing a significant role. A scheme where data-dependent "demons" fire on particular patterns in a linked list seems to be a sensible approach towards incorporating such rules. .rh "Discussion." There are two opposing trends in speech synthesis by rule. On the one hand larger and larger segment inventories can be used, containing more and more allophones explicitly. This is the approach of the Votrax sound-segment synthesizer, discussed in Chapter 11. It puts an increasing burden on the person who codes the utterances for synthesis, although, as we shall see, computer programs can assist with this task. On the other hand the segment inventory can be kept small, perhaps comprising just the logical phonemes as in the ISP system. This places the onus on the computer program to accomodate allophonic variations, and to do so it must take account of the segmental and prosodic context of each phoneme. An event-based approach seems to give the best chance of incorporating contextual modification whilst avoiding undesired interactions. .pp The second trend brings synthesis closer to the articulatory process of speech production. In fact an event-based system would be an ideal way of implementing an articulatory model for speech synthesis by rule. It would be much more satisfying to have the rule table contain articulatory target positions instead of resonance ones, with events like "begin tongue motion towards upper teeth with a given effort". The problem is that hard data on articulatory postures and constraints is much more difficult to gather than resonance information. .pp An interesting question that relates to articulation is whether formant motion can be simulated adequately by a small number of linear pieces. The segment-by-segment system described above had as many as nine pieces for a single phoneme, for some phonemes had three phases and each one contributes up to three pieces (transition in, steady state, and transition out). Another system used curves of decaying exponential form which ensured that all transitions started rapidly towards the target position but slowed down as it was approached (Rabiner, 1968, 1969). .[ Rabiner 1968 Speech synthesis by rule Bell System Technical J .] .[ Rabiner 1969 A model for synthesizing speech by rule .] The time-constant of decay was stored with each segment in the rule table. The rhythm of the synthetic speech was controlled at this level, for the next segment was begun when all the formants had attained values sufficiently close to the current targets. This is a poor model of the human speech production process, where rhythm is dictated at a relatively high level and the next phoneme is not simply started when the current one happens to end. Nevertheless, the algorithm produced smooth, continuous formant motions not unlike those found in spectrograms. .pp There is, however, by no means universal agreement on decaying exponential formant motions. Lawrence (1974) divided segments into "checked" and "free" categories, corresponding roughly to consonants and vowels; and postulated .ul increasing exponential transitions into checked segments, and decaying transitions into free ones. .[ Lawrence 1974 .] This is a reasonable supposition if you consider the mechanics of articulation. The speed of movement of the tongue (for example) is likely to increase until it is physically stopped by reaching the roof of the mouth. When moving away from a checked posture into a free one the transition will be rapid at first but slow down to approach the target asymptotically, governed by proprioceptive feedback. .pp The only thing that seems to be agreed is that the formant tracks should certainly .ul not be piecewise linear. However, in the face of conflicting opinions as to whether exponentials should be decaying or increasing, piecewise linear motions seem to be a reasonable compromise! It is likely that the precise shape of formant tracks is unimportant so long as the gross features are imitated correctly. Nevertheless, this is a question which an articulatory model could help to answer. .sh "7.6 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "7.7 Further reading" .pp There are unfortunately few books to recommend on the subject of joining segments of speech. The references form a representative and moderately comprehensive bibliography. Here is some relevant background reading in linguistics. .LB "nn" .\"Fry-1976-1 .]- .ds [A Fry, D.B.(Editor) .ds [D 1976 .ds [T Acoustic phonetics .ds [I Cambridge Univ Press .ds [C Cambridge, England .nr [T 0 .nr [A 0 .nr [O 0 .][ 2 book .in+2n This book of readings contains many classic papers on acoustic phonetics published from 1922\-1965. It covers much of the history of the subject, and is intended primarily for students of linguistics. .in-2n .\"Lehiste-1967-2 .]- .ds [A Lehiste, I.(Editor) .ds [D 1967 .ds [T Readings in acoustic phonetics .ds [I MIT Press .ds [C Cambridge, Massachusetts .nr [T 0 .nr [A 0 .nr [O 0 .][ 2 book .in+2n Another basic collection of references which covers much the same ground as Fry (1976), above. .in-2n .\"Sivertsen-1961-3 .]- .ds [A Sivertsen, E. .ds [D 1961 .ds [K * .ds [T Segment inventories for speech synthesis .ds [J Language and Speech .ds [V 4 .ds [P 27-89 .nr [P 1 .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n This is a careful early study of the quantitative implications of using phonemes, demisyllables, syllables, and words as the basic building blocks for speech synthesis. .in-2n .LE "nn" .EQ delim $$ .EN .CH "8 PROSODIC FEATURES IN SPEECH SYNTHESIS" .ds RT "Prosodic features .ds CX "Principles of computer speech .pp Prosodic features are those which characterize an utterance as a whole, rather than having a local influence on individual sound segments. For speech output from computers, an "utterance" usually comprises a single unit of information which stretches over several words \(em a clause or sentence. In natural speech an utterance can be very much longer, but it will be broken into prosodic units which are again roughly the size of a clause or sentence. These prosodic units are certainly closely related to each other. For example, the pitch contour used when introducing a new topic is usually different from those employed to develop it subsequently. However, for the purposes of synthesis the successive prosodic units can be treated independently, and information about pitch contours to be used will have to be specified in the input for each one. The independence between them is not complete, however, and lower-level contextual effects, such as interpolation of pitch between the end of one prosodic unit and the start of the next, must still be imitated. .pp Prosodic features were introduced briefly in Chapter 2. Variations in voice dynamics occur in three dimensions: pitch of the voice, time, and amplitude. These dimensions are inextricably twined together in living speech. Variations in voice quality are much less important for the factual kind of speech usually sought in voice response applications, although they can play a considerable in conveying emotions (for a discussion of the acoustic manifestations of emotion in speech, see Williams and Stevens, 1972). .[ Williams Stevens 1972 .] .pp The distinction between prosodic and segmental effects is a traditional one, but it becomes rather fuzzy when examined in detail. It is analogous to the distinction between hardware and software in computer science: although useful from some points of view the borderline becomes blurred as one gets closer to actual systems \(em with microcode, interrupts, memory management, and the like. At a trivial level, prosodics cannot exist without segmentals, for there must be some vehicle to carry the prosodic contrasts. Timing \(em a prosodic feature \(em is actually realized by the durations of individual segments. Pauses are tantamount to silent segments. .pp While pitch may seem to be relatively independent of segmentals \(em and this view is reinforced by the success of the source-filter model which separates the frequency of the excitation source from the filter characteristics \(em there are some subtle phonetic effects of pitch. It has been observed that it drops on the transition into certain consonants, and rises again on the transition out (Haggard .ul et al, 1970). .[ Haggard Ambler Callow 1970 .] This can be explained in terms of variations in pressure from the lungs on the vocal cords (Ladefoged, 1967). .[ Ladefoged 1967 .] Briefly, the increase in mouth pressure which occurs during some consonants causes a reduction in the pressure difference across the vocal cords and in the rate of flow of air between them. This results in a decrease in their frequency of vibration. When the constriction is released, there is a temporary increase in the air flow which increases the pitch again. The phenomenon is called "microintonation". It is particularly noticeable in voiced stops, but also occurs in voiced fricatives and unvoiced stops. Simulation of the effect in synthesis-by-rule has often been found to give noticeable improvements in the speech quality. .pp Loudness also has a segmental role. For example, we noted in the last chapter that amplitude values play a small part in identification of fricatives. In fact loudness is a very .ul weak prosodic feature. It contributes little to the perception of stress. Even for shouting the distinction from normal speech is as much in the voice quality as in amplitude .ul per se. It is not necessary to consider varying loudness on a prosodic basis in most speech synthesis systems. .pp The above examples show how prosodic features have segmental influences as well. The converse is also true: some segmental features have a prosodic effect. The last chapter described how stress is associated with increased aspiration of syllable-initial unvoiced stops. Furthermore, stressed syllables are articulated with greater effort than unstressed ones, and hence the formant transitions are more likely to attain their target values under circumstances which would otherwise cause them to fall short. In unstressed syllables, extreme vowels (like .ul ee, aa, uu\c ) tend to more centralized sounds (like .ul i, uh, u respectively). Although all British English vowels .ul can appear in unstressed syllables, they often become "reduced" into a centralized form. Consider the following examples. .LB .NI diplomat \ .ul d i p l uh m aa t .NI diplomacy \ .ul d i p l uh u m uh s i .NI diplomatic \ .ul d i p l uh m aa t i k. .LE The vowel of the second syllable is reduced to .ul uh in "diplomat" and "diplomatic", whereas the root form "diploma", and also "diplomacy", has a diphthong (\c .ul uh u\c ) there. The third syllable has an .ul aa in "diplomat" and "diplomatic" which is reduced to .ul uh in "diplomacy". In these cases the reduction is shown explicitly in the phonetic transcription; but in more marginal examples where it is less extreme it will not be. .pp I have tried to emphasize in previous chapters that prosodic features are important in speech synthesis. There is something very basic about them. Rhythm is an essential part of all bodily activity \(em of breathing, walking, working and playing \(em and so it pervades speech too. Mothers and babies communicate effectively using intonation alone. Some experiments have indicated that the language environment of an infant affects his babbling at an early age, before he has effective segmental control. There is no doubt that "tone of voice" plays a large part in human communication. .pp However, early attempts at synthesis did not pay too much attention to prosodics, perhaps because it was thought sufficient to get the meaning across by providing clear segmentals. As artificial speech grows more widespread, however, it is becoming apparent that its acceptability to users, and hence its ultimate success, depends to a large extent on incorporating natural-sounding prosodics. Flat, arhythmic speech may be comprehensible in short stretches, but it strains the concentration in significant discourse and people are not usually prepared to listen to it. Unfortunately, current commercial speech output systems do not really tackle prosodic questions, which indicates our present rather inadequate state of knowledge. .pp The importance of prosodics for automatic speech .ul recognition is beginning to be appreciated too. Some research projects have attended to the automatic identification of points of stress, in the hope that the clear articulation of stressed syllables can be used to provide anchor points in an unknown utterance (for example, see Lea .ul et al, 1975). .[ Lea Medress Skinner 1975 .] .pp But prosodics and segmentals are closely intertwined. I have chosen to treat them in separate chapters in order to split the material up into manageable chunks rather than to enforce a deep division between them. It is also true that synthesis of prosodic features is an uncharted and controversial area, which gives this chapter rather a different flavour from the last. It is hard to be as definite about alternative strategies and methods as you can for segment concatenation. In order to make the treatment as concrete and down-to-earth as possible, I will describe in some detail two example projects in prosodic synthesis. The first treats the problem of transferring pitch from one utterance to another, while the second considers how artificial timing and pitch can be assigned to synthetic speech. These examples illustrate quite different problems, and are reasonably representative of current research activity. (Other systems are described by Mattingly, 1966; Rabiner .ul et al, 1969.) Before .[ Mattingly 1966 .] .[ Rabiner Levitt Rosenberg 1969 .] looking at the two examples, we will discuss a feature which is certainly prosodic but does not appear in the list given earlier \(em stress. .sh "8.1 Stress" .pp Stress is an everyday notion, and when listening to natural speech people can usually agree on which syllables are stressed. But it is difficult to characterize in acoustic terms. From the speaker's point of view, a stressed syllable is produced by pushing more air out of the lungs. For a listener, the points of stress are "obvious". You may think that stressed syllables are louder than the others: however, instrumental studies show that this is not necessarily (nor even usually) so (eg Lehiste and Peterson, 1959). .[ Lehiste Peterson 1959 .] Stressed syllables frequently have a longer vowel than unstressed ones, but this is by no means universally true \(em if you say "little" or "bigger" you will find that the vowel in the first, stressed, syllable is short and shows little sign of lengthening as you increase the emphasis. Moreover, experiments using bisyllabic nonsense words have indicated that some people consistently judge the .ul shorter syllable to be stressed in the absence of other clues (Morton and Jassem, 1965). .[ Morton Jassem 1965 .] Pitch often helps to indicate stress. It is not that stressed syllables are always higher- or lower-pitched than neighbouring ones, or even that they are uttered with a rising or falling pitch. It is the .ul rate of change of pitch that tends to be greater for stressed syllables: a sharp rise or fall, or a reversal of direction, helps to give emphasis. .pp Stress is acoustically manifested in timing and pitch, and to a much lesser extent in loudness. However it is a rather subtle feature and does .ul not correspond simply to duration increases or pitch rises. It seems that listeners unconsciously put together all the clues that are present in an utterance in order to deduce which syllables are stressed. It may be that speech is perceived by a listener with reference to how he would have produced it himself, and that this is how he detects which syllables were given greater vocal effort. .pp The situation is confused by the fact that certain syllables in words are often said in ordinary language to be "stressed" on account of their position in the word. For example, the words "diplomat", "diplomacy", and "diplomatic" have stress on the first, second, and third syllables respectively. But here we are talking about the word itself rather than any particular utterance of it. The "stress" is really .ul latent in the indicated syllables and only made manifest upon uttering them, and then to a greater or lesser degree depending on exactly how they are uttered. .pp Some linguists draw a careful distinction between salient syllables, accented syllables, and stressed syllables, although the words are sometimes used differently by different authorities. I will not adopt a precise terminology here, but it is as well to be aware of the subtle distinctions involved. The term "salience" is applied to actual utterances, and salient syllables are those that are perceived as being more prominent than their neighbours. "Accent" is the potential for salience, as marked, for example, in a dictionary or lexicon. Thus the discussion of the "diplo-" words above is about accent. Stress is an articulatory phenomenon associated with increased muscular activity. Usually, syllables which are perceived as salient were produced with stress, but in shouting, for example, all syllables can be stressed \(em even non-salient ones. Furthermore, accented syllables may not be salient. For instance, the first syllable of the word "very" is accented, that is, potentially salient, but in a sentence as uttered it may or may not be salient. One can say .LB "\c .ul he's very good" .LE with salience on "he" and possibly "good", or .LB "he's .ul very good" .LE with salience on the first syllable of "very", and possibly "good". .pp Non-standard stress patterns are frequently used to bring out contrasts. Words like "a" and "the" are normally unstressed, but can be stressed in contexts where ambiguity has arisen. Thus factors which operate at a much higher level than the phonetic structure of the utterance must be taken into account when deciding where stress should be assigned. These include syntactic and semantic considerations, as well as the attitude of the speaker and the likely attitude of the listener to the material being spoken. For example, I might say .LB "Anna .ul and Nikki should go", .LE with emphasis on the "and" purely because I was aware that my listener might quibble about the expense of sending them both. Clearly some notation is needed to communicate to the synthesis process how the utterance is supposed to be rendered. .sh "8.2 Transferring pitch from one utterance to another" .pp For speech stored in source-filter form and concatenated on a slot-filling basis, it would be useful to have stored typical pitch contours which can be applied to the synthetic utterances. From a practical point of view it is important to be able to generate natural-sounding pitch for high-quality artificial speech. Although several algorithms for creating completely synthetic contours have been proposed \(em and we will examine one later in this chapter \(em they are unsuitable for high-quality speech. They are generally designed for use with synthesis-by-rule from phonetics, and the rather poor quality of articulation does not encourage the development of excellent pitch assignment procedures. With speech synthesized by rule there is generally an emphasis on keeping the data storage requirements to a minimum, and so it is not appropriate to store complete contours. Moreover, if speech is entered in textual form as phoneme strings, it is natural to attach pitch information as markers in the text rather than by entering a complete and detailed contour. .pp The picture is rather different for concatenated segments of natural speech. In the airline reservation system, with utterances formed from templates like .LB Flight number \(em leaves \(em at \(em , arrives in \(em at \(em , .LE it is attractive to store the pitch contour of one complete instance of the utterance and apply it to all synthetic versions. .pp There is an enormous literature on the anatomy of intonation, and much of it rests upon the notion of a pitch contour as a descriptive aid to analysis. Underlying this is the assumption, usually unstated, that a contour can be discussed independently of the particular stream of words that manifests it; that a single contour can somehow be bound to any sentence (or phrase, or clause) to produce an acceptable utterance. But the contour, and its binding, are generally described only at the grossest level, the details being left unspecified. .pp There are phonetic influences on pitch \(em the characteristic lowering during certain consonants was mentioned above \(em and these are not normally considered as part of intonation. Such effects will certainly spoil attempts to store contours extracted from living speech and apply them to different utterances, but the impairment may not be too great, for pitch is only one of many segmental clues to consonant identification. .pp In the system mentioned earlier which generated 7-digit telephone numbers by concatenating formant-coded words, a single natural pitch contour was applied to all utterances. It was taken to match as well as possible the general shape of the contours measured in naturally-spoken telephone numbers. However, this is a very restricted environment, for telephone numbers exhibit almost no variety in the configuration of stressed and unstressed syllables \(em the only digit which is not a monosyllable is "seven". Significant problems arise when more general utterances are considered. .pp Suppose the pitch contour of one utterance (the "source") is to be transferred to another (the "target"). Assume that the utterances are encoded in source-filter form, either as parameter tracks for a formant synthesizer or as linear predictive coefficients. Then there are no technical obstacles to combining pitch and segmentals. The source must be available as a complete utterance, while the target may be formed by concatenating smaller units such as words. .pp For definiteness, we will consider utterances of the form .LB The price is \(em dollars and \(em cents, .LE where the slots are filled by numbers less than 100; and of the form .LB The price is \(em cents. .LE The domain of prices encompasses a wide range of syllable configurations. There are between one and five syllables in each variable part, if the numbers are restricted to be less than 100. The sentences have a constant pragmatic, semantic, and syntactic structure. As in the vast majority of real-life situations, minimal phonetic distinctions between utterances do not occur. .pp Pitch transfer is complicated by the fact that values of the source pitch are only known during the voiced parts of the utterance. Although it would certainly be possible to extrapolate pitch over unvoiced parts, this would introduce some artificiality into the otherwise completely natural contours. Let us assume, therefore, that the pitch contour of the voiced nucleus of each syllable in the source is applied to the corresponding syllable nucleus in the target. .pp The primary factors which might tend to inhibit successful transfer are .LB .NP different numbers of syllables in the utterances; .NP variations in the pattern of stressed and unstressed syllables; .NP different syllable durations; .NP pitch discontinuities; .NP phonetic differences between the utterances. .LE .rh "Syllabification." It is essential to take into account the syllable structures of the utterances, so that pitch is transferred between corresponding syllables rather than over the utterance as a whole. Fortunately, syllable boundaries can be detected automatically with a fair degree of accuracy, especially if the speech is carefully enunciated. It is worth considering briefly how this can be done, even though it takes us off the main topic of synthesis and into speech analysis. .pp A procedure developed by Mermelstein (1975) involves integrating the spectral energy at each point in the utterance. .[ Mermelstein 1975 Automatic segmentation of speech into syllabic units .] First the low (<500\ Hz) and high (>4000\ Hz) ends are filtered out with 12\ dB/octave cutoffs. The resulting energy signal is smoothed by a 40\ Hz lowpass filter, giving a so-called "loudness" function. All this can be accomplished with simple recursive digital filters. .pp Then, the loudness function is compared with its convex hull. The convex hull is the shape a piece of elastic would assume if stretched over the top of the loudness function and anchored down at both ends, as illustrated in Figure 8.1. .FC "Figure 8.1" The point of maximum difference between the hull and loudness function is taken to be a tentative syllable boundary. The hull is recomputed, but anchored to the actual loudness function at the tentative boundary, and the points of maximum hull-loudness difference in each of the two halves are selected as further tentative boundaries. The procedure continues recursively until the maximum hull-loudness difference, with the hull anchored at each tentative boundary, falls below a certain minimum (say 4\ dB). .pp At this stage, the number of tentative boundaries will greatly exceed the actual number of syllables (by a factor of around 5). Many of the extraneous boundaries are eliminated by the following constraints: .LB .NP if two boundaries lie within a certain time of each other (say 120\ msec), one of them is discarded; .NP if the maximum loudness within a tentative syllable falls too far short of the overall maximum for the utterance (more than 20\ dB), one boundary is discarded. .LE The question of which boundary to discard can be decided by examining the voicing continuity of the utterance. If possible, voicing across a syllable boundary should be avoided. Otherwise, the boundary with the smallest hull-loudness difference should be rejected. .RF .nr x0 \w'boundaries moved slightly to correspond better with voicing:' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta 3.4i +0.5i \l'\n(x0u\(ul' .sp total syllable count: 332 boundaries missed by algorithm: \0\09 (3%) extra boundaries inserted by algorithm: \029 (9%) boundaries moved slightly to correspond better with voicing: \0\03 (1%) .sp total errors: \041 (12%) \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 8.1 Success of the syllable segmentation procedure" .pp Table 8.1 illustrates the success of this syllabification procedure, in a particular example. Segmentation is performed with less than 10% of extraneous boundaries being inserted, and much less than 10% of actual boundaries being missed. These figures are rather sensitive to the values of the three thresholds. The values were chosen to err on the side of over-zealous syllabification, because all the boundaries need to be checked by ear and eye and it is easier to delete a boundary by hand than to insert one at an appropriate place. It may well be that with careful optimization of thresholds, better figures could be achieved. .rh "Stressed and unstressed syllables." If the source and target utterances have the same number of syllables, and the same pattern of stressed and unstressed syllables, pitch can simply be transferred from a syllable in the source to the corresponding one in the target. But if the pattern differs \(em even though the number of syllables may be the same, as in "eleven" and "seventeen" \(em then a one-to-one mapping will conflict with the stress points, and certainly sound unnatural. Hence an attempt should be made to ensure that the pitch is mapped in a plausible way. .pp The syllables of each utterance can be classified as "stressed" and "unstressed". This distinction could be made automatically by inspection of the pitch contour, within the domain of utterances used, and possibly even in general (Lea .ul et al, 1975). .[ Lea Medress Skinner 1975 .] However, in many cases it is expedient to perform the job by hand. In our example, the sentences have fixed "carrier" parts and variable "number" parts. The stressed carrier syllables, namely .LB "... price ... dol\- ... cents", .LE can be marked as such, by hand, to facilitate proper alignment between the source and target. This marking would be difficult to do automatically because it would be hard to distinguish the carrier from the numbers. .pp Even after classifying the syllables as "carrier stressed", "stressed", and "unstressed", alignment still presents problems, because the configuration of syllables in the variable parts of the utterances may differ. Syllables in the source which have no correspondence in the target can be ignored. The pitch track of the source syllable can be replicated for each additional syllable in corresponding position in the target. Of course, a stressed syllable should be selected for copying if the unmatched target syllable is stressed, and similarly for unstressed ones. It is rather dangerous to copy exactly a part of a pitch contour, for the ear is very sensitive to the juxtaposition of identically intoned segments of speech \(em especially when the segment is stressed. To avoid this, whenever a stressed syllable is replicated the pitch values should be decreased by, say, 20%, on the second copy. It sometimes happens that a single stressed syllable in the source needs to cover a stressed-unstressed pair in the target: in this case the first part of the source pitch track can be used for the stressed syllable, and the remainder for the unstressed one. .pp The example of Figure 8.2 will help to make these rules clear. .FC "Figure 8.2" Note that the marking alone is done by hand. The detailed mapping decisions can be left to the computer. The rules were derived intuitively, and do not have any sound theoretical basis. They are intended to give reasonable results in the majority of cases. .pp Figure 8.3 shows the result of transferring the pitch from "the price is ten cents" to "the price is seventy-seven cents". .FC "Figure 8.3" The syllable boundaries which are marked were determined automatically. The use of the last 30% of the "ten" contour to cover the first "-en" syllable, and its replication to serve the "-ty" syllable, can be seen. However, the 70%\(em30% proportion is applied to the source contour, and the linear distortion (described next) upsets the proportion in the target utterance. The contour of the second "seven" can be seen to be a replication of that of the first one, lowered by 20%. Notice that the pitch extraction procedure has introduced an artifact into the final part of one of the "cents" contours by doubling the pitch. .rh "Stretching and squashing." The pitch contour over a source syllable nucleus must be stretched or squashed to match the duration of the target nucleus. It is difficult to see how anything other than linear stretching and squashing could be done without considerably increasing the complexity of the procedure. The gross non-linearities will have been accounted for by the syllable alignment process, and so simple linear time-distortion should not cause too much degradation. .rh "Pitch discontinuities." Sudden jumps in pitch during voiced speech sound peculiar, although they can in fact be produced naturally (by yodelling). People frequently burst into laughter on hearing them in synthetic speech. It is particularly important to avoid this diverting effect in voice response applications, for the listener's attention is instantly directed away from what is said to the voice that speaks. .pp Discontinuities can arise in the pitch-transfer procedure either by a voiced-unvoiced-voiced transition between syllables mapping on to a voiced-voiced transition in the target, or by voicing continuity being broken when the syllable alignment procedure drops or replicates a syllable. There are several ways in which at least some of the possibilities can be avoided. For example, one could hold unstressed syllables at a constant pitch whose value coincides with either the end of the previous syllable's contour or the beginning of the next syllable's contour, depending on which transition is voiced. Alternatively, the policy of reserving the trailing part of a stressed syllable in the source to cover an unmatched following unstressed syllable in the target could be generalized to allow use of the leading 30% of the next stressed syllable's contour instead, if that maintained voicing continuity. A third solution is simply to merge the pitch contours at a discontinuity by mixing the average pitch value at the break with the pitch contour on either side of it in a proportion which increases linearly from the edges of the domain of influence to the discontinuity. Figure 8.4 shows the effect of this merging, when the pitch contour of "the price is seven cents" is transferred to "the price is eleven cents". .FC "Figure 8.4" Of course, the interpolated part will not necessarily be linear. .rh "Results of an experiment on pitch transfer." Some experiments have been conducted to evaluate the performance of this pitch transfer method on the kind of utterances discussed above (Witten, 1979). .[ Witten 1979 On transferring pitch from one utterance to another .] First, the source and target sentences were chosen to be lexically identical, that is, the same words were spoken. For this experiment alone, expert judges were employed. Each sentence was recorded twice (by the same person), and pitch was transferred from copy A to copy B and vice versa. Also, the originals were resynthesized from their linear predictive coefficients with their own pitch contours. Although all four often sounded extremely similar, sometimes the pitch contours of originals A and B were quite different, and in these cases it was immediately obvious to the ear that two of the four utterances shared the same intonation, which was different to that shared by the other two. .pp Experienced researchers in speech analysis-synthesis served as judges. In order to make the test as stringent as possible it was explained to them exactly what had been done, except that the order of the utterances in each quadruple was kept secret. They were asked to identify which two of the four sentences did not have their original contours, and were allowed to listen to each quadruple as often as they liked. On occasion they were prepared to identify only one, or even none, of the sentences as artificial. .pp The result was that an utterance with pitch transferred from another, lexically identical, one is indistinguishable from a resynthesized version of the original, even to a skilled ear. (To be more precise, this hypothesis could not be rejected even at the 1% level of statistical significance.) This gave confidence in the transfer procedure. However, one particular judge was quite successful at identifying the bogus contours, and he attributed his success to the fact that on occasion the segmental durations did not accord with the pitch contour. This casts a shadow of suspicion on the linear stretching and squashing mechanism. .pp The second experiment examined pitch transfers between utterances having only one variable part each ("the price is ... cents") to test the transfer method under relatively controlled conditions. Ten sentences of the form .LB "The price is \(em cents" .LE were selected to cover a wide range of syllable structures. Each one was regenerated with pitch transferred from each of the other nine, and these nine versions were paired with the original resynthesized with its natural pitch. The $10 times 9=90$ resulting pairs were recorded on tape in random order. .pp Five males and five females, with widely differing occupations (secretaries, teachers, academics, and students), served as judges. Written instructions explained that the tape contained pairs of sentences which were lexically identical but had a slight difference in "tone of voice", and that the subjects were to judge which of each pair sounded "most natural and intelligible". The response form gave the price associated with each pair \(em a preliminary experiment had shown that there was never any difficulty in identifying this \(em and a column for decision. With each decision, the subjects recorded their confidence in the decision. Subjects could rest at any time during the test, which lasted for about 30 minutes, but they were not permitted to hear any pair a second time. .pp Defining a "success" to be a choice of the utterance with natural pitch as the best of a pair, the overall success rate was about 60%. If choices were random, one would of course expect only a 50% success rate, and the figure obtained was significantly different from this. Almost half the choices were correct and made with high confidence; high-confidence but incorrect choices accounted for a quarter of the judgements. .pp To investigate structural effects in the pitch transfer process, low confidence decisions were ignored to eliminate noise, and the others lumped together and tabulated by source and target utterance. The number of stressed and unstressed syllables does not appear to play an important part in determining whether a particular utterance is an easy target. For example, it proved to be particularly difficult to tell .EQ delim @@ .EN natural from transferred contours with utterances $0.37 and $0.77. .EQ delim $$ .EN In fact, the results showed no better than random discrimination for them, even though the decisions in which listeners expressed little confidence had been discarded. Hence it seems that the syllable alignment procedure and the policy of replication were successful. .pp .EQ delim @@ .EN The worst target scores were for utterances $0.11 and $0.79. .EQ delim $$ .EN Both of these contained large unbroken voiced periods in the "variable" part \(em almost twice as long as the next longest voiced period. The first has an unstressed syllable followed by a stressed one with no break in voicing, involving, in a natural contour, a fast but continuous climb in pitch over the juncture, and it is not surprising that it proved to be the most difficult target. A more sophisticated "smoothing" algorithm than the one used may be worth investigating. .pp In a third experiment, sentences with two variable parts were used to check that the results of the second experiment extended to more complex utterances. The overall success rate was 75%, significantly different from chance. However, a breakdown of the results by source and target utterance showed that there was one contour (for the utterance "the price is 19 dollars and 8 cents") which exhibited very successful transfer, subjects identifying the transferred-pitch utterances at only a chance level. .pp Finally, transfers of pitch from utterances with two variable parts to those with one variable part were tested. Pitch contours were transferred to sentences with the same "cents" figure but no "dollars" part; for example, "the price is five dollars and thirteen cents" to "the price is thirteen cents". The contour was simply copied between the corresponding syllables, so that no adjustment needed to be made for different syllable structures. The overall score was 60 successes in 100 judgements \(em the same percentage as in the second experiment. .pp To summarize the results of these four experiments, .LB .NP even accomplished linguists cannot distinguish an utterance from one with pitch transferred from a different recording of it; .NP when the utterance contained only one variable part embedded in a carrier sentence, lay listeners identified the original correctly in 60% of cases, over a wide variety of syllable structures: this figure differs significantly from the chance value of 50%; .NP lay listeners identified the original confidently and correctly in 50% of cases; confidently but incorrectly in 25% of cases; .NP the greatest hindrance to successful transfer was the presence of a long uninterrupted period of voicing in the target utterance; .NP the performance of the method deteriorates as the number of variable parts in the utterances increases; .NP some utterances seemed to serve better than others as the pitch source for transfer, although this was not correlated with complexity of syllable structure; .NP even when the utterance contained two variable parts, there was one source utterance whose pitch contour was transferred to all the others so successfully that listeners could not identify the original. .LE .pp The fact that only 60% of originals in the second experiment were spotted by lay listeners in a stringent paired-comparison test \(em many of them being identified without confidence \(em does encourage the use of the procedure for generating stereotyped, but different, utterances of high quality in voice-response systems. The experiments indicate that although different syllable patterns can be handled satisfactorily by this procedure, long voiced periods should be avoided if possible when designing the message set, and that if individual utterances must contain multiple variable parts the source utterance should be chosen with the aid of listening tests. .sh "8.3 Assigning timing and pitch to synthetic speech" .pp The pitch transfer method can give good results within a fairly narrow domain of application. But like any speech output technique which treats complete utterances as a single unit, with provision for a small number of slot-fillers to accomodate data-dependent messages, it becomes unmanageable in more general situations with a large variety of utterances. As with segmental synthesis it becomes necessary to consider methods which use a textual rather than an acoustically-based representation of the prosodic features. .pp This raises a problem with prosodics that was not there for segmentals: how .ul can prosodic features be written in text form? The standard phonetic transcription method does not give much help with notation for prosodics. It does provide a diacritical mark to indicate stress, but this is by no means enough information for synthesis. Furthermore, text-to-speech procedures (described in the next chapter) promise to allow segmentals to be specified by an ordinary orthographic representation of the utterance; but we have seen that considerable intelligence is required to derive prosodic features from text. (More than mere intelligence may be needed: this is underlined by a paper (Bolinger, 1972) delightfully entitled "Accent is predictable \(em if you're a mind reader"!) .[ Bolinger 1972 Accent is predictable \(em if you're a mind reader .] .pp If synthetic speech is to be used as a computer output medium rather than as an experimental tool for linguistic research, it is important that the method of specifying utterances is natural and easy to learn. Prosodic features must be communicated to the computer in a manner considerably simpler than individual duration and pitch specifications for each phoneme, as was required in early synthesis-by-rule systems. Fortunately, a notation has been developed for conveying some of the prosodic features of utterances, as a by-product of the linguistically important task of classifying the intonation contours used in conversational English (Halliday, 1967). .[ Halliday 1967 .] This system has even been used to help foreigners speak English (Halliday, 1970) \(em which emphasizes the fact that it was designed for use by laymen, not just linguists! .[ Halliday 1970 Course in spoken English: Intonation .] .pp Here are examples of the way utterances can be conveyed to the ISP speech synthesis system which was described in the previous chapter. The notation is based upon Halliday's. .LB .NI 3 .ul ^ aw\ t\ uh/m\ aa\ t\ i\ k /s\ i\ n\ th\ uh\ s\ i\ s uh\ v /*s\ p\ ee\ t\ sh, .NI 1 .ul ^ f\ r\ uh\ m uh f\ uh/*n\ e\ t\ i\ k /r\ e\ p\ r\ uh\ z\ e\ n/t\ e\ i\ sh\ uh\ n. .LE (Automatic synthesis of speech, from a phonetic representation.) Three levels of stress are distinguished: tonic or "sentence" stress, marked by "*" before the syllable; foot stress (marked by "/"); and unstressed syllables. The notion of a "foot" controls the rhythm of the speech in a way that will be described shortly. A fourth level of stress is indicated on a segmental basis when a syllable contains a reduced vowel. .pp Utterances are divided by punctuation into .ul tone groups, which are the basic prosodic unit \(em there are two in the example. The shape of the pitch contour is governed by a numeral at the start of each tone group. Crude control over pauses is achieved by punctuation marks: full stop, for example, signals a pause while comma does not. (Longer pauses can be obtained by several full stops as in "...".) The "^" character stands for a so-called "silent stress" or breath point. Word boundaries are marked by two spaces between phonemes. As mentioned in the previous chapter, syllable boundaries and explicit pitch and duration specifiers can also be included in the input. If they are not, the ISP system will attempt to compute them. .rh "Rhythm." Our understanding of speech rhythm knows many laws but little order. In the mid 1970's there was a spate of publications reporting new data on segmental duration in various contexts, and there is a growing awareness that segmental duration is influenced by a great many factors, ranging from the structure of a discourse, through semantic and syntactic attributes of the utterances, their phonemic and phonetic make-up, right down to physiological constraints (these multifarious influences are ably documented and reviewed by Klatt, 1976). .[ Klatt 1976 Linguistic uses of segment duration in English .] What seems to be lacking in this work is a conceptual framework on to which new information about segmental duration can be nailed. .pp One starting-point for imitating the rhythm of English speech is the hypothesis of regularly recurring stresses. These stresses are primarily .ul rhythmic ones, and should be distinguished from the tonic stress mentioned above which is primarily an .ul intonational one. Rhythmic stresses are marked in the transcription by a "/". The stretch between one and the next is called a "foot", and the hypothesis above is often referred to as that of isochronous feet ("isochronous" means "of equal time"). There is considerable controversy about this hypothesis. It is most popular among British linguists and, it must be admitted, amongst those who work by introspection and intuition and do not actually .ul measure things. Although the question of isochrony of feet has long been debated, there seems to be general agreement \(em even amongst American linguists \(em that there is at least a tendency towards equal spacing of foot boundaries. However, little is known about the strength of this tendency and the extent of deviations from it (see Hill .ul et al, 1979, for an attempt to quantify it) \(em and there is even evidence to suggest that it may in part be a .ul perceptual phenomenon (Lehiste, 1973). .[ Hill Jassem Witten 1979 .] .[ Lehiste 1973 .] On this basic point, as on many others, the designer of a prosodic synthesis strategy must needs make assumptions which cannot be properly justified. .pp From a pragmatic point of view there are two advantages to basing a synthesis strategy on this hypothesis. Firstly, it provides a way to represent the many influences of higher-level processes (like syntax and semantics) on rhythm using a simple notation which fits naturally into the phonetic utterance representation, and which people find quite easy to understand and generate. Secondly, it tends to produce a heavily accentuated, but not unnatural, speech rhythm which can easily be moderated into a more acceptable rhythm by departing from isochrony in a controlled manner. .pp The ISP procedure does not make feet exactly isochronous. It starts with a standard foot time and attempts to fit the syllables of the foot into this time. If doing so would result in certain syllables having less than a preset minimum duration, the isochrony constraint is relaxed and the foot is expanded. There is no preset .ul maximum syllable length. However, when the durations of individual phoneme postures are adjusted to realize the calculated syllable durations, limits are imposed on the amount by which individual phonemes can be expanded or contracted. Thus a hierarchy of limits exists. .pp The rate of talking is determined by the standard foot time. If this time is short, many feet will be forced to have durations longer than the standard, and the speech will be "less isochronous". This seems to accord with common human experience. If the standard time is longer, however, the minimum syllable limit will always be exceeded and the speech will be completely isochronous. If it is too long, the above-mentioned limits to phoneme expansion will come into play and again partially destroy the isochrony. .pp It has often been observed that the final foot of an utterance tends to be longer than others; as does the tonic foot \(em that which bears the major stress. This is easy to accomodate, simply by making the target duration longer for these feet. .rh "From feet to syllables." A foot is a succession of syllables, one or more. And it is obvious that since there are more syllables in some feet than in others, some syllables must occupy less time than others in order to preserve the tendency towards isochrony of feet. .pp However, the duration of a foot is not divided evenly between its constituent syllables. The syllables have a definite rhythm of their own, which seems to be governed by .LB .NP the nature of the salient (that is, the first) syllable of the foot .NP the presence of word boundaries within the foot. .LE A salient syllable tends to be long either if it contains one of a class of so-called "long" vowels, or if there is a cluster of two or more consonants following the vowel. The pattern of syllables and word boundaries governs the rhythm of the foot, and Table 8.2 shows the possibilities for one-, two-, and three-syllable feet. This theory of speech rhythm is due to Abercrombie (1964). .[ Abercrombie 1964 Syllable quantity and enclitics in English .] .RF .nr x2 \w'three-syllable feet 'u .nr x3 \w'sal-short 'u .nr x4 \w'weak [#] 'u .nr x5 \w'weak 'u .nr x6 \w'/\fIit s incon\fR/ceivable 'u .nr x1 (\w'syllable rhythm'/2) .nr x7 \n(x2+\n(x3+\n(x4+\n(x5+\n(x6+\n(x1+\n(x1 .nr x7 (\n(.l-\n(x7)/2 .in \n(x7u .ta \n(x2u +\n(x3u +\n(x4u +\n(x5u +\n(x6u .ul syllable pattern example \0\0\h'-\n(x1u'syllable rhythm .sp one-syllable feet salient /\fIgood\fR /show 1 ^ weak /\fI^ good\fR/bye 2:1 .sp two-syllable feet sal-long weak /\fIcentre\fR /forward 1:1 sal-short weak /\fIatom\fR /bomb 1:2 salient # weak /\fItea for\fR /two 2:1 .sp three-syllable feet salient # weak [#] weak /\fIone for the\fR /road 2:1:1 /\fIit's incon\fR/ceivable sal-long weak # weak /\fIafter the\fR /war 2:3:1 sal-short weak # weak /\fImiddle to\fR /top 1:3:2 sal-long weak weak /\fInobody\fR /knows 3:1:2 sal-short weak weak /\fIanything\fR /more 1:1:1 .sp # denotes a word boundary; [#] is an optional word boundary .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .FG "Table 8.2 Syllable patterns and rhythms" .pp A foot may have the rhythmical characteristics of a two-syllable foot while having only one syllable, if the first place in it is filled by a silent stress (marked by "^"). This is shown in the second one-syllable example of Table 8.2. A similar effect may occur with two- and three-syllable feet, although examples are not given in the table. Feet of four and five syllables \(em with or without a silent stress \(em are considerably rarer. .pp Syllabification \(em splitting an utterance into syllables \(em is a job which had to be done for the pitch-transfer procedure described earlier, and the nature of syllable rhythms calls for it here too. Even though the utterance is now specified phonetically instead of acoustically, the same basic principle applies. Syllables normally coincide with peaks of sonority, where "sonority" measures the inherent loudness of a sound relative to other sounds of the same duration and pitch. However, difficult cases exist where it seems to be unclear how many syllables there are in a word. (Ladefoged, 1975, discusses this problem with examples such as "real", "realistic", and "reality".) Furthermore, .[ Ladefoged 1975 .] care must be taken to avoid counting two syllables in a word like "sky" because of its two peaks of sonority \(em for the stop .ul k has lower sonority than the fricative .ul s. .pp Three levels of notional sonority are enough for syllabification. Dividing phoneme segments into .ul sonorants (glides and nasals), .ul obstruents (stops and fricatives), and vowels; a general syllable has the form .LB .EQ sup * ~ sup * ~ sup * ~ sup * ~ sup * ~ , .EN .LE where "*" means repetition, that is, occurrence zero or more times. This sidesteps the "sky" problem by giving fricatives the same sonority as stops. It is easy to use the above structure to count the number of syllables in a given utterance by counting the sonority peaks. .pp However, what is required is an indication of syllable .ul boundaries as well as a syllable count. For slow conversational speech, these can be approximated as follows. Word divisions obviously form syllable boundaries, as should foot markers \(em but it may be wise not to assume that the latter do if the utterance has been prepared by someone with little knowledge of linguistics. Syllable boundaries should be made to coincide with sonority minima. As an .ul ad hoc pragmatic rule, if only one segment has the minimum sonority the boundary is placed before it. If there are two segments, each with the minimum sonority, it is placed between them, while for three or more it is placed after the first two. .pp These rules produce obviously acceptable divisions in many cases (to'day, ash'tray, tax'free), with perhaps unexpected positioning of the boundary in others (ins'pire, de'par'tment). Actually, people do differ in placement of syllable boundaries (Abercrombie, 1967). .[ Abercrombie 1967 .] .rh "From syllables to segments." The theory of isochronous feet (with the caveats noted earlier) and that of syllable rhythms provide a way of producing durations for individual syllables. But where are these durations supposed to be measured? There is a beat point, or tapping point, near the beginning of each syllable. This is the place where a listener will tap if asked to give one tap to each syllable; it has been investigated experimentally by Allen (1972). .[ Allen 1972 Location of rhythmic stress beats in English One .] It is not necessarily at the very beginning of the syllable. For example, in "straight", the tapping point is certainly after the .ul s and the stopped part of the .ul t. .pp Another factor which relates to the division of the syllable duration amongst phonetic segments is the often-observed fact that the length of the vocalic nucleus is a strong clue to the degree of voicing of the terminating cluster (Lehiste, 1970). .[ Lehiste 1970 Suprasegmentals .] If you say in pairs words like "cap", "cab"; "cat", "cad"; "tack", "tag" you will find that the vowel in the first word of each pair is significantly shorter than that in the second. In fact, the major difference between such pairs is the vowel length, not the final consonant. .pp Such effects can be taken into account by considering a syllable to comprise an initial consonant cluster, followed by a vocalic nucleus and a final consonant cluster. Any of these elements can be missing \(em the most unusual case where the nucleus is absent occurs, for example, in so-called syllabic .ul n\c \&'s (as in renderings of "button", "pudding" which might be written "butt'n", "pudd'n"). However, it is convenient to modify the definition of the nucleus so as to rule out the possibility of it being empty. Using the characterization of the syllable given above, the clusters can be defined as .LB .NI initial cluster = \u*\d \u*\d .NI nucleus = \u*\d \u*\d .NI final cluster = \u*\d. .LE Sonorants are included in the nucleus so that it is always present, even in the case of a syllabic consonant. .pp Then, rules can be used to divide the syllable duration between the initial cluster, nucleus, and final cluster. These must distinguish between situations where the terminating cluster is voiced or unvoiced so that the characteristic differences in vowel lengths can be accomodated. .pp Finally, the cluster durations must be apportioned amongst their constituent phonetic segments. There is little published data on which to base this. Two simple schemes which have been used in ISP are described in Witten (1977) and Witten & Smith (1977). .[ Witten 1977 A flexible scheme for assigning timing and pitch to synthetic speech .] .[ Witten Smith 1977 Synthesizing British English rhythm .] .rh "Pitch." There are two basically different ways of looking at the pitch of an utterance. One is to imagine pitch .ul levels attached to individual syllables. This has been popular amongst American linguists, and some people have even gone so far as to associate pitch levels with levels of stress. The second approach is to consider pitch .ul contours, as we did earlier when examining how to transfer pitch from one utterance to another. This seems to be easier for the person who transcribes the utterances to produce, for the information required is much less detailed than levels attached to each syllable. Some indication needs to be given of how the contour is to be bound to the utterance, and in the notation introduced above the most prominent, or "tonic", syllable is indicated in the transcription. .pp Halliday's (1970) classification identifies five different primary intonation contours, each hinging on the tonic syllable. .[ Halliday 1970 Course in spoken English: Intonation .] These are sketched in Figure 8.5, in the style of Halliday. .FC "Figure 8.5" Several secondary contours, which are variations on the primary ones, are defined as well. However, this classification scheme is intended for consumption by people, who bring to the problem a wealth of prior knowledge of speech and years of experience with it! It captures only the gross features of the infinite variety of pitch contours found in living speech. In a sense, the classification is .ul phonological rather than .ul phonetic, for it attempts to distinguish the features which make a logical difference to the listener instead of the acoustic details of the pitch contours. .pp It is necessary to take these contours and subject them to a sort of phonological-to-phonetic embellishment before applying them in synthetic speech. For example, the stretches with constant pitch which precede the tonic syllable in tone groups 1, 2, and 3 sound most unnatural when synthesized \(em for pitch is hardly ever exactly constant in living speech. Some pretonic pitch variation is necessary, and this can be made to emphasize the salient syllable of each foot. A "lilting" effect which reaches a peak at each foot boundary, and drops rather faster at the beginning of the foot than it rises at the end, sounds more natural. The magnitude of this inflection can be altered slightly to add interest, but a considerable increase in it produces a semantic change by making the utterance sound more emphatic. It is a major problem to pin down exactly the turning points of pitch in the falling-rising and rising-falling contours (4 and 5 in Figure 8.5). And even deciding on precise values for the pitch frequencies involved is not always easy. .pp The aim of the pitch assignment method of ISP is to allow the person (or program) which originates a spoken message to exercise a great deal of control over its intonation, without having to concern himself with foot or syllable structure. The message to be spoken must be broken down into tone groups, which correspond roughly to Halliday's tone groups. Each one comprises a .ul tonic of one or more feet, which is optionally preceded by a .ul pretonic, also with a number of feet. It is advantageous to allow a tone group boundary to occur in the middle of a foot (whereas Halliday's scheme insists that it occurs at a foot boundary). The first foot of the tonic, the .ul tonic foot, is marked by an asterisk at the beginning. It is on the first syllable of this foot \(em the "tonic" or "nuclear" syllable \(em that the major stress of the tone group occurs. If there is no asterisk in a tone group, ISP takes the final foot as the tonic (since this is the most common case). .pp The pitch contour on a tone group is specified by an array of ten numbers. Of course, the system cannot generate all conceivable contours for a tone group, but the definitions of the ten specifiable quantities have been chosen to give a useful range of contours. If necessary, more precise control over the pitch of an utterance can be achieved by making the tone groups smaller. .pp The overall pitch movement is controlled by specifying the pitch at three places: the beginning of the tone group, the beginning of the tonic syllable, and the end of the tone group. Provision is made for an abrupt pitch break at the start of the tonic syllable in order to simulate tone groups 2 and 3, and, to a lesser extent, tone groups 4 and 5. The pitch is interpolated linearly over the first part of the tone group (up to the tonic syllable) and over the last part (from there to the end), except that it is possible to specify a non-linearity on the tonic syllable, for emphasis, as shown in Figure 8.6. .FC "Figure 8.6" .pp On this basic shape are superimposed two finer pitch patterns. One of these is an initialization-continuation option which allows the pitch to rise (or fall) independently on the initial and final feet to specified values, without affecting the contour on the rest of the tone group (Figure 8.7). .FC "Figure 8.7" The other is a foot pattern which is superimposed on each pretonic foot, to give the stressed syllables of the pretonic added prominence and avoid the monotony of constant pitch. This is specified by a .ul non-linearity parameter which distorts the contour on the foot at a pre-determined point along it. Figure 8.8 shows the effect. .FC "Figure 8.8" .pp The ten quantities that define a pitch contour are summarized in Table 8.3, and shown diagrammatically in Figure 8.9. .FC "Figure 8.9" .RF .nr x0 \w'H: 'u .nr x1 \n(x0+\w'fraction along foot of the non-linearity position, for the tonic foot'u .nr x1 (\n(.l-\n(x1)/2 .in \n(x1u .ta \n(x0u +4n A: continuation from previous tone group zero gives no continuation non-zero gives pitch at start of tone group B: notional pitch at start C: pitch range on whole of pretonic D: departure from linearity on each foot of pretonic E: pitch change at start of tonic F: pitch range on tonic G: departure from linearity on tonic H: continuation to next tone group zero gives no continuation non-zero gives pitch at end of tone group I: fraction along foot of the non-linearity position, for pretonic feet J: fraction along foot of the non-linearity position, for the tonic foot .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 8.3 The quantities that define a pitch contour" .pp The intention of this parametric method of specifying contours is that the parameters should be easily derivable from semantic variables like emphasis, novelty of idea, surprise, uncertainty, incompleteness. Here we really are getting into controversial, unresearched areas. Roughly speaking, parameters D and G control emphasis, G by itself controls novelty and surprise, and H and the relative sizes of E and F control uncertainty and incompleteness. Certain parameters (notably I and J) are defined because although they do not appear to correspond to semantic distinctions, we do not yet know how to generate them automatically. .RF .nr x0 0.6i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+\w'0000' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta 0.6i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i Halliday's tone group \0\0A \0\0B \0\0C \0\0D \0\0E \0\0F \0\0G \0\0H \0\0I \0\0J \l'\n(x0u\(ul' .sp 1 \0\0\00 \0175 \0\0\00 \0\-40 \0\0\00 \-100 \0\-40 \0\0\00 0.33 \00.5 2 \0\0\00 \0280 \0\0\00 \0\-40 \-190 \0100 \0\0\00 \0\0\00 0.33 \00.5 3 \0\0\00 \0175 \0\0\00 \0\-40 \0\-70 \0\045 \0\-10 \0\0\00 0.33 \00.5 4 \0\0\00 \0280 \-100 \0\-40 \0\020 \0\045 \0\-45 \0\0\00 0.33 \00.5 5 \0\0\00 \0175 \0\060 \0\-40 \0\-20 \0\-45 \0\045 \0\0\00 0.33 \00.5 \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 8.4 Pitch contour table for Halliday's primary tone groups" .pp One basic requirement of the pitch assignment scheme was the ability to generate contours which approximate Halliday's five primary tone groups. Values of the ten specifiable quantities are given in Table 8.4, for each tone group. All pitches are given in\ Hz. A distinctly dipping pitch movement has been given to each pretonic foot (parameter D), to lend prominence to the salient syllables. .sh "8.4 Evaluating prosodic synthesis" .pp It is extraordinarily difficult to evaluate schemes for prosodic synthesis, and this is surely a large part of the reason why prosodics are among the least advanced aspects of artificial speech. Segmental synthesis can be tested by playing people minimal pairs of words which differ in just one feature that is being investigated. For example, one might experiment with "pit", "bit"; "tot", "dot"; "cot", "got" to test the rules which discriminate unvoiced from voiced stops. There are standard word-lists for intelligibility tests which can be used to compare systems, too. No equivalent of such micro-level evaluation exists for prosodics, for they by definition have a holistic effect on utterances. They are most noticeable, and most important, in longish stretches of speech. Even monotonous, arhythmic speech will be intelligible in sufficiently short samples provided the segmentals are good enough; but it is quite impossible to concentrate on such speech in quantity. Some attempts at evaluation appear in Ainsworth (1974) and McHugh (1976), but these are primarily directed at assessing the success of pronunciation rules, which are discussed in the next chapter. .[ Ainsworth 1974 Performance of a speech synthesis system .] .[ McHugh 1976 Listener preference and comprehension tests .] .pp One evaluation technique is to compare synthetic with natural versions of utterances, as was done in the pitch transfer experiment. The method described earlier used a sensitive paired-comparison test, where subjects heard both versions in quick succession and were asked to judge which was "most natural and intelligible". This is quite a stringent test, and one that may not be so useful for inferior, completely synthetic, contours. It is essential to degrade the "natural" utterance so that it is comparable segmentally to the synthetic one: this was done in the experiment described by extracting its pitch and resynthesizing it from linear predictive coefficients. .pp Several other experiments could be undertaken to evaluate artificial prosody. For example, one could compare .LB .NP natural and artificial rhythms, using artificial segmental synthesis in both cases; .NP natural and artificial pitch contours, using artificial segmental synthesis in both cases; .NP natural and artificial pitch contours, using segmentals extracted from natural utterances. .LE There are many other topics which have not yet been fully investigated. It would be interesting, for example, to define rules for generating speech at different tempos. Elisions, where phonemes or even whole syllables are suppressed, occur in fast speech; these have been analyzed by linguists but not yet incorporated into synthetic models. It should be possible to simulate emotion by altering parameters such as pitch range and mean pitch level; but this seems exceptionally difficult to evaluate. One situation where it would perhaps be possible to measure emotion is in the reading of sports results \(em in fact a study has already been made of intonation in soccer results (Bonnet, 1980)! .[ Bonnet 1980 .] Even the synthesis of voices with different pitch ranges requires investigation, for, as noted earlier, it is difficult to place precise frequency specifications on phonological contours such as those sketched in Figure 8.5. Clearly the topic of prosodic synthesis is a rich and potentially rewarding area of research. .sh "8.5 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "8.6 Further reading" .pp There are quite a lot of books in the field of linguistics which describe prosodic features. Here is a small but representative sample from both sides of the Atlantic. .LB "nn" .\"Abercrombie-1965-1 .]- .ds [A Abercrombie, D. .ds [D 1965 .ds [T Studies in phonetics and linguistics .ds [I Oxford Univ Press .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Abercrombie is one of the leading English authorities on phonetics, and this is a collection of essays which he has written over the years. Some of them treat prosodics explicitly, and others show the influence of verse structure on Abercrombie's thinking. .in-2n .\"Bolinger-1972-2 .]- .ds [A Bolinger, D.(Editor) .ds [D 1972 .ds [T Intonation .ds [I Penguin .ds [C Middlesex, England .nr [T 0 .nr [A 0 .nr [O 0 .][ 2 book .in+2n A collection of papers that treat a wide variety of different aspects of intonation in living speech. .in-2n .\"Crystal-1969-3 .]- .ds [A Crystal, D. .ds [D 1969 .ds [T Prosodic systems and intonation in English .ds [I Cambridge Univ Press .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This book attempts to develop a theoretical basis for the study of British English intonation. .in-2n .\"Gimson-1966-3 .]- .ds [A Gimson, A.C. .ds [D 1966 .ds [T The linguistic relevance of stress in English .ds [B Phonetics and linguistics .ds [E W.E.Jones and J.Laver .ds [P 94-102 .nr [P 1 .ds [I Longmans .ds [C London .nr [T 0 .nr [A 1 .nr [O 0 .][ 3 article-in-book .in+2n Here is a careful discussion of what is meant by "stress", with much more detail than has been possible in this chapter. .in-2n .\"Lehiste-1970-4 .]- .ds [A Lehiste, I. .ds [D 1970 .ds [T Suprasegmentals .ds [I MIT Press .ds [C Cambridge, Massachusetts .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This is a comprehensive study of suprasegmental phenomena in natural speech. It is divided into three major sections: quantity (timing), tonal features (pitch), and stress. .in-2n .\"Pike-1945-5 .]- .ds [A Pike, K.L. .ds [D 1945 .ds [T The intonation of American English .ds [I Univ of Michigan Press .ds [C Ann Arbor, Michigan .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n A classic, although somewhat dated, study. Notice that it deals specifically with American English. .in-2n .LE "nn" .EQ delim $$ .EN .CH "9 GENERATING SPEECH FROM TEXT" .ds RT "Generating speech from text .ds CX "Principles of computer speech .pp In the preceding two chapters I have described how artificial speech can be produced from a written phonetic representation with additional markers indicating intonation contours, points of major stress, rhythm, and pauses. This representation is substantially the same as that used by linguists when recording natural utterances. What we will discuss now are techniques for generating this information, or at least some of it, from text. .pp Figure 9.1 shows various levels of the speech synthesis process. .FC "Figure 9.1" Starting from the top with plain text, the first box splits it into intonation units (tone groups), decides where the major emphases (tonic stresses) should be placed, and further subdivides the tone group into rhythmic units (feet). For intonation analysis it is necessary to decide on an "interpretation" of the text, which in turn, as was emphasized at the beginning of the previous chapter, depends both on the semantics of what is being said and on the attitude of the speaker to his material. The resulting representation will be at the level of Halliday's notation for utterances, with the words still in English rather than phonetics. Table 9.1 illustrates the utterance representation at the various levels of the Figure. .RF .nr x0 \w'pitch and duration '+\w'at 8 kHz sampling rate a 4-second utterance' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta \w'pitch and duration 'u +\w'pause 'u +\w'00 msec 'u representation example \l'\n(x0u\(ul' .sp plain text Automatic synthesis of speech, from a phonetic representation. .sp text adorned with 3\0^ auto/matic /synthesis of /*speech, prosodic markers 1\0^ from a pho/*netic /represen/tation. .sp phonetic text with 3\0\fI^ aw t uh/m aa t i k /s i n th uh s i s\fR prosodic markers \0\0\fIuh v /*s p ee t sh\fR , 1\0\fI^ f r uh m uh f uh/*n e t i k\fR \0\0\fI/r e p r uh z e n/t e i sh uh n\fR . .sp phonemes with pause 80 msec pitch and duration \fIaw\fR 70 msec 105 Hz \fIt\fR 40 msec 136 Hz \fIuh\fR 50 msec 148 Hz \fIm\fR 70 msec 175 Hz \fIaa\fR 90 msec 140 Hz ... ... ... .sp parameters for 10 parameters, each updated at a frame formant or linear rate of 10 msec predictive (4 second utterance gives 400 frames, synthesizer or 4,000 data values) .sp acoustic wave at 8 kHz sampling rate a 4-second utterance has 32,000 samples \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 9.1 Utterance representations at various levels in speech synthesis" .pp The next job is to translate the plain text into a broad phonetic transcription. This requires knowledge of letter-to-sound pronunciation rules for the language under consideration. But much more is needed. The structure of each word must be examined for prefixes and suffixes, because they \(em especially the latter \(em have a strong influence on pronunciation. This is called "morphological" analysis. Actually it is also required for rhythmical purposes, because prefixes are frequently unstressed (note that the word "prefix" is itself an exception to this!). Thus the appealing segmentation of the overall problem shown in Figure 9.1 is not very accurate, for the individual processes cannot be rigidly separated as it implies. In fact, we saw earlier how this intermixing of levels occurs with prosodic and segmental features. Nevertheless, it is helpful to structure discussion of the problem by separating levels as a first approximation. Further influences on pronunciation come from the semantics and syntax of the utterance \(em and both also play a part in intonation and rhythm analysis. The result of this second process is a phonetic representation, still adorned with prosodic markers. .pp Now we move down from higher-level intonation and rhythm considerations to the details of the pitch contour and segment durations. This process was the subject of the previous chapter. The problems are twofold: to map an appropriate acoustic pitch contour on to the utterance, using tonic stress point and foot boundaries as anchor points; and to assign durations to segments using the foot\(emsyllable\(emcluster\(emsegment hierarchy. If it is accepted that the overall rhythm can be captured adequately by foot markers, this process does not interact with earlier ones. However, many researchers do not, believing instead that rhythm is syntactically determined at a very detailed level. This will, of course, introduce strong interaction between the duration assignment process and the levels above. (Klatt, 1975, puts it into his title \(em "Vowel lengthening is syntactically determined in a connected discourse". .[ Klatt 1975 Vowel lengthening is syntactically determined .] Contrast this with the paper cited earlier (Bolinger, 1972) entitled "Accent is predictable \(em if you're a mind reader". .[ Bolinger 1972 Accent is predictable \(em if you're a mind reader .] No-one would disagree that "accent" is an influential factor in vowel length!) .pp Notice incidentally that the representation of the result of the pitch and duration assignment process in Table 9.1 is inadequate, for each segment is shown as having just one pitch. In practice the pitch varies considerably throughout every segment, and can easily rise and fall on a single one. For example, .LB "he's .ul very good" .LE may have a rise-fall on the vowel of "very". The linked event-list data-structure of ISP is much more suitable than a textual string for utterance representation at this level. .pp The fourth and fifth processes of Figure 9.1 have little interaction with the first two, which are the subject of this chapter. Segmental concatenation, which was treated in Chapter 7, is affected by prosodic features like stress; but a notation which indicates stressed syllables (like Halliday's) is sufficient to capture this influence. Contextual modification of segments, by which I mean the coarticulation effects which govern allophones of phonemes, is included explicitly in the fourth process to emphasize that the upper levels need only provide a broad phonemic transcription rather than a detailed phonetic one. Signal synthesis can be performed by either a formant synthesizer or a linear predictive one (discussed in Chapters 5 and 6). This will affect the details of the segmental concatenation process but should have no impact at all on the upper levels. .pp Figure 9.1 performs a useful function by summarizing where we have been in earlier chapters \(em the lower three boxes \(em and introducing the remaining problems that must be faced by a full text-to-speech system. It also serves to illustrate an important point: that a speech output system can demand that its utterances be entered in any of a wide range of representations. Thus one can enter at a low level with a digitized waveform or linear predictive parameters; or higher up with a phonetic representation that includes detailed pitch and duration specification at the phoneme level; or with a phonetic text or plain text adorned with prosodic markers; or at the very top with plain text as it would appear in a book. A heavy price in naturalness and intelligibility is paid by moving up .ul any of these levels \(em and this is just as true at the top of the Figure as at the bottom. .sh "9.1 Deriving prosodic features" .pp If you really need to start with plain text, some very difficult problems present themselves. The text should be understood, first of all, and then decisions need to be made about how it is to be interpreted. For an excellent speaker \(em like an actor \(em these decisions will be artistic, at least in part. They should certainly depend upon the opinion and attitude of the speaker, and his perception of the structure and progress of the dialogue. Very little is known about this upper level of speech synthesis from text. In practice it is almost completely ignored \(em and the speech is at most barely intelligible, and certainly uncomfortable to listen to. Hence anybody contemplating building or using a speech output system which starts from something close to plain text should consider carefully whether some extra semantic information can be coded into the initial utterances to help with prosodic interpretation. Only rarely is this impossible \(em and reading machines for the blind are a prime example of a situation where arbitrary, unannotated, texts must be read. .rh "Intonation analysis." One distinction which a program can usefully try to make is between basically rising and basically falling pitch contours. It is often said that pitch rises on a question and falls on a statement, but if you listen to speech you will find this to be a gross oversimplification. It normally falls on statements, certainly; but it falls as often as it rises on questions. It is more accurate to say that pitch rises on "yes-no" questions and falls on other utterances, although this rule is still only a rough guide. A simple test which operates lexically on the input text is to determine whether a sentence is a question by looking at the punctuation mark at its end, and then to examine the first word. If it is a "wh"-word like "what", "which", "when", "why" (and also "how") a falling contour is likely to fit. If not, the question is probably a yes-no one, and the contour should rise. Such a crude rule will certainly not be very accurate (it fails, for example, when the "wh"-word is embedded in a phrase as in "at what time are you going?"), but at least it provides a starting-point. .pp An air of finality is given to an utterance when it bears a definite fall in pitch, dropping to a rather low value at the end. This should accompany the last intonation unit in an utterance (unless it is a yes-no question). However, a rise-fall contour such as Halliday's tone group 5 (Figure 8.5) can easily be used in utterance-final position by one person in a conversation \(em although it would be unlikely to terminate the dialogue altogether. A new topic is frequently introduced by a fall-rise contour \(em such as Halliday's tone group 4 \(em and this often begins a paragraph. .pp Determining the type of pitch contour is only one part of intonation assignment. There are really three separate problems: .LB .NP dividing the utterance into tone groups .NP choosing the tonic syllable, or major stress point, of each one .NP assigning a pitch contour to each tone group. .LE Let us continue to use the Halliday notation for intonation, which was introduced in simplified form in the previous chapter. Moreover, assume that the foot boundaries can be placed correctly \(em this problem will be discussed in the next subsection. Then a scheme which considers only the lexical form of the utterance and does not attempt to "understand" it (whatever that means) is as follows: .LB .NP place a tone group boundary at every punctuation mark .NP place the tonic at the first syllable of the last foot in a tone group .NP use contour 4 for the first tone group in a paragraph and contour 1 elsewhere, except for a yes-no question which receives contour 2. .LE .RF .nr x0 \w'From Scarborough to Whitby\0\0\0\0'+\w'4 ^ from /Scarborough to /*Whitby is a' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta \w'From Scarborough to Whitby\0\0\0\0\0\0'u plain text text adorned with prosodic markers \l'\n(x0u\(ul' .sp From Scarborough to Whitby is a 4 ^ from /Scarborough to /*Whitby is a very pleasant journey, with 1\- very /pleasant /*journey with very beautiful countryside. 1\- very /beautiful /*countryside ... In fact the Yorkshire coast is 1+ ^ in /fact the /Yorkshire /coast is \0\0\0\0lovely, \0\0\0\0/*lovely all along, ex- 1+ all a/*long ex cept the parts that are covered _4 cept the /parts that are /covered \0\0\0\0in caravans of course; and \0\0\0\0in /*caravans of /course and if you go in spring, 4 if you /go in /*spring when the gorse is out, 4 ^ when the /*gorse is /out or in summer, 4 ^ or in /*summer when the heather's out, 4 ^ when the /*heather's /out it's really one of the most 13 ^ it's /really /one of the /most \0\0\0\0delightful areas in the \0\0\0\0de/*lightful /*areas in the whole country. 1 whole /*country .sp The moorland is 4 ^ the /*moorland is rather high up, and 1 rather /high /*up and fairly flat \(em a 1 fairly /*flat a sort of plateau. 1 sort of /*plateau ... At least, 1 ^ at /*least it isn't really flat, 13 ^ it /*isn't /really /*flat when you get up on the top; \-3 ^ when you /get up on the /*top it's rolling moorland 1 ^ it's /rolling /*moorland cut across by steep valleys. But 1 cut across by /steep /*valleys but seen from the coast it's 4 seen from the /*coast it's ... "up there on the moors", and you 1 up there on the /*moors and you always think of it as a _4 always /*think of it as a kind of tableland. 1 kind of /*tableland \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 9.2 Example of intonation and rhythm analysis (from Halliday, 1970)" .[ Halliday 1970 Course in spoken English: Intonation .] .pp These extremely crude and simplistic rules are really the most that one can do without subjecting the utterance to a complicated semantic analysis. In statistical terms, they are actually remarkably effective. Table 9.2 shows part of a spontaneous monologue which was transcribed by Halliday and appears in his teaching text on intonation (Halliday, 1970, p 133). .[ Halliday 1970 Course in Spoken English: Intonation .] Among the prosodic markers are some that were not introduced in Chapter 8. Firstly, each tone group has secondary contours which are identified by "1+", "1\-" (for tone group 1), and so on. Secondly, the mark "..." is used to indicate a pause which disrupts the speech rhythm. Notice that its positioning belies the advice of the old elocutionists: .br .ev2 .in 0 .LB .fi A Comma stops the Voice while we may privately tell .NI .ul one, a Semi-colon .ul two; a Colon .ul three:\c and a Period .ul four. .br .nr x0 \w'\fIone,\fR a Semi-colon \fItwo;\fR a Colon \fIthree:\fR and a Period \fIfour.'-\w'(Mason,\fR 1748)' .NI \h'\n(x0u'(Mason, 1748) .nf .LE .br .ev Thirdly, compound tone groups such as "13" appear which contain .ul two tonic syllables. This differs from a simple concatenation of tone groups (with contours 1 and 3 in this case) because the second is in some sense subsidiary to the first. Typically it forms an adjunct clause, while the first clause gives the main information. Halliday provides many examples, such as .LB .NI /Jane goes /shopping in /*town /every /*Friday .NI /^ I /met /*Arthur on the /*train. .LE But he does not comment on the .ul acoustic difference between a compound tone group and a concatenation of simple ones \(em which is, after all, the information needed for synthesis. A final, minor, difference between Halliday's scheme and that outlined earlier is that he compels tone group boundaries to occur at the beginning of a foot. .RF .nr x0 3.3i+1.3i+\w'complete' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta 3.3i +1.3i excerpt in complete Table 9.2 passage \l'\n(x0u\(ul' .sp number of tone groups 25 74 .sp number of boundaries correctly 19 (76%) 47 (64%) placed .sp number of boundaries incorrectly \00 \01 (\01%) placed .sp number of tone groups having a 22 (88%) 60 (81%) tonic syllable at the beginning of the final foot .sp number of tone groups whose 17 (68%) 51 (69%) contours are correctly assigned \l'\n(x0u\(ul' .sp number of compound tone groups \02 (\08%) \06 (\08%) .sp number of secondary intonation \07 (28%) 13 (17%) contours \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 9.3 Success of simple intonation assignment rules" .pp Applying the simple rules given above to the text of Table 9.2 leads to the results in the first column of Table 9.3. Three-quarters of the foot boundaries are flagged by punctuation marks, with no extraneous ones being included. 88% of tone groups have a tonic syllable at the start of the final foot. However, the compound tone groups each have two tonic syllables, and of course only the second one is predicted by the final-foot rule. Assigning intonation contours on the extremely simple basis of using contour 4 for the first tone group in a paragraph, and contour 1 thereafter, also seems to work quite well. Secondary contours such as "1+" and "1\-" have been mapped into the appropriate primary contour (1, in this case) for the present purpose, and compound tone groups have been assigned the first contour of the pair. The result is that 68% of contours are given correctly. .pp In order to give some idea of the reliability of these figures, the results for the whole passage transcribed by Halliday \(em of which Table 9.2 is an excerpt \(em are shown in the second column of Table 9.3. Although it looks as though the rules may have been slightly lucky with the excerpt, the general trends are the same, with 65% to 80% of features being assigned correctly. It could be argued, though, that the complete text is punctuated fairly liberally by present-day standards, so that the tone-group boundary rule is unusually successful. .pp These results are really astonishingly good, considering the crudeness of the rules. However, they should be interpreted with caution. What is missed by the rules, although appearing to comprise only 20% to 35% of the features, is certain to include the important, information-bearing, and variety-producing features that give the utterance its liveliness and interest. It would be rash to assume that all tone-group boundaries, all tonic positions, and all intonation contours, are equally important for intelligibility and naturalness. It is much more likely that the rules predict a default pattern, while most information is borne by deviations from them. To give an engineering analogy, it may be as though the carrier waveform of a modulated transmission is being simulated, instead of the information-bearing signal! Certainly the utterance will, if synthesized with intonation given by these rules, sound extremely dull and repetitive, mainly because of the overwhelming predominance of tone group 1 and the universal placement of tonic stress on the final foot. .pp There are certainly many different ways to orate any particular text, and that given by Halliday and reproduced in Table 9.2 is only one possible version. However, it is fair to say that the default intonation discussed above could only occur naturally under very unusual circumstances \(em such as a petulant child, unwilling and sulky, having been forced to read aloud. This is hardly how we want our computers to speak! .rh "Rhythm analysis." Consider now how to decide where foot boundaries should be placed in English text. Clearly semantic considerations sometimes play a part in this \(em one could say .LB /^ is /this /train /going /*to /London .LE instead of the more usual .LB /^ is /this /train /going to /*London .LE in circumstances where the train might be going .ul to or .ul from London. Such effects are ignored here, although it is worth noting in passing that the rogue words will often be marked by underscoring or italicizing (as in the previous sentence). If the text is liberally underlined, semantic analysis may be unnecessary for the purposes of rhythm. .pp A rough and ready rule for placing foot boundaries is to insert one before each word which is not in a small closed set of "function words". The set includes, for example, "a", "and", "but", "for", "is", "the", "to". If a verb or adjective begins with a prefix, the boundary should be moved between it and the root \(em but not for a noun. This will give the distinction between .ul con\c vert (noun) and con\c .ul vert (verb), .ul ex\c tract and ex\c .ul tract, and for many North American speakers, will help to distinguish .ul in\c quiry from in\c .ul quire. However, detecting prefixes by a simple splitting algorithm is dangerous. For example, "predate" is a verb with stress on what appears to be a prefix, contrary to the rule; while the "pre" in "predator" is not a prefix \(em at least, it is not pronounced as the prefix "pre" normally is. Moreover, polysyllabic words like "/diplomat", "dip/lomacy", "diplo/matic"; or "/telegraph", "te/legraphy", "tele/graphic" cannot be handled on such a simple basis. .pp In 1968, a remarkable work on English sound structure was published (Chomsky and Halle, 1968) which proposes a system of rules to transform English text into a phonetic representation in terms of distinctive features, with the aid of a lexicon. .[ Chomsky Halle 1968 .] A great deal of attention is paid to stress, and rules are given which perform well in many tricky cases. .pp It uses the American system of levels of stress, marking so-called primary stress with a superscript 1, secondary stress with a superscript 2, and so on. The superscripts are written on the vowel of the stressed syllable: completely unstressed syllables receive no annotation. For example, the sentence "take John's blackboard eraser" is written .LB ta\u2\dke Jo\u3\dhn's bla\u1\dckboa\u5\drd era\u4\dser. .LE In foot notation this utterance is .LB /take /John's /*blackboard e/raser. .LE It undoubtedly contains less information than the stress-level version. For example, the second syllable of "blackboard" and the first one of "erase" are both unstressed, although the rhythm rules given in Chapter 8 will cause them to be treated differently because they occupy different places in the syllable pattern of the foot. "Take", "John's", and the second syllable of "erase" are all non-tonic foot-initial syllables and hence are not distinguished in the notation; although the pitch contours schematized in Figure 8.9 will give them different intonations. .pp An indefinite number of levels of stress can be used. For example, according to the rules given by Chomsky and Halle, the word "sad" in .LB my friend can't help being shocked at anyone who would fail to consider his sad plight .LE has level-8 stress, the final two words being annotated as "sa\u8\dd pli\u1\dght". However, only the first few levels are used regularly, and it is doubtful whether acoustic distinctions are made in speech between the weaker ones. .pp Chomsky and Halle are concerned to distinguish between such utterances as .LB .NI bla\u2\dck boa\u1\drd-era\u3\dser ("board eraser that is black") .NI bla\u1\dckboa\u3\drd era\u2\dser ("eraser for a blackboard") .NI bla\u3\dck boa\u1\drd era\u2\dser ("eraser of a black board"), .LE and their stress assignment rules do indeed produce each version when appropriate. In foot notation the distinctions can still be made: .LB .NI /black /*board-eraser/ .NI /*blackboard e/raser/ .NI /black /*board e/raser/ .LE .pp The rules operate on a grammatical derivation tree of the text. For instance, input for the three examples would be written .LB .NI [\dNP\u[\dA\u black ]\dA\u [\dN\u[\dN\u board]\dN\u [\dN\u eraser ]\dN\u]\dN\u]\dNP\u .NI [\dN\u[\dN\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dN\u [\dN\u eraser ]\dN\u]\dN\u .NI [\dN\u[\dNP\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dNP\u [\dN\u eraser ]\dN\u]\dN\u, .LE representing the trees shown in Figure 9.2. .FC "Figure 9.2" Here, N stands for a noun, NP for a noun phrase, and A for an adjective. These categories appear explicitly as nodes in the tree. In the linearized textual representation they are used to label brackets which represent the tree structure. An additional piece of information which is needed is the lexical entry for "eraser", which would show that it has only one accented (that is, potentially stressed) syllable, namely, the second. .pp Consider now how to account for stress in prefixed and suffixed words, and those polysyllabic ones with more than one potential stress point. For these, the morphological structure must appear in the input. .pp Now .ul morphemes are well-defined minimal units of grammatical analysis from which a word may be composed. For example, [went]\ =\ [go]\ +\ [ed] is a morphemic decomposition, where "[ed]" denotes the past-tense morpheme. This representation is not particularly suitable for speech synthesis for the obvious reason that the result bears no phonetic resemblance to the input. What is needed is a decomposition into .ul morphs, which occur only when the lexical or phonetic representation of a word may easily be segmented into parts. Thus [wanting]\ =\ [want]\ +\ [ing] and [bigger]\ =\ [big]\ +\ [er] are simultaneously morphic and morphemic decompositions. Notice that in the second example, a rule about final consonant doubling has been applied at the lexical level (although it is not needed in a phonetic representation): this comes into the sphere of "easy" segmentation. Contrast this with [went]\ =\ [go]\ +\ [ed] which is certainly not an easy segmentation and hence a morphemic but not a morphic decomposition. But between these extremes there are some difficult cases: [specific]\ =\ [specify]\ +\ [ic] is probably morphic as well as morphemic, but it is not clear that [galactic]\ =\ [galaxy]\ +\ [ic] is. .pp Assuming that the input is given as a derivation tree with morphological structure made explicit, Chomsky and Halle present rules which assign stress correctly in nearly all cases. For example, their rules give .LB .NI [\dA\u[\dN\u incident ]\dN\u + al]\dA\u \(em> i\u2\dncide\u1\dntal; .LE and if the stem is marked by [\dS\u\ ...\ ]\dS\u in prefixed words, they can deduce .LB .NI [\dN\u tele [\dS\u graph ]\dS\u]\dN\u \(em> te\u1\dlegra\u3\dph .NI [\dN\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u y ]\dN\u \(em> tele\u1\dgraphy .NI [\dA\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u ic ]\dA\u \(em> te\u3\dlegra\u1\dphi\u2\dc. .LE .pp There are two rules which account for the word-level stress on such examples: the "main stress" rule and the "alternating stress" rule. In essence, the main stress rule emphasizes the last strong syllable of a stem. A syllable is "strong" either if it contains one of a class of so-called "long" vowels, or if there is a cluster of two or more consonants following the vowel; otherwise it is "weak". (If you are exceptionally observant you will notice that this strong\(emweak distinction has been used before, when discussing the rhythm of feet in syllables.) Thus the verb "torment" receives stress on the second syllable, for it is a strong one. A noun like "torment" is treated as being derived from the corresponding verb, and the rule assigns stress to the verb first and then modifies it for the noun. The second, "alternating stress", rule gives some stress to alternate syllables of polysyllabic words like "form\c .ul al\c de\c .ul hyde\c ". .pp It is quite easy to incorporate the word-level rules into a computer program which uses feet rather than stress levels as the basis for prosodic description. A foot boundary is simply placed before the primary-stressed (level-1) syllable, except for function words, which do not begin a foot. The other stress levels should be ignored, except that for slow, deliberate speech, secondary (level-2) stress is mapped into a foot boundary too, if it precedes the primary stress. There is also a rule which reduces vowels in unstressed syllables. .pp The stress assignment rules can work on phonemic script, as well as English. For example, starting from the phonetic form [\d\V\u\ \c .ul aa\ s\ t\ o\ n\ i\ sh\ \c ]\dV\u, the stress assignment rules produce \c .ul aa\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c the vowel reduction rule generates \c .ul uh\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c and the foot conversion process gives \c .ul uh\ s/t\ o\ n\ i\ sh. This appears to provide a fairly reliable algorithm for foot boundary placement. .rh "Speech synthesis from concept." I argued earlier that in order to derive prosodic features of an utterance from text it is necessary to understand its role in the dialogue, its semantics, its syntax, and \(em as we have just seen \(em its morphological structure. This is a very tall order, and the problem of natural language comprehension by machine is a vast research area in its own right. However, in many applications requiring speech output, utterances are generated by the computer from internally stored data rather than being read aloud from pre-prepared text. Then the problem of comprehending text may be evaded, for presumably the language-generation module can provide a semantic, syntactic, and even morphological decomposition of the utterance, as well as some indication of its role in the dialogue (that is, why it is necessary to say it). .pp This forms the basis of the appealing notion of "speech synthesis from concept". It has some advantages over speech generation from text, and in principle should provide more natural-sounding speech. Every word produced by the system can have a complete lexical entry which shows its morphological decomposition and potential stress points. The full syntactic history of each utterance is known. The Chomsky-Halle rules described above can therefore be used to place foot boundaries accurately, without the need for a complex parsing program and without the risk of having to make guesses about unknown words. .pp However, it is not clear how to take advantage of any semantic information which is available. Ideally, it should be possible to place tone group boundaries and tonic stress points, and assign intonation contours, in a natural-sounding way. But look again at the example text of Table 9.2 and imagine that you have at your disposal as much semantic information as is needed. It is .ul still far from obvious how the intonation features could be assigned! It is, in the ultimate analysis, interpretive and stylistic .ul choices that add variety and interest to speech. .pp Take the problem of determining pitch contours, for instance. Some of them may be explicable. Contour 4 on .LB .NI except the parts that are covered in caravans of course .LE is due to its being a contrastive clause, for it presents essentially new information. Similarly, the succession .LB .NI if you go in spring .NI when the gorse is out .NI or in summer .NI when the heather's out .LE could be considered contrastive, being in the subjunctive voice, and this could explain why contour 4's were used. But this is all conjecture, and it is difficult to apply throughout the passage. Halliday (1970) explains the contexts in which each tone group is typically used, but in an extremely high-level manner which would be impossible to embody directly in a computer program. .[ Halliday 1970 Course in spoken English: Intonation .] At the other end of the spectrum, computer systems for written discourse production do not seem to provide the subtle information needed to make intonation decisions (see, for example, Davey, 1978, for a fairly complete description of such a system). .[ Davey 1978 .] .pp One project which uses such a method for generating speech has been described (Young and Fallside, 1980). .[ Young Fallside 1980 .] Although some attention is paid to rhythm, the intonation contours which are generated are disappointingly repetitive and lacking in richness. In fact, very little semantic information is used to assign contours; really just that inferred by the crude punctuation-driven method described earlier. .pp The higher-level semantic problems associated with speech output were studied some years go under the title "synthetic elocution" (Vanderslice, 1968). .[ Vanderslice 1968 .] A set of rules was generated and tested by hand on a sample passage, the first part of which is shown in Table 9.4. However, no attempt was made to formalize the rules in a computer program, and indeed it was recognized that a number of important questions, such as the form of the semantic information assumed at the input, had been left unanswered. .RF .nr x0 \w'\0\0 psychologist '+\w'emphasis assigned because of antithesis with ' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta \w'\0\0 psychologist 'u \l'\n(x0u\(ul' .sp Human experience and human behaviour are accessible to observation by everyone. The psychologist tries to bring them under systematic study. What he perceives, however, anyone can perceive; for his task he requires no microscope or electronic gear. .sp2 \0\0 word comments \l'\n(x0u\(ul' .sp \01 Human special treatment because paragraph-initial \04 human accent deleted because it echoes word 1 13 psychologist emphasis assigned because of antithesis with "everyone" 17 them anaphoric to "Human experience and human behaviour" 19 systematic emphasis assigned because of contrast with "observation" 20 study emphasis? \(em text is ambiguous whether "observation" is a kind of study that is nonsystematic, or an activity contrasting with the entire concept of "systematic study" 21 What increase in pitch for "What he perceives" because it is not the subject 22 he accented although anaphoric to word 13 because of antithesis with word 25 24 however decrease in pitch because it is parenthetical 25 anyone emphasized by antithesis with word 22 27 perceive unaccented because it echoes word 23, "perceives" \0\0 ; semicolon assigns falling intonation 30 task unaccented because it is anaphoric with "tries to bring them under systematic study" \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 9.4 Sample passage and comments pertinent to synthetic elocution" .pp The comments in the table, which are selected and slightly edited versions of those appearing in the original work (Vanderslice, 1968), are intended as examples of the nature and subtlety of the prosodic influences which were examined. .[ Vanderslice 1968 .] The concepts of "accent" and "emphasis" are used; these relate to stress but are not easy to define precisely in our tone-group terminology. Fortunately we do not need an exact characterization of them for the present purpose. Roughly speaking, "accent" encompasses both foot-initial stress and tonic stress, whereas "emphasis" is something more than this, typically being realized by the fall-rise or rise-fall contours of Halliday's tone groups 4 and 5 (Figure 8.5). .pp Particular attention is paid to anaphora and antithesis (amongst other things). The first term means the repetition of a word or phrase in the text, and is often applied to pronoun references. In the example, the word "human" is repeated in the first few words; "them" in the second sentence refers to "human experience and human behaviour"; "he" in the third sentence is the previously-mentioned psychologist; and "task" is anaphoric with "tries to bring them under systematic study". Other things being equal, anaphoric references are unaccented. In our terms this means that they certainly do not receive tonic stress and may not even receive foot stress. .pp Antithesis is defined as the contrast of ideas expressed by parallelism of strongly contrasting words or phrases; and the second element taking part in it is generally emphasized. "Psychologist" in the passage is an antithesis of "everyone"; "systematic" and possibly "study" of "observation". Thus .LB .NI /^ the psy/*chologist .LE would probably receive intonation contour 4, since it is also introducing a new actor; while .LB .NI /tries to /bring them /under /system/*matic /study .LE could receive contour 5. "He" and "everyone" are antithetical; not only does the latter receive emphasis but the former has its accent restored \(em for otherwise it would have been removed because of anaphora with "psychologist". Hence it will certainly begin a foot, possibly a tonic foot. .pp A factor that does not affect the sample passage is the accentuation of unusual syllables of similar words to bring out a contrast. For example, .LB .NI he went .ul out\c side, not .ul in\c side. .LE Although this may seem to be just another facet of antithesis, Vanderslice points out that it is phonetic rather than structural similarity that is contrasted: .LB .NI I said .ul de\c plane, not .ul com\c plain. .LE This introduces an interesting interplay between the phonetic and prosodic levels. .pp Anaphora and antithesis provide an ideal domain for speech synthesis from concept. Determining them from plain text is a very difficult problem, requiring a great deal of real-world knowledge. The first has received some attention in the field of natural language understanding. Finding pronoun referents is an important problem for language translation, for their gender is frequently distinguished in, say, French where it is not in English. Examples such as .LB .NI I bought the wine, sat on a table, and drank it .NI I bought the wine, sat on a table, and broke it .LE have been closely studied (Wilks, 1975); for if they were to be translated into French the pronoun "it" would be rendered differently in each case (\c .ul le vin, .ul la table). .[ Wilks 1975 An intelligent analyzer and understander of English .] .pp In spoken language, emphasis is used to indicate the referent of a pronoun when it would not otherwise be obvious. Vanderslice gives the example .LB .NI Bill saw John across the room and he ran over to him .NI Bill saw John across the room and .ul he ran over to .ul him, .LE where the emphasis reverses the pronoun referents (so that John did the running). He suggests accenting a personal pronoun whenever the true antecedent is not the same as the "unmarked" or default one. Unfortunately he does not elaborate on what is meant by "unmarked". Does it mean that the referent cannot be predicted from knowledge of the words alone \(em as in the second example above? If so, this is a clear candidate for speech synthesis from concept, for the distinction cannot be made from text! .sh "9.2 Pronunciation" .pp English pronunciation is notoriously irregular. A poem by Charivarius, the pseudonym of a Dutch high school teacher and linguist G.N.Trenite (1870\-1946), surveys the problems in an amusing way and is worth quoting in full. .br .ev2 .in 0 .LB "nnnnnnnnnnnnnnnn" .ul The Chaos .sp2 .ne4 Dearest creature in Creation Studying English pronunciation, .in +5n I will teach you in my verse Sounds like corpse, corps, horse and worse. .ne4 .in -5n It will keep you, Susy, busy, Make your head with heat grow dizzy; .in +5n Tear in eye your dress you'll tear. So shall I! Oh, hear my prayer: .ne4 .in -5n Pray, console your loving poet, Make my coat look new, dear, sew it. .in +5n Just compare heart, beard and heard, Dies and diet, lord and word. .ne4 .in -5n Sword and sward, retain and Britain, (Mind the latter, how it's written). .in +5n Made has not the sound of bade, Say \(em said, pay \(em paid, laid, but plaid. .ne4 .in -5n Now I surely will not plague you With such words as vague and ague, .in +5n But be careful how you speak: Say break, steak, but bleak and streak, .ne4 .in -5n Previous, precious; fuchsia, via; Pipe, shipe, recipe and choir; .in +5n Cloven, oven; how and low; Script, receipt; shoe, poem, toe. .ne4 .in -5n Hear me say, devoid of trickery; Daughter, laughter and Terpsichore; .in +5n Typhoid, measles, topsails, aisles; Exiles, similes, reviles; .ne4 .in -5n Wholly, holly; signal, signing; Thames, examining, combining; .in +5n Scholar, vicar and cigar, Solar, mica, war and far. .ne4 .in -5n Desire \(em desirable, admirable \(em admire; Lumber, plumber; bier but brier; .in +5n Chatham, brougham; renown but known, Knowledge; done, but gone and tone, .ne4 .in -5n One, anemone; Balmoral, Kitchen, lichen; laundry, laurel; .in +5n Gertrude, German; wind and mind; Scene, Melpemone, mankind; .ne4 .in -5n Tortoise, turquoise, chamois-leather, Reading, Reading; heathen, heather. .in +5n This phonetic labyrinth Gives: moss, gross; brook, brooch; ninth, plinth. .ne4 .in -5n Billet does not end like ballet; Bouquet, wallet, mallet, chalet; .in +5n Blood and flood are not like food, Nor is mould like should and would. .ne4 .in -5n Banquet is not nearly parquet, Which is said to rime with darky .in +5n Viscous, viscount; load and broad; Toward, to forward, to reward. .ne4 .in -5n And your pronunciation's O.K. When you say correctly: croquet; .in +5n Rounded, wounded; grieve and sieve; Friend and fiend, alive and live .ne4 .in -5n Liberty, library; heave and heaven; Rachel, ache, moustache; eleven. We say hallowed, but allowed; People, leopard; towed, but vowed. .in +5n Mark the difference moreover Between mover, plover, Dover; .ne4 .in -5n Leeches, breeches; wise, precise; Chalice, but police and lice. .in +5n Camel, constable, unstable, Principle, discipline, label; .ne4 .in -5n Petal, penal and canal; Wait, surmise, plait, promise; pal. .in +5n Suit, suite, ruin; circuit, conduit, Rime with: "shirk it" and "beyond it"; .ne4 .in -5n But it is not hard to tell Why it's pall, mall, but Pall Mall. .in +5n Muscle, muscular; goal and iron; Timber, climber; bullion, lion; .ne4 .in -5n Worm and storm; chaise, chaos, chair; Senator, spectator, mayor. .in +5n Ivy, privy; famous, clamour and enamour rime with "hammer". .ne4 .in -5n Pussy, hussy and possess, Desert, but dessert, address. .in +5n Golf, wolf; countenants; lieutenants Hoist, in lieu of flags, left pennants. .ne4 .in -5n River, rival; tomb, bomb, comb; Doll and roll, and some and home. .in +5n Stranger does not rime with anger, Neither does devour with clangour. .ne4 .in -5n Soul, but foul; and gaunt, but aunt; Font, front, won't; want, grand and grant; .in +5n Shoes, goes, does. Now first say: finger, And then; singer, ginger, linger. .ne4 .in -5n Real, zeal; mauve, gauze and gauge; Marriage, foliage, mirage, age. .in +5n Query does not rime with very, Nor does fury sound like bury. .ne4 .in -5n Dost, lost, post; and doth, cloth, loth; Job, Job; blossom, bosom, oath. .in +5n Though the difference seems little We say actual, but victual; .ne4 .in -5n Seat, sweat; chaste, caste; Leigh, eight, height; Put, nut; granite but unite. .in +5n Reefer does not rime with deafer, Feoffer does, and zephyr, heifer. .ne4 .in -5n Dull, bull; Geoffrey, George; ate, late; Hint, pint; senate, but sedate. .in +5n Scenic, Arabic, Pacific; Science, conscience, scientific. .ne4 .in -5n Tour, but our, and succour, four; Gas, alas and Arkansas! .in +5n Sea, idea, guinea, area, Psalm, Maria, but malaria. .ne4 .in -5n Youth, south, southern; cleanse and clean; Doctrine, turpentine, marine. .in +5n Compare alien with Italian. Dandelion with battalion, .ne4 .in -5n Sally with ally, Yea, Ye, Eye, I, ay, aye, whey, key, quay. Say aver, but ever, fever, Neither, leisure, skein, receiver. .in +5n Never guess \(em it is not safe; We say calves, valves, half, but Ralf. .ne4 .in -5n Heron, granary, canary; Crevice and device and eyrie; .in +5n Face, preface, but efface, Phlegm, phlegmatic; ass, glass, bass; .ne4 .in -5n Large, but target, gin, give, verging; Ought, out, joust and scour, but scourging; .in +5n Ear, but earn; and wear and tear Do not rime with "here", but "ere". .ne4 .in -5n Seven is right, but so is even; Hyphen, roughen, nephew, Stephen; .in +5n Monkey, donkey; clerk and jerk; Asp, grasp, wasp; and cork and work. .ne4 .in -5n Pronunciation \(em think of psyche - Is a paling, stout and spikey; .in +5n Won't it make you lose your wits, Writing groats and saying "groats"? .ne4 .in -5n It's a dark abyss or tunnel, Strewn with stones, like rowlock, gunwale, .in +5n Islington and Isle of Wight, Housewife, verdict and indict. .ne4 .in -5n Don't you think so, reader, rather Saying lather, bather, father? .in +5n Finally: which rimes with "enough", Though, through, plough, cough, hough or tough? .ne4 .in -5n Hiccough has the sound of "cup", My advice is ... give it up! .LE "nnnnnnnnnnnnnnnn" .br .ev .rh "Letter-to-sound rules." Despite such irregularities, it is surprising how much can be done with simple letter-to-sound rules. These specify phonetic equivalents of word fragments and single letters. The longest stored fragment which matches the current word is translated, and then the same strategy is adopted on the remainder of the word. Table 9.5 shows some English fragments and their pronunciations. .RF .nr x0 1.5i+\w'pronunciation ' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta 1.5i fragment pronunciation \l'\n(x0u\(ul' .sp -p- \fIp\fR -ph- \fIf\fR -phe| \fIf ee\fR -phe|s \fIf ee z\fR -phot- \fIf uh u t\fR -place|- \fIp l e i s\fR -plac|i- \fIp l e i s i\fR -ple|ment- \fIp l i m e n t\fR -plie|- \fIp l aa i y\fR -post \fIp uh u s t\fR -pp- \fIp\fR -pp|ly- \fIp l ee\fR -preciou- \fIp r e s uh\fR -proce|d- \fIp r uh u s ee d\fR -prope|r- \fIp r o p uh r\fR -prov- \fIp r uu v\fR -purpose- \fIp er p uh s\fR -push- \fIp u sh\fR -put \fIp u t\fR -puts \fIp u t s\fR \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 9.5 Word fragments and their pronunciations" .pp It is sometimes important to specify that a rule applies only when the fragment is matched at the beginning or end of a word. In the Table "-" means that other fragments can precede or follow this one. The "|" sign is used to separate suffixes from a word stem, as will be explained shortly. .pp An advantage of the longest-string search strategy is that it is easy to account for exceptions simply by incorporating them into the fragment table. If they occur in the input, the complete word will automatically be matched first, before any fragment of it is translated. The exception list of complete words can be surprisingly small for quite respectable performance. Table 9.6 shows the entire dictionary for an excellent early pronunciation system written at Bell Laboratories (McIlroy, 1974). .[ McIlroy 1974 .] Some of the words are notorious exceptions in English, while others are included simply because the rules would run amok on them. Notice that the exceptions are all quite short, with only a few of them having more than two syllables. .RF .nr x1 0.9i+0.9i+0.9i+0.9i+0.9i+0.9i .nr x1 (\n(.l-\n(x1)/2 .in \n(x1u .ta 0.9i +0.9i +0.9i +0.9i +0.9i a doesn't guest meant reader those alkali doing has moreover refer to always done have mr says today any dr having mrs seven tomorrow april early heard nature shall tuesday are earn his none someone two as eleven imply nothing something upon because enable into nowhere than very been engine is nuisance that water being etc island of the wednesday below evening john on their were body every july once them who both everyone live one there whom busy february lived only thereby whose copy finally living over these woman do friday many people they women does gas maybe read this yes .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 9.6 Exception table for a simple pronunciation program" .pp Special action has to be taken with final "e"'s. These lengthen and alter the quality of the preceding vowel, so that "bit" becomes "bite" and so on. Unfortunately, if the word has a suffix the "e" must be detected even though it is no longer final, as in "lonely", and it is even dropped sometimes ("biting") \(em otherwise these would be pronounced "lonelly", "bitting". To make matters worse the suffix may be another word: we do not want "kiteflying" to have an extra syllable which rhymes with "deaf"! Although simple procedures can be developed to take care of common word endings like "-ly", "-ness", "-d", it is difficult to decompose compound words like "wisecrack" and "bumblebee" reliably \(em but this must be done if they are not to be articulated with three syllables instead of two. Of course, there are exceptions to the final "e" rule. Many common words ("some", "done", "[live]\dV\u") disobey the rule by not lengthening the main vowel, while in other, rarer, ones ("anemone", "catastrophe", "epitome") the final "e" is actually pronounced. There are also some complete anomalies ("fete"). .pp McIlroy's (1974) system is a superb example of a robust program which takes a pragmatic approach to these problems, accepting that they will never be fully solved, and which is careful to degrade gracefully when stumped. .[ McIlroy 1974 .] The pronunciation of each word is found by a succession of increasingly desperate trials: .LB .NP replace upper- by lower-case letters, strip punctuation, and try again; .NP remove final "-s", replace final "ie" by "y", and try again; .NP reject a word without a vowel; .NP repeatedly mark any suffixes with "|"; .NP mark with "|" probable morph divisions in compound words; .NP mark potential long vowels indicated by "e|", and long vowels elsewhere in the word; .NP mark voiced medial "s" as in "busy", "usual"; replace final "-s" if stripped; .NP scanning the word from left to right, apply letter-to-sound rules to word fragments; .NP when all else fails spell the word, punctuation and all (burp on letters for which no spelling rule exists). .LE .RF .nr x0 \w'| ment\0\0\0'+\w'replace final ie by y\0\0\0'+\w'except when no vowel would remain in ' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta \w'| ment\0\0\0'u +\w'replace final ie by y\0\0\0'u suffix action notes and exceptions \l'\n(x0u\(ul' .sp s strip off final s except in context us \&' strip off final ' ie replace final ie by y e replace final e by E when it is the only vowel in a word (long "e") | able place suffix mark as except when no vowel would remain in | ably shown the rest of the word e | d e | n e | r e | ry e | st e | y | ful | ing | less | ly | ment | ness | or | ic place suffix mark as | ical shown and terminate e | final e processing \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 9.7 Rules for detecting suffixes for final 'e' processing" .pp Table 9.7 shows the suffixes which the program recognizes, with some comments on their processing. Multiple suffixes are detected and marked in words like "force|ful|ly" and "spite|ful|ness". This allows silent "e"'s to be spotted even when they occur far back in a word. Notice that the suffix marks are available to the word-fragment rules of Table 9.5, and are frequently used by them. .pp The program has some .ul ad hoc rules for dealing with compound words like "race|track", "house|boat"; these are applied as well as normal suffix splitting so that multiple decompositions like "pace|make|r" can be accomplished. The rules look for short letter sequences which do not usually appear in monomorphemic words. It is impossible, however, to detect every morph boundary by such rules, and the program inevitably makes mistakes. Examples of boundaries which go undetected are "edge|ways", "fence|post", "horse|back", "large|mouth", "where|in"; while boundaries are incorrectly inserted into "comple|mentary", "male|volent", "prole|tariat", "Pame|la". .pp We now seem to have presented two opposing points of view on the pronunciation problem. Charivarius, the Dutch poet, shows that an enormous number of exceptional words exist; whereas McIlroy's program makes do with a tiny exception dictionary. These views can be reconciled by noting that most of Charivarius' words are relatively uncommon. McIlroy tested his program against the 2000 most frequent words in a large corpus (Kucera and Francis, 1967), and found that 97% were pronounced correctly if word frequencies were taken into account. .[ Kucera Francis 1967 .] (The notion of "correctness" is of course a rather subjective one.) However, he estimated that on the remaining words the success rate was only 88%. .pp The system is particularly impressive in that it is prepared to say anything: if used, for example, on source programs in a high-level computer language it will say the keywords and pronouncable identifiers, spell the other identifiers, and even give the names of special symbols (like +, <, =) correctly! .rh "Morphological analysis." The use of letter-to-sound rules provides a cheap and fast technique for pronunciation \(em the fragment table and exception dictionary for the program described above occupy only 11 Kbyte of storage, and can easily be kept in solid-state read-only memory. It produces reasonable results if careful attention is paid to rules for suffix-splitting. However, it is inherently limited because it is not possible in general to detect compound words by simple rules which operate on the lexical structure of the word. .pp Compounds can only be found reliably by using a morph dictionary. This gives the added advantage that syntactic information can be stored with the morphs to assist with rhythm assignment according to the Chomsky-Halle theory. However, it was noted earlier that morphs, unlike the grammatically-determined morphemes, are not very well defined from a linguistic point of view. Some morphemic decompositions are obviously not morphic because the constituents do not in any way resemble the final word; while others, where the word is simply a concatenation of its components, are clearly morphic. Between these extremes lies a hazy region where what one considers to be a morph depends upon how complex one is prepared to make the concatenation rules. The following description draws on techniques used in a project at MIT in which a morph-based pronunciation system has been implemented (Lee, 1969; Allen, 1976). .[ Lee 1969 .] .[ Allen 1976 Synthesis of speech from unrestricted text .] .pp Estimates of the number of morphs in English vary from 10,000 to 30,000. Although these seem to be very large numbers, they are considerably less than the number of words in the language. For example, Webster's .ul New Collegiate Dictionary (7'th edition) contains about 100,000 entries. If all forms of the words were included, this number would probably double. .pp There are several classes of morphs, with restrictions on the combinations that occur. A general word has prefixes, a root, and suffixes, as shown in Figure 9.3; only the root is mandatory. .FC "Figure 9.3" Suffixes usually perform a grammatical role, affecting the conjugation of a verb or declension of a noun; or transforming one part of speech into another ("-al" can make a noun into an adjective, while "-ness" performs the reverse transformation.) Other suffixes, such as "-dom" or "-ship", only apply to certain parts of speech (nouns, in this case), but do not change the grammatical role of the word. Such suffixes, and all prefixes, alter the meaning of a word. .pp Some root morphs cannot combine with other morphs but always stand alone \(em for instance, "this". Others, called free morphs, can either occur on their own or combine with further morphs to form a word. Thus the root "house" can be joined on either side by another root, such as "boat", or by a suffix such as "ing". A third type of root morph is one which .ul must combine with another morph, like "crimin-", "-ceive". .pp Even with a morph dictionary, decomposing a word into a sequence of morphs is not a trivial operation. The process of lexical concatenation often results in a minor change in the constituents. How big this change is allowed to be governs the morph system being used. For example, Allen (1976) gives three concatenation rules: a final "e" can be omitted, as in .ta 1.1i .LB .NI give + ing \(em> giving; .LE the last consonant of the root can be doubled, as in .LB .NI bid + ing \(em> bidding; .LE or a final "y" can change to an "i", as in .LB .NI handy + cap \(em> handicap. .[ Allen 1976 Synthesis of speech from unrestricted text .] .LE If these are the only rules permitted, the morph dictionary will have to include multiple versions of some suffixes. For example, the plural morpheme [-s] needs to be represented both by "-s" and "-es", to account for .LB .NI pea + s \(em> peas .LE and .LB .NI baby + es \(em> babies (using the "y" \(em> "i" rule). .LE This would not be necessary if a "y" \(em> "ie" rule were included too. Similarly, the morpheme [-ic] will include morphs "-ic" and "-c"; the latter to cope with .LB .NI specify + c \(em> specific (using the "y" \(em> "i" rule). .LE Furthermore, non-morphemic roots such as "galact" need to be included because the concatenation rules do not capture the transformation .LB .NI galaxy + ic \(em> galactic. .LE There is clearly a trade-off between the size of the morph dictionary and the complexity of the concatenation rules. .pp Since a text-to-speech system is presented with already-concatenated morphs, it must be prepared to reverse the effects of the concatenation rules to deduce the constituents of a word. When two morphs combine with any of the three rules given above, the changes in spelling occur only in the lefthand one. Therefore the word is best scanned in a right-to-left direction to split off the morphs starting with suffixes, as McIlroy's program does. If the procedure fails at any point, one of the three rules is hypothesized, its effect is undone, and splitting continues. For example, consider the word .LB .NI grasshoppers <\(em grass + hop + er + s .LE (Lee, 1969). .[ Lee 1969 .] The "-s" is detected first, then "-er"; these are both stored in the dictionary as suffixes. The remainder, "grasshopp", cannot be decomposed and does not appear in the dictionary. So each of the rules above is hypothesized in turn, and the result investigated. (The "y" \(em> "i" rule is obviously not applicable.) When the final-consonant-doubling rule is considered, the sequence "grasshop" is investigated. "Shop" could be split off this, but then the unknown morph "gras" would result. The alternative, to remove "hop", leaves a remainder "grass" which .ul is a free morph, as desired. Thus a unique and correct decomposition is obtained. Notice that the procedure would fail if, for example, "grass" had been inadvertently omitted from the dictionary. .pp Sometimes, several seemingly valid decompositions present themselves (Allen, 1976). .[ Allen 1976 Synthesis of speech from unrestricted text .] For example: .LB .NI scarcity <\(em scar + city .NI <\(em scarce + ity (using final-"e" deletion) .NI <\(em scar + cite + y (using final-"e" deletion) .NI resting <\(em rest + ing .NI <\(em re + sting .NI biding <\(em bide + ing (using final-"e" deletion) .NI <\(em bid + ing .NI unionized <\(em un + ion + ize + d .NI <\(em union + ize + d .NI winding <\(em [wind]\dN\u + ing .NI <\(em [wind]\dV\u + ing. .LE The last distinction is important because the pronunciation of "wind" depends on whether it is a noun or a verb. .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .pp Several sources of information can be used to resolve these ambiguities. The word structure of Figure 9.3, together with the division of root morphs into bound and free ones, may eliminate some possibilities. Certain letter sequences (such as "rp") do not appear at the beginning of a word or morph, and others never occur at the end. Knowledge of these sequences can reject some unacceptable decompositions \(em or perhaps more importantly, can enable intelligent guesses to be made in cases where a constituent morph has been omitted from the dictionary. The grammatical function of suffixes allows suffix sequences to be checked for compatibility. The syntax of the sentence, together with suffix knowledge, can rule out other combinations. Semantic knowledge will occasionally be necessary (as in the "unionized" and "winding" examples above \(em compare a "winding road" with a "winding blow"). Finally, Allen (1976) suggests that a preference structure on composition rules can be used to resolve ambiguity. .[ Allen 1976 Synthesis of speech from unrestricted text .] .pp Once the morphological structure has been determined, the rest of the pronunciation process is relatively easy. A phonetic transcription of each morph may be stored in the morph dictionary, or else letter-to-sound rules can be used on individual morphs. These are likely to be quite successful because final-"e" processing can be now be done with confidence: there are no hidden final "e"'s in the middle of morphs. In either case the resulting phonetic transcriptions of the individual morphs must be concatenated to give the transcription of the complete word. Although some contextual modification has to be accounted for, it is relatively straightforward and easy to predict. For example, the plural morphs "-s" and "-es" can be realized phonetically by .ul uh\ z, .ul s, or .ul z depending on context. Similarly the past-tense suffix "-ed" may be rendered as .ul uh\ d, .ul t, or .ul d. The suffixes "-ion" and "-ure" sometimes cause modification of the previous morph: for example .LB .NI act + ion \(em> \c .ul a k t\c + ion \(em> \c .ul a k sh uh n. .LE .pp The morph dictionary does not remove the need for a lexicon of exceptional words. The irregular final-"e" words mentioned earlier ("done", "anemone", "fete") need to be treated on an individual basis, as do words such as "quadruped" which have misleading endings (it should not be decomposed as "quadrup|ed"). .rh "Pronunciation of languages other than English." Text-to-speech systems for other languages have been reported in the literature. (For example, French, Esperanto, Italian, Russian, Spanish, and German are covered by Lesmo .ul et al, 1978; O'Shaughnessy .ul et al, 1981; Sherwood, 1978; Mangold and Stall, 1978). .[ Lesmo 1978 .] .[ O'Shaughnessy Lennig Mermelstein Divay 1981 .] .[ Sherwood 1978 .] .[ Mangold Stall 1978 .] Generally speaking, these present fewer difficulties than does English. Esperanto is particularly easy because each letter in its orthography has only one sound, making the pronunciation problem trivial. Moreover, stress in polysyllabic words always occurs on the penultimate syllable. .pp It is tempting and often sensible when designing a synthesis system for English to use an utterance representation somewhere between phonetics and ordinary spelling. This may happen in practice even if it is not intended: a user, finding that a given word is pronounced incorrectly, will alter the spelling to make it work. The Word English Spelling alphabet (Dewey, 1971), amongst others (Haas, 1966), is a simplified and apparently natural scheme which was developed by the spelling reform movement. .[ Dewey 1971 .] .[ Haas 1966 .] It maps very simply on to a phonetic representation, just like Esperanto. However, it can provide little help with the crucial problem of stress assignment, except perhaps by explicitly indicating reduced vowels. .sh "9.3 Discussion" .pp This chapter has really only touched the tip of a linguistic iceberg. I have given some examples of representations, rules, algorithms, and exceptions, to make the concepts more tangible; but a whole mass of detail has been swept under the carpet. .pp There are two important messages that are worth reiterating once more. The first is that the representation of the input \(em that is, whether it be a "concept" in some semantic domain, a syntactic description of an utterance, a decomposition into morphs, plain text or some contrived re-spelling of it \(em is crucial to the quality of the output. Almost any extra information about the utterance can be taken into account and used to improve the speech. It is difficult to derive such information if it is not provided explicitly, for the process of climbing the tree from text to semantic representation is at least as hard as descending it to a phonetic transcription. .pp Secondly, simple algorithms perform remarkably well \(em witness the punctuation-driven intonation assignment scheme, and word fragment rules for pronunciation. However, the combined degradation contributed by several imperfect processes is likely to impair speech quality very seriously. And great complexity is introduced when these simple algorithms are discarded in favour of more sophisticated ones. There is, for example, a world of difference between a pronunciation program that copes with 97% of common words and one that deals correctly with 99% of a random sample from a dictionary. .pp Some of the options that face the system designer are recapitulated in Figure 9.4. .FC "Figure 9.4" Starting from text, one can take the simple approach of lexically-based suffix-splitting, letter-to-sound rules, and prosodics derived from punctuation, to generate a phonetic transcription. This will provide a cheap system which is relatively easy to implement but whose speech quality will probably not be acceptable to any but the most dedicated listener (such as a blind person with no other access to reading material). .pp The biggest improvement in speech quality from such a system would almost certainly come from more intelligent prosodic control \(em particularly of intonation. This, unfortunately, is also by far the most difficult to make unless intonation contours, tonic stresses, and tone-group boundaries are hand-coded into the input. To generate the appropriate information from text one has to climb to the upper levels in Figure 9.4 \(em and even when these are reached, the problems are by no means over. Still, let us climb the tree. .pp For syntax analysis, part-of-speech information is needed; and for this the grammatical roles of individual words in the text must be ascertained. A morph dictionary is the most reliable way to do this. A linguist may prefer to go from morphs to syntax by way of morphemes; but this is not necessary for the present purpose. Just the information that the morph "went" is a verb can be stored in the dictionary, instead of its decomposition [went]\ =\ [go]\ +\ [ed]. .pp Now that we have the morphological structure of the text, stress assignment rules can be applied to produce more accurate speech rhythms. The morph decomposition will also allow improvements to be made to the pronunciation, particularly in the case of silent "e"'s in compound words. But the ability to assign intonation has hardly been improved at all. .pp Let us proceed upwards. Now the problems become really difficult. A semantic representation of the text is needed; but what exactly does this mean? We certainly must have .ul morphemic knowledge, for now the fact that "went" is a derivative of "go" (rather than any other verb) becomes crucial. Very well, let us augment the morph dictionary with morphemic information. But this does not attack the problem of semantic representation. We may wish to resolve pronoun references to help assign stress. Parts of the problem are solved in principle and reported in the artificial intelligence literature, but if such an ability is incorporated into the speech synthesis system it will become enormously complicated. In addition, we have seen that knowledge of antitheses in the text will greatly assist intonation assignment, but procedures for extracting this information constitute a research topic in their own right. .pp Now step back and take a top-down approach. What could we do with this semantic understanding and knowledge of the structure of the discourse if we had it? Suppose the input were a "concept" in some as yet undetermined representation. What are the .ul acoustic manifestations of such high-level features as anaphoric references or antithetical comparisons, of parenthetical or satirical remarks, of emotions: warmth, sarcasm, sadness and despair? Can we program the art of elocution? These are good questions. .sh "9.4 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "9.5 Further reading" .pp Books on pronunciation give surprisingly little help in designing a text-to-speech procedure. The best aid is a good on-line dictionary and flexible software to search it and record rules, examples, and exceptions. Here are some papers that describe existing systems. .LB "nn" .\"Ainsworth-1974-1 .]- .ds [A Ainsworth, W.A. .ds [D 1974 .ds [T A system for converting text into speech .ds [J IEEE Trans Audio and Electroacoustics .ds [V AU-21 .ds [P 288-290 .nr [P 1 .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n .in-2n .\"Colby-1978-2 .]- .ds [A Colby, K.M. .as [A ", Christinaz, D. .as [A ", and Graham, S. .ds [D 1978 .ds [K * .ds [T A computer-driven, personal, portable, and intelligent speech prosthesis .ds [J Computers and Biomedical Research .ds [V 11 .ds [P 337-343 .nr [P 1 .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n .in-2n .\"Elovitz-1976-3 .]- .ds [A Elovitz, H.S. .as [A ", Johnson, R.W. .as [A ", McHugh, A. .as [A ", and Shore, J.E. .ds [D 1976 .ds [K * .ds [T Letter-to-sound rules for automatic translation of English text to phonetics .ds [J IEEE Trans Acoustics, Speech and Signal Processing .ds [V ASSP-24 .ds [N 6 .ds [P 446-459 .nr [P 1 .ds [O December .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n .in-2n .\"Kooi-1978-4 .]- .ds [A Kooi, R. .as [A " and Lim, W.C. .ds [D 1978 .ds [T An on-line minicomputer-based system for reading printed text aloud .ds [J IEEE Trans Systems, Man and Cybernetics .ds [V SMC-8 .ds [P 57-62 .nr [P 1 .ds [O January .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n .in-2n .\"Umeda-1975-5 .]- .ds [A Umeda, N. .as [A " and Teranishi, R. .ds [D 1975 .ds [K * .ds [T The parsing program for automatic text-to-speech synthesis developed at the Electrotechnical Laboratory in 1968 .ds [J IEEE Trans Acoustics, Speech and Signal Processing .ds [V ASSP-23 .ds [N 2 .ds [P 183-188 .nr [P 1 .ds [O April .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n .in-2n .\"Umeda-1976-6 .]- .ds [A Umeda, N. .ds [D 1976 .ds [K * .ds [T Linguistic rules for text-to-speech synthesis .ds [J Proc IEEE .ds [V 64 .ds [N 4 .ds [P 443-451 .nr [P 1 .ds [O April .nr [T 0 .nr [A 1 .nr [O 0 .][ 1 journal-article .in+2n .in-2n .LE "nn" .EQ delim $$ .EN .CH "10 DESIGNING THE MAN-COMPUTER DIALOGUE" .ds RT "The man-computer dialogue .ds CX "Principles of computer speech .pp Interactive computers are being used more and more by non-specialist people without much previous computer experience. As processing costs continue to decline, the overall expense of providing highly interactive systems becomes increasingly dominated by terminal and communications equipment. Taken together, these two factors highlight the need for easy-to-use, low-bandwidth interactive terminals that make maximum use of the existing telephone network for remote access. .pp Speech output can provide versatile feedback from a computer at very low cost in distribution and terminal equipment. It is attractive from several points of view. Terminals \(em telephones \(em are invariably in place already. People without experience of computers are accustomed to their use, and are not intimidated by them. The telephone network is cheap to use and extends all over the world. The touch-tone keypad (or a portable tone generator) provides a complementary data input device which will do for many purposes until the technology of speech recognition becomes better developed and more widespread. Indeed, many applications \(em especially information retrieval ones \(em need a much smaller bandwidth from user to computer than in the reverse direction, and voice output combined with restricted keypad entry provides a good match to their requirements. .pp There are, however, severe problems in implementing natural and useful interactive systems using speech output. The eye can absorb information at a far greater rate than can the ear. You can scan a page of text in a way which has no analogy in auditory terms. Even so, it is difficult to design a dialogue which allows you to search computer output visually at high speed. In practice, scanning a new report is often better done at your desk with a printed copy than at a computer terminal with a viewing program (although this is likely to change in the near future). .pp With speech, the problem of organizing output becomes even harder. Most of the information we learn using our ears is presented in a conversational way, either in face-to-face discussions or over the telephone. Verbal but non-conversational presentations, as in the university lecture theatre, are known to be a rather inefficient way of transmitting information. The degree of interaction is extremely high even in a telephone conversation, and communication relies heavily on speech gestures such as hesitations, grunts, and pauses; on prosodic features such as intonation, pitch range, tempo, and voice quality; and on conversational gambits such as interruption and long silence. I emphasized in the last two chapters the rudimentary state of knowledge about how to synthesize prosodic features, and the situation is even worse for the other, paralinguistic, phenomena. .pp There is also a very special problem with voice output, namely, the transient nature of the speech signal. If you miss an utterance, it's gone. With a visual display unit, at least the last few interactions usually remain available. Even then, it is not uncommon to look up beyond the top of the screen and wish that more of the history was still visible! This obviously places a premium on a voice response system's ability to repeat utterances. Moreover, the dialogue designer must do his utmost to ensure that the user is always aware of the current state of the interaction, for there is no opportunity to refresh the memory by glancing at earlier entries and responses. .pp There are two separate aspects to the man-computer interface in a voice response system. The first is the relationship between the system and the end user, that is, the "consumer" of the synthesized dialogue. The second is the relationship between the system and the applications programmer who creates the dialogue. These are treated separately in the next two sections. We will have more to say about the former aspect, for it is ultimately more important to more people. But the applications programmer's view is important, too; for without him no systems would exist! The technical difficulties in creating synthetic dialogues for the majority of voice systems probably explain why speech output technology is still greatly under-used. Finally we look at techniques for using small keypads such as those on touch-tone telephones, for they are an essential part of many voice response systems. .sh "10.1 Programming principles for natural interaction" .pp Special attention must be paid to be details of the man-machine interface in speech-output systems. This section summarizes experience of human factors considerations gained in developing the remote telephone enquiry service described in Chapter 1 (Witten and Madams, 1977), which employs an ordinary touch-tone keypad for input in conjunction with synthetic voice response. .[ Witten Madams 1977 Telephone Enquiry Service .] Most of the principles which emerged were the result of natural evolution of the system, and were not clear at the outset. Basically, they stem from the fact that speech is both more intrusive and more ephemeral than writing, and so they are applicable in general to speech output information retrieval systems with keyboard or even voice input. Be warned, however, that they are based upon casual observation and speculation rather than empirical research. There is a desperate need for proper studies of user psychology in speech systems. .rh "Echoing." Most alphanumeric input peripherals echo on a character-by-character basis. Although one can expect quite a high proportion of mistakes with unconventional keyboards, especially when entering alphabetic data on a basically numeric keypad, audio character echoing is distracting and annoying. If you type "123" and the computer echoes .LB .NI "one ... two ... three" .LE after the individual key-presses, it is liable to divert your attention, for voice output is much more intrusive than a purely visual "echo". .pp Instead, an immediate response to a completed input line is preferable. This response can take the form or a reply to a query, or, if successive data items are being typed, confirmation of the data entered. In the latter case, it is helpful if the information can be generated in the same way that the user himself would be likely to verbalize it. Thus, for example, when entering numbers: .LB .nr x0 \w'COMPUTER:' .nr x1 \w'USER:' .NI USER:\h'\n(x0u-\n(x1u' "123#" (# is the end-of-line character) .NI COMPUTER: "One hundred and twenty-three." .LE For a query which requires lengthy processing, the input should be repeated in a neat, meaningful format to give the user a chance to abort the request. .rh "Retracting actions." Because commands are entered directly without explicit confirmation, it must always be easy for the user to revoke his actions. The utility of an "undo" command is now commonly recognized for any interactive system, and it becomes even more important in speech systems because it is easier for the user to lose his place in the dialogue and so make errors. .rh "Interrupting." A command which interrupts output and returns to a known state should be recognized at every level of the system. It is essential that voice output be terminated immediately, rather than at the end of the utterance. We do not want the user to live in fear of the system embarking on a long, boring monologue that is impossible to interrupt! Again, the same is true of interactive dialogues which do not use speech, but becomes particularly important with voice response because it takes longer to transmit information. .rh "Forestalling prompts." Computer-generated prompts must be explicit and frequent enough to allow new users to understand what they are expected to do. Experienced users will "type ahead" quite naturally, and the system should suppress unnecessary prompts under these conditions by inspecting the input buffer before prompting. This allows the user to concatenate frequently-used commands into chunks whose size is entirely at his own discretion. .pp With the above-mentioned telephone enquiry service, for example, it was found that people often took advantage of the prompt-suppression feature to enter their user number, password, and required service number as a single keying sequence. As you becomes familiar with a service you quickly and easily learn to forestall expected prompts by typing ahead. This provides a very natural way for the system to adapt itself automatically to the experience of the user. New users will naturally wait to be prompted, and proceed through the dialogue at a slower and more relaxed pace. .pp Suppressing unnecessary prompts is a good idea in any interactive system, whether or not it uses the medium of speech \(em although it is hardly ever done in conventional systems. It is particularly important with speech, however, because an unexpected or unwanted prompt is quite distracting, and it is not so easy to ignore it as it is with a visual display. Furthermore, speech messages usually take longer to present than displayed ones, so that the user is distracted for more time. .rh "Information units." Lengthy computer voice responses are inappropriate for conveying information, because attention wanders if one is not actively involved in the conversation. A sequential exchange of terse messages, each designed to dispense one small unit of information, forces the user to take a meaningful part in the dialogue. It has other advantages, too, allowing a higher degree of input-dependent branching, and permitting rapid recovery from errors. .pp The following example from the "Acidosis program", an audio response system designed to help physicians to diagnose acidosis, is a good example of what .ul not to do. .LB "(Chime) A VALUE OF SIX-POINT-ZERO-ZERO HAS BEEN ENTERED FOR PH. THIS VALUE IS IMPOSSIBLE. TO CONTINUE THE PROGRAM, ENTER A NEW VALUE FOR PH IN THE RANGE BETWEEN SIX-POINT-SIX AND EIGHT-POINT-ZERO (beep dah beep-beep)" (Smith and Goodwin, 1970). .[ Smith Goodwin 1970 .] .LE The use of extraneous noises (for example, a "chime" heralds an error message, and a "beep dah beep-beep" requests data input in the form ) was thought necessary in the Acidosis program to keep the user awake and help him with the format of the interaction. Rather than a long monologue like this, it seems much better to design a sequential interchange of terse messages, so that the caller can be guided into a state where he can rectify his error. For example, .LB .nf .ne11 .nr x0 \w'COMPUTER:' .nr x1 \w'CALLER:' CALLER:\h'\n(x0u-\n(x1u' "6*00#" COMPUTER: "Entry out of range" CALLER:\h'\n(x0u-\n(x1u' "6*00#" (persists) COMPUTER: "The minimum acceptable pH value is 6.6" CALLER:\h'\n(x0u-\n(x1u' "9*03#" COMPUTER: "The maximum acceptable pH value is 8.0" .fi .LE This dialogue allows a rapid exit from the error situation in the likely event that the entry has simply been mis-typed. If the error persists, the caller is given just one piece of information at a time, and forced to continue to play an active role in the interaction. .rh "Input timeouts." In general, input timeouts are dangerous, because they introduce apparent acausality in the system seen by the user. A case has been reported where a user became "highly agitated and refused to go near the terminal again after her first timed-out prompt. She had been quietly thinking what to do and the terminal suddenly interjecting and making its own suggestions was just too much for her" (Gaines and Facey, 1975). .[ Gaines Facey 1975 .] .pp However, voice response systems lack the satisfying visual feedback of end-of-line on termination of an entry. Hence a timed-out reminder is appropriate if a delay occurs after some characters have been entered. This requires the operating system to support a character-by-character mode of input, rather than the usual line-by-line mode. .rh "Repeat requests." Any voice response system must support a universal "repeat last utterance" command, because old output does not remain visible. A fairly sophisticated facility is desirable, as repeat requests are very frequent in practice. They may be due to a simple inability to understand a response, to forgetting what was said, or to distraction of attention \(em which is especially common with office terminals. .pp In the telephone enquiry service two distinct commands were employed, one to repeat the last utterance in case of misrecognition, and the other to summarize the current state of the interaction in case of distraction. For the former, it is essential to avoid simply regenerating an utterance identical with the last. Some variation of intonation and rhythm is needed to prevent an annoying, stereotyped response. A second consecutive repeat request should trigger a paraphrased reply. An error recovery sequence could be used which presented the misunderstood information in a different way with more interaction, but experience indicates that this is of minor importance, especially if information units are kept small anyway. To summarize the current state of the interaction in response to the second type of repeat command necessitates the system maintaining a model of the user. Even a poor model, like a record of his last few transactions and their results, is well worth having. .rh "Varied speech." Synthetic speech is usually rather dreary to listen to. Successive utterances with identical intonations should be carefully avoided. Small changes in speaking rate, pitch range, and mean pitch level, all serve to add variety. Unfortunately, little is known at present about the role of intonation in interactive dialogue, although this is an active research area and new developments can be expected (for a detailed report of a recent research project relevant to this topic see Brown .ul et al, 1980). .[ Brown Currie Kenworthy 1980 .] However, even random variations in certain parameters of the pitch contour are useful to relieve the tedium of repetitive intonation patterns. .sh "10.2 The applications programming environment" .pp The comments in the last section are aimed at the applications programmer who is designing the dialogue and constructing the interactive system. But what kind of environment should .ul he be given to assist with this work? .pp The best help the applications programmer can have is a speech generation method which makes it easy for him to enter new utterances and modify them on-line in cut-and-try attempts to render the man-machine dialogue as natural as possible. This is perhaps the most important advantage of synthesizing speech by rule from a textual representation. If encoded versions of natural utterances are stored, it becomes quite difficult to make minor modifications to the dialogue in the light of experience with it, for a recording session must be set up to acquire new utterances. This is especially true if more than one voice is used, or if the voice belongs to a person who cannot be recalled quickly by the programmer to augment the utterance library. Even if it is his own voice there will still be delays, for recording speech is a real-time job which usually needs a stand-alone processor, and if data compression is used a substantial amount of computation will be needed before the utterance is in a useable form. .pp The broad phonetic input required by segmental speech synthesis-by-rule systems is quite suitable for utterance representation. Utterances can be entered quickly from a standard computer terminal, and edited as text files. Programmers must acquire skill in phonetic transcription, but this is a small inconvenience. The art is easily learned in an interactive situation where the effect of modifications to the transcription can be heard immediately. If allophones must be represented explicitly in the input then the programmer's task becomes considerably more complicated because of the combinatorial explosion in trial-and-error modifications. .pp Plain text input is also quite suitable. A significant rate of error is tolerable if immediate audio feedback of the result is available, so that the operator can adjust his text to suit the pronunciation idiosyncrasies of the program. But it is acceptable, and indeed preferable, if prosodic features are represented explicitly in the input rather than being assigned automatically by a computer program. .pp The application of voice response to interactive computer dialogue is quite different to the problem of reading aloud from text. We have seen that a major concern with reading machines is how to glean information about intonation, rhythm, emphasis, tone of voice, and so on, from an input of ordinary English text. The significant problems of semantic processing, utilization of pragmatic knowledge, and syntactic analysis do not, fortunately, arise in interactive information retrieval systems. In these, the end user is communicating with a program which has been created by a person who knows what he wants it to say. Thus the major difficulty is in .ul describing the prosodic features rather than .ul deriving them from text. .pp Speech synthesis by rule is a subsidiary process to the main interactive procedure. It would be unwise to allow the updating of resonance parameter tracks to be interrupted by other calls on the system, and so the synthesis process needs to be executed in real time. If a stand-alone processor is used for the interactive dialogue, it may be able to handle the synthesis rules as well. In this case the speech-by-rule program could be a library procedure, if the system is implemented in a compiled language. An interesting alternative with an interpretive-language implementation, such as Basic, is to alter the language interpreter to add a new command, "speak", which simply transfers a string representing an utterance to an asynchronous process which synthesizes it. However, there must be some way for an intepreted program to abort the current synthesis in the event of an interrupt signal from the user. .pp If the main computer system is time-shared, the synthesis-by-rule procedure is best executed by an independent processor. For example, a 16-bit microcomputer controlling a hardware formant synthesizer has been used to run the ISP system in real time without too much difficulty (Witten and Abbess, 1979). .[ Witten Abbess 1979 .] An important task is to define an interface between the two which allows the main process to control relevant aspects of the prosody of the speech in a way which is appropriate to the state of the interaction, without having to bother about such things as matching the intonation contour to the utterance and the details of syllable rhythm. Halliday's notation appears to be quite suitable for this purpose. .pp If there is only one synthesizer on the system, there will be no difficulty in addressing it. One way of dealing with multiple synthesizers is to treat them as assignable devices in the same way that non-spooling peripherals are in many operating systems. Notice that the data rate to the synthesizer is quite low if the utterance is represented as text with prosodic markers, and can easily be handled by a low-speed asynchronous serial line. .pp The Votrax ML-I synthesizer which is discussed in the next chapter has an interface which interposes it between a visual display unit and the serial port that connects it to the computer. The VDU terminal can be used quite normally, except that a special sequence of two control characters will cause Votrax to intercept the following message up to another control character, and interpret it as speech. The fact that the characters which specify the spoken message do not appear on the VDU screen means that the operation is invisible to the user. However, this transparency can be inhibited by a switch on the synthesizer to allow visual checking of the sound-segment character sequence. .pp Votrax buffers up to 64 sound segments, which is sufficient to generate isolated spoken messages. For longer passages, it can be synchronized with the constant-rate serial output using the modem control lines of the serial interface, together with appropriate device-driving software. .pp This is a particularly convenient interfacing technique in cases when the synthesizer should always be associated with a certain terminal. As an example of how it can be used, one can arrange files each of whose lines contain a printed message, together with its Votrax equivalent bracketed by the appropriate control characters. When such a file is listed, or examined with an editor program, the lines appear simultaneously in spoken and typed English. .pp If a phonetic representation is used for utterances, with real-time synthesis using a separate process (or processor), it is easy for the programmer to fiddle about with the interactive dialogue to get it feeling right. For him, each utterance is just a textual string which can be stored as a string constant within his program just as a VDU prompt would be. He can edit it as part of his program, and "print" it to the speech synthesis device to hear it. There are no more technical problems to developing an interactive dialogue with speech output than there are for a conventional interactive program. Of course, there are more human problems, and the points discussed in the last section should always be borne in mind. .sh "10.3 Using the keypad" .pp One of the greatest advantages of speech output from computers is the ubiquity of the telephone network and the possibility of using it without the need for special equipment at the terminal. The requirement for input as well as output obviously presents something of a problem because of the restricted nature of the telephone keypad. .pp Figure 10.1 shows the layout of the keypad. .FC "Figure 10.1" Signalling is achieved by dual-frequency tones. For example, if key 7 is pressed, sinusoidal components at 852\ Hz and 1209\ Hz are transmitted down the line. During the process of dialling these are received by the telephone exchange equipment, which assembles the digits that form a number and attempts to route the call appropriately. Once a connection is made, either party is free to press keys if desired and the signals will be transmitted to the other end, where they can be decoded by simple electronic circuits. .pp Dial telephones signal with closely-spaced dial pulses. One pulse is generated for a "1", two for a "2", and so on. (Obviously, ten pulses are generated for a "0", rather than none!) Unfortunately, once the connection is made it is difficult to signal with dial pulses. They cannot be decoded reliably at the other end because the telephone network is not designed to transmit such low frequencies. However, hand-held tone generators can be purchased for use with dial telephones. Although these are undeniably extra equipment, and one purpose of using speech output is to avoid this, they are very cheap and portable compared with other computer terminal equipment. .pp The small number of keys on the telephone pad makes it rather difficult to use for communicating with computers. Provision is made for 16 keys, but only 12 are implemented \(em the others may be used for some military purposes. Of course, if a separate tone generator is used then advantage can be taken of the extra keys, but this will introduce incompatibility with those who use unmodified touch-tone phones. More sophisticated terminals are available which extend the keypad \(em such as the Displayphone of Northern Telecommunications. However, they are designed as a complete communications terminal and contain their own visual display as well. .rh "Keying alphabetic data." Figure 10.2 shows the near-universal scheme for overlaying alphabetic letters on to the telephone keypad. .FC "Figure 10.2" Since more than one symbol occupies each key, it is obviously necessary to have multiple keystrokes per character if the input sequence is to be decodable as a string of letters. One way of doing this is to depress the appropriate button the number of times corresponding to the position of the letter on it. For example, to enter the letter "L" the user would key the "5" button three times in rapid succession. Keying rhythm must be used to distinguish the four entries "J\ J\ J", "J\ K", "K\ J", and "L", unless one of the bottom three buttons is used as a separator. A different method is to use "*", "0", and "#" as shift keys to indicate whether the first, second, or third letter on a key is intended. Then "#5" would represent "L". Alternatively, the shift could follow the key instead of preceding it, so that "5#" represented "L". .pp If numeric as well as alphabetic information may be entered, a mode-shift operation is commonly used to switch between numeric and alphabetic modes. .pp The relative merits of these three methods, multiple depressions, shift key prefix, and shift key suffix, have been investigated experimentally (Kramer, 1970). .[ Kramer 1970 .] The results were rather inconclusive. The first method seemed to be slightly inferior in terms of user accuracy. It seemed that preceding rather than following shifts gave higher accuracy, although this is perhaps rather counter-intuitive and may have been fortuitous. The most useful result from the experiments was that users exhibited significant learning behaviour, and a training period of at least two hours was recommended. Operators were found able to key at rates of at least three to four characters per second, and faster with practice. .pp If a greater range of characters must be represented then the coding problem becomes more complex. Figure 10.3 shows a keypad which can be used for entry of the full 64-character standard upper-case ASCII alphabet (Shew, 1975). .[ Shew 1975 .] .FC "Figure 10.3" The system is intended for remote vocabulary updating in a phonetically-based speech synthesis system. There are three modes of operation: numeric, alphabetic, and symbolic. These are entered by "##", "**", and "*0" respectively. Two function modes, signalled by "#0" and "#*", allow some rudimentary line-editing and monitor facilities to be incorporated. Line-editing commands include character and line delete, and two kinds of read-back commands \(em one tries to pronounce the words in a line and the other spells out the characters. The monitor commands allow the user to repeat the effect of the last input line as though he had entered it again, to order the system to read back the last complete output line, and to query time and system status. .rh "Incomplete keying of alphanumeric data." It is obviously going to be rather difficult for the operator to key alphanumeric information unambiguously on a 12-key pad. In the description of the telephone enquiry service in Chapter 1, it was mentioned that single-key entry can be useful for alphanumeric data if the ambiguity can be resolved by the computer. If a multiple-character entry is known to refer to an item on a given list, the characters can be keyed directly according to the coding scheme of Figure 10.2. .pp Under most circumstances no ambiguity will arise. For example, Table 10.1 shows the keystrokes that would be entered for the first 50 5-letter words in an English dictionary. Only two clashes occur \(em between " adore" and "afore", and "agate" and "agave". .RF .nr x2 \w'abeam 'u .nr x3 \w'00000# 'u .nr x0 \n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\w'00000#'u .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta \n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u \l'\n(x0u\(ul' .sp aback 22225# abide 22433# adage 23243# adore 23673# after 23837# abaft 22238# abode 22633# adapt 23278# adorn 23676# again 24246# abase 22273# abort 22678# adder 23337# adult 23858# agape 24273# abash 22274# about 22688# addle 23353# adust 23878# agate 24283# abate 22283# above 22683# adept 23378# aeger 23437# agave 24283# abbey 22239# abuse 22873# adieu 23438# aegis 23447# agent 24368# abbot 22268# abyss 22977# admit 23648# aerie 23743# agile 24453# abeam 22326# acorn 22676# admix 23649# affix 23349# aglet 24538# abele 22353# acrid 22743# adobe 23623# afoot 23668# agony 24669# abhor 22467# actor 22867# adopt 23678# afore 23673# agree 24733# \l'\n(x0u\(ul' .in 0 .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .FG "Table 10.1 Keying equivalents of some words" As a more extensive example, in a dictionary of 24,500 words, just under 2,000 ambiguities (8% of words) were discovered. Such ambiguities would have to be resolved interactively by the system explaining its dilemma, and asking the user for a choice. Notice incidentally that although the keyed sequences do not have the same lexicographic order as the words, no extra cost will be associated with the table-searching operation if the dictionary is stored in inverted form, with each legal number pointing to its English equivalent or equivalents. .pp A command language syntax is also a powerful way of disambiguating keystrokes entered. Figure 10.4 shows the keypad layout for a telephone voice calculator (Newhouse and Sibley, 1969). .[ Newhouse Sibley 1969 .] .FC "Figure 10.4" This calculator provides the standard arithmetic operators, ten numeric registers, a range of pre-defined mathematical functions, and even the ability for a user to enter his own functions over the telephone. The number representation is fixed-point, with user control (through a system function) over the precision. Input of numbers is free format. .pp Despite the power of the calculator language, the dialogue is defined so that each keystroke is unique in context and never has to be disambiguated explicitly by the user. Table 10.2 summarizes the command language syntax in an informal and rather heterogeneous notation. .RF .nr x0 1.3i+1.7i+\w'some functions do not need the part'u .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta 1.3i +1.7i \l'\n(x0u\(ul' construct definition explanation \l'\n(x0u\(ul' .sp a sequence of s followed by a call to the system function \fIE X I T\fR .sp OR OR OR OR OR OR OR OR OR OR OR .sp + # OR + # .sp similar to .sp OR \fIregister\fR .sp a sequence of keystrokes like 1 . 2 3 4 or 1 2 3 . 4 or 1 2 3 4 .sp \fIfunction\fR # # some functions do not need the part .sp a sequence of keystrokes like \fIS I N\fR or \fIE X I T\fR or \fIM Y F U N C\fR .sp \fIclear register\fR # clears one of the 10 registers .sp \fIerase\fR # undoes the effect of the last operation .sp \fIanswer register\fR # reads the contents of a register .sp these provide "repeat" facilities .sp aborts the current utterance \l'\n(x0u\(ul' .in 0 .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .FG "Table 10.2 Syntax for a telephone calculator" A calculation is a sequence of operations followed by an EXIT function call. There are twelve different operations, one for each button on the keypad. Actually, two of them \(em .ul cancel and .ul function \(em share the same key so that "#" can be reserved for use as a separator; but the context ensures that they cannot be confused by the system. .pp Six of the operations give control over the dialogue. There are three different "repeat" commands; a command (called .ul erase\c ) which undoes the effect of the last operation; one which reads out the value of a register; and one which aborts the current utterance. Four more commands provide the basic arithmetic operations of add, subtract, multiply, and divide. The operands of these may be keyed literal numbers, or register values, or function calls. A further command clears a register. .pp It is through functions that the extensibility of the language is achieved. A function has a name (like SIN, EXIT, MYFUNC) which is keyed with an appropriate single-key-per-character sequence (namely 746, 3948, 693862 respectively). One function, DEFINE, allows new ones to be entered. Another, LOOP, repeats sequences of operations. TEST incorporates arithmetic testing. The details of these are not important: what is interesting is the evident power of the calculator. .pp For example, the keying sequence .LB .NI 5 # 1 1 2 3 # 2 1 . 2 # 9 # 6 # 2 1 . 4 # .LE would be decoded as .LB .NI .ul clear\c + 123 \- 1.2 \c .ul display erase\c \- 1.4. .LE One of the difficulties with such a tight syntax is that almost any sequence will be intepreted as a valid calculation \(em syntax errors are nearly impossible. Thus a small mistake by the user can have a catastrophic effect on the calculation. Here, however, speech output gives an advantage over conventional character-by-character echoing on visual displays. It is quite adequate to echo syntactic units as they are decoded, instead of echoing keys as they are entered. It was suggested earlier in this chapter that confirmation of entry should be generated in the same way that the user would be likely to verbalize it himself. Thus the synthetic voice could respond to the above keying sequence as shown in the second line, except that the .ul display command would also state the result (and possibly summarize the calculation so far). Numbers could be verbalized as "one hundred and twenty-three" instead of as "one ... two ... three". (Note, however, that this will make it necessary to await the "#" terminator after numbers and function names before they can be echoed.) .sh "10.4 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .sh "10.5 Further reading" .pp There are no books which relate techniques of man-computer dialogue to speech interaction. The best I can do is to guide you to some of the standard works on interactive techniques. .LB "nn" .\"Gilb-1977-1 .]- .ds [A Gilb, T. .as [A " and Weinberg, G.M. .ds [D 1977 .ds [T Humanized input .ds [I Winthrop .ds [C Cambridge, Massachusetts .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n This book is subtitled "techniques for reliable keyed input", and considers most aspects of the problem of data entry by professional key operators. .in-2n .\"Martin-1973-2 .]- .ds [A Martin, J. .ds [D 1973 .ds [T Design of man-computer dialogues .ds [I Prentice-Hall .ds [C Englewood Cliffs, New Jersey .nr [T 0 .nr [A 1 .nr [O 0 .][ 2 book .in+2n Martin concerns himself with all aspects of man-computer dialogue, and the book even contains a short chapter on the use of voice response systems. .in-2n .\"Smith-1980-3 .]- .ds [A Smith, H.T. .as [A " and Green, T.R.G.(Editors) .ds [D 1980 .ds [T Human interaction with computers .ds [I Academic Press .ds [C London .nr [T 0 .nr [A 0 .nr [O 0 .][ 2 book .in+2n A recent collection of contributions on man-computer systems and programming research. .in-2n .LE "nn" .EQ delim $$ .EN .CH "11 COMMERCIAL SPEECH OUTPUT DEVICES" .ds RT "Commercial speech output devices .ds CX "Principles of computer speech .pp This chapter takes a look at four speech output peripherals that are available today. It is risky in a book of this nature to descend so close to the technology as to discuss particular examples of commercial products, for such information becomes dated very quickly. Nevertheless, having covered the principles of various types of speech synthesizer, and the methods of driving them from widely differing utterance representations, it seems worthwhile to see how these principles are embodied in a few products actually on the market. .pp Developments in electronic speech devices are moving so fast that it is hard to keep up with them, and the newest technology today will undoubtedly be superseded next year. Hence I have not tried to choose examples from the very latest technology. Instead, this chapter discusses synthesizers which exemplify rather different principles and architectures, in order to give an idea of the range of options which face the system designer. .pp Three of the devices are landmarks in the commercial adoption of speech technology, and have stood the test of time. Votrax was introduced in the early 1970's, and has been re-implemented several times since in an attempt to cover different market sectors. The Computalker appeared in 1976. It was aimed primarily at the burgeoning computer hobbies market. One of its most far-reaching effects was to stimulate the interest of hobbyists, always eager for new low-cost peripherals, in speech synthesis; and so provide a useful new source of experimentation and expertise which will undoubtedly help this heretofore rather esoteric discipline to mature. Computalker is certainly the longest-lived and probably still the most popular hobbyist's speech synthesizer. The Texas Instruments speech synthesis chip brought speech output technology to the consumer. It was the first single-chip speech synthesizer, and is still the biggest seller. It forms the heart of the "Speak 'n Spell" talking toy which appeared in toyshops in the summer of 1978. Although talking calculators had existed several years before, they were exotic gadgets rather than household toys. .sh "11.1 Formant synthesizer" .pp The Computalker is a straightforward implementation of a serial formant synthesizer. A block diagram of it is shown in Figure 11.1. .FC "Figure 11.1" In the centre is the main vocal tract path, with three formant filters whose resonant frequencies can be controlled individually. A separate nasal branch in parallel with the oral one is provided, with a nasal formant of fixed frequency. It is less important to allow for variation of the nasal formant frequency than it is for the oral ones, because the size and shape of the nasal tract is relatively fixed. However, it is essential to control the nasal amplitude, in particular to turn it off during non-nasal sounds. Computalker provides independent oral and nasal amplitude parameters. .pp Unvoiced excitation can be passed through the main vocal tract through the aspiration amplitude control AH. In practice, the voicing amplitudes AV and AN will probably always be zero when AH is non-zero, for physiological constraints prohibit simultaneous voicing and aspiration. A second unvoiced excitation path passes through a fricative formant filter whose resonant frequency can be varied, and has its amplitude independently controlled by AF. .rh "Control parameters." Table 11.1 summarizes the nine parameters which drive Computalker. .RF .nr x0 \w'address0'+\w'fundamental frequency of voicing00'+\w'0 bits0'+\w'logarithmic00'+\w'0000\-00000 Hz' .nr x1 (\n(.l-\n(x0)/2 .in \n(x1u .ta \w'000'u \w'address0'u +\w'fundamental frequency of voicing00'u +\w'0 bits0'u +\w'logarithmic00'u address meaning width \0\0\0range \l'\n(x0u\(ul' .sp \00 AV amplitude of voicing 8 bits \01 AN nasal amplitude 8 bits \02 AH amplitude of aspiration 8 bits \03 AF amplitude of frication 8 bits \04 FV fundamental frequency of voicing 8 bits logarithmic \0\075\-\0\0470 Hz \05 F1 formant 1 resonant frequency 8 bits logarithmic \0170\-\01450 Hz \06 F2 formant 2 resonant frequency 8 bits logarithmic \0520\-\04400 Hz \07 F3 formant 3 resonant frequency 8 bits logarithmic 1700\-\05500 Hz \08 FF fricative resonant frequency 8 bits logarithmic 1700\-14000 Hz \09 not used 10 not used 11 not used 12 not used 13 not used 14 not used 15 SW audio on-off switch 1 bit \l'\n(x0u\(ul' .in 0 .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .FG "Table 11.1 Computalker control parameters" Four of them control amplitudes, while the others control frequencies. In the latter case the parameter value is logarithmically related to the actual frequency of the excitation (FV) or resonance (F1, F2, F3, FF). The ranges over which each frequency can be controlled is shown in the Table. An independent calibration of one particular Computalker has shown that the logarithmic specifications are met remarkably well. .pp Each parameter is specified to Computalker as an 8-bit number. Parameters are addressed by a 4-bit code, and so a total of 12 bits is transferred in parallel to Computalker from the computer for each parameter update. Parameters 9 to 14 are unassigned ("reserved for future expansion" is the official phrase), and the last parameter, SW, governs the position of an audio on-off switch. .pp Computalker does not contain a clock that is accessible to the user, and so the timing of parameter updates is entirely up to the host computer. Typically, a 10\ msec interval between frames is used, with interrupts generated by a separate timer. In fact the frame interval can be anywhere between 2\ msec and 50\ msec, and can be changed to alter the rate of speaking. However, it is rather naive to view fast speech as slow speech speeded up by a linear time compression, for in human speech production the rhythm changes and elisions occur in a rather more subtle way. Thus it is not particularly useful to be able to alter the frame rate. .pp At each interrupt, the host computer transfers values for all of the nine parameters to Computalker, a total of 108 data bits. In theory, perhaps, it is only necessary to transmit those parameters whose values have changed; but in practice all of them should be updated regardless. This is because the parameters are stored for the duration of the frame in analogue sample-and-hold devices. Essentially, the parameter value is represented as the charge on a capacitor. In time \(em and it takes only a short time \(em the values drift. Although the drift over 10\ msec is insignificant, it becomes very noticeable over longer time periods. If parameters are not updated at all, the result is a "whooosh" sound up to maximum amplitude, in a period of a second or two. Hence it is essential that Computalker be serviced by the computer regularly, to update all its parameters. The audio on-off switch is provided so that the computer can turn off the sound directly if another program, which does not use the device, is to be run. .rh "Filter implementation." It is hard to get definite information on the implementation of Computalker. Because it is a commercial device, circuit diagrams are not published. It is certainly an analogue rather than a digital implementation. The designer suggests that a configuration like that of Figure 11.2 is used for the formant filters (Rice, 1976). .[ Rice 1976 Byte .] .FC "Figure 11.2" Control is obtained over the resonant frequency by varying the resistance at the bottom in sympathy with the parameter value. The middle two operational amplifiers can be modelled by a resistance $-R/k$ in the forward path, where k is the digital control value. This gives the circuit in Figure 11.3, which can be analysed to obtain the transfer function .LB .EQ - ~ k over {R~R sub 1 C sub 2 C sub 3} ~ . ~ {R sub 2 C sub 2 ~s ~+~1} over { s sup 2 ~+~~ ( 1 over {R sub 3 C sub 3} ~+~ {k R sub 2} over {R~R sub 1 C sub 3})~s ~~+~ k over {R~R sub 1 C sub 2 C sub 3}} ~ . .EN .LE .FC "Figure 11.3" .pp This expression has a DC gain of \-1, and the denominator is similar to those of the analogue formant resonators discussed in Chapter 5. However, unlike them the transfer function has a numerator which creates a zero at .LB .EQ s~~=~~-~ 1 over {R sub 2 C sub 2} ~ . .EN .LE If $R sub 2 C sub 2$ is sufficiently small, this zero will have negligible effect at audio frequencies, and the filter has the following parameters: .LB centre frequency: $~ mark 1 over {2 pi}~~( k over {R~R sub 1 C sub 2 C sub 3} ~ ) sup 1/2$ Hz .sp bandwidth:$lineup 1 over {2 pi}~~( 1 over {R sub 3 C sub 3}~+~ {k R sub 2} over {R~R sub 1 C sub 3} ~ )$ Hz. .LE .pp Note first that the centre frequency is proportional to the square root of the control value $k$. Hence a non-linear transformation must be implemented on the control signal, after D/A conversion, to achieve the required logarithmic relationship between parameter value and resonant frequency. The formant bandwidth is not constant, as it should be (see Chapter 5), but depends upon the control value $k$. This dependency can be minimized by selecting component values such that .LB .EQ {k R sub 2} over {R~R sub 1 C sub 3}~~<<~~1 over {R sub 3 C sub 3} .EN .LE for the largest value of $k$ which can occur. Then the bandwidth is solely determined by the time constant $R sub 3 C sub 3$. .pp The existence of the zero can be exploited for the fricative resonance. This should have zero DC gain, and so the component values for the fricative filter should make the time-constant $R sub 2 C sub 2$ large enough to place the zero sufficiently near the frequency origin. .rh "Market orientation." As mentioned above, Computalker is designed for the computer hobbies market. Figure 11.4 shows a photograph of the device. .FC "Figure 11.4" It plugs into the S\-100 bus which has been a .ul de facto standard for hobbyists for several years, and has recently been adopted as a standard by the Institute of Electrical and Electronic Engineers. This makes it immediately accessible to many microcomputer systems. .pp An inexpensive synthesis-by-rule program, which runs on the popular 8080 microprocessor, is available to drive Computalker. The input is coded in a machine-readable version of the standard phonetic alphabet, similar to that which was introduced in Chapter 2 (Table 2.1). Stress digits may appear in the transcription, and the program caters for five levels of stress. The punctuation mark at the end of an utterance has some effect on pitch. The program is perhaps remarkable in that it occupies only 6\ Kbyte of storage (including phoneme tables), and runs on an 8-bit microprocessor (but not in real time). It is, however, .ul un\c remarkable in that it produces rather poor speech. According to a demonstration cassette, "most people find the speech to be readily intelligible, especially after a little practice listening to it," but this seems extremely optimistic. It also cunningly insinuates that if you don't understand it, you yourself may share the blame with the synthesizer \(em after all, .ul most people do! Nevertheless, Computalker has made synthetic speech accessible to a large number of home computer users. .sh "11.2 Sound-segment synthesizer" .pp Votrax was the first fully commercial speech synthesizer, and at the time of writing is still the only off-the-shelf speech output peripheral (as distinct from reading machine) which is aimed specifically at synthesis-by-rule rather than storage of parameter tracks extracted from natural utterances. Figure 11.5 shows a photograph of the Votrax ML-I. .FC "Figure 11.5" .pp Votrax accepts as input a string of codes representing sound segments, each with additional bits to control the duration and pitch of the segment. In the earlier versions (eg model VS-6) there are 63 sound segments, specified by a 6-bit code, and two further bits accompany each segment to provide a 4-level control over pitch. Four pitch levels are quite inadequate to generate acceptable intonation contours for anything but isolated words spoken in citation form. However, a later model (ML-I) uses an 8-level pitch specification, as well as a 4-level duration qualifier, associated with each sound segment. It provides a vocabulary of 80 sound segments, together with an additional code which allows local amplitude modifications and extra duration alterations to following segments. A further, low-cost model (VS-K) is now available which plugs in to the S\-100 bus, and is aimed primarily at computer hobbyists. It provides no pitch control at all and is therefore quite unsuited to serious voice response applications. The device has recently been packaged as an LSI circuit (model SC\-01), using analogue switched-capacitor filter technology. .pp One point where the ML-I scores favourably over other speech synthesis peripherals is the remarkably convenient engineering of its computer interface, which was outlined in the previous chapter. .pp The internal workings of Votrax are not divulged by the manufacturer. Figure 11.6 shows a block diagram at the level of detail that they supply. .FC "Figure 11.6" It seems to be essentially a formant synthesizer with analogue function generators and parameter smoothing circuits that provide transitions between sound segments. .rh "Sound segments." The 80 segments of the high-range ML-I model are summarized in Table 11.2. .FC "Table 11.2" They are divided into phoneme classes according to the classification discussed in Chapter 2. The segments break down into the following categories. (Numbers in parentheses are the corresponding figures for VS-6.) .LB "00 (00) " .NI "00 (00) " 11 (11) vowel sounds which are representative of the phonological vowel classes for English .NI "00 (00) " \09 \0(7) vowel allophones, with slightly different sound qualities from the above .NI "00 (00) " 20 (15) segments whose sound qualities are identical to the segments above, but with different durations .NI "00 (00) " 22 (22) consonant sounds which are representative of the phonological consonant classes for English .NI "00 (00) " 11 \0(6) consonant allophones .NI "00 (00) " \04 \0(0) segments to be used in conjunction with unvoiced plosives to increase their aspiration .NI "00 (00) " \02 \0(2) silent segments, with different pause durations .NI "00 (00) " \01 \0(0) very short silent segment (about 5\ msec). .LE "00 (00) " Somewhat under half of the 80 elements can be put into one-to-one correspondence with the phonemes of English; the rest are either allophonic variations or additional sounds which can sensibly be combined with certain phonemes in certain contexts. The Votrax literature, and consequently Votrax users, persists in calling all elements "phonemes", and this can cause considerable confusion. I prefer to use the term "sound segment" instead, reserving "phoneme" for its proper linguistic use. .pp The rules which Votrax uses for transitions between sound segments are not made public by the manufacturer, and are embedded in encapsulated circuits in the hardware. They are clearly very crude. The key to successful encoding of utterances is to use the many non-phonemic segments in an appropriate way as transitions between the main segments which represent phonetic classes. This is a tricky process, and I have heard of one commercial establishment giving up in despair at the extreme difficulty of generating the utterances it wanted. It probably explains the proliferation of letter-to-sound rules for Votrax which have been developed in research laboratories (Colby .ul et al, 1978; Elovitz .ul et al, 1976; McIlroy, 1974; Sherwood, 1978). .[ Colby Christinaz Graham 1978 .] .[ Elovitz 1976 IEEE Trans Acoustics Speech and Signal Processing .] .[ McIlroy 1974 .] .[ Sherwood 1978 .] Nevertheless, with luck, skill, and especially persistence, excellent results can be obtained. The ML-I manual (Votrax, 1976) contains a list of about 625 words and short phrases, and they are usually clearly recognizable. .[ Votrax 1976 .] .rh "Duration and pitch qualifiers." Each sound segment has a different duration. Table 11.2 shows the measured duration of the segments, although no calibration data is given by Votrax. As mentioned earlier, a 2-bit number accompanies each segment to modify its duration, and this was set to 3 (least duration) for the measurements. The qualifier has a multiplicative effect, shown in Table 11.3. .RF .nr x1 (\w'rate qualifier'/2) .nr x2 (\w'in Table 11.2 by'/2) .nr x0 \n(x1+2i+\w'00'+\n(x2 .nr x3 (\n(.l-\n(x0)/2 .in \n(x3u .ta \n(x1u +2i \l'\n(x0u\(ul' .sp .nr x2 (\w'multiply duration'/2) rate qualifier \0\0\h'-\n(x2u'multiply duration .nr x2 (\w'in Table 11.2 by'/2) \0\0\h'-\n(x2u'in Table 11.2 by \l'\n(x0u\(ul' .sp 3 1.00 2 1.11 1 1.22 0 1.35 \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 11.3 Effect of the 2-bit per-segment rate qualifier" .pp As well as the 2-bit rate qualifier, each sound segment is accompanied by a 3-bit pitch specification. This provides a linear control over fundamental frequency, and Table 11.4 shows the measured values. .RF .nr x1 (\w'pitch specifier'/2) .nr x2 (\w'pitch (Hz)'/2) .nr x0 \n(x1+1.5i+\n(x2 .nr x3 (\n(.l-\n(x0)/2 .in \n(x3u .ta \n(x1u +1.5i \l'\n(x0u\(ul' .sp pitch specifier \h'-\n(x2u'pitch (Hz) \l'\n(x0u\(ul' .sp 0 \057.5 1 \064.1 2 \069.4 3 \075.8 4 \080.6 5 \087.7 6 \094.3 7 100.0 \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 11.4 Effect of the 3-bit per-segment pitch specifier" The quantization interval varies from one to two semitones. Votrax interpolates pitch from phoneme to phoneme in a highly satisfactory manner, and this permits surprisingly sophisticated intonation contours to be generated considering the crude 8-level quantization. .pp The notation in which the Votrax manual defines utterances gives duration qualifiers and pitch specifications as digits preceding the sound segment, and separated from it by a slash (/). Thus, for example, .LB 14/THV .LE defines the sound segment THV with duration qualifier 1 (multiplies the 70\ msec duration of Table 11.2 by 1.22 \(em from Table 11.3 \(em to give 85\ msec) and pitch specification 4 (81 Hz). This representation of a segment is transformed into two ASCII characters before transmission to the synthesizer. .rh "Converting a phonetic transcription to sound segments." It would be useful to have a computer procedure to produce a specification for an utterance in terms of Votrax sound segments from a standard phonetic transcription. This could remove much of the tedium from utterance preparation by incorporating the contextual rules given in the Votrax manual. Starting with a phonetic transcription, each phoneme should be converted to its default Votrax representative. The resulting "wide" Votrax transcription must be transformed into a "narrow" one by application of contextual rules. Separate rules are needed for .LB .NP vowel clusters (diphthongs) .NP vowel transitions (ie consonant-vowel and vowel-consonant, where the vowel segment is altered) .NP intervocalic consonants .NP consonant transitions (ie consonant-vowel and vowel-consonant, where the consonant segment is altered) .NP consonant clusters .NP stressed-syllable effects .NP utterance-final effects. .LE Stressed-syllable effects (which include extra aspiration for unvoiced stops beginning stressed syllables) can be applied only if stress markers are included in the phonetic transcription. .pp To specify a rule, it is necessary to give a .ul matching part and a .ul context, which define at what points in an utterance it is applicable, and a .ul replacement part which is used to replace the matching part. The context can be specified in mathematical set notation using curly brackets. For example, .LB {G SH W K} OO IU OO .LE states that the matching part OO is replaced by IU OO, after a G, SH, W, or K. In fact, allophonic variations of each sound segment should also be accepted as valid context, so this rule will also replace OO after .G, CH, .W, .K, or .X1 (Table 11.2 gives allophones of each segment). .pp Table 11.5 gives some rules that have been used for this purpose. .FC "Table 11.5" They were derived from careful study of the hints given in the ML-I manual (Votrax, 1976). .[ Votrax 1976 .] Classes such as "voiced" and "stop-consonant" in the context specify sets of sound segments in the obvious way. The beginning of a stressed syllable is marked in the input by ".syll". Parentheses in the replacement part have a significance which is explained in the next section. .rh "Handling prosodic features." We know from Chapter 8 the vital importance of prosodic features in synthesizing lifelike speech. To allow them to be assigned to Votrax utterances, an intermediate output from a prosodic analysis program like ISP can be used. For example, .LB 1 \c .ul dh i s i z /*d zh aa k s /h aa u s; .LE which specifies "this is Jack's house" in a declarative intonation with emphasis on the "Jack's", can be intercepted in the following form: .LB \&.syll .ul dh\c \ 50\ (0\ 110) .ul i\c \ 60 .ul s\c \ 90\ (0\ 99) .ul i\c \ 60 .ul z\c \ 60\ (50\ 110) \&.syll .ul d\c \ 50\ (0\ 110) .ul zh\c \ 50 .ul aa\c \ 90 .ul k\c \ 120\ (10\ 90) .ul s\c \ 90 \&.syll .ul h\c \ 60 .ul aa\c \ 140 .ul u\c \ 60 .ul s\c \ 140 ^\ 50\ (40\ 70) . .LE Syllable boundaries, pitches, and durations have been assigned by the procedures given earlier (Chapter 8). A number always follows each phoneme to specify its duration (in msec). Pairs of numbers in parentheses define a pitch specification at some point during the preceding phoneme: the first number of the pair defines the time offset of the specification from the beginning of the phoneme, while the second gives the pitch itself (in Hz). This form of utterance specification can then be passed to a Votrax conversion procedure. .pp The phonetic transcription is converted to Votrax sound segments using the method described above. The "wide" Votrax transcription is .LB \&.syll THV I S I Z .syll D ZH AE K S .syll H AE OO S PA0 ; .LE which is transformed to the following "narrow" one according to the rules of Table 11.5: .LB \&.syll THV I S I Z .syll D J (AE EH3) K S .syll H1 (AH1 .UH2) (O U) S PA0 . .LE The duration and pitch specifications are preserved by the transformation in their original positions in the string, although they are not shown above. The next stage uses them to expand the transcription by adjusting the segments to have durations as close as possible to the specifications, and computing pitch numbers to be associated with each phoneme. .pp Correct duration-expansion can, in general, require a great amount of computation. Associated with each sound segment is a set of elements with the same sound quality but different durations, formed by attaching each of the four duration qualifiers of Table 11.3 to the segment and any others which are sound-equivalents to it. For example, the segment Z has the duration-set .LB {3/Z 2/Z 1/Z 0/Z} .LE with durations .LB { 70 78 85 95} .LE msec respectively, where the initial numerals denote the duration qualifier. The segment I has the much larger duration-set .LB {3/I2 2/I2 1/I2 0/I2 3/I1 2/I1 1/I1 0/I1 3/I 2/I 1/I 0/I} .LE with durations .LB { 58 64 71 78 83 92 101 112 118 131 144 159}, .LE because segments I1 and I2 are sound-equivalents to it. Duration assignment is a matter of selecting elements from the duration-set whose total duration is as close as possible to that desired for the segment. It happens that Votrax deals sensibly with concatenations of more than one identical plosive, suppressing the stop burst on all but the last. Although the general problem of approximating durations in this way is computationally demanding, a simple recursive exhaustive search works in a reasonable amount of time because the desired duration is usually not very much greater than the longest member of the duration-set, and so the search terminates quite quickly. .pp At this point, the role of the parentheses which appear on the right-hand side of Table 11.5 becomes apparent. Because durations are only associated with the input phonemes, which may each be expanded into several Votrax segments, it is necessary to keep track of the segments which have descended from a single phoneme. Target durations are simply spread equally across any parenthesized groups to which they apply. .pp Having expanded durations, mapping pitches on to the sound segments is a simple matter. The ISP system for formant synthesizers (Chapters 7 and 8) uses linear interpolation between pitch specifications, and the frequency which results for each sound segment needs to be converted to a Votrax specification using the information in Table 11.4. .pp After applying these procedures to the example utterance, it becomes .LB 14/THV 14/I1 03/S 14/I1 04/Z 04/D 04/J 33/AE 33/EH3 \c 02/K 02/K 02/S 02/H1 01/AH2 01/.UH2 31/O2 31/U1 01/S \c 10/S 30/PA0 30/PA0 . .LE In several places, shorter sound-equivalents have been substituted (I1 for I, AH2 for AH1, O2 for O, and U1 for U), while doubling-up also occurs (in the K, S, and PA0 segments). .pp The speech which results from the use of these procedures with the Votrax synthesizer sounds remarkably similar to that generated by the ISP system which uses parametrically-controlled synthesizers. Formal evaluation experiments have not been undertaken, but it seems clear from careful listening that it would be rather difficult, and probably pointless, to evaluate the Votrax conversion algorithm, for the outcome would be completely dominated by the success of the original pitch and rhythm assignment procedures. .sh "11.3 Linear predictive synthesizer" .pp The first single-chip speech synthesizer was introduced by Texas Instruments (TI) in the summer of 1978 (Wiggins and Brantingham, 1978). .[ Wiggins Brantingham 1978 .] It was a remarkable development, combining recent advances in signal processing with the very latest in VLSI technology. Packaged in the Speak 'n Spell toy (Figure 11.7), it was a striking demonstration of imagination and prowess in integrated electronics. .FC "Figure 11.7" It gave TI a long lead over its competitors and surprised many experts in the speech field. .EQ delim @@ .EN Overnight, it seemed, digital speech technology had descended from research laboratories with their expensive and specialized equipment into a $50.00 consumer item. .EQ delim $$ .EN Naturally TI did not sell the chip separately but only as part of their mass-market product; nor would they make available information on how to drive it directly. Only recently when other similar devices appeared on the market did they unbundle the package and sell the chip. .rh "The Speak 'n Spell toy." The TI chip (TMC0280) uses the linear predictive method of synthesis, primarily because of the ease of the speech analysis procedure and the known high quality at low data rates. Speech researchers, incidentally, sometimes scoff at what they perceive to be the poor quality of the toy's speech; but considering the data rate used (which averages 1200 bits per second of speech) it is remarkably good. Anyway, I have never heard a child complain! \(em although it is not uncommon to misunderstand a word. Two 128\ Kbit read-only memories are used in the toy to hold data for about 330 words and phrases \(em lasting between 3 and 4 minutes \(em of speech. At the time (mid-1978) these memories were the largest that were available in the industry. The data flow and user dialogue are handled by a microprocessor, which is the fourth LSI circuit in the photograph of Figure 11.8. .FC "Figure 11.8" .pp A schematic diagram of the toy is given in Figure 11.9. .FC "Figure 11.9" It has a small display which shows upper-case letters. (Some teachers of spelling hold that the lack of lower case destroys any educational value that the toy may have.) It has a full 26-key alphanumeric keyboard with 14 additional control keys. (This is the toy's Achilles' heel, for the keys fall out after extended use. More recent toys from TI use an improved keyboard.) The keyboard is laid out alphabetically instead of in QWERTY order; possibly missing an opportunity to teach kids to type as well as spell. An internal connector permits vocabulary expansion with up to 14 more read-only memory chips. Controlling the toy is a 4-bit microprocessor (a modified TMS1000). However, the synthesizer chip does not receive data from the processor. During speech, it accesses the memory directly and only returns control to the processor when an end-of-phrase marker is found in the data stream. Meanwhile the processor is idle, and cannot even be interrupted from the keyboard. Moreover, in one operational mode ("say-it") the toy embarks upon a long monologue and remains deaf to the keyboard \(em it cannot even be turned off. Any three-year-old will quickly discover that a sharp slap solves the problem! A useful feature is that the device switches itself off if unused for more than a few minutes. A fascinating account of the development of the toy from the point of view of product design and market assessment has been published (Frantz and Wiggins, 1981). .[ Frantz Wiggins 1981 .] .rh "Control parameters." The lattice filtering method of linear predictive synthesis (see Chapter 6) was selected because of its good stability properties and guaranteed performance with small word sizes. The lattice has 10 stages. All the control parameters are represented as 10-bit fixed-point numbers, and the lattice operates with an internal precision of 14 bits (including sign). .pp There are twelve parameters for the device: ten reflection coefficients, energy, and pitch. These are updated every 20\ msec. However, if 10-bit values were stored for each, a data rate of 120 bits every 20\ msec, or 6\ Kbit/s, would be needed. This would reduce the capacity of the two read-only memory chips to well under a minute of speech \(em perhaps 65 words and phrases. But one of the desirable properties of the reflection coefficients which drive the lattice filter is that they are amenable to quantization. A non-linear quantization scheme is used, with the parameter data addressing an on-chip quantization table to yield a 10-bit coefficient. .pp Table 11.6 shows the number of bits devoted to each parameter. .RF .in+0.3i .ta \w'repeat flag00'u +1.3i +0.8i .nr x0 \w'repeat flag00'+1.3i+\w'00'+(\w'size (10-bit words)'/2) \l'\n(x0u\(ul' .nr x1 (\w'bits'/2) .nr x2 (\w'quantization table'/2) .nr x3 0.2m parameter \0\h'-\n(x1u'bits \0\0\h'-\n(x2u'quantization table .nr x2 (\w'size (10-bit words)'/2) \0\0\h'-\n(x2u'size (10-bit words) \l'\n(x0u\(ul' .sp energy \04 \016 \v'\n(x3u'_\v'-\n(x3u'\z4\v'\n(x3u'_\v'-\n(x3u' energy=0 means 4-bit frame pitch \05 \032 repeat flag \01 \0\(em \z1\v'\n(x3u'_\v'-\n(x3u'\z0\v'\n(x3u'_\v'-\n(x3u' repeat flag =1 means 10-bit frame k1 \05 \032 k2 \05 \032 k3 \04 \016 k4 \04 \016 \z2\v'\n(x3u'_\v'-\n(x3u'\z8\v'\n(x3u'_\v'-\n(x3u' pitch=0 (unvoiced) means 28-bit frame k5 \04 \016 k6 \04 \016 k7 \04 \016 k8 \03 \0\08 k9 \03 \0\08 k10 \03 \0\08 \z4\v'\n(x3u'_\v'-\n(x3u'\z9\v'\n(x3u'_\v'-\n(x3u' otherwise 49-bit frame __ ___ .sp 49 bits 216 words \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in-0.3i .FG "Table 11.6 Bit allocation for Speak 'n Spell chip" There are 4 bits for energy, and 5 bits for pitch and the first two reflection coefficients. Thereafter the number of bits allocated to reflection coefficients decreases steadily, for higher coefficients are less important for intelligibility than lower ones. (Note that using a 10-stage filter is tantamount to allocating .ul no bits to coefficients higher than the tenth.) With a 1-bit "repeat" flag, whose role is discussed shortly, the frame size becomes 49 bits. Updated every 20\ msec, this gives a data rate of just under 2.5\ Kbit/s. .pp The parameters are expanded into 10-bit numbers by a separate quantization table for each one. For example, the five pitch bits address a 32-word look-up table which returns a 10-bit value. The transformation is logarithmic in this case, the lowest pitch being around 50 Hz and the highest 190 Hz. As shown in Table 11.6, a total of 216 10-bit words suffices to hold all twelve quantization tables; and they are implemented on the synthesizer chip. To provide further smoothing of the control parameters, they are interpolated linearly from one frame to the next at eight points within the frame. .pp The raw data rate of 2.5\ Kbit/s is reduced to an average of 1200\ bit/s by further coding techniques. Firstly, if the energy parameter is zero the frame is silent, and no more parameters are transmitted (4-bit frame). Secondly, if the "repeat" flag is 1 all reflection coefficients are held over from the previous frame, giving a constant filter but with the ability to vary amplitude and pitch (10-bit frame). Finally, if the frame is unvoiced (signalled by the pitch value being zero) only four reflection coefficients are transmitted, because the ear is relatively insensitive to spectral detail in unvoiced speech (28-bit frame). The end of the utterance is signalled by the energy bits all being 1. .rh "Chip organization." The configuration of the lattice filter is shown in Figure 11.10. .FC "Figure 11.10" The "two-multiplier" structure (Chapter 6) is used, so the 10-stage filter requires 19 multiplications and 19 additions per speech sample. (The last operation in the reverse path at the bottom is not needed.) Since a 10\ kHz sample rate is used, just 100\ $mu$sec are available for each speech sample. A single 5\ $mu$sec adder and a pipelined multiplier are implemented on the chip, and multiplexed among the 19 operations. The latter begins a new multiplication every 5\ $mu$sec, and finishes it 40\ $mu$sec later. These times are within the capability of p-channel MOS technology, allowing the chip to be produced at low cost. The time slot for the 20'th, unnecessary, filter multiplication is used for an overall gain adjustment. .pp The final analogue signal is produced by an 8-bit on-chip D/A converter which drives a 200 milliwatt speaker through an impedance-matching transformer. These constitute the necessary analogue low-pass desampling filter. .pp Figure 11.11 summarizes the organization of the synthesis chip. .FC "Figure 11.11" Serial data enters directly from the read-only memories, although a control signal from the processor begins synthesis and another signal is returned to it upon termination. The data is decoded into individual parameters, which are used to address the quantization tables to generate the full 10-bit parameter values. These are interpolated from one frame to the next. The lower part of the Figure shows the speech generation subsystem. An excitation waveform for voiced speech is stored in read-only memory and read out repeatedly at a rate determined by the pitch. The source for unvoiced sounds is hard-limited noise provided by a digital pseudo-random bit generator. The sound source that is used depends on whether the pitch value is zero or not: notice that this precludes mixed excitation for voiced fricatives (and the sound is noticeably poor in words like "zee"). A gain multiplication is performed before the signal is passed through the lattice synthesis filter, described earlier. .sh "11.4 Programmable signal processors" .pp The TI chip has a fixed architecture, and is destined forever to implement the same vocal tract model \(em a 10'th order lattice filter. A more recent device, the Programmable Digital Signal Processor (Caldwell, 1980) from Telesensory Systems allows more flexibility in the type of model. .[ Caldwell 1980 .] It can serve as a digital formant synthesizer or a linear predictive synthesizer, and the order of model (number of formants, in the former case) can be changed. .pp Before describing the PDSP, it is worth looking at an earlier microprocessor which was designed for digital signal processing. Some industry observers have said that this processor, the Intel 2920, is to the analogue design engineer what the first microprocessor was to the random logic engineer way back in the mists of time (early 1970's). .rh "The 'analogue microprocessor'." The 2920 is a digital microprocessor. However, it contains an on-chip D/A converter, which can be used in successive approximation fashion for A/D conversion under program control, and its architecture is designed to aid digital signal processing calculations. Although the precision of conversion is 9 bits, internal arithmetic is done with 25 bits to accomodate the accumulation of round-off errors in arithmetic operations. An on-chip programmable read-only memory holds a 192-instruction program, which is executed in sequence with no program jumps allowed. This ensures that each pass through the program takes the same time, so that the analogue waveform is regularly sampled and processed. .pp The device is implemented in n-channel MOS technology, which makes it slightly faster than the pMOS Speak 'n Spell chip. At its fastest operating speed each instruction takes 400 nsec. The 192-instruction program therefore executes in 78.6\ $mu$sec, corresponding to a sampling rate of almost 13\ kHz. Thus the processor can handle signals with a bandwidth of 6.5\ kHz \(em ample for high-quality speech. However, a special EOP (end of program) instruction is provided which causes an immediate jump back to the beginning. Hence if the program occupies less than 192 instructions, faster sampling rates can be used. For example, a single second-order formant resonance requires only 14 instructions and so can be executed at over 150\ kHz. .pp Despite this speed, the 2920 is only marginally capable of synthesizing speech. Table 11.7 gives approximate numbers of instructions needed to do some subtasks for speech generation (Hoff and Li, 1980). .[ Hoff Li 1980 Software makes a big talker .] .RF .nr x0 \w'parameter entry and data distribution0000'+\w'00000' .nr x1 \w'instructions' .nr x2 (\n(.l-\n(x0)/2 .in \n(x2u .ta \w'parameter entry and data distribution0000'u \l'\n(x0u\(ul' .sp task \0\0\0\0\0\h'-\n(x1u'instructions \l'\n(x0u\(ul' .sp parameter entry and data distribution 35\-40 glottal pulse generation \0\0\0\08 noise generation \0\0\011 lattice section \0\0\020 formant filter \0\0\014 \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .in 0 .FG "Table 11.7 2920 instruction counts for typical speech subsystems" The parameter entry and data distribution procedure collects 10 8-bit parameters from a serial input stream, at a frame rate of 100 frames/s. The parameter data rate is 8\ Kbit/s, and the routine assumes that the 2920 performs each complete cycle in 125\ $mu$sec to generate sampled speech at 8\ kHz. Therefore one bit of parameter data is accepted on every cycle. The glottal pulse program generates an asymmetrical triangular waveform (Chapter 5), while the noise generator uses a 17-bit pseudo-random feedback shift register. About 30% of the 192-instruction program memory is consumed by these essential tasks. A two-multiplier lattice section takes 20 instructions, and so only six sections can fit into the remaining program space. It may be possible to use two 2920's to implement a complete 10 or 12'th order lattice, but the results of the first stage must be passed to the second by transmitting analogue or digital data between each of the 2920's analogue ports \(em not a terribly satisfactory method. .pp Since a formant filter occupies only 14 instructions, up to nine of them would fit in the program space left after the above-mentioned essential subsystems. Although other necessary house-keeping tasks may reduce this number substantially, it does seem possible to implement a formant synthesizer on a single 2920. .rh "The Programmable Digital Signal Processor." Whereas the 2920 is intended for general signal-processing jobs, Telesensory Systems' PDSP (Programmable Digital Signal Processor) is aimed specifically at speech synthesis. It comprises two separate chips, a control unit and an arithmetic unit. To build a synthesizer these must be augmented with external memory and a D/A converter, arranged in a configuration like that of Figure 11.12. .FC "Figure 11.12" .pp The control unit accepts parameter data from a host computer, one byte at a time. The data is temporarily held in buffer memory before being serialized and passed to the arithmetic unit. Notice that for the 2920 we assumed that parameters were presented to the chip already serialized and precisely timed: the PDSP control unit effectively releases the host from this high-speed real-time operation. But it does more. It generates both a voiced and an unvoiced excitation source and passes them to the arithmetic unit, to relieve the latter of the general-purpose programming required for both these tasks and allow its instruction set to be highly specialized for digital filtering. .pp The arithmetic unit has rather a peculiar structure. It accomodates only 16 program steps and can execute the full 16-instruction program at a rate of 10\ kHz. The internal word-length is 18 bits, but coefficients and the digital output are only 10 bits. Each instruction can accomplish quite a lot of work. Figure 11.13 shows that there are four separate blocks of store in addition to the program memory. .FC "Figure 11.13" One location of each block is automatically associated with each program step. Thus on instruction 2, for example, two 18-bit scratchpad registers MA(2) and MB(2), and two 10-bit coefficient registers A1(2) and A2(2), are accessible. In addition five general registers, curiously numbered R1, R2, R5, R6, R7, are available to every program step. .pp Each instruction has five fields. A single instruction loads all the general registers and simultaneously performs two multiplications and up to three additions. The fields specify exactly which operands are involved in these operations. .pp The instructions of the PDSP arithmetic unit are really very powerful. For example, a second-order digital formant resonator requires only two program steps. A two-multiplier lattice stage needs only one step, and a complete 12-stage lattice filter can be implemented in the 16 steps available. An important feature of the architecture is that it is quite easy to incorporate more than one arithmetic unit into a system, with a single control unit. Intermediate data can be transferred digitally between arithmetic units since the D/A converter is off-chip. A four-multiplier normalized lattice (Chapter 6) with 12 stages can be implemented on two arithmetic units, as can a lattice filter which incorporates zeros as well as poles, and a complex series/parallel formant synthesizer with a total of 12 resonators whose centre frequencies and bandwidths can be controlled independently (Klatt, 1980). .[ Klatt 1980 .] .pp How this device will fare in actual commercial products is yet to be seen. It is certainly much more sophisticated than the TI Speak 'n Spell chip, and a complete system will necessitate a much higher chip count and consequently more expense. Telesensory Systems are committed to producing a text-to-speech system based upon it for use both in a reading machine for the blind and as a text-input speech-output computer peripheral. .sh "11.5 References" .LB "nnnn" .[ $LIST$ .] .LE "nnnn" .bp .ev2 .ta \w'\fIsilence\fR 'u +\w'.EH100'u +\w'(used to change amplitude and duration)00'u +\w'00000000000test word'u .nr x0 \w'\fIsilence\fR '+\w'.EH100'+\w'(used to change amplitude and duration)00'+\w'00000000000test word' \l'\n(x0u\(ul' .sp .nr x1 (\w'Votrax'/2) .nr x2 (\w'duration (msec)'/2) .nr x3 \w'test word' \h'-\n(x1u'Votrax \0\h'-\n(x2u'duration (msec) \h'-\n(x3u'test word \l'\n(x0u\(ul' .sp .nr x3 \w'hid' \fIi\fR I 118 \h'-\n(x3u'hid I1 (sound equivalent of I) \083 I2 (sound equivalent of I) \058 I3 (allophone of I) \058 .I3 (sound equivalent of I3) \083 AY (allophone of I) \065 .nr x3 \w'head' \fIe\fR EH 118 \h'-\n(x3u'head EH1 (sound equivalent of EH) \070 EH2 (sound equivalent of EH) \060 EH3 (allophone of EH) \060 .EH2 (sound equivalent of EH3) \070 A1 (allophone of EH) 100 A2 (sound equivalent of A1) \095 .nr x3 \w'had' \fIaa\fR AE 100 \h'-\n(x3u'had AE1 (sound equivalent of AE) 100 .nr x3 \w'hod' \fIo\fR AW 235 \h'-\n(x3u'hod AW2 (sound equivalent of AW) \090 AW1 (allophone of AW) 143 .nr x3 \w'hood' \fIu\fR OO 178 \h'-\n(x3u'hood OO1 (sound equivalent of OO) 103 IU (allophone of OO) \063 .nr x3 \w'hud' \fIa\fR UH 103 \h'-\n(x3u'hud UH1 (sound equivalent of UH) \095 UH2 (sound equivalent of UH) \050 UH3 (allophone of UH) \070 .UH3 (sound equivalent of UH3) 103 .UH2 (allophone of UH) \060 .nr x3 \w'hard' \fIar\fR AH1 143 \h'-\n(x3u'hard AH2 (sound equivalent of AH1) \070 .nr x3 \w'hawed' \fIaw\fR O 178 \h'-\n(x3u'hawed O1 (sound equivalent of O) 118 O2 (sound equivalent of O) \083 .O (allophone of O) 178 .O1 (sound equivalent of .O) 123 .O2 (sound equivalent of .O) \090 .nr x3 \w'who d' \fIuu\fR U 178 \h'-\n(x3u'who'd U1 (sound equivalent of U) \090 .nr x3 \w'heard' \fIer\fR ER 143 \h'-\n(x3u'heard .nr x3 \w'heed' \fIee\fR E 178 \h'-\n(x3u'heed E1 (sound equivalent of E) 118 \fIr\fR R \090 .R (allophone of R) \050 \fIw\fR W \083 .W (allophone of W) \083 \l'\n(x0u\(ul' .sp3 .ce Table 11.2 Votrax sound segments and their durations .bp \l'\n(x0u\(ul' .sp .nr x1 (\w'Votrax'/2) .nr x2 (\w'duration (msec)'/2) .nr x3 \w'test word' \h'-\n(1u'Votrax \0\h'-\n(x2u'duration (msec) \h'-\n(x3u'test word \l'\n(x0u\(ul' .sp \fIl\fR L 105 L1 (allophone of L) 105 \fIy\fR Y 103 Y1 (allophone of Y) \083 \fIm\fR M 105 \fIb\fR B \070 \fIp\fR P 100 .PH (aspiration burst for use with P) \088 \fIn\fR N \083 \fId\fR D \050 .D (allophone of D) \053 \fIt\fR T \090 DT (allophone of T) \050 .S (aspiration burst for use with T) \070 \fIng\fR NG 120 \fIg\fR G \075 .G (allophone of G) \075 \fIk\fR K \075 .K (allophone of K) \080 .X1 (aspiration burst for use with K) \068 \fIs\fR S \090 \fIz\fR Z \070 \fIsh\fR SH 118 CH (allophone of SH) \055 \fIzh\fR ZH \090 J (allophone of ZH) \050 \fIf\fR F 100 \fIv\fR V \070 \fIth\fR TH \070 \fIdh\fR THV \070 \fIh\fR H \070 H1 (allophone of H) \070 .H1 (allophone of H) \048 \fIsilence\fR PA0 \045 PA1 175 .PA1 \0\05 .PA2 (used to change amplitude and duration) \0\0\- \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .sp3 .ce Table 11.2 (continued) .bp .ta 0.8i +2.6i +\w'(AH1 .UH2) (O U)000'u .nr x0 0.8i+2.6i+\w'(AH1 .UH2) (O U)000'+\w'; i uh \- here' \l'\n(x0u\(ul' .sp vowel clusters EH I A1 AY ; e i \- hey UH OO O U ; uh u \- ho AE I (AH1 EH3) I ; aa i \- hi AE OO (AH1 .UH2) (O U) ; aa u \- how AW I (O UH) E ; o i \- hoi I UH E I ; i uh \- here EH UH (EH A1) EH ; e uh \- hair OO UH OO UH ; u uh \- poor Y U Y1 (IU U) .sp vowel transitions {F M B P} O (.O1 O) {L R} EH (EH3 EH) {B K T D R} UH (UH3 UH) {T D} A1 (EH3 A1) {T D} AW (AH1 AW) {W} I (I3 I) {G SH W K} OO (IU OO) AY {K G T D} (AY Y) E {M T} (E Y) I {M T} (I Y) E {L} (I3 UH) EH {R N S D T} (EH EH3) I {R T} (I I3) AE {S N} (AE EH) AE {K} (AE EH3) A1 {R} (A1 EH1) AH1 {R P K} (AH1 UH) AH1 {ZH} (AH1 EH3) .sp intervocalics {voiced} T {voiced} DT .sp consonant transitions L {EH} L1 H {U OO IU} H1 \l'\n(x0u\(ul' .sp3 .ce Table 11.5 Contextual rules for Votrax sound segments .bp \l'\n(x0u\(ul' .sp consonant clusters B {stop-consonant} (B PA0) P {stop-consonant} (P PA0) D {stop-consonant} (D PA0) T {stop-consonant} (T PA0) DT {stop-consonant} (T PA0) G {stop-consonant} (G PA0) K {stop-consonant} (K PA0) {D T} R (.X1 R) K R .K (.X1 R) {consonant} R .R {consonant} L L1 K W .K .W D ZH D J T SH T CH .sp initial effects {.syll} P {vowel} (P .PH) {.syll} K {vowel} (K .H1) {.syll} T {vowel} (T .S) {.syll} L L1 {.syll} H {U OO O AW AH1} H1 .sp terminal effects E {PA0} (E Y) \l'\n(x0u\(ul' .ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i .sp3 .ce Table 11.5 (continued) .ev