.EQ
delim $$
.EN
.CH "1  WHY SPEECH OUTPUT?"
.ds RT "Why speech output?
.ds CX "Principles of computer speech
.pp
Speech is our everyday, informal, communication medium.  But although we use
it a lot, we probably don't assimilate as much information through our
ears as we do through our eyes, by reading or looking at pictures and diagrams.
You go to a technical lecture to get the feel of a subject \(em the overall
arrangement of ideas and the motivation behind them \(em and fill in the details,
if you still want to know them, from a book.  You probably find out more about
the news from ten minutes with a newspaper than from a ten-minute news broadcast.
So it should be emphasized from the start that speech output from computers is
not a panacea.  It doesn't solve the problems of communicating with computers;
it simply enriches the possibilities for communication.
.pp
What, then, are the advantages of speech output?  One good reason for listening
to a radio news broadcast instead of spending the time with a newspaper
is that you can listen while shaving, doing the housework, or driving the car.
Speech leaves hands and eyes free for other tasks.
Moreover, it is omnidirectional, and does not require a free line of sight.
Related to this is the
use of speech as a secondary medium for status reports and warning messages.
Occasional interruptions by voice do not interfere with other activities,
unless they demand unusual concentration, and people can assimilate spoken messages
and queue them for later action quite easily and naturally.
.pp
The second key feature of speech communication stems from the telephone.
It is the universality of the telephone receiver itself that is important
here, rather than the existence of a world-wide distribution network;
for with special equipment (a modem and a VDU) one does not need speech to take advantage of
the telephone network for information transfer.
But speech needs no tools other than the telephone, and this gives
it a substantial advantage.  You can go into a phone booth anywhere in the world,
carrying no special equipment, and have access to your computer within seconds.
The problem of data input is still there:  perhaps your computer
system has a limited word recognizer, or you use the touchtone telephone
keypad (or a portable calculator-sized tone generator).  Easy remote access
without special equipment is a great, and unique, asset to speech communication.
.pp
The third big advantage of speech output is that it is potentially very cheap.
Being all-electronic, except for the loudspeaker, speech systems are well
suited to high-volume, low-cost, LSI manufacture.  Other computer output
devices are at present tied either to mechanical moving parts or to the CRT.
This was realized quickly by the computer hobbies market, where speech output
peripherals have been selling like hot cakes since the mid 1970's.
.pp
A further point in favour of speech is that it is natural-seeming and
somehow cuddly when compared with printers or VDU's.  It would have been much
more difficult to make this point before the advent of talking toys like
Texas Instruments' "Speak 'n Spell" in 1978, but now it is an accepted fact that friendly
computer-based gadgets can speak \(em there are talking pocket-watches
that really do "tell" the time, talking microwave ovens, talking pinball machines, and,
of course, talking calculators.
It is, however, difficult to assess whether the appeal stems from
mechanical speech's novelty \(em it
is still a gimmick \(em and also to what extent it is tied up with
economic factors.
After all, most of the population don't use high-quality VDU's, and their major
experience of real-time interactive computing is through the very limited displays
and keypads provided on video games and teletext systems.
.pp
Articles on speech communication with computers often list many more advantages of voice output
(see Hill 1971, Turn 1974, Lea 1980).
.[
Hill 1971 Man-machine interaction using speech
.]
.[
Lea 1980
.]
.[
Turn 1974 Speech as a man-computer communication channel
.]
For example, speech
.LB
.NP
can be used in the dark
.NP
can be varied from a (confidential) whisper to a (loud) shout
.NP
requires very little energy
.NP
is not appreciably affected by weightlessness or vibration.
.LE
However, these either derive from the three advantages we have discussed above,
or relate
mainly to exotic applications in space modules and divers' helmets.
.pp
Useful as it is at present, speech output would be even more attractive if it could
be coupled with speech input.  In many ways, speech input is its "big brother".
Many of the benefits of speech output are even more striking for speech input.
Although people can assimilate information faster through the eyes than the
ears, the majority of us can generate information faster with the mouth than
with the hands.  Rapid typing is a relatively uncommon skill, and even high
typing rates are much slower than speaking rates (although whether we can
originate ideas quickly enough to keep up with fast speech is another matter!)  To
take full advantage of the telephone for interaction with machines, machine
recognition of speech is obviously necessary.  A microwave oven, calculator,
pinball machine, or alarm clock that responds to spoken commands is certainly
more attractive than one that just generates spoken status messages.  A book
that told you how to recognize speech by machine would undoubtedly be more
useful than one like this that just discusses how to synthesize it!  But the
technology of speech recognition is nowhere near as advanced as that of
synthesis \(em it's a much more difficult problem.  However, because speech input
is obviously complementary to speech output, and even very limited input
capabilities will greatly enhance many speech output systems, it is worth
summarizing the present state of the art of speech recognition.
.pp
Commercial speech recognizers do exist.  Almost invariably, they accept
words spoken in isolation, with gaps of silence between them, rather than
connected utterances.
It is not difficult to discriminate with high accuracy up to a hundred
different words spoken by the same speaker, especially if the vocabulary
is carefully selected to avoid words which sound similar.  If several
different speakers are to be comprehended, performance can be greatly improved
if the machine is given an opportunity to calibrate their voices in a training
session, and is informed at recognition time which one is to speak.
With a large population of unknown speakers, accurate recognition is difficult
for vocabularies of more than a few carefully-chosen words.
.pp
A half-way house between isolated word discrimination and recognition of connected
speech is the problem of spotting known words in continuous speech.  This
allows much more natural input, if the dialogue is structured as keywords
which may be
interspersed by unimportant "noise words".  To speak in truly isolated
words requires a great deal of self-discipline and concentration \(em it is
surprising how much of ordinary speech is accounted for by vague sounds
like um's and aah's, and false starts.  Word spotting disregards these and so
permits a more relaxed style of speech.  Some progress has been made on it in
research laboratories, but the vocabularies that can be accomodated are still
very small.
.pp
The difficulty of recognizing connected speech depends crucially on what is
known in advance about the dialogue:  its pragmatic, semantic, and syntactic
constraints.  Highly structured dialogues constrain very heavily the choice of
the next word.  Recognizers which can deal with vocabularies of over 1000 words
have been built in research laboratories, but the structure of the input has
been such that the average "branching factor" \(em the size of the set out of
which the next word must be selected \(em is only around 10 (Lea, 1980).
.[
Lea 1980
.]
Whether such
highly constrained languages would be acceptable in many practical applications
is a moot point.  One commercial recognizer, developed in 1978, can cope with
up to five words spoken continuously from a basic 120-word vocabulary.
.pp
There has been much debate about whether it will ever be possible for a speech
recognizer to step outside rigid constraints imposed on the utterances it can
understand, and act, say, as an automatic dictation machine.  Certainly the most
advanced recognizers to date depend very strongly on a tight context being
available.  Informed opinion seems to accept that in ten years' time,
voice data entry in the office will be an important and economically feasible
prospect, but that it would be rash to predict the appearance of unconstrained
automatic dictation by then.
.pp
Let's return now to speech output and take a look at some systems which use it,
to illustrate the advantages and disadvantages of speech in practical
applications.
.sh "1.1  Talking calculator"
.pp
Figure 1.1 shows a calculator that speaks.
.FC "Figure 1.1"
Whenever a key is pressed,
the device confirms the action by saying the key's name.
The result of any computation is also spoken aloud.
For most people, the addition of speech output to a calculator is simply a
gimmick.
(Note incidentally that speech
.ul
input
is a different matter altogether.  The ability to dictate lists of numbers and
commands to a calculator, without lifting one's eyes from the page, would have
very great advantages over keypad input.)  Used-car
salesmen find that speech output sometimes helps to clinch a deal:  they key in
the basic car price and their bargain-basement deductions, and the customer is so
bemused by the resulting price being spoken aloud to him by a machine that he
signs the cheque without thinking!  More seriously, there may be some small
advantage to be gained when keying a list of figures by touch from having their
values read back for confirmation.  For blind people, however, such devices
are a boon \(em and there are many other applications, like talking elevators
and talking clocks, which benefit from even very restricted voice output.
Much more sophisticated is a typewriter with audio feedback, designed by
IBM for the blind.  Although blind typists can remember where the keys on a
typewriter are without difficulty, they rely on sighted proof-readers to help
check
their work.  This device could make them more useful as office typists and
secretaries.  As well as verbalizing the material (including punctuation)
that has been typed, either by attempting to pronounce the words or by spelling
them out as individual letters, it prompts the user through the more complex action sequences
that are possible on the typewriter.
.pp
The vocabulary of the talking calculator comprises the 24 words of Table 1.1.
.RF
.nr x1 2.0i+\w'percent'u
.nr x1 (\n(.l-\n(x1)/2
.in \n(x1u
.ta 2.0i
zero	percent
one	low
two	over
three	root
four	em (m)
five	times
six	point
seven	overflow
eight	minus
nine	plus
times-minus	clear
equals	swap
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 1.1  Vocabulary of a talking calculator"
This represents a total of about 13 seconds of speech.  It is stored
electronically in read-only memory (ROM), and Figure 1.2 shows the circuitry
of the speech module inside the calculator.
.FC "Figure 1.2"
There are three large integrated circuits.
Two of them are ROMs, and the other is a special synthesis chip which decodes the
highly compressed stored data into an audio waveform.
Although the mechanisms used for storing speech by commercial devices are
not widely advertised by the manufacturers, the talking calculator almost
certainly uses linear predictive coding \(em a technique that we will examine
in Chapter 6.
The speech quality is very poor because of the highly compressed storage, and
words are spoken in a grating monotone.
However, because of the very small vocabulary, the quality is certainly good
enough for reliable identification.
.sh "1.2  Computer-generated wiring instructions"
.pp
I mentioned earlier that one big advantage of speech over visual output is that
it leaves the eyes free for other tasks.
When wiring telephone equipment during manufacture, the operator needs to use
his hands as well as eyes to keep his place in the task.
For some time tape-recorded instructions have been used for this in certain
manufacturing plants.  For example, the instruction
.LB
.NI
Red 2.5    11A terminal strip    7A tube socket
.LE
directs the operator to cut 2.5" of red wire, attach one end to a specified point
on the terminal strip, and attach the other to a pin of the tube socket.  The
tape recorder is fitted with a pedal switch to allow a sequence of such instructions
to be executed by the operator at his own pace.
.pp
The usual way of recording the instruction tape is to have a human reader
dictate them from a printed list.
The tape is then checked against the list by another listener to ensure that
the instructions are correct.  Since wiring lists are usually stored and
maintained in machine-readable form, it is natural to consider whether speech
synthesis techniques could be used to generate the acoustic tape directly by
a computer (Flanagan
.ul
et al,
1972).
.[
Flanagan Rabiner Schafer Denman 1972
.]
.pp
Table 1.2 shows the vocabulary needed for this application.
.RF
.nr x1 2.0i+2.0i+\w'tube socket'u
.nr x1 (\n(.l-\n(x1)/2
.in \n(x1u
.ta 2.0i +2.0i
A	green	seventeen
black	left	six
bottom	lower	sixteen
break	make	strip
C	nine	ten
capacitor	nineteen	terminal
eight	one	thirteen
eighteen	P	thirty
eleven	point	three
fifteen	R	top
fifty	red	tube socket
five	repeat coil	twelve
forty	resistor	twenty
four	right	two
fourteen	seven	upper
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 1.2  Vocabulary needed for computer-generated wiring instructions"
It is rather larger
than that of the talking calculator \(em about 25 seconds of speech \(em but well
within the limits of single-chip storage in ROM, compressed by the linear
predictive technique.  However, at the time that the scheme was investigated
(1970\-71) the method of linear predictive coding had not been fully developed,
and the technology for low-cost microcircuit implementation was not available.
But this is not important for this particular application, for there is
no need to perform the synthesis on a miniature low-cost computer system,
nor need it
be accomplished in real time.  In fact a technique of concatenating
spectrally-encoded words was used (described in Chapter 7), and it was
implemented on a minicomputer.  Operating much slower than real-time, the system
calculated the speech waveform and wrote it to disk storage.  A subsequent phase
read the pre-computed messages and recorded them on a computer-controlled analogue
tape recorder.
.pp
Informal evaluation showed the scheme to be quite successful.  Indeed, the
synthetic speech, whose quality was not high, was actually preferred to
natural speech in the noisy environment of the production line, for each
instruction was spoken in the same format, with the same programmed pause
between the items.
A list of 58 instructions of the form shown above was recorded and used
to wire several pieces of apparatus without errors.
.sh "1.3  Telephone enquiry service"
.pp
The computer-generated wiring scheme illustrates how speech can be used to give
instructions without diverting visual attention from the task at hand.
The next system we examine shows how speech output can make the telephone
receiver into a remote computer terminal for a variety of purposes
(Witten and Madams, 1977).
.[
Witten Madams 1977 Telephone Enquiry Service
.]
The caller employs the touch-tone keypad shown in Figure 1.3 for input, and the
computer generates
a synthetic voice response.
.FC "Figure 1.3"
Table 1.3 shows the process of making
contact with the system.
.RF
.fi
.nh
.na
.in 0.3i
.nr x0 \w'COMPUTER:  '
.nr x1 \w'CALLER:  '
.in+\n(x0u
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Dials the service.
.ti-\n(x0u
COMPUTER:  Answers telephone.
"Hello, Telephone Enquiry Service.  Please
enter your user number".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters user number.
.ti-\n(x0u
COMPUTER:  "Please enter your password".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters password.
.ti-\n(x0u
COMPUTER:  Checks validity of password.
If invalid, the user is asked to re-enter
his user number.
Otherwise,
"Which service do you require?"
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters service number.
.in 0
.nf
.FG "Table 1.3  Making contact with the telephone enquiry system"
.pp
Advantage is taken of the disparate speeds of input (keyboard) and
output (speech) to hasten the dialogue by imposing a question-answer structure
on it, with the computer taking the initiative.  The machine can
afford to be slightly verbose if by so doing it makes the caller's
response easier, and therefore more rapid.  Moreover, operators who
are experienced enough with the system to anticipate questions can
easily forestall them just by typing ahead, for the computer is programmed
to examine its input buffer before issuing prompts and to suppress them if
input has already been provided.
.pp
An important aim of the system is to allow application programmers with no
special knowledge of speech to write independent services for it.
Table 1.4 shows an example of the use of one such application program,
.RF
.fi
.nh
.na
.in 0.3i
.nr x0 \w'COMPUTER:  '
.nr x1 \w'CALLER:  '
.in+\n(x0u
.ti-\n(x0u
COMPUTER:  "Stores Information Service.  Please enter
component name".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "SN7406#".
.ti-\n(x0u
COMPUTER:  "The component name is SN7406.  Is this correct?"
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "*1#" (system convention for "yes").
.ti-\n(x0u
COMPUTER:  "This component is in stores".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "*7#" (command for "price").
.ti-\n(x0u
COMPUTER:  "The component price is 35 pence".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "*8#" (command for "minimum number").
.ti-\n(x0u
COMPUTER:  "The minimum number of this component kept
in stores is 10".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "SN7417#".
.ti-\n(x0u
COMPUTER:  "The component name is SN7417.  Is this correct?"
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "*1#".
.ti-\n(x0u
COMPUTER:  "This component is not in stores".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "*9#" (command for "delivery time").
.ti-\n(x0u
COMPUTER:  "The expected delivery time is 14 days".
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Enters "*0#".
.ti-\n(x0u
COMPUTER:  "Which service do you require?"
.in 0
.nf
.FG "Table 1.4  The Stores Information Service"
the
Stores Information Service, which permits enquiries to be made of a database
holding information on electronic components kept in stock.
This subsystem is driven by
.ul
alphanumeric
data entered on the touch-tone keypad.  Two or three letters are associated
with each digit, in a manner which is fairly standard in touch-tone telephone
applications.  These are printed on a card overlay
that fits the keypad (see Figure 1.3).  Although true alphanumeric data entry
would require a multiple key press for each character,
the ambiguity inherent in
a single-key-per-character convention can usually be resolved by the computer,
if it has a list of permissible entries.  For example, the component names
SN7406 and ZTX300 are read by the machine as "767406" and "189300", respectively.
Confusion rarely occurs if the machine is expecting a valid component code.
The same holds true of people's names, and file names \(em although with these
one must take care not to identify a series of files by similar names, like
TX38A, TX38B, TX38C.  It is easy for the machine to detect the rare cases
where ambiguity occurs, and respond by requesting further information:  "The
component name is SN7406.  Is this correct?"  (In fact, the Stores Information
Service illustrated in Table 1.4 is defective in that it
.ul
always
requests confirmation of an entry, even when no ambiguity exists.)  The
use of a telephone keypad for data entry will be taken up again in Chapter 10.
.pp
A distinction is drawn throughout the system between data entries and
commands, the latter being prefixed by a "*".  In this example, the
programmer chose to define a command for each possible question about a
component, so that a new component name can be entered at any time
without ambiguity.  The price paid for the resulting brevity of dialogue
is the burden of memorizing the meaning of the commands.  This is an
inherent disadvantage of a one-dimensional auditory display over the
more conventional graphical output:   presenting menus by speech is tedious and
long-winded.  In practice, however, for a simple task such as the
Stores Information Service it is quite convenient for the caller to
search for the appropriate command by trying out all possibilities \(em there
are only a few.
.pp
The problem of memorizing commands is alleviated by establishing some
system-wide conventions.  Each input is terminated by a "#", and
the meaning of standard commands is given in Table 1.5.
.RF
.fi
.nh
.na
.in 0.3i
.nr x0 \w'# alone  '
.nr x1 \w'\(em  '
.ta \n(x0u +\n(x1u
.nr x2 \n(x0+\n(x1
.in+\n(x2u
.ti-\n(x2u
*#	\(em	Erase this input line, regardless of what has
been typed before the "*".
.ti-\n(x2u
*0#	\(em	Stop.  Used to exit from any service.
.ti-\n(x2u
*1#	\(em	Yes.
.ti-\n(x2u
*2#	\(em	No.
.ti-\n(x2u
*3#	\(em	Repeat question or summarize state of current
transaction.
.ti-\n(x2u
# alone	\(em	Short form of repeat.  Repeats or summarizes
in an abbreviated fashion.
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.nf
.FG "Table 1.5  System-wide conventions for the service"
.pp
A summary of services available on the system is given in
Table 1.6.
.RF
.fi
.na
.in 0.3i
.nr x0 \w'000  '
.nr x1 \w'\(em  '
.nr x2 \n(x0+\n(x1
.in+\n(x2u
.ta \n(x0u +\n(x1u
.ti-\n(x2u
\0\01	\(em	tells the time
.ti-\n(x2u
\0\02	\(em	Biffo (a game of NIM)
.ti-\n(x2u
\0\03	\(em	MOO (a game similar to that marketed under the name "Mastermind")
.ti-\n(x2u
\0\04	\(em	error demonstration
.ti-\n(x2u
\0\05	\(em	speak a file in phonetic format
.ti-\n(x2u
\0\06	\(em	listening test
.ti-\n(x2u
\0\07	\(em	music (allows you to enter a tune and play it)
.ti-\n(x2u
\0\08	\(em	gives the date
.sp
.ti-\n(x2u
100	\(em	squash ladder
.ti-\n(x2u
101	\(em	stores information service
.ti-\n(x2u
102	\(em	computes means and standard deviations
.ti-\n(x2u
103	\(em	telephone directory
.sp
.ti-\n(x2u
411	\(em	user information
.ti-\n(x2u
412	\(em	change password
.ti-\n(x2u
413	\(em	gripe (permits feedback on services from caller)
.sp
.ti-\n(x2u
600	\(em	first year laboratory marks entering service
.sp
.ti-\n(x2u
910	\(em	repeat utterance (allows testing of system)
.ti-\n(x2u
911	\(em	speak utterance (allows testing of system)
.ti-\n(x2u
912	\(em	enable/disable user 100 (a no-password guest user number)
.ti-\n(x2u
913	\(em	mount a magnetic tape on the computer
.ti-\n(x2u
914	\(em	set/reset demonstration mode (prohibits access by low-priority users)
.ti-\n(x2u
915	\(em	inhibit games
.ti-\n(x2u
916	\(em	inhibit the MOO game
.ti-\n(x2u
917	\(em	disable password checking when users log in
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.nf
.FG "Table 1.6  Summary of services on a telephone enquiry system"
They range from simple games and demonstrations, through serious database
services, to system maintenance facilities.
A priority structure is imposed upon them, with higher
service numbers being available only to higher priority users.
Services in the lowest range (1\-99) can be obtained by all, while
those in the highest range (900\-999) are maintenance services,
available only to the system designers.  Access to the lower-numbered
"games" services can be inhibited by a priority user \(em this was
found necessary to prevent over-use of the system!  Another advantage
of telephone access to an information retrieval system is that some
day-to-day maintenance can be done remotely, from the office telephone.
.pp
This telephone enquiry service, which was built in 1974, demonstrated that
speech synthesis had moved from a specialist phonetic discipline into the
province of engineering practicability.  The speech was generated "by rule"
from a phonetic input (the method is covered in Chapters 7 and 8), which
has very low data storage requirements of around 75\ bit/s of speech.
Thus an enormous vocabulary and range of services could be accomodated on a
small computer system.
Despite the fairly low quality of the speech, the response from callers was
most encouraging.  Admittedly the user population was a self-selected body of
University staff, which one might suppose to have high tolerance to new ideas,
and a system designed for the general public would require more effort to be
spent on developing speech of greater intelligibility.  Although it was
observed that some callers failed to understand parts of the responses, even
after repetition, communication was largely unhindered in most cases; users
being driven by a high motivation to help the system help them.
.pp
The use of speech output in conjunction with a simple input device requires
careful thought for interaction to be successful and comfortable.  It is
necessary that the computer direct the conversation as much as possible,
without seeming to be taking charge.  Provision for eliminating prompts
which are unwanted by sophisticated users is essential to avoid frustration.
We will return to the topic of programming techniques for speech interaction
in Chapter 10.
.pp
Making a computer system available over the telephone results in a sudden
vast increase in the user population.  Although people's reaction to a new
computer terminal in every office was overwhelmingly favourable, careful
resource allocation was essential to prevent the service being hogged by a
persistent few.  As with all multi-access computer systems, it is particularly
important that error recovery is effected automatically and gracefully.
.sh "1.4  Speech output in the telephone exchange"
.pp
The telephone enquiry service was an experimental vehicle for research on speech
interaction, and was developed in 1974.
Since then, speech has begun to be used in real commercial applications.
One example is System\ X, the British Post Office's computer-controlled
telephone exchange.  This incorporates many features
not found in conventional telephone exchanges.
For example, if a number is found to be busy, the call can be attempted
again by a "repeat last call" command, without having to re-dial the full number.
Alternatively, the last number can be stored for future re-dialling, freeing
the phone for other calls.
"Short code
dialling" allows a customer to associate short codes with commonly-dialled
numbers.
Alarm calls can be booked at specified times, and are made automatically
without human intervention.
Incoming calls can be barred, as can outgoing ones.  A diversion service
allows all incoming calls to be diverted to another telephone, either
immediately, or if a call to the original number remains unanswered for
a specified period of time, or if the original number is busy.
Three-party calls can be set up automatically, without involving the
operator.
.pp
Making use of these facilities presents the caller with something of a problem.
With conventional telephone exchanges, feedback is provided on what is happening
to a call by the use of four tones \(em the dial tone, the busy tone,
the ringing tone, and the number unavailable tone.
For the more sophisticated interaction which is expected on the advanced
exchange, a much greater variety of status signals is required.
The obvious solution is to use
computer-generated spoken
messages to inform the caller when these services are invoked, and to guide him
through the sequences of actions needed to set up facilities like call
re-direction.  For example, the messages used by the exchange when a user
accesses the alarm call
service are
.LB
.NI
Alarm call service.
Dial the time of your alarm call followed by square\u\(dg\d.
.FN 1
\(dg\d"Square" is the term used for the "#" key on the touch-tone telephone.\u
.EF
.NI
You have booked an alarm call for seven thirty hours.
.NI
Alarm call operator.  At the third stroke it will be seven thirty.
.LE
.pp
Because of the rather small vocabulary, the number of messages that can be
stored in their entirety rather than being formed by concatenation of
smaller units, and the short time which was available for development,
System\ X stores speech as a time waveform, slightly compressed by a time-domain
encoding operation (such techniques are described in Chapter 3).
Utterances which contain variable parts, like the time of alarm in the messages
above, are formed by inserting separately-recorded digits in a fixed 
"carrier" message.  No attempt is made to apply uniform intonation
contours to the synthetic utterances.  The resulting speech is of excellent
quality (being a slightly compressed recording of a human voice), but sometimes
exhibits somewhat anomalous pitch contours.
For example, the digits comprising numbers often sound rather jerky and
out-of-context \(em which indeed they are.
.pp
Even more advanced facilities can be expected on telephone exchanges in
the future.  A message storage capability is one example.  Although
automatic call recording machines have been available for years, a centralized
facility could time and date a message, collect the caller's identity
(using the telephone keypad), and allow the recipient to select messages left
for him through an interactive dialogue so that he could control the order
in which he listens to them.  He could choose to leave certain messages to be
dealt with later, or re-route them to a colleague.  He may even wish to leave
reminders for himself, to be dialled automatically at specified times (like
alarm calls with user-defined information attached).  The sender of a message
could be informed automatically by the system when it is delivered.  None of
this requires speech recognition, but it does need economical speech
.ul
storage,
and also speech
.ul
synthesis
(for time and date tags).
.sh "1.5  Travel consultant"
.pp
Some current research in speech communication with computers is aimed at
embedding the dialogue in a task environment which provides it with an overall
contextual framework.
This in fact was the motivation for the experimental telephone enquiry service
described above, and it was observed that people are able to tolerate rather
low quality speech if they feel that the interaction is worthwhile.  One step
further is the use of speech
.ul
input
as well as output, and here the context of the conversation which is provided
by the task environment can be used to improve recognition accuracy
significantly.
A system set up at Bell Telephone Laboratories indicates what we can expect
to find operating commercially in the future (Levinson and Shipley, 1980).
.[
Levinson Shipley 1980
.]
It provides timetable information about airline flights, and allows the user
to make reservations.
.pp
An example dialogue is given in Table 1.7.
.RF
.fi
.nh
.na
.in 0.3i
.nr x0 \w'COMPUTER:  '
.nr x1 \w'CALLER:  '
.in+\n(x0u
.ti-\n(x0u
COMPUTER:  This is the Bell Laboratories flight information system.  May I help you?
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  I would like to go to Chicago.
.ti-\n(x0u
COMPUTER:  Please specify the day of your departure.
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  I want to leave at four p.m..
.ti-\n(x0u
COMPUTER:  Please specify the day of your departure.
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  I want to depart on the two five May.
.ti-\n(x0u
COMPUTER:  Flight number nine two leaves New York at four p.m. arrives in Chicago at five twenty five p.m..
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  What plane is on the flight?
.ti-\n(x0u
COMPUTER:  The aircraft is a Boeing seven twenty seven.
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  How much is the fare?
.ti-\n(x0u
COMPUTER:  The round trip coach fare is one hundred forty two dollars.
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  I want a first class seat.
.ti-\n(x0u
COMPUTER:  One seat thank you.
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  What is the fare?
.ti-\n(x0u
COMPUTER:  The round trip first class fare is one hundred eighty four dollars.
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  Please repeat the arrival time.
.ti-\n(x0u
COMPUTER:  The arrival time is five twenty five p.m..
.ti-\n(x0u
CALLER:\h'\n(x0u-\n(x1u'  I will take the flight.
.ti-\n(x0u
COMPUTER:  Your reservation on flight number nine two to Chicago on Friday is confirmed.  Thank you.
.in 0
.nf
.MT 2
Table 1.7  A conversation with an airline flight information service
(from Levinson and Shipley, 1980)
.TE
.[
Levinson Shipley 1980
.]
The user dials the system from an
ordinary telephone.  The recognition side must be trained by each user, and
accepts isolated words spoken with brief pauses between them.
The voice response unit has a vocabulary of around 200 words, and
synthesizes its answers by slotting words into "templates" evoked by the speech
understanding part in response to a query.  For example,
.LB
.NI
This flight makes \(em stops
.NI
Flight number \(em leaves \(em at \(em , arrives in \(em at \(em
.LE
are templates which when called with specific slot fillers could produce the
utterances
.LB
.NI
This flight makes three stops
.NI
Flight number nine two leaves New York at four p.m.,
arrives in Chicago at five twenty-five p.m.
.LE
The chief research interest of the system is in its speech understanding
capabilities, and the method used for speech output is relatively
straightforward.  The templates and words are recorded, digitized, compressed
slightly, and stored on disk files (totalling a few hundred thousand bytes of
storage), using techniques similar to those of System\ X.
Again, no independent manipulation of pitch is possible, and so the utterances
sound intelligible but the transition between templates and slot fillers is not
completely fluent.  However, the overall context of the interaction means that
the communication is not seriously disrupted even if the machine occasionally
misunderstands the man or vice versa.  The user's attention is drawn away from
recognition accuracy and focussed on the exchange of information with the machine.
The authors conclude that progress in speech recognition can best be made by
studying it in the context of communication rather than in a vacuum or as part
of a one-way channel, and the same is undoubtedly true of speech synthesis as
well.
.sh "1.6  Reading machine for the blind"
.pp
Perhaps the most advanced attempt to provide speech output from a computer
is the Kurzweil reading machine for the blind, first marketed in the late
1970's (Figure 1.4).
.FC "Figure 1.4"
This device reads an ordinary book aloud.  Users adjust the reading
speed according to the content of the material and their familiarity with
it, and the maximum rate has recently been improved to around 225 words per
minute \(em perhaps half as fast again as normal human speech rates.
.pp
As well as generating speech from text, the machine has to scan the document
being read and identify the characters presented to it.  A scanning camera
is used, controlled by a program which searches for and tracks the lines of
text.  The output of the camera is digitized, and the image is enhanced
using signal-processing techniques.  Next each individual letter must be
isolated, and its geometric features identified and compared with a pre-stored
table of letter shapes.  Isolation of letters is not at all trivial, for
many type fonts have "ligatures" which are combinations of characters joined
together (for example, the letters "fi" are often run together.)  The
machine must cope with many printed type fonts, as well as typewritten ones.
The text-recognition side of the Kurzweil reading machine is in fact one of
its most advanced features.
.pp
We will discuss the problem of speech generation from text in Chapter 9.
It has many facets.  First there is pronunciation, the
translation of letters to sounds.  It is important to take into account
the morphological structure of words, dividing them into "root" and "endings".
Many words have concatenated suffixes (like "like-li-ness").  These are
important to detect, because a final "e" which appears on a root word
is not pronounced itself but affects the pronunciation of the previous
vowel.  Then there is the difficulty that some words look the same
but are pronounced differently, depending on their meaning or on the syntactic
part that they play in the sentence.
Appropriate intonation is extremely difficult to generate from a plain textual
representation, for it depends on the meaning of the text and the way in which
emphasis is given to it by the reader.  Similarly the rhythmic structure is
important, partly for correct pronunciation and partly for purposes of
emphasis.
Finally the sounds that have been deduced from the text need to be synthesized
into acoustic form, taking due account of the many and varied contextual effects
that occur in natural speech.  This by itself is a challenging problem.
.pp
The performance of the Kurzweil reading machine is not good.  While it seems
to be true that some blind people can make use of it, it is far from
comprehensible to an untrained listener.  For example,
it will miss out words and even whole phrases, hesitate in a
stuttering manner, blatantly mis-pronounce many words, fail to detect
"e"s which should be silent, and give completely wrong rhythms
to words, making them impossible to understand.
Its intonation is decidedly unnatural, monotonous, and often downright
misleading.  When it reads completely new text to people unfamiliar with its
quirks,
they invariably fail to understand more than an odd word here and there,
and do not improve significantly when the text is repeated more than once.
Naturally performance improves if the material is familiar or expected
in some way.
One useful feature is the machine's ability to spell out difficult words
on command from the user.
.pp
While not wishing to denigrate the Kurzweil machine, which is a remarkable
achievement in that it integrates together many different advanced
technologies, there is no doubt that the state of the art in speech synthesis
directly from unadorned text is extremely primitive, at present.
It is vital not to overemphasize the potential usefulness of abysmal speech,
which takes a great deal of training on the part of the user before
it becomes at all intelligible.  To make a rather extreme analogy,
Morse code could be used as
audio output, requiring a great deal of training, but capable of being understood
at quite high rates by an expert.
It could be generated very cheaply.
But clearly the man in the street would find it quite unacceptable as
an audio output medium, because of the excessive effort required to learn to use
it.  In many applications, very bad synthetic speech is just as useless.
However, the issue is complicated by the fact that for people who use
synthesizers regularly, synthetic speech becomes quite easily comprehensible.
We will return to the problem of evaluating the quality of artificial speech
later in the book (Chapter 8).
.sh "1.7  System considerations for speech output"
.pp
Fortunately, very many of the applications of speech output from computers
do not need to read unadorned text.
In all the example systems described above (except the reading machine),
it is enough to be able to store utterances in some representation which can
include pre-programmed cues for pronunciation, rhythm, and intonation in
a much more explicit way than ordinary text does.
.pp
Of course, techniques
for storing audio information have been in use for decades.
For example, a domestic cassette tape recorder stores speech at much better
than telephone quality at very low cost.  The method of direct
recording of an analogue waveform is currently used for announcements in
the telephone network to provide information such as the time, weather
forecasts, and even bedtime stories.
However, it is difficult to provide rapid access to messages stored in
analogue form, and although some computer peripherals which use analogue
recordings for voice-response applications have been marketed \(em they are
discussed briefly at the beginning of Chapter 3 \(em they have been
superseded by digital storage techniques.
.pp
Although direct storage of a digitized audio waveform is used in some
voice-response systems, the approach has certain limitations.  The most
obvious one is the large storage requirement:  suitable coding can reduce
the data-rate of speech to as little as one hundredth of that needed by
direct digitization, and textual representations reduce it by another factor
of ten or twenty.  (Of course, the speech quality is inevitably compromised
somewhat by data-compression techniques.)  However, the cost of storage is
dropping so fast that this is not necessarily an overriding factor.
A more fundamental limitation is that utterances stored directly cannot sensibly
be modified in any way to take account of differing contexts.
.pp
If the results of certain kinds of analyses
of utterances are stored, instead of simply the digitized waveform,
a great deal more flexibility can be gained.
It is possible to separate out the features of intonation and amplitude from
the articulation of the speech, and this raises the attractive possibility
of regenerating utterances with pitch contours different from those with which they were
recorded.
The primary analysis technique used for this purpose is
.ul
linear prediction
of speech, and this is treated in some detail in Chapter 6.  It also reduces drastically the
data-rate of speech, by a factor of around 50.
It is likely that many voice-response systems in the short- and medium-term
future will use linear predictive representations for utterance storage.
.pp
For maximum flexibility, however, it is preferable to store a textual
representation of the utterance.
There is an important distinction between speech
.ul
storage,
where an actual human utterance is recorded, perhaps processed to lower
the data-rate, and stored for subsequent regeneration when required,
and speech
.ul
synthesis,
where the machine produces its own individual utterances which are not based
on recordings of a person saying the same thing.  The difference is summarized
in Figure 1.5.
.FC "Figure 1.5"
In both cases something is stored:  for the first it is
a direct representation of an actual human utterance, while for the second
it is a typed
.ul
description
of the utterance in terms of the sounds, or phonemes, which constitute it.
The accent and tone of voice of the human speaker will be apparent in
the stored speech output, while for synthetic speech the accent is the
machine's and the tone of voice is determined by the synthesis program.
.pp
Probably the most attractive representation of utterances in man-machine
systems is ordinary English text, as used by the Kurzweil reading machine.
But, as noted above, this poses extraordinarily difficult problems for the
synthesis procedure, and these inevitably result in severely degraded speech.
Although in the very long term these problems may indeed be solved,
most speech output systems can adopt as their representation of an utterance
a description of it which explicitly conveys the difficult features of
intonation, rhythm, and even pronunciation.
In the kind of applications described above (barring the reading machine),
input will be prepared by a
programmer as he builds the software system which supports the interactive
dialogue.
Although it is important that the method of specifying utterances be easily
learned, it is not necessary that plain English
is used.  It should be simple for the programmer to enter new
utterances and modify them on-line in cut-and-try attempts to render the
man-machine dialogue as natural as possible.  A phonetic input
can be quite adequate for this, especially if the system allows the
programmer to hear immediately the synthesized version of the message
he types.  Furthermore, markers which indicate rhythm and intonation can
be added to the message so that the system does not have to deduce these features
by attempting to "understand" the plain text.
.pp
This brings us to another disadvantage of speech storage as compared with
speech synthesis.  To provide utterances for a voice response system using
stored human speech, one must assemble together special input hardware,
a quiet room, and (probably) a dedicated computer.  If the speech is to be
heavily encoded, either expensive special hardware is required or the encoding
process, if performed by software on a general-purpose computer, will take
a considerable length of time (perhaps hundreds of times real-time).  In
either case, time-consuming editing of the speech will be necessary, with
follow-up recordings to clarify sections of speech which turn out to be
unsuitable or badly recorded.  If at a later date the voice response
system needs modification, it will be necessary to recall the same speaker,
or re-record the entire utterance set.  This discourages the application
programmer from adjusting his dialogue in the light of experience.
Synthesizing from a textual representation, on the other hand, allows him
to change a speech prompt as simply as he could a VDU one, and evaluate
its effect immediately.
.pp
We will return to methods of digitizing and compacting speech in Chapters 3
and 4, and carry on to consider speech synthesis in subsequent chapters.
Firstly, however, it is necessary to take a look at what speech is and how
people produce it.
.sh "1.8  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "1.9  Further reading"
.pp
There are remarkably few general books on speech output, although a
substantial specialist literature exists for the subject.
In addition to the references listed above, I suggest that you look
at the following.
.LB "nn"
.\"Ainsworth-1976-1
.]-
.ds [A Ainsworth, W.A.
.ds [D 1976
.ds [T Mechanisms of speech recognition
.ds [I Pergamon
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
A nice, easy-going introduction to speech recognition, this book covers
the acoustic structure of the speech signal in a way which makes
it useful as background reading for speech synthesis as well.
It complements Lea, 1980, cited above; which presents more recent results
in greater depth.
.in-2n
.\"Flanagan-1973-2
.]-
.ds [A Flanagan, J.L.
.as [A " and Rabiner, L.R. (Editors)
.ds [D 1973
.ds [T Speech synthesis
.ds [I Wiley
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.in+2n
This is a collection of previously-published research papers on speech
synthesis, rather than a unified book.
It contains many of the classic papers on the subject from 1940\ -\ 1972,
and is a very useful reference work.
.in-2n
.\"LeBoss-1980-3
.]-
.ds [A LeBoss, B.
.ds [D 1980
.ds [K *
.ds [T Speech I/O is making itself heard
.ds [J Electronics
.ds [O May\ 22
.ds [P 95-105
.nr [P 1
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
The magazine
.ul
Electronics
is an excellent source of up-to-the-minute news, product announcements,
titbits, and rumours in the commercial speech technology world.
This particular article discusses the projected size of the voice
output market and gives a brief synopsis of the activities of several
interested companies.
.in-2n
.\"Witten-1980-5
.]-
.ds [A Witten, I.H.
.ds [D 1980
.ds [T Communicating with microcomputers
.ds [I Academic Press
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
A recent book on microcomputer technology, this is unusual in that
it contains a major section on speech communication
with computers (as well as ones
on computer buses, interfaces, and graphics).
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "2  WHAT IS SPEECH?"
.ds RT "What is speech?
.ds CX "Principles of computer speech
.pp
People speak by using their vocal cords as a sound source, and making rapid
gestures of the articulatory organs (tongue, lips, jaw, and so on).
The resulting changes in shape of the vocal tract allow production
of the different sounds that we know as the vowels and consonants of
ordinary language.
.pp
What is it necessary to learn about this process for the purposes of
speech output from computers?
That depends crucially upon how speech is represented in the system.
If utterances are stored as time waveforms \(em and this is what we will be
discussing in the next chapter \(em the structure of speech is not important.
If frequency-related parameters of particular natural utterances are
stored, then it is advantageous to take into account some of the
acoustic properties of the speech waveform.
.pp
This point can be brought into focus by contrasting the transmission
(or storage) of speech with that of real-life television pictures,
as has been proposed for a videophone service.
Massive data reductions, of the order of 50:1, can be achieved for speech,
using techniques that are described in later chapters.  For pictures,
data reduction is still an important issue \(em even more so for the
videophone than for the telephone, because of the vastly higher
information rates involved.
Unfortunately, the potential for data reduction is much
smaller \(em nothing like the 50:1 figure quoted above.
This is because speech sounds have definite characteristics, imparted
by the fact that they are produced by a human vocal tract, which
can be exploited for data reduction.
Television pictures have no equivalent generative structure, for
they show just those things that the camera points at.
.pp
Moving up from frequency-related parameters of
.ul
particular
utterances, it
is possible to store such parameters in a
.ul
general
form which characterizes the sound segments that appear in spoken language.
This immediately raises the issue of
.ul
classification
of sound segments, to form a basis for storing generalized acoustic
information and for retrieval of the information needed to synthesize
any particular utterance.
Speech is by nature continuous, and any synthesis system based upon
discrete classification must come to terms with this by tackling
the problems of transition from one segment to another,
and local modification of sound segments as a function of their context.
.pp
This brings us to another level of representation.
So far we have talked of the
.ul
acoustic
nature of speech, but when we have to cope with transitions between
discrete sound segments it may be fruitful to consider
.ul
articulatory
properties as well.
Any model of the speech production process
is in effect a model of the articulatory process that generates the speech.
Some speech research is concerned with
modelling
the vocal tract directly, rather than modelling the acoustic output from it.
One might specify, for example, position of tongue and posture of jaw and lips
for a vowel, instead of giving frequency-related
characteristics of it.  This is a potent
tool in linguistic research, for it brings one closer to human production of
speech \(em in particular to the connection between brain and articulators.
.pp
Articulatory
synthesis holds a promise of high-quality speech, for the transitional
effects caused by tongue and jaw inertia can be modelled directly.
However, this potential has
not yet been realized.
Speech from current articulatory models is of much poorer quality than
that from acoustically-based synthesis methods.
The major problem is in gaining data about articulatory
behaviour during running speech \(em it is much easier to perform acoustic
analysis on the resulting sound than it is to examine the vocal organs in
action.  Because of this, the subject is not treated in this book.
We will only look at articulatory properties insofar as they help us
to understand, in a qualitative way, the acoustic nature of speech.
.pp
Speech, however, is much more than mere articulation.
Consider \(em admittedly a rather extreme and chauvinistic example \(em the
number of ways a girl can say "yes".
Breathy voice, slow tempo, low pitch \(em these are all characteristics which
affect the utterance as a whole, rather than being classifiable into
individual sound segments.  Linguists call them "prosodic" or
"suprasegmental" features, for they relate to overall aspects of the
utterance, and distinguish them from "segmental" ones which concern
the articulation of individual segments of syllables.
The most important prosodic features are pitch, or fundamental frequency
of the voice, and rhythm.
.pp
This chapter provides a brief introduction to the nature of the speech
signal.  Depending upon what speech output techniques we use, it may be
necessary to understand something of the acoustic nature of the speech
signal; the system that generates it (the vocal tract); commonly-used
classifications of sound segments; and the prosodic aspects of speech.
This material is little used in the early chapters of the book, but
becomes increasingly important as the story unfolds.
Hence you may skip the remainder of this chapter if you wish, but
should return to it later to pick up more background whenever it
becomes necessary.
.sh "2.1  The anatomy of speech"
.pp
The so-called "voiced" sounds of speech \(em like the sound you make when
you say "aaah" \(em are produced by passing air up from the lungs through
the larynx or voicebox, which is situated just behind the Adam's apple.
The vocal tract from the larynx to the lips acts as a resonant cavity,
amplifying certain frequencies and attenuating others.
.pp
The waveform generated by the larynx, however, is not simply sinusoidal.
(If it were, the vocal tract resonances would merely
give a sine wave of the same frequency but amplified or
attenuated according to how close it was to the nearest resonance.)  The
larynx contains two folds of skin \(em the vocal cords \(em which blow apart and flap
together again in each cycle of the pitch period.
The pitch of a male voice in speech varies from as low as 50\ Hz
(cycles per second) to perhaps
250\ Hz, with a typical median value of 100\ Hz.
For a female voice the range is higher, up to about 500\ Hz in speech.
Singing can go much higher:  a top C sung by a soprano has a frequency
of just over 1000\ Hz, and some opera singers can reach
substantially higher than this.
.pp
The flapping action of the vocal cords
gives a waveform which can be approximated by a
triangular pulse (this and other approximations will be discussed in
Chapter 5).
It has a rich spectrum of harmonics,
decaying at around 12\ dB/octave, and each harmonic is affected
by the vocal tract resonances.
.rh "Vocal tract resonances."
A simple model of the vocal tract is an organ-pipe-like cylindrical tube
(Figure 2.1),
with a sound source at one end (the larynx) and open at the other (the lips).
.FC "Figure 2.1"
This has resonances at wavelengths $4L$, $4L/3$, $4L/5$, ..., where $L$
is the length of the tube;
and these correspond to frequencies $c/4L$, $3c/4L$, $5c/4L$, ...\ Hz, $c$
being the speed of
sound in air.
Calculating these frequencies, using a typical figure for the
distance between larynx and lips of 17\ cm,
and $c = 340$\ m/s for the speed of sound, leads to resonances at
approximately 500\ Hz, 1500\ Hz, 2500\ Hz, ... .
.pp
When excited by the harmonic-rich waveform of the larynx,
the vocal tract resonances produce
peaks known as
.ul
formants
in the energy spectrum of the speech wave (Figure 2.2).
.FC "Figure 2.2"
The lowest formant, called formant one, varies from around 200\ Hz
to 1000\ Hz during speech, the exact range depending on the size
of the vocal tract.
Formant two varies from around 500 to 2500\ Hz, and formant three
from around 1500 to 3500\ Hz.
.pp
You can easily hear the lowest formant by whispering the vowels in
the words "heed", "hid", "head", "had", "hod", "hawed", and "who'd".
They appear to have a steadily descending pitch, yet since you are
whispering there is no fundamental frequency.
What you hear is the lowest resonance of the vocal tract \(em formant one.
Some masochistic people can play simple tunes with this formant by putting
their mouth in successive vowel shapes and knocking the top of their head
with their knuckles \(em hard!
.pp
A difficulty occurs when trying to identify the lower formants for speakers
with high-pitched voices.
When a formant frequency falls below the fundamental excitation frequency
of the voice, its effect is diminished \(em although it is still present.
The vibrato used by opera singers provides a very low-frequency excitation
(at the vibrato rate) which helps to illuminate the lower formants even
when the pitch of the voice is very high.
.pp
Of course, speech is not a static phenomenon.
The organ-pipe model describes the speech spectrum during a continuously
held vowel with the mouth in a neutral position such as for "aaah".
But in real speech the tongue and lips are in continuous motion,
altering the shape of the vocal tract and hence the positions of the resonances.
It is as if the organ-pipe were being squeezed and expanded in
different places all the time.
Say
.ul
ee
as in "heed" and feel how close your tongue is to the roof of your mouth,
causing a constriction near the front of the vocal cavity.
.pp
Linguists and speech engineers use a special frequency analyser called a
"sound spectrograph" to make a three-dimensional plot of the variation
of the speech energy spectrum with time.
Figure 2.3 shows a spectrogram of the
utterance "go away".
.FC "Figure 2.3"
Frequency is given on the vertical axis,
and bands are shown at the beginning to indicate the scale.
Time is plotted horizontally,
and energy is given by the darkness of any particular area.
The lower few formants can be seen as dark bands extending horizontally,
and they are in continuous motion.
In the neutral first vowel of "away", the formant frequencies
pass through
approximately the 500\ Hz, 1500\ Hz, and 2500\ Hz that we calculated earlier.
(In fact, formants two and three are somewhat lower than these values.)
.pp
The
fine vertical striations in the spectrogram correspond to single openings of the vocal cords.
Pitch changes continuously throughout an utterance,
and this can be seen on the spectrogram by the differences in spacing
of the striations.
Pitch change, or
.ul
intonation,
is singularly important in
lending naturalness to speech.
.pp
On a spectrogram, a continuously held vowel shows up as a static energy spectrum.
But beware \(em what we call a vowel in everyday language is not the same thing as a
"vowel" in phonetic terms.
Say "I" and feel how the tongue moves continuously while you're speaking.
Technically, this is a
.ul
diphthong
or slide between two vowel positions,
and not a single vowel.
If you say
.ul
ar
as in "hard",
and change slowly to
.ul
ee
as in "heed", you will obtain a diphthong not unlike that in "I".
And there are many more phonetically different vowel sounds
than the a, e, i, o, and u that we normally think of.
The words "hood" and "mood" have different vowels, for example, as do "head" and "mead".
The principal acoustic difference between the various vowel sounds
is in the frequencies of the first two formants.
.pp
A further complication is introduced by the nasal tract.  This is
a large cavity which is coupled to the oral tract by a passage at the
back of the mouth.
The passage is guarded by a flap of skin called the "velum".
You know about this because inadvertent opening of the velum while
swallowing causes food or drink to go up your nose.
The nasal cavity is switched in and out of the vocal tract
by the velum during speech.
It is used for consonants
.ul
m,
.ul
n,
and the
.ul
ng
sound in the word
"singing".
Vowels are frequently nasalized too.
A very effective demonstration of the amount of nasalization in ordinary
speech can be obtained by cutting a nose-shaped hole in a large
baffle which divides a room, speaking normally with one's nose in the hole,
and having someone listen on the other side.
The frequency of occurrence of
nasal sounds, and the volume of sound that is emitted
through the nose, are both surprisingly large.
Interestingly enough, when we say in conversation that someone sounds
"nasal", we usually mean "non-nasal".  When the nasal passages are
blocked by a cold, nasal sounds are missing \(em
.ul
n\c
\&'s turn into
.ul
d\c
\&'s,
and
.ul
m\c
\&'s to
.ul
b\c
\&'s.
.pp
When the nasal cavity is switched in to the vocal tract, it introduces
formant resonances, just as the oral cavity does.
Although we cannot
alter the shape of the nasal tract significantly, the nasal formant
pattern is not fixed, because the oral tract does play a part in nasal
resonances.
If you say
.ul
m,
.ul
n,
and
.ul
ng
continuously, you can hear the difference and feel how it is produced by
altering the combined nasal/oral tract resonances with your tongue position.
The nasal cavity operates in parallel with
the oral one:  this causes the two resonance patterns to be summed
together, with resulting complications which will be discussed in Chapter 5.
.rh "Sound sources."
Speech involves sounds other than those caused by regular vibration of
the larynx.
When you whisper, the folds of the larynx are held slightly
apart so that the air passing between them becomes turbulent, causing a noisy excitation
of the resonant cavity.
The formant peaks are still present, superimposed on the noise.  Such
"aspirated" sounds occur in the
.ul
h
of "hello", and for a very short time
after the lips are opened at the beginning of "pit".
.pp
Constrictions made in the mouth produce hissy noises such as
.ul
ss,
.ul
sh,
and
.ul
f.
For example, in
.ul
ss
the tip of the tongue is high up,
very close to the roof of the mouth.
Turbulent air passing through this constriction causes a
random noise excitation, known as "frication".
Actually, the roof of the mouth is quite a complicated object.
You can feel with your tongue a bony hump or ridge just behind the front
teeth, and it is this that forms a constriction with the tongue for
.ul
s.
In
.ul
sh,
the tongue is flattened close to the roof of the mouth slightly farther back,
in a position rather similar to that for
.ul
ee
but with a narrower
constriction,
while
.ul
f
is produced with the upper teeth and lower lip.
Because they are made near the front of the mouth,
the resonances of the vocal tract have little effect on these fricative
sounds.
.pp
To distinguish them from aspiration and frication, the ordinary speech
sounds (like "aaah") which have their source in larynx vibration are
known technically as "voiced".  Aspirated and fricative sounds are called
"unvoiced".  Thus the three different sound types can be classified as
.LB
.NP
voiced
.NP
unvoiced (fricative)
.NP
unvoiced (aspirated).
.LE
Can any of these three types occur together?
It would seem that voicing and aspiration can not, for the former requires
the larynx to be vibrating regularly, but for the latter it must be
generating turbulent noise.
However, there is a condition known technically as "breathy voice"
which occurs when the vocal cords are slightly apart, still vibrating,
but with a large volume of air passing between to create turbulence.
Voicing can easily occur in conjunction with frication.
Corresponding to
.ul
s,
.ul
sh,
and
.ul
f
we get the
.ul
voiced
fricatives
.ul
z,
the sound in the middle of words like "vision" which I will call
.ul
zh,
and
.ul
v.
A simple illustration of voicing is to say "ffffvvvvffff\ ...".
During the voiced part you can feel the larynx vibrations with a finger
on your Adam's apple, and it can be heard quite clearly if you stop up
your ears.
Technically, there is nothing to prevent frication and aspiration
from occurring together \(em they do, for example, when a voiced fricative
is whispered \(em but the combination is not an important one.
.pp
The complicated acoustic effects of noisy excitations in speech can be
seen in the spectrogram in Figure 2.4 of
"high altitude jets whizz past screaming".
.FC "Figure 2.4"
.rh "The source-filter model of speech production."
We have been talking in terms of a sound source (be it voiced or unvoiced)
exciting the resonances of the oral (and possible the nasal) tract.
This model, which is used extensively in speech analysis and synthesis,
is known as
the source-filter model of speech production.  The reason for its success
is that the effect of the resonances can be modelled as a frequency-selective
filter, operating on an input which is the source excitation.
Thus the frequency spectrum of the source is modified by multiplying it
by the frequency characteristic of the filter (or adding it, if amplitudes
are expressed logarithmically).
This can be seen in Figure 2.5, which shows a source
spectrum and filter characteristic which combine to give the overall
spectrum of Figure 2.2.
.FC "Figure 2.5"
.pp
Although, as mentioned above, the various fricatives are not subjected
to the resonances of the vocal tract to the same extent
that voiced and aspirated
sounds are, they can still be modelled as a noise source followed by
a filter to give them their different sound qualities.
.pp
The source-filter model is an oversimplification of the actual speech
production system.  There is inevitably some coupling between the vocal
tract and the lungs, through the glottis, during the period when
it is open.  This effectively makes the filter characteristics
change during each individual cycle of the excitation.
However, although the effect is of interest to speech researchers,
it is probably not of great significance for practical speech output.
.pp
One very interesting implication of the
source-filter model is that the prosodic features of
pitch and amplitude are largely properties of the source; while
segmental ones are introduced by the filter.  This makes it possible to
separate some aspects of
overall prosody from the actual segmental content of an
utterance, so that, for example, a human utterance can be stored initially
and then spoken by a machine with a variety of different intonations.
.sh "2.2  Classification of speech sounds"
.pp
The need to classify sound segments as a basis for storing generalized acoustic
information and retrieving it was mentioned earlier.  There is a real
difficulty here because speech is by nature continuous and classifications are
discrete.
It is important to remember this difficulty because it is all too easy
to criticize the complex and often confusing attempts of linguists to
tackle the classification task.
.pp
Linguists call a written representation of the
.ul
sounds
of an utterance a "phonetic
transcription" of it.  The same utterance can be transcribed at
different levels of detail:  simple transcriptions are called "broad"
and more specific ones are called "narrow".
Perhaps the most logically satisfying kind of transcription employs units
termed "phonemes".  This is the broadest transcription,
and is sometimes called a
.ul
phonemic
transcription to emphasize that that it is in terms of phonemes.
Unfortunately, the word "phoneme" is often used somewhat loosely.
In its true sense, a phoneme is a
.ul
logical
unit, rather than a physical, acoustic, one,
and is defined in relation to a particular language by reference
to its use in discriminating different words.
Classifications of sounds which are based on their
semantic
role as word-discriminators are called
.ul
phonological
classifications:  we could ensure that there is no ambiguity in the sense
with which we use the term "phoneme" by calling it a phonological unit, and
the phonemic transcription could be called a phonological one.
.rh "Broad phonetic transcription."
A phoneme is an abstract unit representing a set of different sounds.
The issue is confused by the fact that the members of the set actually
sound very similar, if not identical, to the untrained ear \(em precisely because
the difference between them plays no part in distinguishing words from
each other in the particular language concerned.
.pp
Take the words "key" and "caw", for example.  Despite the difference in
spelling, both of them begin with a
.ul
k
sound that belongs (in English)
to the same phoneme set, called
.ul
k.
However, say them two or three times each, concentrating on the position of
the tongue during the
.ul
k.
It is quite different in each case.  For "key", it
is raised, close to the roof of the mouth, in preparation for the
.ul
ee,
whereas in "caw" it is much lower down.
The sound of the
.ul
k
is actually quite different in the two cases.
Yet they belong to the same phoneme, for there is no pair of words which
relies on this difference to distinguish them \(em "key" and "caw" are
obviously distinguished by their vowels, not by the initial
consonant.
You probably cannot hear clearly the difference between the two
.ul
k\c
\&'s,
precisely because they belong to the same phoneme and so the difference
is not important (for English).
.pp
The point is sharpened by considering another language where we make a
distinction \(em and hence can hear the difference \(em between two sounds
that belong, in the language, to the same phoneme.
Japanese does not distinguish
.ul
r
from
.ul
l.
Japanese people
.ul
do not hear
the difference between "lice" and "rice", in the same way that you do
not hear the difference between the two
.ul
k\c
\&'s above.
Cockneys do not hear, except with a special effort, the difference
between "has" and "as", or "haitch" and "aitch", for the Cockney dialect
does not recognize initial
.ul
h\c
\&'s.
.pp
So what is a phoneme?  It is a set of sounds whose members do not
discriminate between any words in the language under consideration.
If you are mathematically minded you could think of it as an equivalence
class of sounds, determined by the relationship
.LB
$sound sub 1$ is related to $sound sub 2$ if $sound sub 1$ and $sound sub 2$
do not discriminate any pair of words in the language.
.LE
The
.ul
p
and
.ul
d
in
"pig" and "dig" belong to different phonemes (in English),
because they discriminate
the two words.
.ul
b,
.ul
f,
and
.ul
j
belong to different phonemes again.
.ul
i
and
.ul
a
in "hid" and "had" belong to different phonemes too.
Proceeding like this, a list of phonemes can be drawn up.
.pp
Such a list is shown in Table 2.1, for British English.
(The layout of the list does have some significance in terms of different
categories of phonemes, which will be explained later.)  In fact,
linguists use an
assortment of English letters, foreign letters, and special
symbols to represent phonemes.  In this book we use one- or two-letter
codes, partly because they are more mnemonic, and partly because
they are more suitable for communication to computers using standard
peripheral devices.
They are
a direct transliteration of linguists' standard International Phonetic
Association symbols.
.RF
.nr x1 3m+1.0i+0.5i+0.5i+0.5i+\w'y'u
.nr x1 (\n(.l-\n(x1)/2
.in \n(x1u
.ta 3m +1.0i +0.5i +0.5i +0.5i +0.5i +0.5i
\fIuh\fR	(the)	\fIp\fR	\fIt\fR	\fIk\fR
\fIa\fR	(bud)	\fIb\fR	\fId\fR	\fIg\fR
\fIe\fR	(head)	\fIm\fR	\fIn\fR	\fIng\fR
\fIi\fR	(hid)
\fIo\fR	(hod)	\fIr\fR	\fIw\fR	\fIl\fR	\fIy\fR
\fIu\fR	(hood)
\fIaa\fR	(had)	\fIs\fR	\fIz\fR
\fIee\fR	(heed)	\fIsh\fR	\fIzh\fR
\fIer\fR	(heard)	\fIf\fR	\fIv\fR
\fIuu\fR	(food)	\fIth\fR	\fIdh\fR
\fIar\fR	(hard)	\fIch\fR	\fIj\fR
\fIaw\fR	(hoard)	\fIh\fR
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 2.1 The phonemes of British English"
.pp
We will discuss the sounds which make up each of these phoneme classes
shortly.  First, however, it is worthwhile pointing out some rather
tricky points in the definition of these phonemes.
.rh "Phonological difficulties."
There are snags with phonological classification, as there are
in any area where attempts are made to make completely logical
statements about human activity.
Consider
.ul
h
and the
.ul
ng
in "singing".
(\c
.ul
ng
is certainly not an
.ul
n
sound followed by a
.ul
g
sound, although
it is true that in some English accents "singing" is rendered with
the
.ul
ng
followed by a
.ul
g
at each of its two occurrences.)  No words
end with
.ul
h,
and none begin with
.ul
ng.
(Notice that we are still talking about British English.
In Chinese, the sound
.ul
ng
is a word in its own right, and is a common
family name.
But we must stick with one language for phonological classification.)  Hence
it follows that there is no pair of words which is distinguished
by the difference between
.ul
h
and
.ul
ng.
Technically,
they belong to the same phoneme.  However, technical considerations
in this case must take second place to common sense!
.pp
The
.ul
j
in "jig" is another interesting case.  It can be considered
to belong to a
.ul
j
phoneme, or to be a sequence of two
phonemes,
.ul
d
followed by
.ul
zh
(the sound in "vision").  There is
disagreement on this point in phonetics textbooks, and we do not
have the time (nor, probably, the inclination!) to consider the
pros and cons of this moot point.
I have resolved the matter arbitrarily by writing it as a separate
phoneme.  The
.ul
ch
in "choose" is a similar case
(\c
.ul
t
followed by the
.ul
sh
in "shoes").
.pp
Another difficulty, this time where Table 2.1 does not show how to
distinguish between two sounds which
.ul
do
discriminate words in many people's English, is the
.ul
w
in "witch"
and that in "which".  The latter is conventionally transcribed
as a sequence of two phonemes,
.ul
h w.
.pp
The last few difficulties are all to do with deciding whether a
sound belongs to a single phoneme class, or comprises a sequence
of sounds each of which belongs to a phoneme.
Are the
.ul
j
in "jug", the
.ul
ch
in "chug", and the
.ul
w
in "which",
single phonemes or not?  The definition above of a phoneme
as a "set of sounds whose members do not discriminate any words
in the language" does not help us to answer this question.
As far as this definition is concerned, we could go so far as
to call each and every word of the language an individual phoneme!
It is clear that some acoustic evidence, and quite a lot of judgement,
is being used when phonemes such as those of Table 2.1 are defined.
.pp
So much for the consonants.  This same problem occurs in vowel sounds,
particularly in diphthongs, which are sequences of two vowel-like sounds.
Do the vowels of "main" and "man" belong to different phonemes?
Clearly so, if they are both transcribed as single units, for they
distinguish the two words.
Notwithstanding the fact that they are sequences of separate sounds,
a logically consistent system could be constructed which gave separate,
unitary, symbols to each diphthong.
However, it is usual to employ a compound symbol which indicates explicitly
the character of the two vowel-like sounds involved.
We will transcribe the diphthong of "main" as a sequence of two
vowels,
.ul
e
(as in "head") and
.ul
i
(as in "hid", not "I").
This is done primarily for economy of symbols, choosing the constituent
sounds on the basis of the closest match to existing vowel sounds.
(Note that this again violates purely
.ul
logical
criteria for identifying phonemes.)
.rh "Categories of speech sounds."
A phoneme is defined as a set of sounds whose members to not discriminate
between any words in the language under consideration.
The phonemes themselves can be classified into groups which reflect
similarities between them.
This can be done in many different ways, using various criteria
for classification.  In fact, one branch of linguistic research
is concerned with defining a set of "distinctive
features" such that a phoneme class is uniquely identified by
the values of the features.  Distinctive features are binary,
and include such things as voiced\(emunvoiced, fricative\(emnot\ fricative,
aspirated\(emunaspirated.  We will not be concerned here with such
detailed classifications, but it is as well to know that they exist.
.pp
There is an everyday distinction between vowels and consonants.
A vowel forms the nucleus of every syllable, and one or more consonants
may optionally surround the vowel.
But the distinction sometimes becomes a little ambiguous.
Syllables like
.ul
sh
are commonly uttered and certainly do not
contain a vowel.  Furthermore, when we say "vowel" in everyday
language we usually refer to the
.ul
written
vowels a, e, i, o, and u; there are many more vowel sounds.
A vowel in orthography is different to a vowel as a phoneme.
Is a diphthong a phonetic vowel?  \(em certainly, by the syllable-nucleus
criterion; but it is a little different from ordinary vowels because
it is a changing sound rather than a constant one.
.pp
Table 2.2 shows one classification of the phonemes of Table 2.1, which
will be useful in our later studies of speech synthesis from phonetics.
It shows twelve vowels, including the rather peculiar one
.ul
uh
(which corresponds to the first vowel in the word "above").
This is the sound produced by the vocal tract when it is in a relaxed,
neutral position; and it never occurs in prominent, stressed,
syllables.  The vowels later in the list are almost always longer
than the earlier ones.  In fact, the first six
(\c
.ul
uh, a, e, i, o, u\c
)
are often called "short" vowels, and the last five
(\c
.ul
ee, er, uu, ar, aw\c
)
"long" ones.  The shortness or longness of the one in the middle
(\c
.ul
aa\c
)
is rather ambiguous.
.RF
.nr x0 \w'000unvoiced fricative    'u
.nr x1 \n(x0+\w'[not classified as individual phonemes]'u
.nr x1 (\n(.l-\n(x1)/2
.in \n(x1u
.ta \n(x0u
.fi
vowel	\c
.ul
uh  a  e  i  o  u  aa  ee  er  uu  ar  aw
.br
diphthong	[not classified as individual phonemes]
.br
glide (or liquid)	\c
.ul
r  w  l  y
.br
stop
.br
\0\0\0unvoiced stop	\c
.ul
p  t  k
.br
\0\0\0voiced stop	\c
.ul
b  d  g
.br
nasal	\c
.ul
m  n  ng
.br
fricative
.br
\0\0\0unvoiced fricative	\c
.ul
s  sh  f  th
.br
\0\0\0voiced fricative	\c
.ul
z  zh  v  dh
.br
affricate
.br
\0\0\0unvoiced affricate	\c
.ul
ch
.br
\0\0\0voiced affricate	\c
.ul
j
.br
aspirate	\c
.ul
h
.nf
.in 0
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.FG "Table 2.2 Phoneme categories"
.pp
Diphthongs pose no problem here because we have not classified them
as single phonemes.
.pp
The remaining categories are consonants.  The glides are quite
similar to vowels and diphthongs, though; for they are voiced,
continuous sounds.  You can say them and prolong them.
(This is also true of the fricatives.) 
.ul
r
is interesting
because it can be realized acoustically in very different ways.
Some people curl the tip of the tongue
back \(em a so-called retroflex action of the tongue.  Many people
cannot do this, and their
.ul
r\c
\&'s sound like
.ul
w\c
\&'s.
The stage Scotsman's
.ul
r
is a trill where the tip of the tongue vibrates against the roof of the mouth.
.ul
l
is also
slightly unusual, for it is the only English phoneme which is "lateral" \(em
air passes either side of it, in two separate passages.  Welsh
has another lateral sound, a fricative, which is written "ll" as
in "Llandudno".
.pp
The next category is the stops.  These are formed by stopping up
the mouth, so that air pressure builds up behind the lips, and
releasing this pressure suddenly.  The result is a little
explosion (and the stops are often called "plosives"), which
usually creates a very short burst of fricative noise (and, in some cases,
aspiration as well).  They are further subdivided into voiced and
unvoiced stops, depending upon whether voicing starts as soon as
the plosion occurs (sometimes even before) or well after it.
If you put your hand in front of your mouth when saying "pit" you
can easily feel the puff of air that signals the plosion on the
.ul
p,
and probably on the
.ul
t
as well.
.pp
In a sense, nasals are really stops as well (and they are often
called stops), for the oral tract is blocked although the nasal
one is not.  The peculiar fact that the nasal
.ul
ng
never occurs at the beginning of a word (in English) was mentioned
earlier.  Notice that for stops and nasals there is a similarity in the
.ul
vertical
direction of Table 2.2, between
.ul
p,
.ul
b,
and
.ul
m;
.ul
t,
.ul
d,
and
.ul
n;
and
.ul
k,
.ul
g,
and
.ul
ng.
.ul
p
is an unvoiced version of
.ul
b
(try saying them),
and
.ul
m
is a nasalized version (for
.ul
b
is what you get when you
have a cold and try to say
.ul
m\c
).
These three sounds are all made
at the front of the mouth, while
.ul
t,
.ul
d,
and
.ul
n,
which bear the
same resemblance to each other, are made in the middle; and
.ul
k,
.ul
g,
and
.ul
ng
are made at the back.  This introduces another
possible classification, according to
.ul
place of articulation.
.pp
The unvoiced fricatives are quite straightforward, except perhaps
for
.ul
th,
which is the sound at the beginning of "thigh".
They are paired with the voiced fricatives on the basis of place
of articulation.  The voiced version of
.ul
th
is the
.ul
dh
at
the beginning of "thy".
.ul
zh
is a fairly rare phoneme, which
is heard in the middle of "vision".  Affricates are similar to
fricatives but begin with a stopped posture, and we mentioned earlier
the controversy as to whether they should be considered to be
single phonemes, or
sequences of stop phonemes and fricatives.
Finally comes the lonely aspirate,
.ul
h.
Aspiration does occur
elsewhere in speech, during the plosive burst of unvoiced stops.
.rh "Narrow phonetic transcription."
The phonological classification outlined above is based upon a clear
rationale for distinguishing between sounds according to how
they affect meaning \(em although the rationale does become
somewhat muddied in difficult cases.
Narrower transcriptions are not so systematic.
They use units called
.ul
allophones,
which are defined by reference to physical, acoustic, criteria rather
than purely logical ones.
("Phone" is a more old-fashioned term for the same thing,
and the misused word "phoneme" is often employed where allophone is
meant, that is, as a physical rather than a logical
unit.)  Each phoneme has several allophones,
more or less depending on how narrow or broad the transcription is,
and the allophones are different acoustic realizations of the same
logical unit.
For example, the
.ul
k\c
\&'s in "key" and "caw" may be considered as different
allophones (in a slightly narrow transcription).
Although we will not use symbols for allophones here,
they are often indicated by diacritical marks in a text
which modify the basic phoneme classes.
For example, a tilde (~) over a vowel means that it is nasalized, while a small
circle underneath a consonant means that it is devoiced.
.pp
Allophonic variation in speech is governed by a mechanism called
.ul
coarticulation,
where a sound is affected by those that come either side of it.
"Key"\-"caw" is a clear example of this, where the tongue
position in the
.ul
k
anticipates that of the following vowel \(em high
in the first case, low in the second.
Most allophonic variation in English is anticipatory, in that the sound
is influenced by the following articulation rather than by
preceding ones.
.pp
Nasalization is a feature which applies to vowels in English through
anticipatory coarticulation.
In many languages (for example, French) it is a
.ul
distinctive
feature for vowels in that it serves to distinguish one vowel phoneme class
from another.
That this is not so in English sometimes tempts us to assume,
incorrectly, that nasalization does not occur in vowels.
It does, typically when the vowel is followed by a nasal consonant, and it is
important for synthesis that nasalized vowel allophones are recognized and
treated accordingly.
.pp
Coarticulation can be predicted by phonological rules, which show
how a phonemic sequence will be realized by allophones.
Such rules have been studied extensively by linguists.
.pp
The reason for coarticulation, and for the existence of allophones,
lies in the physical constraints imposed by the motion
of the articulatory organs \(em particularly their acceleration and deceleration.
An immensely crude model is that the brain decides what phonemes to
say (for it is concerned with semantic things, and the definition
of a phoneme is a semantic one).
It then takes this sequence and translates it into neural commands
which actually move the articulators into target positions.
However, other commands may be issued, and executed, before these targets
are reached, and this accounts for coarticulation effects.
Phonological rules for converting a phonemic sequence to an
allophonic one are a sort of discrete model of the process.
Particularly for work involving computers, it is possible that this
rule-based approach will be overtaken by potentially more accurate
methods which attempt to model the continuous articulatory phenomena
directly.
.sh "2.3  Prosody"
.pp
The phonetic classification introduced above divides speech into
segments and classifies these into phonemes or allophones.
Riding on top of this stream of segments are other, more global,
attributes that dictate the overall prosody of the utterance.
Prosody is defined by the Oxford English Dictionary as the
"science of versification, laws of metre,"
which emphasizes the aspects of stress and rhythm that are central
to classical verse.
There are, however, many other features which are more or less
global.
These are collectively called prosodic or, equivalently, suprasegmental,
features, for they lie above the level of phoneme or syllable segments.
.pp
Prosodic features can be split into two basic categories:  features
of voice quality and features of voice dynamics.
Variations in voice quality, which are sometimes called
"paralinguistic" phenomena, are accounted for by anatomical
differences and long-term muscular idiosyncrasies (like a sore
throat), and have little part to play in the kind of applications
for speech output that have been sketched in Chapter 1.
Variations in voice dynamics occur in three dimensions:  pitch
or fundamental frequency of the voice, time, and amplitude.
Within the first, the pattern of pitch variation, or
.ul
intonation,
can be distinguished from the overall range within which that variation
occurs.
The time dimension encompasses the rhythm of the speech, pauses, and the
overall tempo \(em whether it is uttered quickly or slowly.
The third dimension, amplitude, is of relatively minor importance.
Intonation and rhythm work together to produce an effect commonly called
"stress", and we will elaborate further on the nature of stress and discuss
algorithms for synthesizing intonation and rhythm in Chapter 8.
.pp
These features have a very important role to play in communicating meaning.
They are not fancy, optional components.
It is their neglect which is largely responsible for the layman's
stereotype of computer speech,
a caricature of living speech \(em abrupt, arhythmic, and in a grating
monotone \(em
which was well characterized by Isaac Asimov when he wrote of speaking
"all in capital letters".
.pp
Timing has a syntactic function in that it sometimes helps to
distinguish nouns from
verbs
(\c
.ul
ex\c
tract versus ex\c
.ul
tract\c
).
and adjectives from verbs (app\c
.ul
rox\c
imate versus approxi\c
.ul
mate\c
) \(em although segmental aspects play a part here too, for the vowel
qualities differ in each pair of words.
Nevertheless, if you make a mistake when assigning stress to words
like these in conversation you are very likely to be queried as
to what you actually said.
.pp
Intonation has a big effect on meaning too.
Pitch often \(em but by no means always \(em rises on a question,
the extent and abruptness of the rise depending on features like whether
a genuine information-bearing reply or merely confirmation is expected.
A distinctive pitch pattern accompanies the introduction of a new topic.
In conjunction with rhythm, intonation can be used to bring out contrasts
as in
.LB
.NI
"He didn't have a
.ul
red
car, he had a
.ul
black
one."
.LE
In general, the intonation patterns used by a reader depend not only on
the text itself, but on his interpretation of it, and also on his
expectation of the listener's interpretation of it.
For example:
.LB
.NI
"He had a
.ul
red
car" (I think you thought it was black),
.NI
"He had a red
.ul
bi\c
cycle" (I think you thought it was a car).
.LE
.pp
In natural speech, prosodic features are significantly influenced by
whether the utterance is generated spontaneously or read aloud.
The variations in spontaneous speech are enormous.
There are all sorts of emotions which are plainly audible in
everyday speech:  sarcasm, excitement, rudeness, disagreement,
sadness, fright, love.
Variations in voice quality certainly play a part here.
Even with "ordinary" cooperative friendly conversation, the need to find
words and somehow fit them into an overall utterance produces great
diversity of prosodic structures.
Applications for speech output from computers do not, however, call for
spontaneous conversation, but for a controlled delivery which is
like that when reading aloud.
Here, the speaker is articulating utterances which have been set out for
him, reducing his cognitive load to one of understanding and interpreting
the text rather than generating it.
Unfortunately for us, linguists are (quite rightly)
primarily interested in living,
spontaneous speech rather than pre-prepared readings.
.pp
Nevertheless, the richness of prosody in speech even when reading from
a book should not be underestimated.
Read aloud to an audience and listen to the contrasts in voice dynamics
deliberately introduced for variety's sake.
If stories are to be read there is even a case for controlling voice
.ul
quality
to cope with quotations and affective imitations.
.pp
We saw earlier that the source-filter model is particularly
helpful in distinguishing prosodic features, which are largely
properties of the source, from segmental ones, which belong to
the filter.
Pitch and amplitude are primarily source properties.
Rhythm and speed of speaking are not, but neither are they filter
properties, for they belong to the source-filter system as a whole
and not specifically to either part of it.
The difficult notion of stress is, from an acoustic point of view,
a combination of pitch, rhythm, and amplitude.
Even some features of voice quality can be attributed to the source
(like laryngitis), although others \(em cleft palate, badly-fitting
dentures \(em affect segmental features as well.
.sh "2.4  Further reading"
.pp
This chapter has been no more than a cursory introduction to some
of the difficult problems of linguistics and phonetics.
Here are some readable books which discuss these problems further.
.LB "nn"
.\"Abercrombie-1967-1
.ds [F 1
.]-
.ds [A Abercrombie, D.
.ds [D 1967
.ds [T Elements of general phonetics
.ds [I Edinburgh Univ Press
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is an excellent book which covers all of the areas of this
chapter, in much more detail than has been possible here.
.in-2n
.\"Brown-1980-2
.ds [F 2
.]-
.ds [A Brown, Gill
.as [A ", Currie, K.L.
.as [A ", and Kenworthy, J.
.ds [D 1980
.ds [T Questions of intonation
.ds [I Croom Helm
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
An intensive study of the prosodics of colloquial, living speech
is presented, with particular reference to intonation.  Although
not particularly relevant to speech output from computers,
this book gives great insight into how conversational speech
differs from reading aloud.
.in-2n
.\"Fry-1979-1
.ds [F 1
.]-
.ds [A Fry, D.B.
.ds [D 1979
.ds [T The physics of speech
.ds [I Cambridge University Press
.ds [C Cambridge, England
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is a simple and readable account of speech science, with a good
and completely non-mathematical introduction to frequency analysis.
.in-2n
.\"Ladefoged-1975-4
.ds [F 4
.]-
.ds [A Ladefoged, P.
.ds [D 1975
.ds [T A course in phonetics
.ds [I Harcourt Brace and Johanovich
.ds [C New York
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Usually books entitled "A course on ..." are dreadfully dull, but
this is a wonderful exception.  An exciting, readable, almost racy
introduction to phonetics, full of little experiments you can try
yourself.
.in-2n
.\"Lehiste-1970-5
.ds [F 5
.]-
.ds [A Lehiste, I.
.ds [D 1970
.ds [T Suprasegmentals
.ds [I MIT Press
.ds [C Cambridge, Massachusetts
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This fairly comprehensive study of the prosodics of speech
complements Ladefoged's book, which is mainly concerned with segmental
phonetics.
.in-2n
.\"O'Connor-1973-1
.ds [F 1
.]-
.ds [A O'Connor, J.D.
.ds [D 1973
.ds [T Phonetics
.ds [I Penguin
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is another introductory book on phonetics.
It is packed with information on all aspects of the subject.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "3  SPEECH STORAGE"
.ds RT "Speech storage
.ds CX "Principles of computer speech
.pp
The most familiar device that produces speech output is the ordinary tape
recorder, which stores information in analogue form on magnetic tape.
However, this is unsuitable for speech output from computers.
One reason is that it is difficult to access different utterances quickly.
Although random-access tape recorders do exist, they are expensive and
subject to mechanical breakdown because of the stresses associated with
frequent starting and stopping.
.pp
Storing speech on a rotating drum instead of
tape offers the possibility of access to any track within one revolution time.
For example, the IBM 7770 Audio Response Unit employs drums rotating twice
a second which are able to store up to 32 500-msec words.  These can be accessed
randomly, within half a second at most.
Although one can
arrange to store longer words by allowing overflow on to an adjacent track at
the end of the rotation period, the discrete time-slots provided by this
system make it virtually impossible for it to generate connected utterances
by assembling appropriate words from the store.
.pp
The Cognitronics Speechmaker has a similar structure, but with
the analogue speech waveform recorded on photographic film.
Storing audio waveforms optically is not an unusual technique, for this is how
soundtracks are recorded on ordinary movie films.  The original version of
the "speaking clock" of the British Post Office used optical storage in
concentric tracks on flat glass discs.
It is described by Speight and Gill (1937),
who include a fascinating account of how the utterances are synchronized.
.[
Speight Gill 1937
.]
A 4\ Hz signal from a pendulum clock was used to supply current to an electric
motor, which drove a shaft equipped with cams and gears that rotated
the glass discs containing utterances for seconds, minutes, and hours
at appropriate speeds!
.pp
A second reason for avoiding analogue storage is price.  It is difficult to see how a random-access
tape recorder could be incorporated into a talking pocket calculator or
child's toy without considerably inflating the cost.
Solid-state electronics is much cheaper than mechanics.
.pp
But the best reason is that, in many of the applications we have discussed,
it is necessary to form utterances by concatenating separately-recorded
parts.  It is totally infeasible, for example, to store each and every
possible telephone number as an individual recording!  And
utterances that are formed by concatenating individual words which were
recorded in isolation, or in a different context, do not sound completely
natural.  For example, in an early experiment, Stowe and Hampton (1961) recorded
individual words on acoustic tape, spliced the tape with the words in a different
order to make sentences, and played the result to subjects who were scored on
the number of key words which they identified correctly.
.[
Stowe Hampton 1961
.]
The overall conclusion was that while embedding a word in normally-spoken sentences
.ul
increases
the probability of recognition (because the extra context gives clues about the
word), embedding a word in a constructed sentence, where intonation and rhythm
are not properly rendered,
.ul
decreases
the probability of recognition.  When the speech was uttered slowly,
however, a considerable improvement was noticed, indicating that if the
listener has more processing time he can overcome the lack of proper intonation
and rhythm.
.pp
Nevertheless, many present-day voice response systems
.ul
do
store what amounts to a direct recording of the acoustic wave.
However, the storage medium is digital rather than analogue.
This means that standard computer storage devices can be used, providing
rapid access to any segment of the speech at relatively low cost \(em for
the economics of mass-production ensures a low price for random-access
digital devices compared with random-access analogue ones.
Furthermore, it reduces the amount of special equipment needed for speech
output.  One can buy very cheap speech input/output interfaces for home computers
which connect to standard hobby buses.
Another advantage of digital over analogue recording is that
integrated circuit read-only memories (ROMs)
can be used for hand-held devices which need small quantities of speech.
Hence this chapter begins by showing how waveforms are stored digitally,
and then describes some techniques for reducing the data needed for a given
utterance.
.sh "3.1  Storing waveforms digitally"
.pp
When an analogue signal is converted to digital form, it is made discrete
both in time and in amplitude.  Discretization in time is the operation of
.ul
sampling,
whilst in amplitude it is
.ul
quantizing.
It is worth pointing out that the transmission of analogue information by
digital means is called "PCM" (standing for "pulse code modulation") in
telecommunications jargon.
Much of the theory of digital signal processing investigates signals which
are sampled but not quantized (or quantized into sufficiently many levels to
avoid inaccuracies).  The operation of quantization, being non-linear,
is not very amenable to theoretical analysis.  Quantization introduces issues
such as accumulation of round-off noise in arithmetic operations,
which, although they are very important in practical implementations, can only
be treated theoretically under certain somewhat unrealistic assumptions
(in particular, independence of the quantization error from sample to sample).
.rh "Sampling."
A fundamental theorem of telecommunications states that a signal can only be
reconstructed accurately from a sampled version if it does not contain
components whose frequency is greater than half the frequency at which the
sampling takes place.  Figure 3.1(a) shows how a component of slightly greater
than half the sampling frequency can masquerade, as far as an observer with
access only to the sampled data can tell, as a component at slightly less
than half the sampling frequency.
.FC "Figure 3.1"
Call the sampling interval $T$ seconds, so that the
sampling frequency is $1/T$\ Hz.
Then components at $1/2T+f$, $3/2T-f$, $3/2T+f$ and so on all masquerade
as a component at $1/2T-f$.  Similarly, components at frequencies just under
the sampling frequency masquerade as very low-frequency components, as shown
in Figure 3.1(b).  This phenomenon is often called "aliasing".
.pp
Thus the continuous, infinite, frequency axis for the unsampled signal, where
two components at different frequencies can always be distinguished, maps
into a repetitive frequency axis when the signal is sampled.  As depicted
in Figure 3.2, the frequency
interval $[1/T,~ 2/T)$ \u\(dg\d
.FN 3
.sp
\u\(dg\dIntervals are specified in brackets, with a square bracket representing
a closed end of the interval and a round one representing an open one.
Thus the interval $[1/T,~ 2/T)$ specifies the range $1/T ~ <= ~ frequency
~ < ~ 2/T$.
.EF
is mapped back into the band $[0,~ 1/T)$, as are the
intervals $[2/T,~ 3/T)$,  $[3/T,~ 4/T)$, and so on.
.FC "Figure 3.2"
Furthermore, the interval $[1/2T,~ 1/T)$ between half the sampling frequency and the sampling
frequency, is mapped back into the interval
below half the sampling frequency; but this time the mapping is backwards,
with frequencies at just under $1/T$ being mapped to frequencies slightly greater
than zero, and frequencies just over $1/2T$ being mapped to ones
just under $1/2T$.
The best way to represent a repeating frequency axis like this is as a circle.
Figure 3.3 shows how the linear frequency axis for continuous systems maps
on to a circular axis for sampled systems.
.FC "Figure 3.3"
For present purposes it is
easiest to imagine the bottom half of the circle as being reflected into
the top half, so that traversing the upper semicircle in the anticlockwise direction
corresponds to frequencies increasing from 0 to $1/2T$ (half the sample frequency),
and returning along the lower semicircle is actually the same as coming
back round the upper one, and corresponds to frequencies from $1/2T$ to $1/T$
being mapped into the range $1/2T$ to 0.
.pp
As far as speech is concerned, then, we must ensure that before sampling a
signal no significant components at greater than half the sample frequency
are present.  Furthermore, the sampled signal will only contain information
about frequency components less than this, so the sample frequency must be
chosen as twice the highest frequency of interest.
For example, consider telephone-quality speech.
Telephones provide a familiar standard of speech quality which,
although it can only be an approximate "standard",
will be much used throughout this book.
The telephone network
aims to transmit only frequencies lower than 3.4\ kHz.  We saw in the
previous chapter that this region will contain the information-bearing formants,
and some \(em but not all \(em of the fricative and aspiration energy.
Actually, transmitting speech through the telephone system degrades its
quality very significantly, probably more than you realize since everyone is
so accustomed to telephone speech.  Try the dial-a-disc service and compare
it with high-fidelity music for a striking example of the kind of degradation
suffered.
.pp
For telephone speech, the sampling frequency must be chosen to be
at least 6.8\ kHz.
Since speech contains significant amounts of energy above 3.4\ kHz, it should be
filtered before sampling to remove this; otherwise the higher components
would be mapped back into the baseband and distort the low-frequency information.
Because it is difficult to make filters that cut off very sharply, the
sampling frequency is chosen rather greater than twice the highest frequency of
interest.  For example, the digital telephone network samples at 8\ kHz.
The pre-sampling filter should have a cutoff frequency of 4\ kHz; aim for
negligible distortion below 3.4\ kHz; and transmit negligible components
above 4.6\ kHz \(em for these are reflected back into the band of interest,
namely 0 to 3.4\ kHz.  Figure 3.4 shows a block diagram for the input hardware.
.FC "Figure 3.4"
.rh "Quantization."
Before considering specifications for the pre-sampling filter, let us turn
from discretization in time to discretization in amplitude, that is,
quantization.
This is performed by an A/D converter (analogue-to-digital), which takes as input
a constant analogue voltage (produced by the sampler) and generates a
corresponding binary value as output.  The simplest correspondence is
.ul
uniform
quantization, where the amplitude range is split into equal regions by points
termed "quantization levels", and the output is a binary representation of
the nearest quantization level to the input voltage.
Typically, 11-bit conversion is used for speech, giving 2048 quantization
levels, and the signal is adjusted to have zero mean so that half the
levels correspond to negative input voltages and the other half to positive
ones.
.pp
It is, at first sight, surprising that as many as 11 bits are needed for
adequate representation of speech signals.  Research on the digital telephone
network, for example, has concluded that a signal-to-noise ratio of
some 26\-27\ dB is enough to avoid undue harshness of quality, loss
of intelligibility, and listener fatigue for speech at a comfortable
level in an otherwise reasonably good channel.
Rabiner and Schafer (1978) suggest that about 36\ dB signal-to-noise ratio
would "most likely provide adequate quality in a communications system".
.[
Rabiner Schafer 1978 Digital processing of speech signals
.]
But 11-bit quantization seems to give a very much better signal-to-noise
ratio than these figures.  To estimate its magnitude, note that for N-bit quantization
the error for each sample will lie between
.LB
$
- ~ 1 over 2 ~. 2 sup -N$    and    $+ ~ 1 over 2 ~. 2 sup -N .
$
.LE
Assuming that it is uniformly distributed in this range \(em an assumption
which is likely to be justified if the number of levels is sufficiently
large \(em leads to a mean-squared error of
.LB
.EQ
integral from {-2 sup -N-1} to {2 sup -N-1} ~e sup 2 p(e) de,
.EN
.LE
where $p(e)$, the probability density function of the error $e$, is a constant
which satisfies the usual probability normalization constraint, namely
.LB
.EQ
integral from {-2 sup -N-1} to {2 sup -N-1} ~ p(e) de ~~=~ 1.
.EN
.LE
Hence $p(e)=2 sup N $, and so the mean-squared error is  $2 sup -2N /12$.
This is  $10 ~ log sub 10 (2 sup -2N /12)$\ dB, or around \-77\ dB for 11-bit
quantization.
.pp
This noise level is relative to the maximum amplitude range of the conversion.
A maximum-amplitude sine wave has a power of \-9\ dB relative to the same
reference, giving a signal-to-noise ratio of some 68\ dB.  This is far in excess
of that needed for telephone-quality speech.  However, look at the very peaky
nature of the typical speech waveform given in Figure 3.5.
.FC "Figure 3.5"
If clipping is to be avoided, the maximum amplitude level of the A/D converter
must be set at a value which makes the power of the speech signal very much
less than a maximum-amplitude sine wave.  Furthermore, different people
speak at very different volumes, and the overall level fluctuates constantly
with just one speaker.  Experience shows that while 8- or 9-bit quantization
may provide sufficient signal-to-noise ratio to preserve telephone-quality
speech if the overall speaker levels are carefully controlled, about 11 bits
are generally required to provide high-quality representation of speech with
a uniform quantization.  With 11 bits, a sine wave whose amplitude is only 1/32
of the full-scale value would be digitized with a signal-to-noise ratio
of around 36\ dB, the most pessimistic figure quoted above for adequate quality.
Even then it is useful if the speaker is provided
with an indication of the amplitude of his speech:  a traffic-light
indicator with red signifying clipping overload, orange a suitable level,
and green too low a value, is often convenient for this.
.rh "Logarithmic quantization."
For the purposes of speech
.ul
processing,
it is essential to have the signal quantized uniformly.  This is because
all of the theory applies to linear systems, and nonlinearities introduce
complexities which are not amenable to analysis.
Uniform quantization, although a nonlinear operation, is linear in the
limiting case as the number of levels becomes large, and for most purposes
its effect can be modelled by assuming that the quantized signal is obtained
from the original analogue one by the addition of a small amount of
uniformly-distributed quantizing noise, as in fact was done above.
Usually the quantization noise is disregarded in subsequent analysis.
.pp
However, the peakiness of the speech signal illustrated in Figure 3.5 leads
one to suspect that a non-linear representation, for example a logarithmic one,
could provide a better signal-to-noise ratio over a wider range of input
amplitudes, and hence be more useful than linear quantization \(em at least
for speech storage (and transmission).
And indeed this is the case.  Linear quantization has the unfortunate effect
that the absolute noise level is independent of the signal level, so that an excessive
number of bits must be used if a reasonable ratio is to be achieved for peaky
signals.  It can be shown that a logarithmic representation like
.LB
.EQ
y ~ = ~ 1 ~ + ~ k ~ log ~ x,
.EN
.LE
where $x$ is the original signal and $y$ is the value which is to be quantized,
gives a
signal-to-noise
.ul
ratio
which is independent of the input signal level.
This relationship cannot be realized physically, for it is undefined when the signal
is negative and diverges when it is zero.
However, realizable approximations to it can be made which retain the advantages
of constant signal-to-noise ratio within a useful range of signal amplitudes.
Figure 3.6 shows the logarithmic relation with one widely-used approximation to it,
called the A-law.
.FC "Figure 3.6"
The idea of non-linearly quantizing a signal to achieve adequate signal-to-noise
ratios for a wide variety of amplitudes is called "companding", a contraction
of "compressing-expanding".  The original signal can be retrieved from
its A-law compression by antilogarithmic expansion.
.pp
Figure 3.6 also
shows one common coding scheme which is a piecewise linear approximation
to the A-law.  This provides an 8-bit code, and gives the equivalent
of 12-bit linear quantization for small signal levels.  It approximates
the A-law in 16 linear segments, 8 for positive and 8 for negative
inputs.
Consider the positive part of the curve.  The first two segments, which
are actually collinear, correspond exactly to 12-bit linear conversion.
Thus the output codes 0 to 31 correspond to inputs from 0 to 31/2048,
in equal steps.  (Remember that both positive and negative signals
must be converted, so a 12-bit linear converter will allocate 2048 levels
for positive signals and 2048 for negative ones.)  The next
segment provides 11-bit linear quantization,
output codes 32 to 47 corresponding to inputs from 16/1024 to 31/1024.
Similarly, the next segment corresponds to 10-bit quantization, covering
inputs from 16/512 to 31/512.  And so on, the last section giving 6-bit
quantization of inputs from 16/32 to 31/32, the full-scale positive value.
Negative inputs are converted similarly.
For signal levels of less than 32/2048, that is, $2 sup -8$, this implementation
of the A-law provides full 12-bit precision.
As the signal level increases, the precision decreases gradually to 6 bits
at maximum amplitudes.
.pp
Logarithmic encoding provides what is in effect a floating-point representation
of the input.  The conventional floating-point format, however, is not used
because many different codes can represent the same value.  For example, with
a 4-bit exponent preceding a 4-bit mantissa, the words 0000:1000,
0001:0100, 0010:0010, and 0011:0001 represent the numbers
$0.1 ~ times ~ 2 sup 0$,  $0.01 ~ times ~ 2 sup 1
$,  $0.001 ~ times ~ 2 sup 2$,  \c
and  $0.0001 ~ times ~ 2 sup 3$  respectively,
which are the same.  (Some floating-point conventions assume that an unwritten
"1" bit precedes the mantissa, except when the whole word is zero; but this
gives decreased resolution around zero \(em which is exactly where we want the
resolution to be greatest.)  Table 3.1 shows the 8-bit A-law codes,
.RF
.in+0.7i
.ta 1.6i +\w'bits 1-3   'u
8-bit codeword:	bit 0	sign bit
	bits 1-3	3-bit exponent
	bits 4-7	4-bit mantissa
.sp2
.ta 1.6i 3.5i
.ul
 codeword	   interpretation
.sp
0000 0000	\h'\w'\0-\0  +  'u'$.0000 ~ times ~ 2 sup -7$
\0\0\0...	\0\0\0\0...
0000 1111	\h'\w'\0-\0  +  'u'$.1111 ~ times ~ 2 sup -7$
0001 0000	$2 sup -7 ~~ + ~~ .0000 ~ times ~ 2 sup -7$
\0\0\0...	\0\0\0\0...
0001 1111	$2 sup -7 ~~ + ~~ .1111 ~ times ~ 2 sup -7$
0010 0000	$2 sup -6 ~~ + ~~ .0000 ~ times ~ 2 sup -6$
\0\0\0...	\0\0\0\0...
0010 1111	$2 sup -6 ~~ + ~~ .1111 ~ times ~ 2 sup -6$
0011 0000	$2 sup -5 ~~ + ~~ .0000 ~ times ~ 2 sup -5$
\0\0\0...	\0\0\0\0...
0011 1111	$2 sup -5 ~~ + ~~ .1111 ~ times ~ 2 sup -5$
0100 0000	$2 sup -4 ~~ + ~~ .0000 ~ times ~ 2 sup -4$
\0\0\0...	\0\0\0\0...
0100 1111	$2 sup -4 ~~ + ~~ .1111 ~ times ~ 2 sup -4$
0101 0000	$2 sup -3 ~~ + ~~ .0000 ~ times ~ 2 sup -3$
\0\0\0...	\0\0\0\0...
0101 1111	$2 sup -3 ~~ + ~~ .1111 ~ times ~ 2 sup -3$
0110 0000	$2 sup -2 ~~ + ~~ .0000 ~ times ~ 2 sup -2$
\0\0\0...	\0\0\0\0...
0110 1111	$2 sup -2 ~~ + ~~ .1111 ~ times ~ 2 sup -2$
0111 0000	$2 sup -1 ~~ + ~~ .0000 ~ times ~ 2 sup -1$
\0\0\0...	\0\0\0\0...
0111 1111	$2 sup -1 ~~ + ~~ .1111 ~ times ~ 2 sup -1$

1000 0000	\h'\w'\0-\0  'u'$- ~~ .0000 ~ times ~ 2 sup -7$	negative numbers treated as
\0\0\0...	\0\0\0\0...	above, with a sign bit of 1
1111 1111	\h'-\w'\- 'u'\- $2 sup -1 ~~ - ~~ .1111 ~ times ~ 2 sup -1$
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 3.1  8-bit A-law codes, with their floating-point equivalents"
according
to the piecewise linear approximation of Figure 3.6, written in a notation which
suggests floating point.  Each linear segment has a different exponent except
the first two segments, which as explained above are collinear.
.pp
Logarithmic encoders and decoders are available from many semiconductor
manufacturers as single-chip devices
called "codecs" (for "coder/decoder").  Intended for use on digital communication
links, these generally provide a serial output bit-stream, which
should be converted to parallel by a shift register if the data is intended
for a computer.
Because of the potentially vast market for codecs in telecommunications,
they are made in great quantities and are consequently very cheap.
Estimates of the speech quality necessary for telephone applications indicate
that somewhat less than this accuracy is needed \(em 7-bit logarithmic encoding
was used in early digital communications links, and it may be that even 6 bits
are adequate.  However, during the transition period when digital
networks must coexist with the present analogue one, it is anticipated that
a particular telephone call may have to pass through several links, some
using analogue technology and some being digital.  The possibility of
several successive encodings and decodings has led telecommunications
engineers to standardize on 8-bit representations, leaving some margin
before additional degradation of signal quality becomes unduly distracting.
.pp
Unfortunately, world telecommunications authorities cannot agree on a single
standard for logarithmic encoding.  The A-law, which we have described,
is the European standard, but there is another system, called
the $mu$-law, which is used universally in North America.  It also is available
in single-chip form with an 8-bit code.  It has very similar
quantization error characteristics to the A-law, and would be indistinguishable
from it on the scale of Figure 3.6.
.rh "The pre-sampling filter."
Now that we have some idea of the accuracy requirements for quantization,
let us discuss quantitative specifications for the pre-sampling filter.
Figure 3.7 sketches the characteristics of this filter.
.FC "Figure 3.7"
Assume a
sampling frequency of 8\ kHz and a range of interest from 0 to 3.4\ kHz.
Although all components at frequencies above 4\ kHz will fold back into
the 0\ \-\ 4\ kHz baseband, those below 4.6\ kHz fold back above 3.4\ kHz and are
therefore outside the range of interest.  This gives a "guard band" between
3.4 and 4.6\ kHz which separates the passband from the stopband.  The filter
should transmit negligible components in the stopband above 4.6\ kHz.
To reduce the harmonic distortion caused by aliasing to the same level
as the quantization noise in 11-bit linear conversion, the stopband
attenuation should be around \-68\ dB (the signal-to-noise ratio for a full-scale
sine wave).  Passband ripple is not so critical,
for two reasons.  Whilst the presence of aliased components means that
information has been lost about the frequency components within the range of
interest, passband ripple does not actually cause a loss of information but
only a distortion, and could, if necessary, be compensated by a suitable
filter acting on the digitized waveform.  Secondly, distortion of the
passband spectrum is not nearly so audible as the frequency images caused
by aliasing.  Hence one usually aims for a passband ripple of around 0.5\ dB.
.pp
The pass and stopband targets we have mentioned above can be achieved with
a 9'th order elliptic filter.  While such a filter is often used in
high-quality signal-processing systems, for telephone-quality speech
much less stringent specifications seem to be sufficient.  Figure 3.8, for
example, shows a template which has been recommended by telecommunications
authorities.
.FC "Figure 3.8"
A 5'th order elliptic filter can easily meet this specification.
Such filters, implemented by switched-capacitor means, are available in
single-chip form.  Integrated CCD (charge-coupled device)
filters which meet the same specification
are also marketed.  Indeed, some codecs provide input filtering on the same
chip as the A/D converter.
.pp
Instead of implementing a filter by analogue means to meet the aliasing
specifications, digital filtering can be used.  A high sample-rate A/D
converter, operating at, say, 32\ kHz, and preceded by a very simple low-pass
pre-sampling filter, is followed by a digital filter which meets the
desired specification, and its output is subsampled to provide an 8\ kHz sample
rate.  While such implementations may be economic where a multichannel digitizing
capability is required, as in local telephone exchanges where the subscriber
connection is an analogue one, they are unlikely to prove cost-effective for
a single channel.
.rh "Reconstructing the analogue waveform."
Having digitized and stored a signal, it needs to be passed though a D/A
converter (digital-to-analogue) and low-pass filter when replayed.
D/A converters are cheaper than A/D converters, and the characteristics of the
low-pass filter for output can be the same as those for input.
However, the desampling operation introduces an additional distortion, which
has an effect on the component at frequency $f$ of
.LB
.EQ
{ sin ( pi f/f sub s )} over { pi f/f sub s } ~ ,
.EN
.LE
where $f sub s$ is the sampling frequency.  An "aperture correction" filter is
needed to compensate for this, although many systems simply do without it.
Such a filter is sometimes incorporated into the codec chip.
.rh "Summary."
For telephone-quality speech, existing codec chips,
coupled if necessary with integrated pre-sampling filters, can
be used, at a remarkably low cost.
For higher-quality speech storage the analogue interface can become quite complex.
A comprehensive study of the problems as they relate to digitization of audio,
which demands much greater fidelity than speech, has been made by Blesser (1978).
.[
Blesser 1978
.]
He notes the following sources of error (amongst others):
.LB
.NP
slew-rate distortion in the pre-sampling filter for signals at the upper end
of the audio band;
.NP
insufficient filtering of high-frequency input signals;
.NP
noise generated by the sample-and-hold amplifier or pre-sampling filter;
.NP
acquisition errors because of the finite settling time of the sample-and-hold
circuit;
.NP
insufficient settling time in the A/D conversion;
.NP
errors in the quantization levels of the A/D and D/A converters;
.NP
noise in the converters;
.NP
jitter on the clock used for timing input or output samples;
.NP
aperture distortion in the output sampler;
.NP
noise in the output filter as a result of limited dynamic range of the
integrated circuits;
.NP
power-supply noise injection or ground coupling;
.NP
changes in characteristics as a result of temperature or ageing.
.LE
Care must be taken with the analogue interface to ensure that the precision
implied by the resolution of the A/D and D/A converters is not compromised
by inadequate analogue circuitry.  It is especially important to eliminate
high-frequency noise caused by fast edges on nearby computer buses.
.sh "3.2  Coding in the time domain"
.pp
There are several methods of coding the time waveform of a speech signal to
reduce the data rate for a given signal-to-noise ratio, or alternatively to
reduce the signal-to-noise ratio for a given data rate.  They almost all require
more processing, both at the encoding (for storage) and decoding (for
regeneration) ends of the digitization process.  They are sometimes used to
economize on memory in systems using stored speech,
for example the System\ X telephone exchange and the travel consultant described
in Chapter 1, and so will be described here.  However, it is to be expected
that simple time-domain coding techniques will be superseded by the more complex
linear predictive method, which is covered in Chapter 6, because this
can give a much more substantial reduction in the data rate for only a small
degradation in speech quality.  Hence the aim of this section is to introduce
the ideas in a qualitative way:  theoretical development and summaries of
results of listening tests can be found elsewhere (eg Rabiner and Schafer, 1978).
.[
Rabiner Schafer 1978 Digital processing of speech signals
.]
The methods we will examine are summarized in Table 3.2.
.RF
.nr x0 \w'linear PCM      'u
.nr x1 \n(x0+\w'    adaptive quantization, or adaptive prediction,'u
.nr x2 (\n(.l-\n(x1)/2
.in \n(x2u
.ta \n(x0u
\l'\n(x1u\(ul'
.sp
linear PCM	linearly-quantized pulse code modulation
.sp
log PCM	logarithmically-quantized pulse code modulation
	    (instantaneous companding)
.sp
APCM	adaptively quantized pulse code modulation
	    (usually syllabic companding)
.sp
DPCM	differential pulse code modulation
.sp
ADPCM	differential pulse code modulation with either
	    adaptive quantization, or adaptive prediction,
	    or both
.sp
DM	delta modulation (1-bit DPCM)
.sp
ADM	delta modulation with adaptive quantization
\l'\n(x1u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 3.2  Time-domain encoding techniques"
.rh "Syllabic companding."
We have already studied one time-domain encoding technique, namely logarithmic
quantization, or log PCM (sometimes called "instantaneous companding").  A more
sophisticated encoder could track slowly varying trends in the overall amplitude
of the speech signal and use this information to adjust the quantization
levels dynamically.  Speech coding methods based on this principle are called
adaptive pulse code modulation systems (APCM).  Because the overall amplitude
changes slowly, it is sufficient to adjust the quantization relatively infrequently
(compared with the sampling rate), and this is often done at rates approximating
the syllable rate of running speech, leading to the term "syllabic companding".
A block floating-point format can be used, with a common exponent being
stored every M samples (with M, say, 125 for a 100\ msec block rate at 8\ kHz
sampling), but the mantissa being stored at the regular sample rate.  The overall
energy in the block,
.LB
$sum from n=h to h+M-1 ~x(n) sup 2$    ($M = 125$, say),
.LE
is used to determine a suitable exponent, and every sample
in the block \(em namely
$x(h)$, $x(h+1)$, ..., $x(h+M-1)$ \(em is scaled according to that exponent.
Note that for speech transmission systems this method necessitates a delay of
$M$ samples at the encoder, and indeed some methods base the exponent on the
energy in the last block to avoid this.  For speech storage, however, the delay
is irrelevant.  A rather different, nonsyllabic, method of adaptive PCM is
continually to change the step size of a uniform quantizer, by multiplying it by
a constant at each sample which is based on the magnitude of the previous code
word.
.pp
Adaptive quantization exploits information about the amplitude of the signal,
and, as a rough generalization, yields a reduction of one bit per sample
in the data rate for telephone-quality speech over ordinary logarithmic
quantization, for a given signal-to-noise ratio.  Alternatively, for the
same data rate an improvement of 6\ dB in signal-to-noise ratio can be obtained.
Some results for actual schemes are given by Rabiner and Schafer (1978).
.[
Rabiner Schafer 1978 Digital processing of speech signals
.]
However, there is other information in the time waveform of speech, namely, the
sample-to-sample correlation, which can be exploited to give further reductions.
.rh "Differential coding."
Differential pulse code modulation (DPCM), in its simplest form, uses the
present speech sample as a prediction of the next one,
and stores the prediction error \(em that is, the sample-to-sample difference.
This is a simple case of predictive encoding.
Referring back to the speech waveform displayed in Figure 3.5,
it seems plausible that the data rate can be reduced by transmitting the difference
between successive samples instead of their absolute values:  less bits are
required for the difference signal for a given overall accuracy because it
does not assume such extreme values as the absolute signal level.
Actually, the improvement is not all that great \(em about 4\ \-\ 5\ dB in
signal-to-noise ratio, or just under one bit per sample for a given
signal-to-noise ratio \(em for the difference signal can be nearly as large as
the absolute signal level.
.pp
If DPCM is used in conjunction with adaptive quantization, giving one form of
adaptive differential pulse code modulation (ADPCM), both the overall amplitude
variation and the sample-to-sample correlation are exploited, leading to a
combined gain of 10\ \-\ 11\ dB in signal-to-noise ratio (or just under two bits
reduction per sample for telephone-quality speech).  Another form of adaptation
is to alter the predictor by multiplying the previous sample value by a
parameter which is adjusted for best performance.
Then the transmitted signal at time $n$ is
.LB
.EQ
e(n) ~~ = ~~ x(n)~ - ~ax(n-1),
.EN
.LE
where the parameter $a$ is adapted (and stored) on a syllabic time-scale.  This
leads to a slight improvement in signal-to-noise ratio, which can be combined
with that achieved by adaptive quantization.  Much more substantial benefits
can be realized by using a weighted sum of the past several (up to 15) speech
samples, and adapting all the weights.  This is the basic idea of linear
prediction, which is developed in Chapter 6.
.rh "Delta modulation."
The coding methods presented so far all increase the complexity of the
analogue-to-digital interface (or, if the sampled waveform is coded
digitally, they increase the processing required before and after storage).
One method which considerably
.ul
simplifies
the interface is the limiting case
of DPCM with just 1-bit quantization.  Only the sign of the difference between
the current and last values is transmitted.  Figure 3.9 shows the conversion
hardware.
.FC "Figure 3.9"
The encoding part is essentially the same as a tracking D/A,
where the value in a counter is forced to track the analogue input by
incrementing or decrementing the counter according as the input exceeds or
falls short of the analogue equivalent of the counter's contents.  However,
for this encoding scheme, called "delta modulation", the increment-decrement
signal itself forms the discrete representation of the waveform, instead of the counter's
contents.  The analogue waveform can be reconstituted from the bit stream with
another counter and D/A converter.  Alternatively, an all-analogue implementation
can be used, both for the encoder and decoder, with a capacitor as integrator
whose charging current is controlled digitally.  This is a much cheaper realization.
.pp
It is fairly obvious that the sampling frequency for delta modulation will need
to be considerably higher than for straightforward PCM.  Figure 3.10 shows
an effect called "slope overload" which occurs when the sampling rate is too low.
.FC "Figure 3.10"
Either a higher sample rate or a larger step size will reduce the overload;
however, larger steps increase the noise level of the alternate 1's and \-1's
that occur when no input is present \(em called "granular noise".  A compromise
is necessary between slope overload and granular noise for a given bit rate.
Delta modulation results in lower data rates than logarithmic quantization
for a given signal-to-noise ratio if that ratio is low (poor-quality speech).
As the desired speech quality is increased its data rate grows faster than
that of logarithmic PCM.  The crossover point occurs at much lower than
telephone quality speech, and so although delta modulation is used for some
applications where the permissible data rate is severely constrained,
it is not really suitable for speech output from computers.
.pp
It is profitable to adjust the step size, leading to
.ul
adaptive
delta modulation.
A common strategy is to increase or decrease the step size by a multiplicative
constant, which depends on whether the new transmitted bit will be equal to
or different from the last one.  That is,
.LB "nnnn"
.NI "nn"
$stepsize(n+1)  =  stepsize(n) times 2$  if $x(n+1)<x(n)<x(n-1)$
or $x(n+1)>x(n)>x(n-1)$
.br
(slope overload condition);
.NI "nn"
$stepsize(n+1) = stepsize(n)/2$  if $x(n+1),~x(n-1)<x(n)$
or $x(n+1),~x(n-1)>x(n)$
.br
(granular noise condition).
.LE "nnnn"
Despite these adaptive equations, the step size should be constrained to
lie between a predetermined fixed maximum and minimum, to prevent it from
becoming so large or so small that rapid accomodation to changing input signals is
impossible.
Then, in a period of potential slope overload the step size will grow, preventing
overload, possibly to its maximum value when overload may resume.  In a quiet
period it will decrease to its minimum value which determines the granular
noise in the idle condition.  Note that the step size need not be stored, for
it can be deduced from the bit changes in the digitized data.  Although
adaptation improves the performance of delta modulation, it is still inferior to
PCM at telephone qualities.
.rh "Summary."
It seems that ADPCM, with
adaptive quantization and adaptive prediction, can provide a worthwhile
advantage for speech storage, reducing the number of bits needed per sample of
telephone-quality speech from 7 for logarithmic PCM to perhaps 5, and the data
rate from 56\ Kbit/s to 40\ Kbit/s.  Disadvantages are additional complexity
in the encoding and decoding processes, and the fact that byte-oriented storage,
with 8 bits/sample in logarithmic PCM, is more convenient for computer use.
For low quality speech where hardware complexity is to be minimized,
adaptive delta modulation could provide worthwhile \(em although the ready
availability of PCM codec chips reduces the cost advantage.
.sh "3.3  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "3.4  Further reading"
.pp
Probably the best single reference on time-domain coding of speech is
the book by Rabiner and Schafer (1978), cited above.
However, this does not contain a great deal of information on practical
aspects of the analogue-to-digital conversion process; this is
covered by Blesser (1978) above, who is especially interested in
high-quality conversion for digital audio applications,
and Garrett (1978) below.
There are many textbooks in the telecommunications area which
are relevant to the subject of the chapter,
although they concentrate primarily on fundamental theoretical aspects rather
than the practical application of the technology.
.LB "nn"
.\"Cattermole-1969-1
.]-
.ds [A Cattermole, K.W.
.ds [D 1969
.ds [T Principles of pulse code modulation
.ds [I Iliffe
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is a standard, definitive, work on PCM, and provides a good grounding
in the theory.
It goes into the subject in much more depth than we have been able to here.
.in-2n
.\"Garrett-1978-1
.]-
.ds [A Garrett, P.H.
.ds [D 1978
.ds [T Analog systems for microprocessors and minicomputers
.ds [I Reston Publishing Company
.ds [C Reston, Virginia
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Garrett discusses the technology of data conversion systems, including
A/D and D/A converters and basic analogue filter design, in a
clear and practical manner.
.in-2n
.\"Inose-1979-2
.]-
.ds [A Inose, H.
.ds [D 1979
.ds [T An introduction to digital integrated communications systems
.ds [I Peter Peregrinus
.ds [C Stevenage, England
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Inose's book is a recent one which covers the whole area of digital
transmission and switching technology.
It gives a good idea of what is happening to the telephone networks
in the era of digital communications.
.in-2n
.\"Steele-1975-3
.]-
.ds [A Steele, R.
.ds [D 1975
.ds [T Delta modulation systems
.ds [I Pentech Press
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Again a standard work, this time on delta modulation techniques.
Steele gives an excellent and exhaustive treatment of the subject from a
communications viewpoint.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "4  SPEECH ANALYSIS"
.ds RT "Speech analysis
.ds CX "Principles of computer speech
.pp
Digital recordings of speech provide a jumping-off point for
further processing of the audio waveform, which is usually necessary for
the purpose of speech output.
It is difficult to synthesize natural sounds by concatenating
individually-spoken words.
Pitch is perhaps the most perceptually significant contextual effect
which must be
taken into account when forming connected speech out of isolated words.
The intonation of an utterance, which manifests itself as a
continually changing pitch, is a holistic property of the utterance
and not the sum of components determined by the individual words alone.
Happily, and quite coincidentally, communications engineers in their quest
for reduced-bandwidth telephony have invented methods of coding speech that
separate the pitch information from that carried by the articulation.
.pp
Although these analysis techniques, which were first introduced in the late
1930's (Dudley, 1939), were originally implemented by analogue means \(em and
in many systems still are (Blankenship, 1978, describes a recent
switched-capacitor realization) \(em there is a continuing trend
towards digital implementations, particularly for the more sophisticated coding
schemes.
.[
Dudley 1939
.]
.[
Blankenship 1978
.]
It is hard to see how the technique of linear prediction of speech,
which is described in detail in Chapter 6, could be accomplished in the
absence of digital processing.
Some groundwork is laid for the theory of digital signal analysis in this
chapter.
The ideas are not presented in a formal, axiomatic way; but are developed as
and when they are needed to examine some of the structures that turn out to be
useful in speech processing.
.pp
Most speech analysis views speech according to the source-filter model which
was introduced in Chapter 2, and aims to separate the effects of the source from
those of the filter.  The frequency spectrum of the vocal tract filter is of
great interest, and the technique of discrete Fourier transformation is
discussed in this chapter.  For many purposes it is better to extract the formant
frequencies from the spectrum and use these alone (or in conjunction with their
bandwidths) to characterize it.  As far as the signal source in the source-filter
model is concerned, its most interesting features are pitch and amplitude \(em the
latter being easy to estimate.  Hence we go on to look at pitch extraction.
Related to this is the problem of deciding whether a segment of speech has
voiced or unvoiced excitation, or both.
.pp
Estimating formant and pitch parameters is one of the messiest areas of
speech processing.  There is a delightful paper which points this out
(Schroeder, 1970), entitled "Parameter estimation in speech: a lesson in unorthodoxy".
.[
Schroeder 1970
.]
It emphasizes that the most successful estimation procedures "have often relied
on intuition based on knowledge of speech signals and their production in the
human vocal apparatus rather than routine applications of well-established
theoretical methods".
Fortunately, the emphasis of the present book is on speech
.ul
output,
which involves parameter estimation only in so far as it is needed to produce
coded speech for storage, and to illuminate the acoustic nature of speech
for the development of synthesis by rule from phonetics or text.
Hence the many methods of formant and pitch estimation are treated rather
cursorily and qualitatively here:  our main interest is in how to
.ul
use
such information for speech output.
.pp
If the incoming speech can be analysed into its formant frequencies, amplitude,
excitation mode, and pitch (if voiced), it is quite easy to resynthesize
it directly from these parameters.  Speech synthesizers are described in the
next chapter.  They can be realized in either analogue or digital
hardware, the former being predominant in production systems and the latter
in research systems \(em although, as in other areas of electronics, the balance
is changing in favour of digital implementations.
.sh "4.1  The channel vocoder"
.pp
A direct representation of the frequency spectrum of a signal can be obtained
by a bank of bandpass filters.  This is the basis of
the
.ul
channel vocoder,
which was the first device that attempted to take advantage of the source-filter
model for speech coding (Dudley, 1939).
.[
Dudley 1939
.]
The word "vocoder" is a contraction
of
.ul
vo\c
ice
.ul
coder.
The energy in each filter band is
estimated by rectification and smoothing, and the resulting approximation to
the frequency spectrum is transmitted or stored.  The source properties are
represented by the type of excitation (voiced or unvoiced), and if voiced,
the pitch.  It is not necessary to include the overall amplitude of the speech
explicitly, because this is conveyed by the energy levels from the separate
bandpass filters.
.pp
Figure 4.1 shows the encoding part of a channel vocoder which has been used
successfully for many years (Holmes, 1980).
.[
Holmes 1980 JSRU channel vocoder
.]
.FC "Figure 4.1"
We will discuss the block labelled "pre-emphasis" shortly.
The shape of the spectrum is estimated by 19 bandpass filters, whose spacing
and bandwidth decrease slightly with decreasing frequency to obtain the rather
greater resolution that is needed in the lower frequency region,
as shown in Table 4.1.
.RF
.nr x0 4n+2.6i+\w'\0\0'u+(\w'bandwidth'/2)
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta 4n +1.3i +1.3i
\l'\n(x0u\(ul'
.sp
.nr x1 (\w'channel'/2)
.nr x2 (\w'centre'/2)
.nr x3 (\w'analysis'/2)
	\0\h'-\n(x1u'channel	\0\h'-\n(x2u'centre	\0\0\h'-\n(x3u'analysis
.nr x1 (\w'number'/2)
.nr x2 (\w'frequency'/2)
.nr x3 (\w'bandwidth'/2)
	\0\h'-\n(x1u'number	\0\0\h'-\n(x2u'frequency	\0\0\h'-\n(x3u'bandwidth
.nr x2 (\w'(Hz)'/2)
		\0\h'-\n(x2u'(Hz)	\0\0\h'-\n(x2u'(Hz)
\l'\n(x0u\(ul'
.sp
	\01	\0240	\0120
	\02	\0360	\0120
	\03	\0480	\0120
	\04	\0600	\0120
	\05	\0720	\0120
	\06	\0840	\0120
	\07	1000	\0150
	\08	1150	\0150
	\09	1300	\0150
	10	1450	\0150
	11	1600	\0150
	12	1800	\0200
	13	2000	\0200
	14	2200	\0200
	15	2400	\0200
	16	2700	\0200
	17	3000	\0300
	18	3300	\0300
	19	3750	\0500
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 4.1  Filter specifications for a vocoder analyser (after Holmes, 1980)"
.[
Holmes 1980 JSRU channel vocoder
.]
The 3\ dB points
of adjacent filters are halfway between their centre frequencies, so that there
is some overlap between bands.
The filter characteristics do not need to have very sharp edges, because the energy
in neighbouring bands is fairly highly correlated.  Indeed, there is a
disadvantage in making them too sharp, because the phase delays associated
with sharp cutoff filters induce "smearing" of the spectrum in the time domain.
This particular channel vocoder uses second-order Butterworth bandpass filters.
.pp
For regenerating speech stored in this way, an excitation of unit impulses
at the specified pitch period (for voiced sounds) or white noise (for unvoiced
sounds) is produced and passed through a bank of bandpass filters similar
to the analysis ones.  The excitation has a flat spectrum, for regular impulses
have harmonics at multiples of the repetition frequency which are all of the
same size, and so the spectrum of the output signal is completely determined
by the filter bank.  The gain of each filter is controlled by the stored
magnitude of the spectrum at that frequency.
.pp
The frequency spectrum and voicing pitch of speech change at much slower rates
than the time waveform.  The changes are due to movements of the articulatory
organs (tongue, lips, etc) in the speaker, and so are limited in their speed
by physical constraints.  A typical rate of production of phonemes is 15 per
second, but in fact the spectrum can change quite a lot within a single
phoneme (especially a stop sound).
Between 10 and 25\ msec (100\ Hz and 40\ Hz)
is generally thought to be a satisfactory interval for transmitting or storing
the spectrum, to preserve a reasonably faithful representation of the speech.
Of course, the entire spectrum, as well as the source characteristics, must
be stored at this rate.
The channel vocoder described by Holmes (1980) uses 48 bits to encode
the information.
.[
Holmes 1980 JSRU channel vocoder
.]
Repeated every 20\ msec, this gives a data rate of 2400\ bit/s \(em very
considerably less than any of the time-domain encoding techniques.
.pp
It needs some care to encode the output of 19 filters, the excitation type,
and the pitch into 48 bits of information.  Holmes uses 6 bits for pitch,
logarithmically encoded,
and one bit for excitation type.
This leaves 41 bits to encode the output of the 19 filters, and so a differential
technique is used which transmits just the difference between adjacent
channels \(em for the spectrum does not change abruptly in the frequency domain.
Three bits are used for the absolute level in channel 1, and two bits
for each channel-to-channel difference, giving a total of 39 bits for the whole
spectrum.  The remaining two bits per frame are reserved for signalling or
monitoring purposes.
.pp
A 2400 bit/s channel vocoder degrades the speech in a telephone channel quite
perceptibly.  It is sufficient for interactive communication, where
if you do not understand something you can always ask for it to be repeated.
It is probably not good enough for most voice response applications.
However, the vocoder principle can be used with larger filter banks and much
higher bit rates, and still reduce the data rate substantially below that
required by log PCM.
.sh "4.2  Pre-emphasis"
.pp
There is an
overall \-6\ dB/octave trend in speech radiated from the lips,
as frequency increases.
We will discuss why this is so in the next chapter.
Notice that this trend means that the signal power is reduced
by a factor of 4, or the signal amplitude by a factor of 16, for each
doubling in frequency.
For vocoders, and indeed for other methods of spectral analysis of speech,
it is usually desirable to equalize this by a +6\ dB/octave lift prior to
processing, so that the channel outputs occupy a similar range of levels.
On regeneration, the output speech is passed through an inverse filter which
provides 6\ dB/octave of attenuation.
.pp
For a digital system, such pre-emphasis
can either be implemented as an analogue circuit which precedes the presampling
filter and digitizer, or as a digital operation on the sampled and quantized
signal.  In the former case, the characteristic is usually flat up to a certain
breakpoint, which occurs somewhere between 100\ Hz and 1\ kHz \(em the exact
position does not seem to be critical \(em at which point the +6\ dB/octave lift
begins.  Although de-emphasis on output ought to have an exactly inverse
characteristic, it is sometimes modified or even eliminated altogether in an
attempt to counteract approximately
the  $sin( pi f/f sub s )/( pi f/f sub s )$  distortion
introduced by the desampling operation, which was discussed in an earlier
section.  Above half the sampling frequency, the characteristic of the
pre-emphasis is irrelevant because any effect will be suppressed by the presampling
filter.
.pp
The effect of a 6\ dB/octave lift can also be achieved digitally, by differencing
the input.  The operation
.LB
.EQ
y(n)~~ = ~~ x(n)~ -~ ax(n-1)
.EN
.LE
is suitable, where the constant parameter $a$ is usually chosen between 0.9 and 1.
The latter value gives straightforward differencing, and this amounts to
creating a DPCM signal as input to the spectral analysis.  Figure 4.2 plots
the frequency response of this operation, with a sample frequency of 8\ kHz,
for two values of the parameter; together with that of a 6\ dB/octave lift
above 100\ Hz.
.FC "Figure 4.2"
The vertical positions of the plots have been adjusted to give
the same gain, 20\ dB, at 1\ kHz.
The difference at 3.4\ kHz, the upper end of the telephone spectrum, is just
over 2\ dB.  At frequencies below the breakpoint, in this case 100\ Hz, the
difference between analogue and digital pre-emphasis can be very great.  For
$a=0.9$ the attenuation at DC (zero frequency) is 18\ dB below that at 1\ kHz,
which happens to be close to that of the analogue filter for frequencies below the
breakpoint.  However, if the breakpoint had been at 1\ kHz there would have been
20\ dB difference between the analogue and $a=0.9$ plots at DC.  And of course
the $a=1$ characteristic has infinite attenuation at DC.
In practice, however, the exact form of the pre-emphasis does not seem to be at all
critical.
.pp
The above remarks apply only to voiced speech.  For unvoiced speech there appears
to be no real need for pre-emphasis; indeed, it may do harm by reinforcing
the already large high-frequency components.  There is a case for altering the
parameter $a$ according to the excitation mode of the speech:  $a=1$ for voiced
excitation and $a=0$ for unvoiced gives pre-emphasis just when it is needed.
This can be achieved by expressing the parameter in terms of the autocorrelation
of the incoming signal, as
.LB
.EQ
a ~~ = ~~ R(1) over R(0) ~ ,
.EN
.LE
where $R(1)$ is the correlation of the signal with itself delayed by one sample,
and $R(0)$ is the correlation without delay (that is, the signal variance).
This is reasonable intuitively because high sample-to-sample correlation
is to be expected in voiced speech, so that $R(1)$ is very nearly as great as
$R(0)$ and the ratio becomes 1; whereas little or no sample-to-sample correlation
will be present in unvoiced speech, making the ratio close to 0.  Such a
scheme is reminiscent of ADPCM with adaptive prediction.
.pp
However, this sophisticated pre-emphasis method does not seem to be worthwhile
in practice.  Usually the breakpoint in an analogue pre-emphasis filter is
chosen to be rather greater than 100\ Hz to limit the amplification of fricative
energy.  In fact, the channel vocoder described by Holmes (1980) has the
breakpoint at 1\ kHz, limiting the gain to 12\ dB at 4\ kHz, two octaves above.
.[
Holmes 1980 JSRU channel vocoder
.]
.sh "4.3  Digital signal analysis"
.pp
You may be wondering how the frequency response for the digital pre-emphasis
filters, displayed in Figure 4.2, can be calculated.  Suppose a digitized
sinusoid is applied as input to the filter
.LB
.EQ
y(n) ~~ = ~~ x(n)~ - ~ax(n-1).
.EN
.LE
A sine wave of frequency $f$ has equation  $x(t) ~ = ~ sin ~ 2 pi ft$, and when
sampled at $t=0,~ T,~ 2T,~ ...$ (where $T$ is the sampling interval, 125\ msec for
an 8\ kHz sample rate), this becomes  $x(n) ~ = ~ sin ~ 2 pi fnT.$  It is much
more convenient to consider a complex exponential
input,  $e sup { j2 pi fnT}$  \(em the response to a sinusoid can then be derived
by taking imaginary parts, if necessary.  The output for this input is
.LB
.EQ
y(n) ~~ = ~~ e sup {j2 pi fnT} ~~-~ae sup {j2 pi f(n-1)T} ~~ = ~~
(1~-~ae sup {-j2 pi fT} )~e sup {j2 pi fnT} ,
.EN
.LE
a sinusoid at the same frequency as the input.  The
factor  $1~-~ae sup {-j2 pi fT}$  is complex, with both amplitude and phase
components.  Thus the output will be a phase-shifted and amplified version
of the input.  The amplitude response at frequency $f$ is therefore
.LB
.EQ
|1~ - ~ ae sup {-j2 pi fT} | ~~ = ~~
[1~ +~ a sup 2 ~-~ 2a~cos~2 pi fT ] sup 1/2 ,
.EN
.LE
or
.LB
.EQ
10 ~ log sub 10 (1~ +~ a sup 2 ~ - ~ 2a~ cos 2 pi fT)
.EN
dB.
.LE
Normalizing to 20\ dB at 1\ kHz, and assuming 8\ kHz sampling, yields
.LB
.EQ
20~ + ~~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ { pi f} over 4000 )
~~ -~ 10~ log sub 10 (1~ +~ a sup 2 ~-~ 2a~ cos ~ pi over 4 )
.EN
dB.
.LE
With $a=0.9$ and 1 this gives the graphs of Figure 4.2.
.pp
Frequency responses for analogue filters are often plotted with a logarithmic
frequency scale, as well as a logarithmic amplitude one, to bring out the
asymptotes in dB/octave as straight lines.  For digital filters the response
is usually drawn on a
.ul
linear
frequency axis extending to half the sampling frequency.  The response is
symmetric about this point.
.pp
Analyses like the above are usually expressed in terms of the $z$-transform.
Denote the unit delay operation by $z sup -1$.  The choice of the inverse rather
than $z$ itself is of course an arbitrary matter, but the convention has stuck.
Then the filter can be characterized
by Figure 4.3, which signifies that the output is the input minus a delayed
and scaled version of itself.
.FC "Figure 4.3"
The transfer function of the filter is
.LB
.EQ
H(z) ~~ = ~~ 1~ -~ az sup -1 ,
.EN
.LE
and we have seen that the effect of the system on a (complex) exponential of
frequency $f$ is to multiply it by
.LB
.EQ
1~ -~ ae sup {-j2 pi fT}.
.EN
.LE
To get the frequency response from the transfer function, replace $z sup -1$
by $e sup {-j2 pi fT}$.  Amplitude and phase responses can then be found by
taking the modulus and angle of the complex frequency response.
.pp
If $z sup -1$ is treated as an
.ul
operator,
it is quite in order to summarize the action of the filter by
.LB
.EQ
y(n) ~~ = ~~ x(n)~ - ~az sup -1 x(n) ~~ = ~~ (1~ -~ az sup -1 )x(n).
.EN
.LE
However, it is usual to derive from the sequence $x(n)$ a
.ul
transform
$X(z)$ upon which $z sup -1$ acts as a
.ul
multiplier.
If the transform of $x(n)$ is defined as
.LB
.EQ
X(z) ~~ = ~~ sum from {n=- infinity} to infinity ~x(n) z sup -n ,
.EN
.LE
then on multiplication by $z sup -1$ we get a new transform, say $V(z)$:
.LB
.EQ
V(z) ~~ = ~~ z sup -1 X(z) ~~ =
~~ z sup -1 sum from {n=- infinity} to infinity ~x(n) z sup -n ~~ =
~~ sum ~x(n)z sup -n-1 ~~ =
~~ sum ~x(n-1)z sup -n .
.EN
.LE
$V(z)$ can also be expressed as the transform of a new sequence, say $v(n)$, by
.LB
.EQ
V(z) ~~ = ~~ sum from {n=- infinity} to infinity ~v(n) z sup -n ,
.EN
.LE
from which it becomes apparent that
.LB
.EQ
v(n) ~~ = ~~ x(n-1).
.EN
.LE
Thus $v(n)$ is a delayed version of $x(n)$, and we have accomplished what we
set out to do, namely to show that the delay
.ul
operator
$z sup -1$ can be treated as an ordinary
.ul
multiplier
in the $z$-transform domain, where $z$-transforms are defined as the infinite
sums given above.
.pp
In terms of $z$-transforms, the filter can be written
.LB
.EQ
Y(z) ~~ = ~~ (1~ -~ az sup -1 )X(z),
.EN
.LE
where $z sup -1$ is now treated as a multiplier.
The transfer function of the filter is
.LB
.EQ
H(z) ~~ = ~~ Y(z) over X(z) ~~ = ~~ 1 - az sup -1 ,
.EN
.LE
the ratio of the output to the input transform.
.pp
It may seem that little has been gained by inventing this rather abstract
notion of transform, simply to change an operator to a multiplier.  After
all, the equation of the filter is no simpler in the transform domain than
it was in the time domain using $z sup -1$ as an operator.  However, we will
need to go on to examine more complex filters.  Consider, for example, the
transfer function
.LB
.EQ
H(z) ~~ = ~~ {1~+~az sup -1 ~+~bz sup -2} over {1~+~cz sup -1 ~+~dz sup -2} ~ .
.EN
.LE
If $z sup -1$ is treated as an operator, it is not immediately obvious how
this transfer function can be realized by a time-domain recurrence relation.
However, with $z sup -1$ as an ordinary multiplier in the transform domain, we can
make purely mechanical manipulations with infinite sums to see what the transfer
function means as a recurrence relation.
.pp
It is worth noting the similarity between the $z$-transform in the discrete
domain and the Fourier and Laplace transforms in the continuous domains.
In fact, the $z$-transform plays an analogous role in digital signal processing
to the Laplace transform in continuous theory, for the delay operator
$z sup -1$
performs a similar service to the differentiation operator $s$.
Recall first the continuous Fourier transform,
.LB
$
G(f) ~~ = ~~
integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt
$,    where $f$ is real,
.LE
and the Laplace transform,
.LB
$
F(s) ~~ = ~~
integral from 0 to infinity ~f(t)~e sup -st dt
$,    where $s$ is complex.
.LE
The main difference between these two transforms is that the range of integration
begins at -$infinity$ for the Fourier transform and at 0 for the Laplace.
Advocates of the Fourier transform, which typically include people involved with
telecommunications, enjoy the freedom from initial conditions which is bestowed
by an origin way back in the mists of time.  Advocates of Laplace, including
most analogue filter theorists, invariably
consider systems where all is quiet before $t=0$ \(em altering the origin
of measurement of time to achieve this if necessary \(em and welcome the opportunity
to include initial conditions explicitly
.ul
without
having to worry about what happens in the mists of time.
Although there is a two-sided Laplace transform where the integration begins
at -$infinity$, it is not generally used because it causes some convergence
complications.  Ignoring this difference between the transforms (by considering
signals which are zero when $t<0$), the Fourier spectrum can be found from the
Laplace transform by writing  $s=j2 pi f$; that is, by considering values
of $s$ which lie on the imaginary axis.
.pp
The $z$-transform is
.LB
$
H(z) ~~ = ~~ sum from n=0 to infinity ~h(n)~z sup -n
$,    or    $
H(z) ~~ = ~~ sum from {n=- infinity} to infinity ~h(n)~z sup -n ,
$
.LE
depending on whether a one-sided or two-sided transform is used.  The advantages
and disadvantages of one- and two-sided transforms are the same as in the
analogue case.
$z$ plays the role of $e sup sT $, and so it is not surprising that the response
to a (sampled) sinusoid input can be found by setting
.LB
.EQ
z ~~ = ~~ e sup {j2 pi fT}
.EN
.LE
in $H(z)$, as we proved explicitly above for the pre-emphasis filter.
.pp
The above relation between $z$ and $f$ means that real-valued frequencies correspond
to points where $|z|=1$, that is, the unit circle in the complex $z$-plane.
As you travel anticlockwise around this unit circle, starting from the
point $z=1$, the corresponding frequency increases from 0, to $1/2T$ half-way
round ($z=-1$), to $1/T$ when you get back to the beginning ($z=1$) again.
Frequencies greater than the sampling frequency are aliased back into the
sampling band, corresponding to further circuits of $|z|=1$ with frequency
going from $1/T$ to $2/T$, $2/T$ to $3/T$, and so on.  In fact, this is the circle
of Figure 3.3 which was used earlier to explain how sampling affects the frequency
spectrum!
.sh "4.4  Discrete Fourier transform"
.pp
Let us return from this brief digression into techniques of digital signal
analysis to the problem of determining the frequency spectrum of speech.
Although a bank of bandpass filters such as is used in the channel vocoder
is the perhaps most straightforward way to obtain a frequency spectrum,
there are other techniques which are in fact more commonly used in digital speech
processing.
.pp
It is possible to define the Fourier transform of a discrete sequence of
points.  To motivate the definition, consider first the
ordinary Fourier transform (FT), which is
.LB
$
g(t) ~~ = ~~
integral from {- infinity} to infinity ~G(f)~e sup {+j2 pi ft} df
~~~~~~~~~~~~~~~~
G(f) ~~ = ~~
integral from {- infinity} to infinity ~g(t)~e sup {-j2 pi ft} dt .
$
.LE
This takes a continuous time domain into a continuous frequency domain.
Sometimes you see a normalizing factor $1/2 pi$ multiplying the integral in
either the forward or the reverse transform.  This is only needed
when the frequency variable is expressed in radians/s, and we will find it
more convenient to express frequencies in\ Hz.
.pp
The Fourier series (FS), which should also be familiar to you,
operates on a periodic time waveform (or, equivalently,
one that only exists for a finite period of time, which is notionally extended
periodically).  If a period lies in the time range $[0,b)$, then the transform is
.LB
$
g(t) ~~ = ~~
sum from {r = - infinity} to infinity ~G(r)~e sup {+j2 pi rt/b}
~~~~~~~~~~~~~~~~
G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} dt .
$
.LE
The Fourier series takes a periodic time-domain function into a discrete frequency-domain one.
Because of the basic duality between the time and frequency domains in the
Fourier transforms, it is not surprising that another version of the transform
can be defined which takes a periodic
.ul
frequency\c
-domain function into a
discrete
.ul
time\c
-domain one.
.pp
Fourier transforms can only deal with a finite stretch of a time signal
by assuming that the signal is periodic, for if $g(t)$ is evaluated from
its transform $G(r)$ according to the formula above, and $t$ is chosen outside
the interval $[0,b)$, then a periodic extension of the function $g(t)$ is obtained
automatically.
Furthermore, periodicity in one domain implies discreteness in the other.
Hence if we transform a
.ul
finite
stretch of a
.ul
discrete
time waveform,
we get a frequency-domain representation which is also finite (or, equivalently,
periodic), and discrete.
This is the discrete Fourier transform (DFT),
and takes a discrete periodic time-domain function into a discrete
periodic frequency-domain one as illustrated in Figure 4.4.
.FC "Figure 4.4"
It is defined by
.LB
$
g(n) ~~ = ~~
1 over N ~ sum from r=0 to N-1~G(r)~e sup { + j2 pi rn/N}
~~~~~~~~~~~~~~~~
G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~e sup { - j2 pi rn/N} ,
$
.LE
or, writing  $W=e sup {-j2 pi /N}$,
.LB
$
g(n) ~~ = ~~
1 over N ~ sum from r=0 to N-1~G(r)~W sup -rn
~~~~~~~~~~~~~~~~
G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup rn .
$
.LE
.sp
The $1/N$ in the first equation is the same normalizing
factor as the $1/b$ in the Fourier series,
for the finite time domain is $[0,N)$
in the discrete case and $[0,b)$ in the Fourier series case.
It does not matter
whether it is written into the forward or the reverse transform, but it is usually
placed as shown above as a matter of convention.
.pp
As illustrated by Figure 4.5, discrete Fourier transforms
take an input of $N$ real values, representing equally-spaced time samples
in the interval $[0,b)$, and produce as output $N$ complex values, representing
equally-spaced frequency samples in the interval $[0,N/b)$.
.FC "Figure 4.5"
Note that the end-point of this frequency interval is the sampling frequency.
It seems odd that the input is real and the output is the same number of
.ul
complex
quantities:  we seem to be getting some numbers for nothing!
However, this isn't so, for it is easy to show that if the input sequence is
real, the output frequency
spectrum has a symmetry about its mid-point (half the sampling frequency).
This can be expressed as
.LB
DFT symmetry:\0\0\0\0\0\0 $
~ mark G( half N +r) ~=~ G( half N -r) sup *$  if $g$ is real-valued,
.LE
where $*$ denotes the conjugate of a complex quantity
(that is, $(a+jb) sup * = a-jb$).
.pp
It was argued above that the frequency spectrum in the DFT is periodic, with
the spectrum from 0 to the sampling frequency being repeated regularly up and
down the frequency axis.  It can easily be seen from the DFT equation that
this is so.  It can be written
.LB
DFT periodicity:$ lineup G(N+r) ~=~ G(r)$  always.
.LE
Figure 4.6 illustrates the properties of symmetry and periodicity.
.FC "Figure 4.6"
.sh "4.5  Estimating the frequency spectrum of speech using the DFT"
.pp
Speech signals are not exactly periodic.  Although the waveform in a particular
pitch period will usually resemble those in the preceding and following pitch
periods, it will certainly not be identical to them.
As the articulation of the speech changes, the formant positions will alter.
As we saw in Chapter 2, the pitch itself is certainly not constant.
Hence the fundamental assumption of the DFT, that the waveform is periodic,
is not really justified.  However, the signal is quasi-periodic, for changes
from period to period will not usually be very great.  One way of computing
the short-term frequency spectrum of speech is to use
.ul
pitch-synchronous
Fourier transformation, where single pitch periods are isolated from the
waveform and processed with the DFT.  This gives a rather accurate estimate
of the spectrum.  Unfortunately, it is difficult to determine the beginning
and end of each pitch cycle, as we shall see later in this chapter when
discussing pitch extraction techniques.
.pp
If a finite stretch of a speech waveform is isolated and Fourier transformed,
without regard to pitch of the speech, then the periodicity assumption will
be grossly violated.  Figure 4.7 illustrates that the effect is the same
as
multiplying the signal by a rectangular
.ul
window function,
which is 0 except during the period to be analysed, where it is 1.
.FC "Figure 4.7"
The windowed sequence will almost certainly have discontinuities at its edges,
and these will affect the resulting spectrum.  The effect can be analysed
quite easily, but we will not do so here.  It is enough to say that the
high frequencies associated with the edges of the window cause considerable
distortion of the spectrum.  The effect can be alleviated by
using a smoother window than a rectangular one,
and several have been investigated extensively.  The commonly-used windows of
Bartlett, Blackman, and Hamming are illustrated in Figure 4.8.
.FC "Figure 4.8"
.pp
Because the DFT produces the same number of frequency samples, equally spaced,
as there were points in the time waveform, there is a tradeoff between
frequency resolution and time resolution (for a given sampling rate).
For example, a 256-point transform with a sample rate of 8\ kHz gives the 256
equally-spaced frequency components between 0 and 8\ kHz that are shown in Table
4.2.
.RF
.nr x0 (\w'time domain'/2)
.nr x1 (\w'frequency domain'/2)
.in+1.0i
.ta 1.0i 3.0i 4.0i
\h'0.5i+2n-\n(x0u'time domain\h'|3.5i+2n-\n(x1u'frequency domain
.sp
sample	time	sample	\h'-3n'frequency
number		number
.nr x0 1i+\w'00000'
\l'\n(x0u\(ul'	\l'\n(x0u\(ul'
.sp
\0\0\00	\0\0\0\00 $mu$sec	\0\0\00	\0\0\0\00 Hz
\0\0\01	\0\0125	\0\0\01	\0\0\031
\0\0\02	\0\0250	\0\0\02	\0\0\062
\0\0\03	\0\0375	\0\0\03	\0\0\094
\0\0\04	\0\0500	\0\0\04	\0\0125
.nr x2 (\w'...'/2)
\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'...
\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'...
\h'0.5i+4n-\n(x2u'...\h'|3.5i+4n-\n(x2u'...
.sp
\0254	31750	\0254	\07938
\0255	31875 $mu$sec	\0255	\07969 Hz
\l'\n(x0u\(ul'	\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.MT 2
Table 4.2  Time domain and frequency domain samples for a 256-point DFT,
with 8\ kHz sampling
.TE
The top half of the frequency spectrum is of no interest, because
it contains the complex conjugates of the bottom half (in reverse order),
corresponding to frequencies greater than half the sampling frequency.
Thus for a 30\ Hz resolution in the frequency domain,
256 time samples, or a 32\ msec stretch of speech, needs to be transformed.
A common technique is to take overlapping periods in the time domain to
give a new frequency spectrum every 16\ msec.  From the acoustic point
of view this is a reasonable rate to re-compute the spectrum, for as noted
above when discussing channel vocoders the rate of change in the spectrum
is limited by the speed that the speaker can move his vocal organs, and
anything between 10 and 25\ msec is a reasonable figure for transmitting
or storing the spectrum.
.pp
The DFT is a complex transform, and speech is a real signal.  It is possible
to do two DFT's at once by putting one time waveform into the real parts
of the input and another into the imaginary parts.  This destroys the DFT
symmetry property, for it only holds for real inputs.  But given the DFT
of a complex sequence formed in this way, it is easy to separate out the
DFT's of the two real time sequences.  If the two time sequences are
$x(n)$ and $y(n)$, then the transform of the complex sequence
.LB
.EQ
g(n) ~~ = ~~ x(n) ~+~ jy(n)
.EN
.LE
is
.LB
.EQ
G(r) ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn ~+~ y(n)W sup rn ] .
.EN
.LE
It follows that the complex conjugate of the aliased parts of the spectrum,
in the upper frequency region, are
.LB
.EQ
G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup -(N-r)n
~-~ y(n)W sup -(N-r)n ] ,
.EN
.LE
and this is the same as
.LB
.EQ
G(N-r) sup * ~~ = ~~ sum from n=0 to N-1 ~[x(n)W sup rn
~-~ y(n)W sup rn ] ,
.EN
.LE
because $W sup N$ is 1 (recall the definition of $W$),
and so $W sup -Nn$ is 1 for any $n$.
Thus
.LB
.EQ
X(r) ~~ = ~~ {G(r) ~+~ G(N-r) sup * } over 2
~~~~~~~~~~~~~~~~
Y(r) ~~ = ~~ {G(r) ~-~ G(N-r) sup * } over 2
.EN
.LE
extracts the transforms $X(r)$ and $Y(r)$ of the original sequences
$x$ and $y$.
.pp
With speech, this trick is frequently used to calculate two spectra at once.
Using 256-point transforms, a new estimate of the spectrum can be obtained
every 16\ msec by taking overlapping 32\ msec stretches of speech, with a
computational requirement of one 256-point transform every 32\ msec.
.sh "4.6  The fast Fourier transform"
.pp
Straightforward calculation of the DFT, expressed as
.LB
.EQ
G(r) ~~ = ~~ sum from n=0 to N-1 ~g(n)~W sup nr ,
.EN
.LE
for $r=0,~ 1,~ 2,~ ...,~ N-1$, takes $N sup 2$ operations, where each operation
is a complex multiply and add (for $W$ is, of course, a complex number).
There is a better way, invented in the early sixties, which reduces this to
$N ~ log sub 2 N$ operations \(em a very considerable improvement.
Dubbed the "fast Fourier transform" (FFT) for historical reasons, it would actually
be better called the "Fourier transform", with the straightforward method above
known as the "slow Fourier transform"!  There
is no reason nowadays to use the slow method, except for tiny transforms.
It is worth describing the basic principle of the FFT, for it is surprisingly
simple.  More details on actual implementations can be found in Brigham (1974).
.[
Brigham 1974
.]
.pp
It is important to realize that the FFT involves no approximation.
It is an
.ul
exact
calculation of the values that would be obtained by the slow method
(although it may be affected differently by round-off errors).
Problems of aliasing and windowing occur in all discrete Fourier transforms,
and they are neither alleviated nor exacerbated by the FFT.
.pp
To gain insight into the working of the FFT, imagine the sequence $g(n)$ split
into two halves, containing the even and odd points
respectively.
.LB
even half $e(n)$ is $g(0)~ g(2)~ .~ .~ .~ g(N-2)$
.br
odd  half $o(n)$ is $g(1)~ g(3)~ .~ .~ .~ g(N-1)$.
.LE
Then it is easy to show that if $G$ is the transform of $g$,
$E$ the transform of $e$,
and $O$ that of $o$, then
.LB
$
G(r) ~~ = ~~ E(r) ~+~ W sup r O(r)$  for  $r=0,~ 1,~ ...,~ half N -1$,
.LE
and
.LB
$
G( half N +r ) ~~ = ~~ E(r) ~+~ W sup { half N +r} O(r)$  for  $
r = 0,~ 1,~ ...,~ half N -1$.
.LE
Calculation of the $E$ and $O$ transforms involves $( half N) sup 2$ operations each,
while combining them together according to the above relationship occupies
$N$ operations.  Thus the total is  $N + half N sup 2 $  operations, which is considerably
less than $N sup 2$.
.pp
But don't stop there!  The even half can itself be broken down into
even and odd parts to expedite its calculation, and the same with the odd half.
The only constraint is that the number of elements in the sequences splits
exactly into two at each stage.
Providing $N$ is a power of 2, then, we are left at the end with some 1-point
transforms to do.  But transforming a single point leaves it unaffected!  (Check
the definition of the DFT.)  A quick calculation shows that the number of operations
needed is not  $N + half N sup 2$, but $N~ log sub 2 N$.
Figure 4.9 compares this with $N sup 2$, the number of operations for
straightforward DFT calculation, and it can be seen that the FFT is very much
faster.
.FC "Figure 4.9"
.pp
The only restriction on the use of the FFT is that $N$ must be a power of two.
If it is not, alternative, more complicated, algorithms can be used which
give comparable computational advantages.  However, for speech processing
the number of samples that are transformed is usually arranged to be a power
of two.  If a pitch synchronous analysis is undertaken, the
time stretch that is to be transformed is dictated by the length of the pitch
period, and will vary from time to time.  Then, it is usual to pad out the
time waveform with zeros to bring the number of samples up to a power of two;
otherwise, if different-length time stretches were transformed the scale
of the resulting frequency components would vary too.
.pp
The FFT provides very worthwhile cost savings over the use of a bank of
bandpass filters for spectral analysis.  Take the example of a 256-point
transform with 8\ kHz sampling, giving 128 frequency components spaced
by 31.25\ Hz from 0 up to almost 4\ kHz.  This can be computed on overlapping
32\ msec stretches of the time waveform, giving a new spectrum every 16\ msec,
by a single FFT calculation every 32\ msec (putting successive pairs of
time stretches in the real and imaginary parts of the complex input sequence,
as described earlier).  The FFT algorithm requires $N~ log sub 2 N$ operations,
which is 2048 when $N=256$.  An additional 512 operations are required
for the windowing calculation.  Repeated every 32\ msec, this gives
a rate of 80,000 operations per second.  To achieve a much lower frequency
resolution with 20 bandpass filters, each of which are fourth-order,
will need a great deal more operations.  Each filter will need between 4 and 8
multiplications per sample, depending on its exact digital implementation.  But new
samples appear every 125
.ul
micro\c
seconds, and so somewhere around a million
operations will be required every second.
If we increased the frequency resolution to that obtained by the FFT, 128
filters would be needed, requiring between 4 and 8 million operations!
.sh "4.7  Formant estimation"
.pp
Once the frequency spectrum of a speech signal has been calculated, it may
seem a simple matter to estimate the positions of the formants.  But it is
not!  Spectra obtained in practice are not usually like the idealized ones
of Figure 2.2.  One reason for this is that, unless the analysis is
pitch-synchronous, the frequency spectrum of the excitation source is mixed
in with that of the vocal tract filter.  There are other reasons, which will
be discussed later in this section.  But first, let us consider how to
extract the vocal tract filter characteristics from the combined spectrum
of source and filter.  To do so we must begin to explore the theory of linear
systems.
.rh "Discrete linear systems."
Figure 4.10 shows an input signal exciting a filter to produce an output
signal.
.FC "Figure 4.10"
For present purposes, imagine the input to be a glottal
waveform, the filter a vocal tract one, and the output a
speech signal (which is then subjected to high-frequency de-emphasis
by radiation from the lips).
We will consider here
.ul
discrete
systems, so that the input $x(n)$ and output $y(n)$ are sampled signals,
defined only when $n$ is integral.  The theory is quite similar for continuous
systems.
.pp
Assume that the system is
.ul
linear,
that is, if input $x sub 1 (n)$ produces output $y sub 1 (n)$ and
input $x sub 2 (n)$ produces output $y sub 2 (n)$,
then the sum of $x sub 1 (n)$ and
$x sub 2 (n)$ will produce the sum of $y sub 1 (n)$ and $y sub 2 (n)$.
It is easy to show from this that, for any constant multiplier $a$,
the input $ax(n)$ will produce output $ay(n)$ \(em it is pretty obvious
when $a=2$,
or indeed any positive integer; for then $ax(n)$ can be written as
$x(n)+x(n)+...$ .
Assume further that the system is
.ul
time-invariant,
that is, if input $x(n)$
produces output $y(n)$ then a time-shifted version of $x$,
say $x(n+n sub 0 )$ for
some constant $n sub 0$, will produce the same output, only time-shifted; namely
$y(n+n sub 0)$.
.pp
Now consider the discrete delta function $delta (n)$, which is 0 except at
$n=0$ when it is 1.
If this single impulse is presented as input to the system, the output is called
the
.ul
impulse response,
and will be denoted by $h(n)$.
The fact that the system is time-invariant guarantees that the response does
not depend upon the particular time at which the impulse occurred, so that,
for example, the impulsive input $delta (n+n sub 0 )$ will produce output
$h(n+n sub 0 )$.
A delta-function input and corresponding impulse response are shown in Figure
4.10.
.pp
The impulse response of a linear, time-invariant system is an extremely useful
thing to
know, for it can be used to calculate the output of the system for any input
at all!  Specifically, an input signal $x(n)$ can be written
.LB
.EQ
x(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) delta (n-k) ,
.EN
.LE
because $delta (n-k)$ is non-zero only when $k=n$, and so for any
particular value of $n$, the summation contains only
one non-zero term \(em that is, $x(n)$.
The action of the system on each term of the sum is to produce an output
$x(k)h(n-k)$, because $x(k)$ is just a constant, and
the system is linear.
Furthermore, the complete input $x(n)$ is just the sum of such terms, and since
the system is linear, the output is the sum of $x(k)h(n-k)$.
Hence the response of the system to an arbitrary input is
.LB
.EQ
y(n)~ = ~~ sum from {k=- infinity} to infinity ~ x(k) h(n-k) .
.EN
.LE
This is called a
.ul
convolution sum,
and is sometimes written
.LB
.EQ
y(n)~ =~ x(n) ~*~ h(n).
.EN
.LE
.pp
Let's write this in terms of $z$-transforms.  The (two-sided) $z$-transform of y(n)
is
.LB
.EQ
Y(z)~ = ~~ sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ =
~~ sum from n ~ sum from k ~x(k)h(n-k) ~z sup -n ,
.EN
.LE
Writing $z sup -n$ as  $z sup -(n-k) z sup -k$,  and interchanging the order
of summation, this becomes
.LB
.EQ
Y(z)~ mark = ~~ sum from k ~[~ sum from n ~ h(n-k)z sup -(n-k) ~]~x(k)z sup -k
.EN
.br
.EQ
lineup = ~~ sum from k ~H(z)~z sup -k ~~ = ~~ H(z)~ sum from k ~x(k)z sup
-k ~~=~~H(z)X(z) .
.EN
.LE
Thus convolution in the time domain is the same as multiplication in the
$z$-transform domain; a very important result.  Applied to the linear system of
Figure 4.10, this means that the output $z$-transform is the input $z$-transform
multiplied by the $z$-transform of the system's impulse response.
.pp
What we really want to do is to relate the frequency spectrum of
the output to the response of the system and the spectrum of the
input.
In fact, frequency spectra are very closely connected with $z$-transforms.  A
periodic signal $x(n)$ which repeats every $N$ samples has DFT
.LB
.EQ
sum from n=0 to N-1 ~x(n)~e sup {-j2 pi rn/N} ,
.EN
.LE
and its $z$-transform is
.LB
.EQ
sum from {n=- infinity} to infinity ~x(n) ~z sup -n .
.EN
.LE
Hence the DFT is the same as the $z$-transform of a single cycle of the signal,
evaluated at the points  $z= e sup {j2 pi r/N}$  for $r=0,~ 1,~ ...~ ,~ N-1$.
In other
words, the frequency components are samples of the $z$-transform at $N$
equally-spaced points around the unit circle.
Hence the frequency spectrum at the output of a linear system is the product of
the
input spectrum and the frequency response of the system itself (that is, the
transform of its impulse response function).
It should be admitted that this statement is somewhat questionable,
because to get from $z$-transforms to DFT's we have assumed that
a single cycle only is transformed \(em and the impulse response function of
a system is not necessarily periodic.  The real action of the system is
to multiply $z$-transforms, not DFT's.  However, it is useful in imagining
the behaviour of the system to think in terms of products of DFT's; and in
practice it is always these rather than $z$-transforms which are computed
because of the existence of the FFT algorithm.
.pp
Figure 4.11 shows the frequency spectrum of a typical voiced speech signal.
.FC "Figure 4.11"
The overall shape shows humps at the formant positions, like those in the
idealized Figure 2.2.  However, superimposed on this is an "oscillation"
(in the frequency domain!) at the pitch frequency.  This occurs because the
transform of the vocal tract filter has been multiplied by that of the
pitch pulse, the latter having components at harmonics of the pitch frequency.
The oscillation must be suppressed before the formants
can be estimated to any degree of accuracy.
.pp
One way of eliminating the oscillation is to perform pitch-synchronous
analysis.
This removes the influence of pitch from the frequency domain by dealing with
it in the time domain!  The snag is, of course, that it is not easy to estimate
the pitch frequency:  some techniques for doing so are discussed in the next
main section.
Another way is to use linear predictive analysis, which really does get rid
of pitch information without having to estimate the pitch period first.  A
smooth
frequency spectrum can be produced using the analysis techniques described in
Chapter 6, which provides
a suitable starting-point for formant frequency estimation.
The third method is to remove the pitch ripple from the frequency spectrum
directly.  This will be discussed in an intuitive rather than a
theoretical way, because linear predictive methods are becoming dominant
in speech processing.
.rh "Cepstral processing of speech."
Suppose the frequency spectrum of Figure 4.11 were actually a time waveform.
To remove the high-frequency pitch ripple is easy:  just filter it out!
However,
filtering removes
.ul
additive
ripples, whereas this is a
.ul
multiplicative
ripple.  To turn multiplication into addition, take logarithms.  Then the
procedure would be
.LB
.NP
compute the DFT of the speech waveform (windowed, overlapped);
.NP
take the logarithm of the transform;
.NP
filter out the high-frequency part, corresponding to pitch ripple.
.LE
.pp
Filtering is often best done using the DFT.  If the rippled waveform of Figure
4.11 is transformed, a strong component could be expected at the ripple
frequency, with weaker ones at its harmonics.  These components can be
simply removed by setting them to zero, and inverse-transforming the result
to give a smoothed version of the original frequency spectrum.
A spectrum of the logarithm of a frequency spectrum is often called a
.ul
cepstrum
\(em a sort of backwards spectrum.  The horizontal axis of the cepstrum,
having the dimension of time, is called "quefrency"!  Note that high-frequency
signals have low quefrencies and vice versa.  In practice,
because the pitch ripple is usually well above the quefrency of interest for
formants, the upper end of the cepstrum is often simply cut off from a fixed
quefrency which corresponds to the maximum pitch expected.  However, identifying
the pitch peaks of the cepstrum has the useful byproduct of giving the pitch
period of the original speech.
.pp
To summarize, then, the procedure for spectral smoothing by the cepstral method
is
.LB
.NP
compute the DFT of the speech waveform (windowed, overlapped);
.NP
take the logarithm of the transform;
.NP
take the DFT of this log-transform, calling it the cepstrum;
.NP
identify the lowest-quefrency peak in the spectrum as the pitch,
confirming it by examining its harmonics, which should be
equally spaced at the pitch quefrency;
.NP
remove pitch effects from the cepstrum by cutting off its high-quefrency
part above either the pitch quefrency or some constant representing the maximum
expected pitch (which is the minimum expected pitch quefrency);
.NP
inverse DFT the resulting cepstrum to give a smoothed spectrum.
.LE
.rh "Estimating formant frequencies from smoothed spectra."
The difficulties of formant extraction are not over even when a smooth frequency
spectrum has been obtained.  A simple peak-picking algorithm which identifies
a peak at the $k$'th frequency component whenever
.LB
$
X(k-1) ~<~ X(k)
$  and  $
X(k) ~>~ X(k+1)
$
.LE
will quite often identify formants incorrectly.
It helps to specify in advance minimum and maximum formant frequencies \(em say
100\ Hz and 3\ kHz for three-formant identification, and ignore peaks lying
outside these limits.  It helps to estimate
the bandwidth of the peaks and reject those with bandwidths greater than
500\ Hz \(em for real formants are never this wide.  However, if two formants are
very close, then they may appear as a single, wide, peak and be rejected by
this criterion.  It is usual to take account of formant positions identified
in previous frames under these conditions.
.pp
Markel and Gray (1976) describe in detail several estimation algorithms.
.[
Markel Gray 1976 Linear prediction of speech
.]
Their simplest uses the number of peaks identified in the raw spectrum
(under 3\ kHz, and with
bandwidths greater than 500\ Hz), to determine what to do.  If exactly three
peaks are found, they are used as the formant positions.  It is claimed that
this happens about 85% to 90% of the time.
If only one peak is found, the present frame is ignored and the
previously-identified
formant positions are used (this happens less than 1% of the time).
The remaining cases are two peaks \(em corresponding to omission of one formant \(em
and four peaks \(em corresponding to an extra formant being included.  (More
than
four peaks never occurred in their data.)  Under these conditions,
a nearest-neighbour measure is used for disambiguation.  The measure is
.LB
.EQ
v sub ij ~ = ~ |{ F sup * } sub i (k) ~-~ F sub j (k-1)| ,
.EN
.LE
where $F sub j sup (k-1)$ is the $j$'th formant frequency defined
in the previous frame
$k-1$ and ${ F sup * } sub i (k)$ is the $i$'th raw data frequency estimate
for frame $k$.
If two peaks only are found, this measure is used to identify
the closest peaks in the previous frame; and then the
third peak of that frame is taken to be the missing formant
position.  If four peaks are found, the measure is used to
determine which of them is furthest from the previous formant
values, and this one is discarded.
.pp
This procedure works forwards, using the previous frame to
disambiguate peaks given in the current one.  More sophisticated
algorithms work backwards as well, identifying
.ul
anchor points
in the data which have clearly-defined formant positions, and
moving in both directions from these to disambiguate
neighbouring frames of data.  Finally, absolute limits can be
imposed upon the magnitude of formant movements between frames
to give an overall smoothing to the formant tracks.
.pp
Very often, people will refine the result of such automatic formant
estimation procedures by hand, looking at the tracks, knowing
what was said, and making adjustments in the light of their
experience of how formants move in speech.  Unfortunately, it is difficult to
obtain high-quality formant tracks by completely automatic
means.
.pp
One of the most difficult cases in formant estimation is where
two formants are so close together that the individual peaks
cannot be resolved.  One simple solution to this problem is to
employ "analysis-by-synthesis", whereby once a formant is
identified, a standard formant shape at this position is
synthesized and
subtracted from the
logarithmic spectrum (Coker, 1963).
.[
Coker 1963
.]
Then, even if two formants
are right on top of each other, the second is not missed because
it remains after the first one has been subtracted.
.pp
Unfortunately, however, the single peak which appears when
two formants are close together usually does not correspond exactly with the
position of either one.
There is one rather advanced signal-processing technique that
can help in this case.
The frequency spectrum of
speech is determined by
.ul
poles
which lie in the complex $z$-plane inside the unit circle.  (They
must be inside the unit circle if the system is stable.  Those
familiar with Laplace analysis of analogue systems may like to note that the
left half of the $s$-plane corresponds with the inside of the unit
circle in the $z$-plane.)  As shown earlier, computing a DFT is tantamount to
evaluating the $z$-transform at equally-spaced points around the
unit circle.  However, better resolution is obtained by
evaluating around a circle which lies
.ul
inside
the unit circle, but
.ul
outside
the outermost pole position.  Such a circle is sketched in
Figure 4.12.
.FC "Figure 4.12"
.pp
Recall that the FFT is a fast way of calculating the DFT of a
sequence.  Is there a similarly fast way of evaluating the
$z$-transform inside the unit circle?  The answer is yes, and the
technique is known as the "chirp $z$-transform", because it
involves considering a signal whose frequency increases
linearly \(em just like a radar chirp signal.  The chirp method
allows the $z$-transform to be computed quickly at equally-spaced
points along spirally-shaped contours around the origin of the
$z$-plane \(em corresponding to signals of linearly increasing
complex frequency.  The spiral nature of these curves is not of
particular interest in speech processing.  What
.ul
is
of interest, though, is that the spiral can begin at any point
on
the $z=0$ axis, and its pitch can be set arbitrarily.
If we begin spiralling at $z=0.9$, say, and set the pitch
to zero, the contour becomes a circle inside the unit one, with
radius 0.9.  Such a circle is exactly what is needed to refine
formant resolution.
.sh "4.8  Pitch extraction"
.pp
The last section discussed how to characterize the vocal tract filter
in the source-filter model of speech production:  this one looks
at how the most important property of the source \(em that is, the
pitch period \(em can be derived.  In many ways pitch extraction
is more important from a practical point of view than is formant
estimation.  In a voice-output system, formant estimation is
only necessary if speech is to be stored in formant-coded form.
For linear predictive storage of speech, or for speech synthesis
from phonetics or text, formant extraction is unnecessary \(em
although of course general information about formant
frequencies and formant tracks in natural speech is needed
before a synthesis-from-phonetics system can be built.
However, knowledge of the pitch contour is needed for
many different purposes.  For example, compact encoding of
linearly predicted speech relies on the pitch being estimated and
stored as a parameter separate from the articulation.
Significant improvements in frequency analysis can be made by
performing pitch-synchronous Fourier transformations,
because the need to window is eliminated.
Many synthesis-from-phonetics systems require the pitch contour
for utterances to be stored rather computed from markers in the
phonetic text.
.pp
Another issue which is closely bound up with pitch extraction is
the voiced-unvoiced distinction.   A good pitch estimator ought to
fail when presented with aperiodic input such as an unvoiced
sound, and so give a reliable indication of whether the frame of
speech is voiced or not.
.pp
One method of pitch estimation, which uses the cepstrum, has been outlined
above.  It involves a substantial amount of computation,
and has a high degree of complexity.  However, if implemented
properly it gives excellent results, because the source-filter
structure of the speech is fully utilized.
Another method, using the
linear prediction residual, will be described in Chapter 6.
Again, this requires a great deal of computation of a fairly sophisticated
nature, and gives good results \(em although it relies on a
somewhat more
restricted version of the source-filter model than cepstral
analysis.
.rh "Autocorrelation methods."
The most reliable way of estimating the pitch of a periodic
signal which is corrupted by noise is to examine its
short-time autocorrelation function.
The autocorrelation of a signal $x(n)$ with lag $k$ is defined as
.LB
.EQ
phi (k) ~~ = ~~ sum from {n=- infinity} to infinity ~ x(n)x(n+k) .
.EN
.LE
If the signal is quasi-periodic, with slowly varying period,
a finite stretch of it can be isolated with a window
$w(i)$, which is 0 when $i$ is outside the range $[0,N)$.
Beginning this window at sample $m$ gives the windowed signal
.LB
.EQ
x(n)w(n-m),
.EN
.LE
whose autocorrelation,
the
.ul
short-time
autocorrelation of the signal $x$ at point $m$ is
.LB
.EQ
phi sub m (k)~ = ~~ sum from n ~ x(n)w(n-m)x(n+k)w(n-m+k) .
.EN
.LE
.pp
The autocorrelation function exhibits peaks at lags which correspond to
the pitch periods and multiples of it.  At such lags, the signal is in
phase with a delayed version of itself, giving high correlation.
The pitch of natural speech ranges about three octaves, from 50\ Hz (low-pitched men) to around
400\ Hz (children).  To ensure that at least two pitch cycles are seen, even at
the
low end, the window needs to be at least 40\ msec long, and the autocorrelation
function calculated for lags up to 20\ msec.  The peaks which occur at lags
corresponding to multiples of the pitch become smaller as the multiple
increases, because the speech waveform will change slightly and the pitch
period is not perfectly constant.  If signals at the high end of the pitch
range, 400\ Hz, are
viewed through a 40\ msec autocorrelation window, considerable smearing of
pitch resolution in the time domain is to be expected.  Finally, for unvoiced
speech, no substantial peaks of autocorrelation will occur.
.pp
If all deviations from perfect periodicity can be attributed to
additive, white, Gaussian noise, then it can be shown from
standard detection theory that autocorrelation methods are
appropriate for pitch identification.  Unfortunately, this is
certainly not the case for speech signals.  Although the
short-time autocorrelation of voiced speech exhibits peaks at
multiples of the pitch period, it is not clear that it is any
easier to detect these peaks in the autocorrelation function
than it is in the original time waveform!  To take a simple
example, if a signal contains a fundamental and in-phase first
and second harmonics,
.LB
.EQ
x(n)~ =~ a sin 2 pi fnT ~+~ b sin 4 pi fnT ~+~ c sin 6 pi fnT ,
.EN
.LE
then its autocorrelation function is
.LB
.EQ
phi (k) ~=~~ {a sup 2 ~cos~2 pi fkT~+~b sup 2 ~cos~2 pi
fkT~+~c sup 2 ~cos 2 pi fkT} over 2 ~ .
.EN
.LE
There is no reason to believe that detection of the fundamental
period of this signal will be any easier in the autocorrelation
domain than in the time domain.
.pp
The most common error of pitch detection by autocorrelation
analysis is that the periodicities of the formants are confused
with the pitch.  This typically leads to the repetition time
being identified as  $T sub pitch ~ +- ~ T sub formant1$,  where the
$T$'s are the periods of the pitch and first formant.  Fortunately,
there are simple ways of processing the signal non-linearly to
reduce the effect of formants on pitch estimation using autocorrelation.
.pp
One way
is to low-pass filter the
signal with a cut-off above the maximum pitch period, say 600
Hz.  However, formant 1 is often below this value.  A different
technique, which may be used in conjunction with filtering, is
to "centre-clip" the signal as shown in Figure 4.13.
.FC "Figure 4.13"
This
removes many of
the ripples which are associated with formants.  However, it
entails the use of an adjustable clipping threshold to cater for
speech of varying amplitudes.  Sondhi (1968), who introduced the
technique, set the clipping level at 30% of the maximum
amplitude.
.[
Sondhi 1968
.]
An alternative which achieves
much the same effect without the need to fiddle with thresholds,
is to cube the signal, or raise it to some other high (odd!)
power, before taking the autocorrelation.  This highlights the
peaks and suppresses the effect of low-amplitude parts.
.pp
For very accurate pitch detection, it is best to combine the evidence
from several different methods of analysis of the time waveform.
The autocorrelation function provides one source of evidence;
and the cepstrum provides another.
A third source comes from the time waveform itself.
McGonegal
.ul
et al
(1975) have described a semi-automatic method of pitch
detection which uses human judgement to make a final decision based upon these
three sources of evidence.
.[
McGonegal Rabiner Rosenberg 1975 SAPD
.]
This appears to provide highly accurate pitch contours at the expense of
considerable human effort \(em it takes an experienced user 30 minutes to
process each second of speech.
.rh "Speeding up autocorrelation."
Calculating the autocorrelation function is an
arithmetic-intensive procedure.  For large lags, it can best be
done using FFT methods; although there are simpler arithmetic
tricks which speed it up without going to such complexity.
However, with the availability of analogue delay lines using
charge-coupled devices, autocorrelation can now be done
effectively and cheaply by analogue, sampled-data, hardware.
.pp
Nevertheless, some techniques to speed up digital
calculation of short-time autocorrelations are in wide use.  It
is tempting to hard-limit the signal so that it becomes binary
(Figure 4.14(a)), thus eliminating multiplication.
.FC "Figure 4.14"
This can be
disastrous, however, because hard-limited speech is known to
retain considerable intelligibility and therefore the formant
structure is still there.  A better plan is to take
centre-clipped speech and hard-limit that to a ternary signal
(Figure 4.14(b)).  This simplifies the computation considerably
with essentially no degradation in performance (Dubnowski
.ul
et al,
1976).
.[
Dubnowski Schafer Rabiner 1976 Digital hardware pitch detector
.]
.pp
A different approach to reducing the amount of calculation is to
perform a kind of autocorrelation which does not use
multiplications.  The
"average magnitude difference function",
which is defined by
.LB
.EQ
d(k)~ = ~~ sum from {n=- infinity} to infinity ~ |x(n)-x(n+k)| ,
.EN
.LE
has been used for this purpose with some success (Ross
.ul
et al,
1974).
.[
Ross Schafer Cohen Freuberg Manley 1974
.]
It exhibits dips at pitch periods (instead of the peaks of the
autocorrelation function).
.rh "Feature-extraction methods."
Another possible way of extracting pitch in the time domain is to try to
integrate information from different sources to give reliable
pitch estimates.  Several features of the time
waveform can be defined, each of which provides an estimate of the pitch period,
and
an overall estimate can be obtained by majority vote.
.pp
For example, suppose that the only feature of the speech
waveform which is retained is the height and position of the
peaks, where a "peak" is defined by the simplistic criterion
.LB
$
x(n-1) ~<~ x(n)
$  and  $
x(n) $>$ x(n+1) .
$
.LE
Having found a peak which is thought to represent a pitch pulse,
one could define a "blanking period", based upon the current
pitch estimate, within which the next pitch pulse could not
occur.  When this period has expired, the next pitch pulse is
sought.  At first, a stringent criterion should be used for
identifying the next peak as a pitch pulse; but it can gradually be
relaxed if time goes on without a suitable pulse being
located.  Figure 4.15 shows a convenient way of doing this:  a
decaying exponential is begun at the end of the blanking period
and when a peak shows above, it is identified as a pitch pulse.
.FC "Figure 4.15"
One big advantage of this type of algorithm is that the data is
greatly reduced by considering peaks only \(em which can be
detected by simple hardware.  Thus it can permit real-time
operation on a small processor with minimal special-purpose
hardware.
.pp
Such a pitch pulse detector is exceedingly simplistic, and will
often identify the pitch incorrectly.  However, it can be used
in conjunction with other features to produce good pitch
estimates.  Gold and Rabiner (1969), who pioneered the
approach, used six features:
.[
Gold Rabiner 1969 Parallel processing techniques for pitch periods
.]
.LB
.NP
peak height
.NP
valley depth
.NP
valley-to-peak height
.NP
peak-to-valley depth
.NP
peak-to-peak height (if greater than 0)
.NP
valley-to-valley depth (if greater than 0).
.LE
The features are symmetric with regard to peaks and valleys.
The first feature is the one described above, and the second one works in
exactly the same way.
The third feature records the
height between each valley and the succeeding peak, and fourth
uses the depth between each peak and the succeeding valley.  The
purpose of the final two detectors is to eliminate secondary,
but rather large, peaks from consideration.  Figure 4.16 shows
the kind of waveform on which the other features might
incorrectly double the pitch, but the last two features identify
correctly.
.FC "Figure 4.16"
.pp
Gold and Rabiner also included the last two pitch estimates from each
feature detector.
Furthermore, for each feature, the present estimate
was added to the previous one to make a fourth, and the previous one to
the one before that to make a fifth, and all three were added together
to make a sixth; so that for each feature there were 6 separate estimates of
pitch.  The reason for this is that if three consecutive estimates of the
fundamental period are $T sub 0$, $T sub 1$ and $T sub 2$; then if some peaks are
being falsely identified, the actual period could be any of
.LB
.EQ
T sub 0 ~+~ T sub 1 ~~~~ T sub 1 ~+~ T sub 2 ~~~~
T sub 0 ~+~ T sub 1 ~+~ T sub 2 .
.EN
.LE
It is essential to do this, because
a feature of a given type can occur more than once in a pitch period \(em
secondary peaks usually exist.
.pp
Six features, each contributing six separate estimates, makes 36 estimates
of pitch in all.
An overall figure was obtained from this
set by selecting the most popular estimate (within some
pre-specified tolerance).  The complete scheme has been
evaluated extensively (Rabiner
.ul
et al,
1976) and compares
favourably with other methods.
.[
Rabiner Cheng Rosenberg McGonegal 1976
.]
.pp
However, it must be admitted that this procedure seems to be rather
.ul
ad hoc
(as are many other successful speech parameter estimation
algorithms!).  Specifically, it is not easy to predict what
kinds of waveforms it will fail on, and evaluation of it can
only be pragmatic.  When used to
estimate the pitch of musical
instruments and singers over a 6-octave range (40\ Hz to 2.5\ kHz),
instances were found where it failed dramatically (Tucker and Bates, 1978).
.[
Tucker Bates 1978
.]
This is, of
course, a much more difficult problem than pitch estimation for
speech, where the range is typically 3 octaves.
In fact, for speech the feature
detectors are usually preceded by
a low-pass filter to attenuate the myriad
of peaks
caused by higher formants, and this
is inappropriate for
musical applications.
.pp
There is evidence which shows that additional features can
assist with pitch identification.  The above features are all
based upon the signal amplitude, and could be described as
.ul
secondary
features derived from a single
.ul
primary
feature.  Other primary features can easily be defined.
Tucker and Bates (1978) used a centre-clipped waveform, and considered only
the peaks rising above the central region.
.[
Tucker Bates 1978
.]
They defined two
further primary features, in addition to the peak amplitude:  the
.ul
time width
of a peak (period for which it is
outside the clipping level), and its
.ul
energy
(again, outside the clipping level).  The primary
features are shown in Figure 4.17.
.FC "Figure 4.17"
Secondary features are
defined, based on these three primary ones, and pitch estimates
are made for each one.  A further innovation was to combine the
individual estimates on a way which is based upon
autocorrelation analysis, reducing to some degree the
.ul
ad-hocery
of the pitch detection process.
.sh "4.9  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "4.10  Further reading"
.pp
There are a lot of books on digital signal analysis, although in general
I find them rather turgid and difficult to read.
.LB "nn"
.\"Ackroyd-1973-1
.]-
.ds [A Ackroyd, M.H.
.ds [D 1973
.ds [T Digital filters
.ds [I Butterworths
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Here is the exception to prove the rule.
This book
.ul
is
easy to read.
It provides a good introduction to digital signal processing,
together with a wealth of practical design information on digital filters.
.in-2n
.\"Committee.I.D.S.P-1979-3
.]-
.ds [A IEEE Digital Signal Processing Committee
.ds [D 1979
.ds [T Programs for digital signal processing
.ds [I Wiley
.ds [C New York
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.in+2n
This is a remarkable collection of tried and tested Fortran programs
for digital signal analysis.
They are all available from the IEEE in machine-readable form on magnetic
tape.
Included are programs for digital filter design, discrete Fourier transformation,
and cepstral analysis, as well as others (like linear predictive analysis;
see Chapter 6).
Each program is accompanied by a concise, well-written description of how
it works, with references to the relevant literature.
.in-2n
.\"Oppenheim-1975-4
.]-
.ds [A Oppenheim, A.V.
.as [A " and Schafer, R.W.
.ds [D 1975
.ds [T Digital signal processing
.ds [I Prentice Hall
.ds [C Englewood Cliffs, New Jersey
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is one of the standard texts on most aspects of digital signal processing.
It treats the $z$-transform, digital filters, and discrete Fourier transformation
in far more detail than we have been able to here.
.in-2n
.\"Rabiner-1975-5
.]-
.ds [A Rabiner, L.R.
.as [A " and Gold, B.
.ds [D 1975
.ds [T Theory and application of digital signal processing
.ds [I Prentice Hall
.ds [C Englewood Cliffs, New Jersey
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is the other standard text on digital signal processing.
It covers the same ground as Oppenheim and Schafer (1975) above,
but with a slightly faster (and consequently more difficult) presentation.
It also contains major sections on special-purpose hardware for
digital signal processing.
.in-2n
.\"Rabiner-1978-1
.]-
.ds [A Rabiner, L.R.
.as [A " and Schafer, R.W.
.ds [D 1978
.ds [T Digital processing of speech signals
.ds [I Prentice Hall
.ds [C Englewood Cliffs, New Jersey
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Probably the best single reference for digital speech analysis,
as it is for the time-domain encoding techniques of the last chapter.
Unlike the books cited above, it is specifically oriented to speech processing.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "5  RESONANCE SPEECH SYNTHESIZERS"
.ds RT "Resonance speech synthesizers
.ds CX "Principles of computer speech
.pp
This chapter considers the design of speech synthesizers which
implement a direct electrical analogue of
the resonance properties of the vocal tract by providing a filter for each
formant whose resonant frequency is to be controlled.  Another method is the
channel vocoder, with a bank of fixed filters whose gains are varied to match
the spectrum of the speech as described in Chapter 4.  This is not generally
used for synthesis from a written representation, however, because it is hard
to get good quality speech.  It
.ul
is
used sometimes for low-bandwidth
transmission and storage, for
it is fairly easy to analyse natural speech into fixed frequency bands.
A second alternative to the resonance synthesizer is the linear predictive
synthesizer, which at present is used quite extensively and is likely to become
even more popular.  This is covered in the next chapter.
Another alternative is the articulatory synthesizer, which
attempts to model the vocal tract directly, rather than
modelling the acoustic output from it.
Although, as noted in Chapter 2, articulatory synthesis holds a promise of
high-quality speech \(em for the coarticulation effects caused by tongue
and jaw inertia can be modelled directly \(em this has not yet been realized.
.pp
The source-filter model of speech production indicates that an electrical
analogue of the vocal tract can be obtained by considering the source
excitation and the filter that produces the formant frequencies separately.
This approach was pioneered by Fant (1960), and we shall present much of his
work in this chapter.
.[
Fant 1960 Acoustic theory of speech production
.]
There has been some discussion over whether the source-filter model really
is a good one, and some
synthesizers
explicitly introduce an element of
"sub-glottal coupling", which simulates the effect of the lung cavity
on the vocal tract transfer function during the periods when the glottis is
open (for an example see Rabiner, 1968).
.[
Rabiner 1968 Digital formant synthesizer JASA
.]
However, this is very much a low-order effect when considering
speech synthesized by rule from a written representation, for the software
which calculates parameter values to drive the synthesizer is a far greater
source of degradation in speech quality.
.sh "5.1  Overall spectral considerations"
.pp
Figure 5.1 shows the source-filter model of speech production.
.FC "Figure 5.1"
For voiced speech, the excitation source produces a waveform whose frequency
components decay at about 12\ dB/octave, as we shall see in a later section.
The excitation passes into the vocal tract filter.  Conceptually, this can best
be viewed as an infinite series of formant filters, although for implementation
purposes only the first few are modelled explicitly and the effect of the rest
is lumped together into a higher-formant compensation network.  In either case
the overall frequency profile of the filter is a flat one, upon which humps are
superimposed at the various formant frequencies.  Thus the output of the
vocal tract filter falls off at 12\ dB/octave just as the input does.
However, measurements of actual speech show a 6\ dB/octave decay with increasing
frequency.  This is explained by the effect of radiation of speech from the
lips, which in fact has a "differentiating" action, producing a 6\ dB/octave
rise in the frequency spectrum.  This 6\ dB/octave lift is similar to that
provided by a treble boost control on a radio or amplifier.  Speech synthesized
without it sounds unnaturally heavy and bassy.
.pp
These overall spectral shapes, which are derived from considering the human
vocal tract, are summarized in the upper annotations in Figure 5.1.  But there
is no real necessity for a synthesizer to model the frequency characteristics
of the human vocal tract at intermediate points:  only the output speech is of
any concern.  Because the system is a linear one, the filter blocks in the
figure can be shuffled around to suit engineering requirements.  One such
requirement is the desire to minimize internally-generated noise in the
electrical implementation, most of which will arise in the vocal tract filter
(because it is much more complicated than the other components).  For this
reason an excitation source with a flat spectrum is often preferred, as shown
in the lower annotations.  This can be generated either by taking the desired
glottal pulse shape, with its 12\ dB/octave fall-off, and passing it through a
filter giving 12\ dB/octave lift at higher frequencies; or, if the pulse shape
is to be stored digitally, by storing its second derivative instead.
Then the radiation compensation, which is now more properly called
"spectral equalization", will comprise a 6\ dB/octave fall-off to give the
required trend in the output spectrum.
.pp
For a given pitch period, this scheme yields exactly the same spectral
characteristics as the original system which modelled the human vocal tract.
However, when the pitch varies there will be a difference, for sounds with
higher excitation frequencies will be attenuated by \-6\ dB/octave in the new
system and +6\ dB/octave in the old by the final spectral equalization.
In practice, the pitch of the human voice lies quite low in the frequency
region \(em usually below 400\ Hz \(em and if all filter characteristics begin
their roll-off at this frequency the two systems will be the same.  This
simplifies the implementation with a slight compromise in its accuracy in
modelling the spectral trend of human speech, for the overall \-6\ dB/octave
decay actually begins at a frequency of around 100\ Hz.  If this is
implemented, some adjustment will need to be made to the amplitudes to ensure
that high-pitched sounds are not attenuated unduly.
.pp
The discussion so far pertains to voiced speech only.  The source spectrum of
the random excitation in unvoiced sounds is substantially flat, and combines
with the radiation from the lips to give a +6\ dB/octave rise in the output
spectrum.  Hence if spectral equalization is changed to \-6\ dB/octave to
accomodate a voiced excitation with flat spectrum, the noise source should
show a 12\ dB/octave rise to give the correct overall effect.
.sh "5.2  The excitation sources"
.pp
In human speech, the excitation source for voiced sounds is produced by two
flaps of skin called the "vocal cords".  These are blown apart by pressure from
the lungs.  When they come apart the pressure is relieved, and the muscles
tensioning the skin cause the flaps to come together again.  Subsequently, the
lung pressure \(em called "sub-glottal pressure" \(em builds up once more and the
process is repeated.  The factors which influence the rate and nature of
vibration are muscular tension of the cords and the sub-glottal pressure.  The detail
of the excitation has considerable importance to speech synthesis because it
greatly influences the apparent naturalness of the sound produced.  For example,
if you have inflamed vocal cords caused by laryngitis the sound quality
changes dramatically.  Old people who do not have proper muscular control over
their vocal cord tension produce a quavering sound.  Shouted speech can easily
be distinguished from quiet speech even when the volume cue is absent \(em you
can verify this by fiddling with the volume control of a tape recorder \(em because
when shouting, the vocal cords stay apart for a much smaller fraction of the
pitch cycle than at normal volumes.
.rh "Voiced excitation in natural speech."
There are two basic ways to examine the shape of the excitation source in
people.  One is to use a dentist's mirror and high-speed photography to observe
the vocal cords directly.  Although it seems a lot to ask someone to speak
naturally with a mirror stuck down the back of his throat, the method has been
used and photographs can be found, for example, in Flanagan (1972).
.[
Flanagan 1972 Speech analysis synthesis and perception
.]
The second
technique is to process the acoustic waveform digitally, identifying the
formant positions and deducting the formant contributions from the waveform by
filtering.  This leaves the basic excitation waveform, which can then be
displayed.  Such techniques lead to excitation shapes like those sketched in
Figure 5.2, in which the gradual opening and abrupt closure of the vocal cords
can easily be seen.
.FC "Figure 5.2"
.pp
It is a fact that if a periodic function has one or more discontinuities, its frequency
spectrum will decay at sufficiently high frequencies at the rate of 6\ dB/octave.
For example, the components of the square wave
.LB
$
g(t) ~~ = ~~ mark 0
$  for $
0 <= t < h
$
.br
$
lineup 1
$  for $
h <= t < b
$
.LE
can be calculated from the Fourier series
.LB
.EQ
G(r) ~~ = ~~ 1 over b ~ integral from 0 to b ~g(t)~e sup {-j2 pi rt/b} ~dt
~~ = ~~ j over {2 pi r} ~e sup {-j2 pi rh/b} ,
.EN
.LE
so $|G(r)|$ is proportional to $1/r$, and the change in one octave is
.LB
.EQ
20~log sub 10 ~ |G(2r)| over |G(r)|
~~=~~20~log sub 10 ~ 1 over 2
~~ = ~ 
.EN
\-6\ dB.
.LE
However, if the discontinuities are ones of slope only, then the asymptotic decay
at high frequencies is 12\ dB/octave.  Thus the glottal excitation of Figure 5.2
will decay at this rate.
Note that it is not the
.ul
number
but the
.ul
type
of discontinuities which are important in determining the asymptotic spectral
trend.
.rh "Voiced excitation in synthetic speech."
There are several ways that glottal excitation can be simulated in a synthesizer,
four of which are shown in Figure 5.3.
.FC "Figure 5.3"
The square pulse and the sawtooth pulse
both exhibit discontinuities, and so will have the wrong asymptotic rate of
decay (6\ dB/octave instead of 12\ dB/octave).  A better bet is the triangular
pulse.  This has the correct decay, for there are only discontinuities of slope.
However, although the asymptotic rate of decay is of first importance, the fine
structure of the frequency spectrum at the lower end is also significant, and
the fact that there are two discontinuities of slope instead of just one in the
natural waveform means that the spectra cannot match closely.
.pp
Rosenberg (1971) has investigated several different shapes using listening
tests, and he found that the polynomial approximation sketched in Figure 5.3
was preferred by listeners.
.[
Rosenberg 1971
.]
This has one slope discontinuity, and comprises
three sections:
.LB
$g(t) ~~ = ~~ 0$  for $0 <= t < t sub 1$    (flat during the period of closure)
.sp
$g(t) ~~ = ~~ A~ u sup 2 (3 - 2u) $,	where
$u ~=~ {t-t sub 1} over {t sub 2 -t sub 1} $ ,    for
$t sub 1 <= t < t sub 2$  (opening phase)
.sp
.sp
$g(t) ~~ = ~~ A~ (1 - v sup 2 )$,	where
$v ~=~ {t-t sub 2} over {b-t sub 2} $ ,    for
$t sub 2 <= t < b$    (closing phase).
.LE
It is easy to see that the joins between the first and second section, and
between the second and third section, are smooth; but that the slope of the third
section at the end of the cycle when $t=b$ is
.LB
.EQ
dg over dt ~~ = ~~ -~ 2A.
.EN
.LE
$A$ is the maximum amplitude of the pulse, and is reached when $t=t sub 2$.
.pp
A much simpler glottal pulse shape to implement is the filtered impulse.
Passing an impulse through a filter with characteristic
.LB
.EQ
1 over {(1+sT) sup 2}
.EN
.LE
imparts a 12\ dB/octave decay after frequency $1/T$.  This gives a pulse shape of
.LB
.EQ
g(t) ~~ = ~~ A~ t over T ~e sup {1-t/T} ,
.EN
.LE
which is sketched in Figure 5.4.
.FC "Figure 5.4"
The pulse is the wrong way round in time
when compared with the desired one; but this is not important under most
listening conditions because phase differences are not noticeable (this
point is discussed further below).
The maximum is reached when $t=T$ and has
height $A$.  The value zero is never actually attained, for the decay to it
is asymptotic, and if the slight discontinuity between pulses shown in the
Figure is left, the asymptotic rate of decay of the frequency spectrum will
be 6\ dB/octave rather than 12\ dB/octave.  However, in a real implementation
involving filtering an impulse there will be no such discontinuity, for the
next pulse will start off where the last one ended.
.pp
This seems to be an attractive scheme because of its simplicity,
and indeed is sometimes used in speech synthesis.  However, it does not have
the right properties when the pitch is varied, for in real glottal
waveforms the maximum occurs at a fixed
.ul
fraction
of the period, whereas the filtered impulse's maximum is at a fixed time, $T$.
If $T$ is chosen to make the system correct at high pitch frequencies (say
400\ Hz), then the pulse will be much too narrow at low pitches and sound rather
harsh.  The only solution is to vary the filter parameters with the pitch,
leading to complexity again.
.pp
Holmes (1973) has made an extensive study of the effect of the glottal
waveshape on the naturalness of high-quality synthesized speech.
.[
Holmes 1973 Influence of glottal waveform on naturalness
.]
He employed a rather special speech synthesizer, which provides far more
comprehensive and sophisticated control than most.  It was driven by parameters
which were extracted from natural utterances by hand \(em but the process of
generating and tuning them took many months of a skilled person's time.
By using the pulse shape
extracted from the natural utterance, he found that synthetic and natural
versions could actually be made indistinguishable to most people, even under high-quality
listening conditions using headphones.  Performance dropped quite drastically
when one of Rosenberg's pulse shapes, similar to the three-section one given
above, was used.  Holmes also investigated phase effects and found that whilst
different pulse shapes with identical frequency spectra could easily be
distinguished when listening over headphones, there was no perceptible difference
if the listener was placed at a comfortable distance from a loudspeaker in
a room.  This is attributable to the fact that the room itself imposes a
complex modification to the phase characteristics of the speech signal.
.pp
Although a great deal of care must be taken with the glottal pulse shape for very
high-quality synthetic speech, for speech synthesized by rule from a written
representation the degradation which stems from incorrect control of the
synthesizer parameters is much greater than that caused by using a slightly
inferior glottal pulse.  The triangular pulse illustrated in Figure 5.3
has been found quite satisfactory for speech synthesis by rule.
.rh "Unvoiced excitation."
Speech quality is much less sensitive to the characteristics of the unvoiced
excitation.  Broadband white noise will serve admirably.  It is quite
acceptable to generate this digitally, using a pseudo-random feedback shift
register.  This gives a bit sequence whose autocorrelation is zero except at
multiples of the repetition length.  The repetition length
can easily be made as long as the number of states in the shift
register (less one) \(em in this case, the configuration is called
"maximal length" (Gaines, 1969).
.[
Gaines 1969 Stochastic computing advances in information science
.]
For example, an 18-bit maximal-length shift register will repeat
every $2 sup 18 -1$ cycles.  If the bit-stream is used as a source of analogue
noise, the autocorrelation function will have triangular parts whose width is
twice the clock period, as shown in Figure 5.5.
.FC "Figure 5.5"
According to a well-known
result (the Weiner-Kinchine theorem; see for example Chirlian, 1973)
the power density of the frequency
spectrum is the same as the Fourier transform of the autocorrelation function.
.[
Chirlian 1973
.]
Since the feedback shift register gives a periodic autocorrelation function,
its transform is a Fourier series.  The $r$'th frequency component is
.LB
.EQ
G(r) ~~ = ~~ {R sup 2} over {4 pi sup 2 r sup 2 T}
~(1~-~~cos~{{2 pi rT} over R}) ~ .
.EN
.LE
Here, $T$ is the clock period and  $R=(2 sup N -1)T$  is the repetition time of
an $N$-bit shift register.
.pp
The spectrum is a bar spectrum, with components spaced
at
.LB
$
{1 over R}~~=~~{1 over {(2 sup N -1)T}}$   Hz.
.LE
These are very close together \(em with $N=18$ and
sampling at 20\ kHz (50\ $mu$sec)
the spacing becomes under 0.1\ Hz \(em and so it is reasonable to treat the
spectrum as continuous, with
.LB
.EQ
G(f) ~~ = ~~ 1 over {4 pi sup 2 f sup 2 T}~~(1~-~cos 2 pi fT) .
.EN
.LE
This spectrum is sketched in Figure 5.6(a), and the measured result of an actual
implementation in Figure 5.6(b).
.FC "Figure 5.6"
The 3\ dB point occurs when
.LB
.EQ
{G(f) over G(0)} ~~=~~{1 over 2} ~ ,
.EN
.LE
and $G(0)$ is $T/2$.  Hence, at the 3\ dB point,
.LB
.EQ
{1~-~cos 2 pi fT} over {2 pi sup 2 f sup 2 T sup 2}
~~ = ~~ 1 over 2 ~ ,
.EN
.LE
which has solution  $f=0.45/T$.
Thus a pseudo-random shift register generates
noise whose spectrum is substantially flat up to half the clock frequency.
Anything over 10\ kHz is therefore a suitable clocking rate for speech-quality
noise.  Choose 20\ kHz to err on the conservative side.  If the repetition occurs
in less than 3 or 4 seconds, it can be heard quite clearly; but above this figure
it is not noticeable.  An 18-bit shift register clocked at 20\ kHz repeats
every  $(2 sup 18 -1)/20000 ~ = ~ 13$ seconds, which is more than adequate.
.sh "5.3  Simulating vocal tract resonances"
.pp
The vocal tract, from glottis to lips, can be modelled as an unconstricted
tube of varying cross-section with no side branches and no sub-glottal coupling.
This has an all-pole transfer function, which can be written in the form
.LB
.EQ
H(s) ~~ = ~~
{w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2}
~.~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~~ .~ .~ .
.EN
.LE
There is an unspecified (conceptually infinite) number of terms in the
product.  Each of them produces a peak in the energy spectrum,
and these are the formants we observed in Chapter 2.
.pp
Formants appear even in an over-simplified
model of the tract as a tube of uniform cross-section, with a sound source
at one end (the larynx) and open at the other (the lips).
This extremely crude model was discussed in Chapter 2, and surprisingly,
perhaps, it gives a good approximation to the observed formant frequencies
for a neutral, relaxed vowel such as that in
.ul
"a\c
bove".
.pp
Speech is made by varying the postures of the various organs of the vocal tract.
Different vowels, for example, result largely from different tongue positions
and lip postures.  Naturally, such physical changes alter the frequencies of the
resonances, and successful automatic speech synthesis depends upon
successful movement of the formants.  Fortunately, only the first three or
four resonances need to be altered even for extremely realistic synthesis, and
virtually all existing synthesizers provide control over these formants only.
.rh "Analysis of a single formant."
Each formant is modelled as a second-order resonance, with transfer function
.LB
.EQ
H(s) ~~ = ~~ {w sub c sup 2} over {s sup 2 ~+~ b s ~+~ w sub c sup 2} ~ .
.EN
.LE
As will be shown below, $w sub c$ is the nominal resonant frequency in
radians/s, and $b$ is the
approximate 3\ dB bandwidth of the resonance.  The term $w sub c sup 2$ in the
numerator adjusts the gain to be unity at DC ($s=0$).
.pp
To calculate the frequency response of the formant, write  $s=jw$.  Then the
energy spectrum is
.LB
.EQ
|H(jw)| sup 2 ~~ mark = ~~
{w sub c sup 4} over {(w sup 2 - w sub c sup 2 ) sup 2 ~+~ b sup 2 w sup 2}
.EN
.sp
.sp
.EQ
lineup = ~~
{w sub c sup 4} over
{[w sup 2 ~-~(w sub c sup 2 -~ {b sup 2} over 2 )] sup 2 ~~
+~~b sup 2 (w sub c sup 2~-~{{b sup 2} over 4})} ~ .
.EN
.sp
.LE
This reaches a maximum when the squared term in the denominator of the second
expression is zero, namely when  $w=(w sub c sup 2 ~-~ b sup 2 /2) sup 1/2$.
However,
formant bandwidths are low compared with their centre frequencies, and so to
a good approximation the peak occurs
at  $w=w sub c$  and is of amplitude  $w sub c /b$,  that
is,  $10~log sub 10 w sub c /b$\ dB above the DC gain.
At frequencies higher than the peak the energy falls off as $1/w sup 4$,
a factor of 1/16 for each doubling
in frequency, and so the asymptotic decay is 12\ dB/octave.
.pp
At the points which are 3\ dB below the peak,
.LB
.EQ
|H(jw sub 3dB )| sup 2 ~~ = ~~
1 over 2 ~|H(jw sub max )| sup 2 ~~ = ~~
1 over 2 ~ times ~ {w sub c sup 2} over {b sup 2} ~ ,
.EN
.LE
and it is easy to show that
this is satisfied by  $w sub 3dB ~ = ~ w sub c ~ +- ~ b/2$  to a
good approximation (neglecting higher powers of $b/w sub c )$.  Figure 5.7
summarizes the shape of an individual formant resonance.
.FC "Figure 5.7"
.pp
The bandwidth of a formant is fairly constant, regardless of the formant
frequency.  This makes the formant filter a slightly unusual one:  most
engineering applications which use variable-frequency resonances require
the bandwidth to be a constant proportion of the resonant
frequency \(em the ratio
$w sub c /b$, often called the "$Q$" of the filter, is to be constant.
For formants, we wish the Q to increase linearly with resonant frequency.
Since the amplitude gain of the formant at resonance is $w sub c /b$,
this peak gain increases as the formant frequency is increased.
.pp
Although it is easy to measure formant frequencies on a spectrogram
(cf Chapter 2),
it is not so easy to measure bandwidths accurately.  One rather unusual method
was reported by van den Berg (1955), who took a subject who had had a partial
laryngectomy, an operation which left an opening into the vocal tract near
the larynx position.  Into this he inserted a sound source and made a
swept-frequency calibration of the vocal tract!
.[
Berg van den 1955
.]
Almost as bizarre is a
technique which involves setting off a spark inside the mouth of a subject
as he holds his articulators in a given position.
.pp
The results of several different kinds of experiment are reported by Dunn (1961),
and are summarized in Table 5.1, along with the formant frequency ranges.
.[
Dunn 1961
.]
.RF
.in+0.5i
.ta 1.7i +2.5i
.nr x1 (\w'range of formant'/2)
.nr x2 (\w'range of bandwidths'/2)
	\h'-\n(x1u'range of formant	\h'-\n(x2u'range of bandwidths
.nr x1 (\w'frequencies (Hz)'/2)
.nr x2 (\w'as measured in different'/2)
	\h'-\n(x1u'frequencies (Hz)	\h'-\n(x2u'as measured in different
.nr x1 (\w'experiments (Hz)'/2)
		\h'-\n(x1u'experiments (Hz)
.nr x1 (\w'0000 \- 0000'/2)
.nr x2 (\w'000 \- 000'/2)
.nr x0 2.5i+(\w'range of formant'/2)+(\w'as measured in different'/2)
.nr x3 (\w'range of formant'/2)
	\h'-\n(x3u'\l'\n(x0u\(ul'
.sp
formant 1	\h'-\n(x1u'\0100 \- 1100	\h'-\n(x2u'\045 \- 130
formant 2	\h'-\n(x1u'\0500 \- 2500	\h'-\n(x2u'\050 \- 190
formant 3	\h'-\n(x1u'1500 \- 3500	\h'-\n(x2u'\070 \- 260
	\h'-\n(x3u'\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in-0.5i
.MT 2
Table 5.1  Different estimates of formant bandwidths, with range of
formant frequencies for reference
.TE
Note that the bandwidths really are narrow compared with the resonant frequencies
of the filters, except at the lower end of the formant 1 range.  Choosing the
lowest bandwidth estimate leads to an amplification factor at resonance of 50 for formant 2
when its frequency is at the top of its range; and formant 3 happens to give
the same value.
.rh "Series synthesizers."
The simplest realization of the vocal tract filter is a chain of formant
filters in series, as illustrated in Figure 5.8.
.FC "Figure 5.8"
This leads to particular difficulties if the frequencies of two formants
stray close together.  The worst case occurs if formants 2 and 3 have the
same resonant frequencies, at the top of the range of formant 2, namely 2500\ Hz.
In this case, and if the bandwidths of the formants are set to the lowest
estimates, a combined amplification factor
of  $(2500/50) times (2500/70)=1800$  is
obtained at the point of resonance \(em that is,
65\ dB above the DC value.  This is enough
to tax most analogue implementations, and can evoke clipping in the formant
filters, with a very noticeable effect on speech quality.  This
extreme case will not occur during synthesis of realistic speech, for
although the formant
.ul
ranges
overlap, the values for any particular (human) sound will not coincide exactly.  However,
it illustrates the difficulty of designing a series synthesizer which copes
sensibly with arbitrary parameter settings, and explains why designers often
choose formant bandwidths in the top half of the ranges given in Table 5.1.
.pp
The problem of excessive amplification within a series synthesizer can be
alleviated to a small extent by choosing carefully the order in which the
filters are placed in the chain.  In a linear system, of course, the order in
which the components occur does not matter.
In physical implementations, however, it is advantageous to minimize extreme
amplification at intermediate points.  By placing the formant 1 filter between
formants 2 and 3, the formant 2 resonance is attenuated somewhat before it
reaches formant 3.  Continuing with the extreme example above, where both
formants 2 and 3 were set to 2500\ Hz; assume that formant 1 is at its
nominal value of 500\ Hz.  It provides attenuation at approximately 12\ dB/octave
above this, and so at the formant 2 peak, 2.3\ octaves higher, the attenuation
is 28\ dB.  Thus the gain at 2500\ Hz,
which is $20 ~ log sub 10 ~ 2500/50 ~ = ~ 34$\ dB after
passing through the formant 2 filter, is reduced to 6\ dB by formant 1, only
to be increased by $20 ~ log sub 10 ~ 2500/70 ~ = ~ 31$\ dB to
a value of 37\ dB by formant 3.
This avoids the extreme 65\ dB gain of formants 2 and 3 combined.
.pp
Figure 5.8 shows only three formant filters modelled explicitly.
The effect of the rest \(em and they do have an effect, although it is small
at low frequencies \(em is
incorporated by lumping them together into the "higher-formant correction" filter.
To calculate the characteristics of this filter, assume that the lumped
formants have the values given by the simple uniform-tube model of Chapter 2,
namely 3500\ Hz for formant 4, 4500\ Hz for formant 5, and, in general,
$500(2n-1)$\ Hz for formant $n$.  The effect of each of these on the spectrum is
.LB
.EQ
10~ log sub 10  {w sub n sup 4} over {(w sup 2 ~-~w sub n sup 2 ) sup 2
~~+~~b sub n sup 2 w sup 2}
~~ = ~~ -~ 10~ log sub 10 ~[(1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2
~~+~~ {{b sub n sup 2 w sup 2} over {w sub n sup 4}}]
.EN
dB,
.LE
following from what was calculated above.
We will have to approximate this by assuming that
$b sub n sup 2 /w sub n sup 2$ is
negligible \(em this is quite reasonable for these higher formants because
Table 5.1 shows that the bandwidth does not increase in proportion to the
formant frequency range \(em and approximate the logarithm by the first
term of its series expansion:
.LB
.EQ
-10 ~ log sub 10 ~ (1~-~~{{w sup 2} over {w sub n sup 2}}) sup 2
~~ = ~~ -20~ log sub 10 ~ e ~ log sub e 
(1~-~~{{w sup 2} over {w sub n sup 2}})
~~ = ~~ 20~ log sub 10 ~ e ~ times ~ {w sup 2} over {w sub n sup 2} ~ .
.EN
.LE
.pp
Now the total effect of formants 4, 5, ... at frequency $f$\ Hz (as distinct
from $w$\ radians/s) is
.LB
.EQ
20~ log sub 10 ~ e ~ times ~ sum from n=4 to infinity
~{{f sup 2} over {500 sup 2 (2n-1) sup 2}} ~ .
.EN
.LE
This expression is
.LB
.EQ
20~ log sub 10 ~ e ~ times ~
{{f sup 2} over {500 sup 2}}~~(~sum from n=1 to infinity
~{1 over {(2n-1) sup 2}} ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~)
~ .
.EN
.LE
The infinite sum can actually be calculated in closed form, and is equal
to  $pi sup 2 /8$.  Hence the total correction is
.LB
.EQ
20~ log sub 10 ~ e ~ times {{f sup 2} over {500 sup 2}}
~~(~{pi sup 2} over 8 ~~-~~ sum from n=1 to 3 ~{1 over {(2n-1) sup 2}}~)
~~ = ~~ 2.87 times 10 sup -6 f sup 2
.EN
dB.
.LE
.pp
Although this may at first seem to be a rather small correction,
it is in fact 72\ dB when
$f=5$\ kHz!  On further reflection this is not an unreasonable figure, for the
12\ dB/octave decays contributed by formants 1, 2, and 3 must all be annihilated
by the higher-formant correction to give an overall flat spectral trend.
In fact, formant 1 will contribute
12\ dB/octave from 500\ Hz (3.3\ octaves to 5\ kHz, representing 40\ dB); formant
2 will contribute 12\ dB/octave from 1500\ Hz (1.7\ octaves to 5\ kHz, representing
21\ dB); and formant 3 will contribute 12\ dB/octave from 2500\ Hz (1\ octave to 5\ kHz,
representing 12\ dB).
These sum to 73\ dB.
.pp
If the first five formants are synthesized explicitly instead of just the
first three, the correction is
.LB
.EQ
20~ log sub 10 ~ e ~ times ~ {{f sup 2} over {500 sup 2}}
~~(~{pi sup 2} over 8 ~-~~ sum from n=1 to 5 ~{1 over {(2n-1) sup 2}}~)
~~ = ~~ 1.73 times 10 sup -6  f sup 2
.EN
dB,
.LE
giving a rather more reasonable value of 43\ dB when $f=5$\ kHz.  In actual
implementations, fixed filters are sometimes included explicitly for
formants 4 and 5.  Although this lowers the gain of the higher-formant
correction filter, the total amplification at 5\ kHz of the combined correction
is still 72\ dB.  If one is less demanding and aims for a synthesizer that
produces a correct spectrum only up to 3.5\ kHz, it is 35\ dB.
This places quite stringent requirements on the preceding formant filters if
the stray noise that they generate internally is not to be amplified to
perceptible magnitudes by the correction filter at high frequencies.
.pp
Explicit inclusion of fixed filters for formants 4 and 5 undoubtedly improves
the accuracy of the higher-formant correction.  Recall that the above derivation
of the correction filter characteristic used the first-order approximation
.LB
.EQ
log sub e (1~-~{{w sup 2} over {w sub n sup 2}})
~~ = ~~ -~ {w sup 2} over {w sub n sup 2} ~ ,
.EN
.LE
which is only valid if $w << w sub n$.
Thus it only holds at frequencies less than
the highest explicitly synthesized formant,
and so with formants 4 (3.5\ kHz) and
5 (4.5\ kHz) included a reasonable correction should be obtained for
telephone-quality speech.  However, detailed analysis with a second-order
approximation shows that the coefficient of the neglected term is in fact
small (Fant, 1960).
.[
Fant 1960 Acoustic theory of speech production
.]
A second, perhaps more compelling, reason for explicitly
including a couple of fixed formants is that the otherwise enormous amplification
provided by the correction can be distributed throughout the formant chain.
We saw earlier why there is reason to prefer the
order F3\(emF1\(emF2 over F1\(emF2\(emF3.
With explicit formants 4 and 5, a suitable order which helps
to keep the amplification at intermediate points in the chain within reasonable
bounds is F3\(emF5\(emF2\(emF4\(emF1.
.rh "Parallel synthesizers."
A series synthesizer models the vocal tract resonances by a chain of formant
filters in series.  A parallel synthesizer utilizes a parallel connection of
filters as illustrated in Figure 5.9.
.FC "Figure 5.9"
.pp
Consider a parallel combination of two formants with individually-controllable
amplitudes.  The combined transfer function is
.LB
.EQ
H(s) ~~ mark = ~~ {A sub 1 w sub 1 sup 2} over
{s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2}
~~+~~{A sub 2 w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2}
.EN
.sp
.sp
.EQ
lineup = ~~ { (A sub 1 w sub 1 sup 2 + A sub 2 w sub 2 sup 2 )s sup 2
~+~(A sub 1 b sub 2 w sub 1 sup 2 + A sub 2 b sub 1 w sub 2 sup 2 )s
~+~ (A sub 1 +A sub 2 )w sub 1 sup 2 w sub 2 sup 2 }
over
{ (s sup 2 ~+~b sub 1 s~+~w sub 1 sup 2 )
(s sup 2 ~+~b sub 2 s~+~w sub 2 sup 2 ) }
.EN
.LE
If the formant bandwidths $b sub 1$ and $b sub 2$
are equal and the amplitudes are
chosen as
.LB
.EQ
A sub 1 ~~=~~ {w sub 2 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2}
~~~~~~~~
A sub 2 ~~=~~-~ {w sub 1 sup 2} over {w sub 2 sup 2 -w sub 1 sup 2} ~ ,
.EN
.LE
then the transfer function becomes the same as that of a two-formant series synthesizer,
namely
.LB
.EQ
H(s) ~~ = ~~ {w sub 1 sup 2} over {s sup 2 ~+~ b sub 1 s ~+~ w sub 1 sup 2}
~ . ~{w sub 2 sup 2} over {s sup 2 ~+~ b sub 2 s ~+~ w sub 2 sup 2} ~ .
.EN
.LE
The argument can be extended to any number of formants, under the assumption
that the formant bandwidths are equal.  Note that the signs of $A sub 1$
and $A sub 2$
differ:  in general the formant amplitudes for a parallel synthesizer alternate
in sign.
.pp
In theory, therefore, it would be possible to use five parallel formants to
model a five-formant series synthesizer exactly.  Then the same higher-formant
correction filter would be needed for the parallel synthesizer as for the
series one.  If the formant amplitudes were set slightly incorrectly, however,
the five filters would not combine to give a total of 60\ dB/octave high-frequency
decay above the resonances.  It is easy to see this in the context of the
simplified two-formant combination above:  if the amplitudes were not chosen
exactly right then the $s sup 2$
term in the numerator would not be quite zero.
Then, the decay in the two-formant combination would be \-12\ dB/octave instead
of \-24\ dB/octave, and in the five-formant case the decay would in fact still be
\-12\ dB/octave.  Advantage can be taken of this to equalize the levels
within the synthesizer so that large amplitude variations do not occur.
This can best be done by associating relatively low-gain fixed correction filters
with each formant instead of providing one comprehensive correction to the
combined spectrum:  these are shown in Figure 5.9.
Suitable correction filters
have been determined empirically by Holmes (1972).
.[
Holmes 1972 Speech synthesis
.]
They provide a 6\ dB/octave
lift above 640\ Hz for formant 1, and 6\ dB/octave lift above 300\ Hz for formant
2.  Formants 3 and 4 are uncorrected, whilst for formant 5 the correction begins
as a 6\ dB/octave decay above 600\ Hz and increases to an 18\ dB/octave decay
above 5.5\ kHz.
.pp
The disadvantage of a parallel synthesizer is that the amplitudes of the
formants must be specified as well as their frequencies.  (Furthermore, the
formant bandwidths should all be equal, but they are often chosen to be such
in series synthesizers because of the uncertainty as to their exact
values.)  However, the extra amplitude parameters clearly give greater
control over the frequency spectrum of the synthesized speech.
.pp
A good example of how this extra control can usefully be exploited is the
synthesis of nasal sounds.
Nasalization introduces a cavity parallel to the oral tract, as illustrated
in Figure 5.10, and this causes zeros in the transfer function.
.FC "Figure 5.10"
It is as if two different copies of the vocal tract transfer function, one for
the oral and the other for the nasal passage, were added
together.  We have seen the effect of this above when considering parallel
synthesis.  The combination
.LB
.EQ
H(s) ~~ = ~~ {A sub 1 w sub o sup 2} over
{s sup 2 ~+~ b sub o s ~+~ w sub o sup 2}
~~+~~{A sub 2 w sub n sup 2}
over {s sup 2 ~+~ b sub n s ~+~ w sub n sup 2} ~ ,
.EN
.LE
where the subscript "$o$" stands for oral and "$n$" for nasal,
produces zeros in the
numerator (unless the amplitudes are carefully adjusted to avoid them).
These cannot be modelled by a series synthesizer, but they obviously can be
by a parallel one.
.pp
Although they are certainly needed for accurate imitation of human speech,
transfer function zeros to simulate nasal sounds are not essential for
synthesis of intelligible English.  It is not difficult to get a sound
like a nasal consonant
(\c
.ul
n,
or
.ul
m\c
)
with an all-pole synthesizer.
Nevertheless, it is certainly true that a parallel synthesizer gives better
.ul
potential
control over the spectrum than a series one.  Whether the added flexibility
can be used properly by a synthesis-by-rule computer program is another matter.
.rh "Implementation of formant filters."
Formant filters can be built in either analogue or digital form.  A
second-order resonance is needed, whose centre frequency can be controlled
but whose bandwidth is fixed.  If the control can be arranged as two
tracking resistors, then the simple analogue configuration of Figure 5.11,
with two operational amplifiers, will suffice.
.FC "Figure 5.11"
.pp
The transfer function of this arrangement is
.LB
.EQ
- ~~ { 1/C sub 1 R sub 1 C sub 2 R sub 2 } over
{ s sup 2 ~~+~~ {1 over {C sub 2 R sub 2}}~s
~~+~~{1 over {C sub 1 R' sub 1 C sub 2 R sub 2 }}} ~ ,
.EN
.LE
which characterizes it as a low-pass resonator with DC gain
of  $- R' sub 1 /R sub 1 $,  bandwidth of  $1/2 pi C sub 2 R sub 2$\ Hz,  and
centre frequency of  $1/2 pi (C sub 1 R' sub 1 C sub 2 R sub 2 ) sup 1/2$\ Hz.
Tracking $R' sub 1$ with $R sub 1$ ensures that the DC gain remains constant,
and that the centre frequency follows  $R sub 1 sup -1/2$.  Moreover,
neither is especially sensitive to slight departures from exact tracking
of $R' sub 1$ with $R sub 1$.
Such a filter has been used in a simple hand-controlled speech synthesizer,
built for demonstration and amusement (Witten and Madams, 1978).
.[
Witten Madams 1978 Chatterbox
.]
However, the need for tracking resistors, and the inverse square root variation
of the formant frequency with $R sub 1$, makes it rather unsuitable for serious
applications.
.pp
A better analogue filter is the ring-of-three configuration
shown in Figure 5.12.
.FC "Figure 5.12"
(Ignore the secondary output for now.)  Control
is achieved over the centre frequency by two multipliers, driven from
the same control input $k$.  These have a high-impedance output, producing a
current $kx$ if the input voltage is $x$.
It is not too difficult to show that the transfer function of the circuit is
.LB
.EQ
- ~~ { {k sup 2} over {C sup 2} } over
{ s sup 2 ~~+~~ 2 over RC ~s
~~+~~{1+k sup 2 R sup 2} over {R sup 2 C sup 2} } ~ .
.EN
.LE
Suppose that $R$ is chosen so that  $k sup 2 R sup 2 ~ >>~ 1$.  Then this is a
unity-gain resonator with constant bandwidth  $1/ pi RC$\ Hz  and centre
frequency  $k/2 pi C$\ Hz.  Note that it is the combination of both multipliers that
makes the centre frequency grow linearly with $k$:  with one multiplier there
would be a square-root relationship.
.pp
The ring-of-three filter of Figure 5.12 is arranged in a slightly unusual
way, with an inverting stage at the beginning and the two resonant stages
following it.  This ensures that the signal level at intermediate
points in the filter does not exceed that at the output, and gives the filter
the best chance of coping with a wide range of input amplitudes without
clipping.  This contrasts markedly with the resonator of Figure 5.11, where
the voltage at the output of the first integrator is $w/b$ times the final output \(em a
factor of 50 in the worst case.
.pp
For a digital implementation of a formant, consider the recurrence relation
.LB
.EQ
y(n)~ = ~~ a sub 1 y(n-1) ~-~ a sub 2 y(n-2) ~+~ a sub 0 x(n) ,
.EN
.LE
where $x(n)$ is the input and $y(n)$ the output at time $n$,
$y(n-1)$ and $y(n-2)$ are the previous two values of the output,
and $a sub 0$, $a sub 1$, and $a sub 2$ are (real) constants.
The minus sign is in front of the second term because it makes $a sub 2$
turn out to be
positive.  To calculate the $z$-transform version of this relationship, multiply
through by $z sup -n$ and sum from $n=- infinity$ to $infinity$ :
.LB "nn"
.EQ
sum from {n=- infinity} to infinity ~y(n)z sup -n ~~ mark =~~
a sub 1 sum from {n=- infinity} to infinity ~y(n-1)z sup -n ~~-~
a sub 2 sum from {n=- infinity} to infinity ~y(n-2)z sup -n ~~+~
a sub 0 sum from {n=- infinity} to infinity ~x(n)z sup -n
.EN
.sp
.EQ
lineup = ~~ a sub 1 z sup -1 ~ sum ~y(n-1)z sup -(n-1) ~~-~~
a sub 2 z sup -2 ~ sum ~y(n-2)z sup -(n-2)
~~+~~ a sub 0 ~ sum ~x(n)x sup -n ~ .
.EN
.LE "nn"
Writing this in terms of $z$-transforms,
.LB
.EQ
Y(z)~ = ~~ a sub 1 z sup -1 Y(z) ~-~ a sub 2 z sup -2 Y(z) ~+~ a sub 0 X(z) .
.EN
.LE
Thus the input-output transfer function of the system is
.LB
.EQ
H(z)~ = ~~ Y(z) over X(z)
~~=~~ {a sub 0 } over {1~-~a sub 1 z sup -1 ~+~a sub 2 z sup -2} ~ .
.EN
.LE
.pp
We learned in the previous chapter that the frequency response is obtained
from the $z$-transform of a system by replacing $z sup -1$
by  $e sup {-j2 pi fT}$,  where $f$ is the frequency variable in\ Hz.
Hence the amplitude response of the digital formant filter is
.LB
.EQ
|H(e sup {j2 pi fT} )| sup 2
~~ = ~~ left [ {a sub 0} over {1~-~a sub 1 e sup {-j2 pi fT}
~+~a sub 2 e sup {-j4 pi fT} } ~ right ] sup 2 ~ .
.EN
.sp
.LE
It is fairly obvious from this that a DC gain of 1 is obtained if
.LB
.EQ
a sub 0 ~ = ~~ 1 ~-~ a sub 1  ~+~ a sub 2 ,
.EN
.LE
for  $e sup {-j2 pi fT}$  is 1 at a frequency of 0\ Hz.  Some manipulation is
required to show that, under the usual assumption that the bandwidth is
small, the centre frequency is
.LB
.EQ
1 over {2 pi T} ~~ cos sup -1 ~ {a sub 1} over {2 a sub 2 sup 1/2} ~
.EN
Hz.
.LE
Furthermore, the 3\ dB bandwidth of the resonance is given approximately by
.LB
.EQ
-~ 1 over {2 pi T} ~~ log sub e a sub 2 ~
.EN
Hz.
.LE
.pp
As an example, Figure 5.13 shows an amplitude response for this digital filter.
.FC "Figure 5.13"
The parameters $a sub 0$, $a sub 1$ and $a sub 2$
were generated from the above
relationships for a sampling frequency of 8\ kHz, centre frequency of 1\ kHz,
and bandwidth of 75\ Hz.
It exhibits a peak of approximately the right bandwidth at the correct
frequency, 1\ kHz.  Note that the response is flat at half the sampling
frequency, for the frequency response from 4\ kHz to 8\ kHz is just a reflection of
that up to 4\ kHz.
This contrasts sharply with that of an analogue formant filter, also shown
in Figure 5.13, which slopes
at \-12\ dB/octave at frequencies above resonance.
.pp
The behaviour of a digital formant filter at frequencies above
resonance actually makes it preferable to an analogue implementation.
We saw earlier that considerable trouble must be taken with the latter to
compensate for the cumulative effect of \-12\ dB/octave at higher frequencies for
each of the formants.
This is not necessary with digital implementations, for the response of
a digital formant filter is flat at half the sampling frequency.  In fact, further
study shows that digital synthesizers without any higher-pole correction
give a closer approximation to the vocal tract than analogue ones with higher-pole
correction (Gold and Rabiner, 1968).
.[
Gold Rabiner 1968 Analysis of digital and analogue formant synthesizers
.]
.rh "Time-domain methods."
An interesting alternative to frequency-domain speech synthesis is to construct
the formants in the time domain.  When a second-order resonance is excited by
an impulse, an exponentially decaying sinusoid is produced, as illustrated by
Figure 5.14.
.FC "Figure 5.14"
The oscillation occurs at the resonant frequency of the filter,
while the decay is related to the bandwidth.  In fact, if the formant filter
has transfer function
.LB
.EQ
{w sup 2} over {s sup 2 ~+~ b s ~+~ w sup 2} ~ ,
.EN
.LE
the time waveform for impulsive excitation is
.LB
.EQ
x(t)~ = ~~ w~ e sup -bt/2 ~ sin ~ wt ~~~~~~~~
.EN
(neglecting  $b sup 2 /w sup 2$).
.LE
It is the combination of several such time waveforms, coupled with the regular
reappearance of excitation at the pitch period, that produces the characteristic
wiggly waveform of voiced speech.
.pp
Now suppose we take a sine wave of frequency $w$ and multiply it by a
decaying exponential  $e sup -bt/2$.  This gives a signal
.LB
.EQ
x(t)~ = ~~ e sup -bt/2 ~ sin ~ wt ,
.EN
.LE
which is identical with the filtered impulse except for a factor $w$.
If there are several formants in parallel, all with the same bandwidth,
the exponential factor is the same for each:
.LB
.EQ
x(t)~ = ~~ e sup -bt/2 ~ (A sub 1 ~ sin ~ w sub 1 t
~~+ ~~ A sub 2 ~ sin ~ w sub 2 t ~~ + ~~ A sub 3 ~  sin ~ w sub 3 t) .
.EN
.LE
$A sub 1$, $A sub 2$, and $A sub 3$ control the formant amplitudes,
as in an ordinary parallel synthesizer;
except that they need adjusting to account for the missing
factors $w sub 1$, $w sub 2$, and $w sub 3$.
.pp
A neat way of implementing such a synthesizer digitally is to store one cycle of a
sine wave in a read-only memory (ROM).  Then, the formant frequencies can be
controlled by reading the ROM at different rates.  For example, if twice the
basic frequency is desired, every second value should be read.
Multiplication is needed for amplitude control of each formant:  this can be
accomplished by shifting the digital word (each place shifted accounts for
6\ dB of attenuation).  Finally, the exponential damping factor can be
provided in analogue hardware by a single capacitor after the D/A converter.
This implementation gives a system for hardware-software synthesis which
involves an absolutely minimal amount of extra hardware apart from the computer,
and does not need hardware multiplication for real-time operation.
It could easily be made to work in real time with a microprocessor coupled
to a D/A converter, damping capacitor, and fixed tone-control filter to give
the required spectral equalization.
.pp
Because the overall spectral decay of an impulse exciting a second-order
formant filter is 12\ dB/octave, the appropriate equalization is +6\ dB/octave
lift at high frequencies, to give an overall \-6\ dB/octave spectral trend.
.pp
Note, however, that this synthesis model is an extremely basic one.  Only
impulsive excitation can be accomodated.  For fricatives, which we will
discuss in more detail below, a different implementation is needed.  A
hardware noise generator, with a few fixed filters \(em one
for each fricative type \(em will suffice for a simple system.  More damaging
is the lack of aspiration, where random noise excites the vocal tract resonances.
This cannot be simulated in the model.  The
.ul
h
sound can be provided by
treating it as a fricative, and although it will not sound completely realistic,
because there will be no variation with the formant positions of adjacent phonemes,
this can be tolerated because
.ul
h
is not too important for speech intelligibility.
A bigger disadvantage is the lack of proper aspiration control for producing
unvoiced stops, which as mentioned in Chapter 2 consist of an silent phase
followed by a burst of aspiration.
Experience has shown that although it is difficult to drive such a synthesizer
from a software synthesis-by-rule system, quite intelligible output can
be obtained if parameters are derived from real speech and tweaked by hand.
Then, for each aspiration burst the most closely-matching fricative sound
can be used.
.sh "5.4  Aspiration and frication"
.pp
The model of the vocal tract as a filter which affects the frequency spectrum
of the basic voiced excitation breaks down if there are constrictions in it,
for these introduce new sound sources caused by turbulent air.
The generation of unvoiced excitation has been discussed earlier in this
chapter:  now we must consider how to simulate the filtering action of
the vocal tract for unvoiced sounds.
.pp
Aspiration and frication need to be dealt with separately.  The former
is caused by excitation at the vocal cords \(em the cords are held
so close together that turbulent noise is produced.
This noise passes through the same vocal tract filter that modifies voiced
sounds, and the same kind of formant structure can be observed.
All that is needed to simulate it is to replace the voiced excitation
source by white noise, as shown in the upper part of Figure 5.15.
.FC "Figure 5.15"
.pp
Speech can be whispered by substituting aspiration for voicing throughout.
Of course, there is no fundamental frequency associated with aspiration.
An interesting way of assessing informally the degradation caused by inadequate
pitch control in a speech synthesis-by-rule system is to listen to
whispered speech, in which pitch variations play no part.
.pp
Voiced and aspirative excitation are rarely produced at the same time
in natural speech (but see the discussion in Chapter 2 about breathy voice).
However, the excitation can change from one to the other quite quickly, and
when this happens there is no discontinuity in the formant structure.
.pp
Fricative, or sibilant, excitation is quite different from aspiration,
because it introduces a new sound source at a different place from the vocal
cords.  The constriction which produces the sound may be at the lips,
the teeth, the hard ridge just behind the top front teeth, or further
back along the palate.
These positions each produce a different sound
(\c
.ul
f,
.ul
th,
.ul
s,
and
.ul
sh
respectively).  However, smooth transitions from one of these sounds to another
do not occur in natural speech; and dynamical movement of the frequency
spectrum during a fricative is unnecessary for speech synthesis.
.pp
It is necessary, however, to be able to produce an approximation to the
noise spectrum for each of these sound types.  This is commonly achieved
by a single high-pass resonance whose centre frequency can be controlled.
This is the purpose of the secondary output
of the formant filter of Figure 5.12.
Taking the output from this point gives a high-pass instead of a low-pass
resonance, and this same filter configuration is quite acceptable for
fricatives.  Figure 5.15 shows the fricative sound path as a noise generator
followed by such a filter.
.pp
Unlike aspiration, fricative excitation is frequently combined with voicing.
This gives the voiced fricative sounds
.ul
v,
.ul
dh,
.ul
z,
and
.ul
zh.
It is possible to produce frication and aspiration together, and although
there are no examples of this in English, speech synthesis-by-rule
programs often use a short burst of aspiration
.ul
and
frication when simulating the opening of unvoiced stops.
Separate amplitude controls are therefore needed for voicing and frication,
but the former can be used for aspiration as well, with a "glottal excitation
type" switch to indicate aspiration rather than voicing.
.sh "5.5  Summary"
.pp
A resonance speech synthesizer consists of a vocal tract filter, excited by
either a periodic pitch pulse or aspiration noise.  In addition, a set of
sibilant sounds must be provided.  The vocal tract filter is dynamic, with
three controllable resonances.  These, coupled with some fixed spectral
compensation, give it a fairly high order \(em about 10 complex poles are
needed.  Although several different sibilant sound types must be simulated,
dynamical movement is less important in fricative sound spectra than
for voiced and aspirated sounds because
smooth transitions between one fricative and another are not important
in speech.
However, fricative timing and amplitude must be controlled rather precisely.
.pp
The speech synthesizer is controlled by several parameters.
These include fundamental frequency (if voiced), amplitude of voicing,
frequency of the first few \(em typically three \(em formants,
aspiration amplitude, sibilance amplitude, and frequency of one (or more)
sibilance filters.
Additionally, if the synthesizer is a parallel one, parameters for the
amplitudes of individual formants will need to be included.
It may be that some control over formant bandwidths is provided too.
Thus synthesizers have from eight up to about 20 parameters (Klatt, 1980,
describes one with 20 parameters).
.[
Klatt 1980 Software for a cascade/parallel formant synthesizer
.]
.pp
The parameters are supplied to the synthesizer at regular intervals of time.
For a 10-parameter synthesizer, the control can be thought of as a set of
10 graphs, each representing the time evolution of one parameter.
They are usually called parameter
.ul
tracks,
the terminology dating from the days when a track was painted on a glass
slide for each parameter to provide dynamic control of the synthesizer
(Lawrence, 1953).
.[
Lawrence 1953
.]
The pitch track is often called a pitch
.ul
contour;
this is a common phonetician's usage.
Do not confuse this with the everyday meaning of "contour"
as a line joining points of equal height on a map \(em a pitch contour is
just the time evolution of the pitch frequency.
.pp
For computer-controlled synthesizers, of course, the parameter tracks
are sampled, typically every 5 to 20\ msec.
The rate is determined by the need to generate fast amplitude transitions
for nasals and stop consonants.
Contrast it with the 125\ $mu$sec sampling period needed to digitize
telephone-quality speech.
The raw data rate for a 10-parameter synthesizer updated every 10 msec
is 1,000 parameters/sec, or 6\ Kbit/s if each parameter is represented
by 6\ bits.
This is a substantial reduction over the 56\ Kbit/s needed for PCM representation.
For speech synthesis by rule (Chapter 7), these parameter tracks
are generated by a computer program from a phonetic (or English)
version of the utterance, lowering the data rate by a further one or two
orders of magnitude.
.pp
Filters for speech
synthesizers can be implemented in either analogue or digital form.
High-order filters are usually broken down into second-order sections in
parallel or in series.  A third possibility, which has not been discussed
above, is to implement a single high-order filter directly.  Finally, the
action of formant filters can be synthesized in the time domain.  This gives
eight possibilities which are summarized in Table 5.2.
.RF
.in +0.5i
.ta 2.1i +2.0i
.nr x1 (\w'Analogue'/2)
.nr x2 (\w'Digital'/2)
	\h'-\n(x1u'Analogue	\h'-\n(x2u'Digital
.nr x0 2.0i+(\w'Liljencrants (1968)'/2)+(\w'Morris and Paillet (1972)'/2)
.nr x3 (\w'Liljencrants (1968)'/2)
	\h'-\n(x3u'\l'\n(x0u\(ul'
.sp
.nr x1 (\w'Rice (1976)'/2)
.nr x2 (\w'Rabiner \fIet al\fR'/2)
Series	\h'-\n(x1u'Rice (1976)	\h'-\n(x2u'Rabiner \fIet al\fR
.nr x1 (\w'Liljencrants (1968)'/2)
.nr x2 (\w'Holmes (1973)'/2)
Parallel	\h'-\n(x1u'Liljencrants (1968)	\h'-\n(x2u'Holmes (1973)
.nr x1 (\w'unpublished'/2)
.nr x2 (\w'unpublished'/2
Time-domain	\h'-\n(x1u'unpublished	\h'-\n(x2u'unpublished
.nr x1 (\w'\(em'/2)
.nr x2 (\w'Morris and Paillet (1972)'/2)
High-order filter	\h'-\n(x1u'\(em	\h'-\n(x2u'Morris and Paillet (1972)
	\h'-\n(x3u'\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in-0.5i
.FG "Table 5.2  Implementation options for resonance speech synthesizers"
.[
Rice 1976 Byte
.]
.[
Rabiner Jackson Schafer Coker 1971
.]
.[
Liljencrants 1968
.]
.[
Holmes 1973 Influence of glottal waveform on naturalness
.]
.[
Morris and Paillet 1972
.]
All but one have certainly been used as the basis for synthesis, and
the table includes reference to published descriptions.
.pp
Each method has advantages and disadvantages.  Series decomposition obviates
the need for control over the amplitudes of individual formants, but does
not allow synthesis of sounds which use the nasal tract as well as the oral
one; for these are in parallel.  Analogue implementation of series synthesizers
is complicated by the need for higher-pole correction, and the fact that
the gains at different frequencies can vary widely throughout the system.
Higher-pole correction is not so important for digital synthesizers.
Parallel decomposition eliminates some of these problems:  higher-pole correction
can be implemented individually for each formant.  However, the formant
amplitudes must be controlled rather precisely to simulate the vocal tract,
which is essentially serial.
Time-domain synthesis is associated with low hardware costs but does not
easily allow proper control over the excitation sources.  In particular,
it cannot simulate dynamical movement of the spectrum during aspiration.
Implementation of the entire vocal tract model as a single high-order filter,
without breaking it down into individual formants in series or parallel,
is attractive from the computational point of view because less arithmetic
operations are required.  It is best analysed in terms of linear predictive
coding, which is the subject of the next chapter.
.sh "5.6  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "5.7  Further reading"
.pp
Historically-minded readers should look at the early speech synthesizer
designed by Lawrence (1953).
This and other classic papers on the subject
are reprinted in Flanagan and Rabiner (1973).
A good description of a quite sophisticated parallel synthesizer can
be found in Holmes (1973), above, and another of a switchable
series/parallel one in Klatt (1980), who even includes a listing of
the Fortran program that implements it.
Here are some useful books on speech synthesizers.
.LB "nn"
.\"Fant-1960-1
.]-
.ds [A Fant, G.
.ds [D 1960
.ds [T Acoustic theory of speech production
.ds [I Mouton
.ds [C The Hague
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Fant really started the study of the vocal tract as an acoustic system,
and this book marks the beginning of modern speech synthesis.
.in-2n
.\"Flanagan-1972-1
.]-
.ds [A Flanagan, J.L.
.ds [D 1972
.ds [T Speech analysis, synthesis, and perception (2nd, expanded, edition)
.ds [I Springer Verlag
.ds [C Berlin
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This book is the speech researcher's bible, and like the bible, it's not
all that easy to read.
However, it is an essential reference source for speech acoustics and
speech synthesis (as well as for human speech perception).
.in-2n
.\"Flanagan-1973-2
.]-
.ds [A Flanagan, J.L.
.as [A " and Rabiner, L.R.(Editors)
.ds [D 1973
.ds [T Speech synthesis
.ds [I Dowsen, Hutchinson and Ross
.ds [C Stroudsburg, Pennsylvania
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.in+2n
I recommended this book at the end of Chapter 1 as a collection of
classic papers on the subject of speech synthesis and synthesizers.
.in-2n
.\"Holmes-1972-3
.]-
.ds [A Holmes, J.N.
.ds [D 1972
.ds [T Speech synthesis
.ds [I Mills and Boom
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This little book, by one of Britain's foremost workers in the field,
introduces the subject of speech synthesis and speech synthesizers.
It has a particularly good discussion of parallel synthesizers.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "6  LINEAR PREDICTION OF SPEECH"
.ds RT "Linear prediction of speech
.ds CX "Principles of computer speech
.pp
The speech coding techniques which were discussed in Chapter 3 operate
in the time domain, while the analysis and synthesis techniques
of Chapters 4 and 5 are
based in the frequency domain.  Linear prediction is a relatively
new method of speech analysis-synthesis,
introduced in the early 1970's and used
extensively since then, which is primarily a time-domain coding method
but can be used to give frequency-domain parameters like formant
frequency, bandwidth, and amplitude.
.pp
It has several advantages over other speech analysis techniques, and is
likely to become increasingly dominant in speech output systems.
As well as bridging the gap between time- and frequency-domain techniques, it
is of equal value for both speech storage and speech synthesis, and forms
an extremely convenient basis for speech-output systems which use high-quality
stored speech for routine messages and synthesis from phonetics or text
for unusual or exceptional conditions.  Linear prediction can be used to
separate the excitation source properties of pitch and amplitude from the
vocal tract filter which governs phoneme articulation, or, in other words,
to separate much of the prosodic from the segmental information.
Hence it makes it easy to use stored segmentals with synthetic prosody,
which is just what is needed to enhance the flexibility of stored speech by
providing overall intonation contours for utterances formed by word
concatenation (see Chapter 7).
.pp
The frequency-domain analysis technique
of Fourier transformation necessarily involves approximation because it
applies only to periodic waveforms, and so the artificial operation
of windowing is required to suppress the aperiodicity of real
speech.  In contrast, the linear predictive technique, being a time-domain
method, can \(em in certain forms \(em deal more rationally with aperiodic
signals.
.pp
The basic idea of linear predictive coding is exactly the same as
one form of adaptive differential pulse code modulation which
was introduced briefly in Chapter 3.  There it was noted that a speech
sample $x(n)$ can be predicted quite closely by the previous sample
$x(n-1)$.  The prediction can be improved by multiplying the previous
sample by a number, say $a sub 1$, which is adapted on a syllabic
time-scale.  This can be utilized for speech coding by transmitting
only the prediction error
.LB
.EQ
e(n)~=~~x(n)~-~a sub 1 x(n-1),
.EN
.LE
and using it (and the value of $a sub 1$) to reconstitute the signal
$x(n)$ at the receiver.  It is worthwhile noting that
exactly the same relationship was used for digital
preemphasis in Chapter 4, with the value of $a sub 1$
being constant at about 0.9 \(em although
the possibility of adapting it to take into account the difference
between voiced and unvoiced speech was discussed.
.pp
An obvious extension is to use several past values of the signal to form
the prediction, instead of just one.  Different multipliers for each would
be needed, so that the prediction error could be written as
.LB
.EQ
e(n)~~ mark =~~x(n)~-~a sub 1 x(n-1)~-~a sub 2 x(n-2)~-~...~-~a sub p x(n-p)
.EN
.sp
.EQ
lineup =~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k).
.EN
.LE
The multipliers $a sub k$ should be adapted to minimize the error signal,
and we will consider how to do this in the next section.  It turns out
that they must be re-calculated and transmitted on a time-scale that is
rather faster than syllabic but much slower than
the basic sampling rate:  intervals
of 10\-25\ msec are usually used (compare this with the 125\ $mu$sec sampling
rate for telephone-quality speech).
A configuration for high-order adaptive differential
pulse code modulation is shown in Figure 6.1.
.FC "Figure 6.1"
.pp
Figure 6.2 shows typical time waveforms for each of the ten coefficients
over a 1-second stretch of speech.
.FC "Figure 6.2"
Notice that they vary much more slowly than, say, the speech waveform of
Figure 3.5.
.pp
Turning the above relationship into $z$-transforms gives
.LB
.EQ
E(z)~~=~~X(z)~-~~sum from k=1 to p ~a sub k z sup -k ~X(z)~~=~~(1~-~~
sum from k=1 to p ~a sub k z sup -k )~X(z).
.EN
.LE
Rewriting the speech signal in terms of the error,
.LB
.EQ
X(z)~~=~~1 over {1~-~~ sum ~a sub k z sup -k }~.~E(z) .
.EN
.LE
.pp
Now let us bring together some facts from the previous chapter which will
allow the time-domain technique of linear prediction to be interpreted
in terms of the frequency-domain formant model of speech.  Recall that speech
can be viewed as an excitation source passing through a vocal tract filter,
followed by another filter to model the effect of radiation from the lips.
The overall spectral levels can be reassigned as in Figure 5.1 so that
the excitation source has a 0\ dB/octave spectral profile, and hence is
essentially impulsive.
Considering the vocal tract filter as a series connection
of digital formant filters, its transfer function is the product of terms like
.LB
.EQ
1 over {1~-~b sub 1 z sup -1 ~+~b sub 2 z sup -2}~ ,
.EN
.LE
where $b sub 1$ and $b sub 2$ control the position and bandwidth of the formant resonances.
The \-6\ dB/octave spectral compensation can be modelled by the
first-order digital filter
.LB
.EQ
1 over {1~-~bz sup -1}~ .
.EN
.LE
The product of all these terms, when multiplied out, will have the
form
.LB
.EQ
1 over {1~-~c sub 1 z sup -1 ~-~c sub 2 z sup -2 ~-~...~-~
c sub q z sup -q }~ ,
.EN
.LE
where $q$ is twice the number of formants plus one, and the $c$'s are calculated
from the positions and bandwidths of the formant resonances and the spectral
compensation parameter.  Hence
the $z$-transform of the speech is
.LB
.EQ
X(z)~=~~1 over {1~-~~ sum from k=1 to q ~c sub k z sup -k }~.~I(z) ,
.EN
.LE
where $I(z)$ is the transform of the impulsive excitation.
.pp
This is remarkably similar to the linear prediction relation given earlier!  If
$p$ and $q$ are the same, then the linear predictive coefficients $a sub k$
form a $p$'th order polynomial which is the same as that obtained by multiplying
together the second-order polynomials representing the individual formants
(together with the first-order one for spectral compensation).
Furthermore, the predictive error $E(z)$ can be identified with the
impulsive excitation $I(z)$.  This raises the very interesting
possibility of parametrizing the error signal by its frequency and
amplitude \(em two relatively slowly-varying quantities \(em instead of
transmitting it sample-by-sample (at an 8\ kHz rate).  This is how
linear prediction separates out the excitation properties of the source
from the vocal tract filter:  the source parameters can be derived
from the error signal and the vocal tract filter is represented by
the linear predictive coefficients.
Figure 6.3 shows how this can be used for speech transmission.
.FC "Figure 6.3"
Note that
.ul
no
signals need now be transmitted at the speech sampling rate; for the
source parameters vary relatively slowly.  This leads to an extremely
low data rate.
.pp
Practical linear predictive coding schemes operate with a value of $p$ between
10 and 15, corresponding approximately to 4-formant and 7-formant synthesis
respectively.  The $a sub k$'s are re-calculated every 10 to 25\ msec, and
transmitted to the receiver.  Also, the pitch and amplitude
of the speech are estimated and transmitted at the same rate.
If the speech
is unvoiced, there is no pitch value:  an "unvoiced flag" is
transmitted instead.
Because the linear predictive coefficients are intimately related to
formant frequencies and bandwidths, a "frame rate" in the region
of 10 to 25\ msec is appropriate because this approximates the maximum rate
at which acoustic events happen in speech production.
.pp
At the receiver, the excitation waveform
is reconstituted.
For voiced speech, it is impulsive at the specified
frequency and with the specified amplitude, while for unvoiced speech it
is random, with the specified amplitude.  This signal $e(n)$, together
with the transmitted parameters $a sub 1$, ..., $a sub p$, is used
to regenerate the speech waveform by
.LB
.EQ
x(n)~=~~e(n)~+~~sum from k=1 to p ~a sub k x(n-k) ,
.EN
.LE
\(em which is the inverse of the transmitter's formula for calculating $e(n)$,
namely
.LB
.EQ
e(n)~=~~x(n)~-~~sum from k=1 to p ~a sub k x(n-k) .
.EN
.LE
This relies on knowing the past $p$ values of the speech samples.
Many systems set these past values to zero at the beginning of each pitch
cycle.
.pp
Linear prediction can also be used for speech analysis, rather than
for speech coding, as shown in Figure 6.4.
.FC "Figure 6.4"
Instead of transmitting the coefficients $a sub k$,
they are used to determine the formant positions and bandwidths.
We saw above that the polynomial
.LB
.EQ
1~-~a sub 1 z sup -1 ~-~a sub 2 z sup -2 ~-~...~-~a sub p z sup -p ,
.EN
.LE
when factored into a product of second-order terms, gives the formant
characteristics (as well as the spectral compensation term).
Factoring is equivalent to finding the complex roots of the polynomial,
and this is fairly demanding computationally \(em especially if done at
a high rate.  Consequently, peak-picking algorithms are sometimes
used instead.  The absolute value of the polynomial gives the
frequency spectrum of the vocal tract filter, and the formants
appear as peaks \(em just as they do in cepstrally smoothed speech
(see Chapter 4).
.pp
The chief deficiency in the linear predictive method, whether it
is used for speech coding or for speech analysis, is that \(em like a series
synthesizer \(em it
implements an all-pole model of the vocal tract.
We mentioned in Chapter 5 that this is rather simplistic,
especially for nasalized sounds which involve a cavity in parallel
with the oral one.  Some research has been done on incorporating zeros
into a linear predictive model, but it complicates the problem of
calculating the parameters enormously.  For most purposes people seem
to be able to live with the limitations of the all-pole model.
.sh "6.1  Linear predictive analysis"
.pp
The key problem in linear predictive coding is to determine the values
of the coefficients $a sub 1$, ..., $a sub p$.
If the error signal is to be transmitted on a sample-by-sample basis,
as it is in adaptive differential pulse code modulation, then it can be most
economically encoded if its mean power is as small as possible.
Thus the coefficients are chosen to minimize
.LB
.EQ
sum ~e(n) sup 2
.EN
.LE
over some period of time.
The period of time used is related to the frame rate at which the
coefficients are transmitted or stored, although there is no need
to make it exactly the same as one frame interval.  As mentioned above,
the frame size
is usually chosen to be in the region of 10 to 25\ msec.  Some
schemes minimize the error signal over as few as 30 samples
(corresponding to 3\ msec at a 10\ kHz sampling rate).  Others take
longer; up to 250 samples (25\ msec).
.pp
However, if the error signal is to be considered as impulsive and
parametrized by its frequency and amplitude before transmission,
or if the coefficients $a sub k$ are to be used for spectral calculations,
then it is not immediately obvious how the coefficients should be
calculated.
In fact, it is still best to choose them to minimize the above sum.
This is at least plausible, for an impulsive excitation will have a
rather small mean power \(em most of the samples are zero.
It can be justified theoretically in terms of
.ul
spectral whitening,
for it can be shown that minimizing the mean-squared error
produces an error signal whose spectrum is maximally flat.
Now the only two waveforms whose spectra are absolutely flat
are a single impulse and white noise.  Hence if
the speech is voiced, minimizing the mean-squared error
will lead to an error signal which is as nearly impulsive
as possible.  Provided the time-frame for minimizing is short enough,
the impulse will correspond to a single excitation pulse.
If the speech is unvoiced, minimization will lead to an error
signal which is as nearly white noise as possible.
.pp
How does one choose the linear predictive coefficients to minimize
the mean-squared error?  The total squared prediction error is
.LB
.EQ
M~=~~sum from n ~e(n) sup 2~~=~~sum from n
~[x(n)~-~ sum from k=1 to p ~a sub k x sub n-k ] sup 2 ,
.EN
.LE
leaving the range of summation unspecified for the moment.
To minimize $M$ by choice of the coefficients $a sub j$, differentiate
with respect to each of them and set the resulting derivatives
to zero.
.LB
.EQ
dM over {da sub j} ~~=~~-2 sum from n ~x(n-j)[x(n)~-~~
sum from k=1 to p ~a sub k x(n-k)]~~=~0~,
.EN
.LE
so
.LB
.EQ
sum from k=1 to p ~a sub k ~ sum from n ~x(n-j)x(n-k)~~=~~
sum from n ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p.
.EN
.LE
.pp
This is a set of $p$ linear equations for the $p$ unknowns $a sub 1$, ...,
$a sub p$.
Solving it is equivalent to inverting a $p times p$ matrix.
This job must be repeated at the frame rate, and so if
real-time operation is desired quite a lot of calculation is needed.
.rh "The autocorrelation method."
So far, the range of the $n$-summation has been left open.  The
coefficients of the matrix equation have the form
.LB
.EQ
sum from n ~x(n-j)x(n-k).
.EN
.LE
If a doubly-infinite summation were made, with $x(n)$ being defined
as zero whenever $n<0$, we could make use of the fact that
.sp
.ce
.EQ
sum from {n=- infinity} to infinity ~x(n-j)x(n-k)~=~~
sum from {n=- infinity} to infinity ~x(n-j+1)x(n-k+1)~=~...~=~~
sum from {n=- infinity} to infinity ~x(n)x(n+j-k)
.EN
.sp
to simplify the matrix equation.  This just states that the
autocorrelation of an infinite sequence depends only on the lag at which
it is computed, and not on absolute time.
.pp
Defining $R(m)$ as the
autocorrelation at lag $m$, that is,
.LB
.EQ
R(m)~=~ sum from n ~x(n)x(n+m),
.EN
.LE
the matrix equation becomes
.LB
.ne7
.nf
.EQ
R(0)a sub 1 ~+~R(1)a sub 2 ~+~R(2)a sub 3 ~+~...~~=~R(1)
.EN
.EQ
R(1)a sub 1 ~+~R(0)a sub 2 ~+~R(1)a sub 3 ~+~...~~=~R(2)
.EN
.EQ
R(2)a sub 1 ~+~R(1)a sub 2 ~+~R(0)a sub 3 ~+~...~~=~R(3)
.EN
.EQ
etc
.EN
.fi
.LE
An elegant method due to Durbin and Levinson exists for solving this
special system of equations.  It requires much less computational
effort than is generally needed for symmetric matrix equations.
.pp
Of course, an infinite range of summation can not be used in
practice.  For one thing, the power spectrum is changing, and
only the data from a short time-frame should be used for
a realistic estimate of the optimum linear predictive coefficients.
Hence a windowing procedure,
.LB
.EQ
x(n) sup * ~=~w sub n x(n),
.EN
.LE
is used to reduce the signal to zero outside a finite range of
interest.  Windows were discussed in Chapter 4 from the
point of view of Fourier analysis of speech signals, and the same
sort of considerations apply to choosing a window for linear
prediction.
.pp
This is known as the
.ul
autocorrelation method
of computing prediction parameters.  Typically a window of
100 to 250 samples is used for analysis of one frame of speech.
.rh "Algorithm for the autocorrelation method."
The algorithm for obtaining linear prediction coefficients
by the autocorrelation method is quite simple.  It is
straightforward to compute the matrix coefficients
$R(m)$ from the speech samples and window coefficients.
The Durbin-Levinson method of solving matrix equations operates
directly on this $R$-vector to produce the coefficient vector $a sub k$.
The complete procedure is given as Procedure 6.1, and is shown
diagrammatically in Figure 6.5.
.FC "Figure 6.5"
.RF
.fi
.na
.nh
.ul
const
N=256; p=15;
.ul
type
svec =
.ul
array
[0..N\-1]
.ul
of
real;
cvec =
.ul
array
[1..p]
.ul
of
real;
.sp
.ul
procedure
autocorrelation(signal: vec; window: svec;
.ul
var
coeff: cvec);
.sp
{computes linear prediction coefficients by autocorrelation method
in coeff[1..p]}
.sp
.ul
var
R, temp:
.ul
array
[0..p]
.ul
of
real;
n: [0..N\-1]; i,j: [0..p]; E: real;
.sp
.ul
begin
{window the signal}
.in+6n
.ul
for
n:=0
.ul
to
N\-1
.ul
do
signal[n] := signal[n]*window[n];
.sp
{compute autocorrelation vector}
.br
.ul
for
i:=0
.ul
to
p
.ul
do begin
.in+2n
R[i] := 0;
.br
.ul
for
n:=0
.ul
to
N\-1\-i
.ul
do
R[i] := R[i] + signal[n]*signal[n+i]
.in-2n
.ul
end;
.sp
{solve the matrix equation by the Durbin-Levinson method}
.br
E := R[0];
.br
coeff[1] := R[1]/E;
.br
.ul
for
i:=2
.ul
to
p
.ul
do begin
.in+2n
E := (1\-coeff[i\-1]*coeff[i\-1])*E;
.br
coeff[i] := R[i];
.br
.ul
for
j:=1
.ul
to
i\-1
.ul
do
coeff[i] := coeff[i] \- R[i\-j]*coeff[j];
.br
coeff[i] := coeff[i]/E;
.br
.ul
for
j:=1
.ul
to
i\-1
.ul
do
temp[j] := coeff[j] \- coeff[i]*coeff[i\-j];
.br
.ul
for
j:=1
.ul
to
i\-1
.ul
do
coeff[j] := temp[j]
.in-2n
.ul
end
.in-6n
.ul
end.
.nf
.FG "Procedure 6.1  Pascal algorithm for the autocorrelation method"
.pp
This algorithm is not quite as efficient as it might be, for some
multiplications are repeated during the calculation of the
autocorrelation vector.  Blankinship (1974) shows how
the number of multiplications can be reduced by about half.
.[
Blankinship 1974
.]
.pp
If the algorithm is performed in fixed-point arithmetic
(as it often is in practice because of speed considerations),
some scaling must be done.  The maximum and minimum values of
the windowed signal can be determined within the window
calculation loop, and one extra pass over the vector will
suffice to scale it to maximum significance.
(Incidentally, if all sample values are the same the procedure
cannot produce a solution because $E$ becomes zero, and this
can easily be checked when scaling.)
.pp
The absolute value of the $R$-vector has no significance, and since
$R(0)$ is always the greatest element, this can be set to the largest
fixed-point number and the other $R$'s scaled down appropriately
after they have been calculated.
These scaling operations are shown as dashed boxes in Figure 6.5.
$E$ decreases monotonically
as the computation proceeds, so it is safe to initialize it to $R(0)$
without extra scaling.  The remainder of the scaling is straightforward,
with the linear prediction coefficients $a sub k$ appearing as fractions.
.rh "The covariance method."
One of the advantages of linear predictive methods that was
promised earlier was that it allows us to escape from
the problem of windowing.  To do this, we must abandon the
requirement that the coefficients of the matrix equation have
the symmetry property of autocorrelations.  Instead, suppose
that the range of $n$-summation uses a fixed number of
elements, say N, starting at $n=h$, to estimate the prediction
coefficients between sample number $h$ and sample number $h+N$.
.pp
This leads to the matrix equation
.LB
.EQ
sum from k=1 to p ~a sub k sum from n=h to h+N-1 ~x(n-j)x(n-k) ~~=~~
sum from n=h to h+N-1 ~x(n)x(n-j)~~~~j~=~1,~2,~...,~p.
.EN
.LE
Alternatively, we could write
.LB
.EQ
sum from k=1 to p ~a sub k ~ Q sub jk sup h~~=~~Q sub 0j sup h
~~~~j~=~1,~2,~...,~p;
.EN
.LE
where
.LB
.EQ
Q sub jk sup h~~=~~sum from n=h to h+N-1 ~x(n-j)x(n-k).
.EN
.LE
Note that some values of $x(n)$ outside the range  $h ~ <= ~ n ~ < ~ h+N$  are
required:  these are shown diagrammatically in Figure 6.6.
.FC "Figure 6.6"
.pp
Now  $Q sub jk sup h ~=~ Q sub kj sup h$,  so the equation has
a diagonally symmetric matrix; and in fact the matrix $Q sup h$ can
be shown to be positive semidefinite \(em and is almost always positive
definite in practice.  Advantage can be taken of these facts
to provide a computationally efficient method for solving the
equation.  According to a result called Cholesky's theorem, a
positive definite symmetric matrix $Q$ can be factored into the form
$Q ~ = ~ LL sup T$, where $L$ is a lower triangular matrix.
This leads to an efficient
solution algorithm.
.pp
This method of computing prediction coefficients has become known
as the
.ul
covariance method.
It does not use windowing of the speech signal, and can give accurate
estimates of the prediction coefficients with a smaller analysis
frame than the autocorrelation method.  Typically, 50 to 100 speech samples
might be used to estimate the coefficients, and they are re-calculated
every 100 to 250 samples.
.rh "Algorithm for the covariance method."
An algorithm for the covariance method is given in Procedure 6.2,
.RF
.fi
.na
.nh
.ul
const
N=100; p=15;
.ul
type
svec =
.ul
array
[\-p..N\-1]
.ul
of
real;
cvec =
.ul
array
[1..p]
.ul
of
real;
.sp
.ul
procedure
covariance(signal: svec;
.ul
var
coeff: cvec);
.sp
{computes linear prediction coefficients by covariance method
in coeff[1..p]}
.sp
.ul
var
Q:
.ul
array
[0..p,0..p]
.ul
of
real;
n: [0..N\-1]; i,j,r: [0..p]; X: real;
.sp
.ul
begin
{calculate upper-triangular covariance matrix in Q}
.in+6n
.ul
for
i:=0
.ul
to
p
.ul
do
.in+2n
.ul
for
j:=i
.ul
to
p
.ul
do begin
.in+2n
Q[i,j]:=0;
.br
.ul
for
n:=0
.ul
to
N\-1
.ul
do
.in+2n
Q[i,j] := Q[i,j] + signal[n\-i]*signal[n\-j]
.in-2n
.in-2n
.ul
end;
.in-2n
.sp
{calculate the square root of Q}
.br
.ul
for
r:=2
.ul
to
p
.ul
do
.in+2n
.ul
begin
.in+2n
.ul
for
i:=2
.ul
to
r\-1
.ul
do
.in+2n
.ul
for
j:=1
.ul
to
i\-1
.ul
do
.in+2n
Q[i,r] := Q[i,r] \- Q[j,i]*Q[j,r];
.in-2n
.ul
for
j:=1
.ul
to
r\-1
.ul
do
.in+2n
.ul
begin
.in+2n
X := Q[j,r];
.br
Q[j,r] := Q[j,r]/Q[j,i];
.br
Q[r,r] := Q[r,r] \- Q[j,r]*X
.in-2n
.ul
end
.in-2n
.in-2n
.in-2n
.ul
end;
.in-2n
.sp
{calculate coeff[1..p]}
.br
.ul
for
r:=2
.ul
to
p
.ul
do
.in+2n
.ul
for
i:=1
.ul
to
r\-1
.ul
do
Q[0,r] := Q[0,r] \- Q[i,r]*Q[0,i];
.in-2n
.ul
for
r:=1
.ul
to
p
.ul
do
Q[0,r] := Q[0,r]/Q[r,r];
.br
.ul
for
r:=p\-1
.ul
downto
1
.ul
do
.in+2n
.ul
for
i:=r+1
.ul
to
p
.ul
do
Q[0,r] := Q[0,r] \- Q[r,i]*Q[0,i];
.in-2n
.ul
for
r:=1
.ul
to
p
.ul
do
coeff[r] := Q[0,r]
.in-6n
.ul
end.
.nf
.FG "Procedure 6.2  Pascal algorithm for the covariance method"
and is shown diagrammatically in Figure 6.7.
.FC "Figure 6.7"
The algorithm shown is not terribly efficient from a computation
and storage point of view, although it is workable.  For one thing,
it uses the obvious method for computing the covariance matrix
by calculating
.EQ
Q sub 01 sup h ,
.EN
.EQ
Q sub 02 sup h , ~ ...,
.EN
.EQ
Q sub 0p sup h ,
.EN
.EQ
Q sub 11 sup h , ...,
.EN
in turn, which repeats most of the multiplications $p$ times \(em not
an efficient procedure.  A simple alternative is to precompute the necessary
multiplications and store them in a  $(N+h) times (p+1)$ diagonally symmetric
table, but even apart from the extra storage required for this, the number
of additions which must be performed subsequently to give the $Q$'s is far
larger than necessary.  It is possible, however, to write a procedure which is
both time- and space-efficient (Witten, 1980).
.[
Witten 1980 Algorithms for linear prediction
.]
.pp
The scaling problem is rather more tricky for the covariance
method than for the autocorrelation method.  The $x$-vector
should be scaled initially in the same way as before, but now there
are $p+1$ diagonal elements of the covariance matrix, any of which could
be the greatest element.  Of course,
.LB
.EQ
Q sub jk ~~ <= ~~ Max ( Q sub 11 , Q sub 22 , ..., Q sub pp ),
.EN
.LE
but despite the considerable communality in the summands of the diagonal
elements, there are no
.ul
a priori
bounds on the ratios between them.
.pp
The only way to scale the $Q$ matrix properly is to calculate each of its $p$
diagonal elements and use the greatest as a scaling factor.
Alternatively, the fact that
.LB
.EQ
Q sub jk ~~ <= ~~ N times Max( x sub n sup 2 )
.EN
.LE
can be used to give a bound for scaling purposes; however, this
is usually a rather conservative bound, and as $N$ is often around 100, several
bits of significance will be lost.
.pp
Scaling difficulties do not cease when $Q$ has been determined.  It is possible
to show that the elements of the lower-triangular matrix $L$ which represents
the square root of $Q$ are actually
.ul
unbounded.
In fact there is a slightly different variant of the Cholesky decomposition
algorithm which guarantees bounded coefficients but suffers from the
disadvantage that it requires square roots to be taken (Martin
.ul
et al,
1965).
.[
Martin Peters Wilkinson 1965
.]
However, experience with the method indicates that it is rare for the elements
of $L$ to exceed 16 times the maximum element of $Q$, and the possibility of
occasional failure to adjust the coefficients may be tolerable in a practical
linear prediction system.
.rh "Comparison of autocorrelation and covariance analysis."
There are various factors which should be taken into account when
deciding whether to use the autocorrelation or covariance method for linear
predictive analysis.  Furthermore, there is a rather different technique,
called the "lattice method", which will be discussed shortly.
The autocorrelation method involves windowing, which means that in
practice a rather longer stretch of speech should be used
for analysis.  We have illustrated this by setting $N$=256 in the
autocorrelation algorithm and 100 in the covariance one.
Offsetting the extra calculation that this entails is the
fact that the Durbin-Levinson method of inverting a matrix is much more
efficient than Cholesky decomposition.  In practice, this means
that similar amounts of computation are needed for each method \(em a
detailed comparison is made in Witten (1980).
.[
Witten 1980 Algorithms for linear prediction
.]
.pp
A factor which weighs against the covariance method is the
difficulty of scaling intermediate quantities within the algorithm.
The autocorrelation method can be implemented quite satisfactorily
in fixed-point arithmetic, and this makes it more suitable for
hardware implementation.  Furthermore, serious instabilities sometimes
arise with the covariance method, whereas it can be shown that
the autocorrelation one is always stable.  Nevertheless, the approximations
inherent in the windowing operation, and the smearing effect of taking a
larger number of sample points, mean that covariance-method coefficients
tend to represent the speech more accurately, if they can be obtained.
.pp
One way of using the covariance method which has proved to be rather
satisfactory in practice is to synchronize the analysis frame with
the beginning of a pitch period, when the excitation is strongest.
Pitch synchronous techniques were discussed in Chapter 4 in the context
of discrete Fourier transformation of speech.  The snag, of course, is that
pitch peaks do not occur uniformly in time, and furthermore it is difficult
to estimate their locations precisely.
.sh "6.2  Linear predictive synthesis"
.pp
If the linear predictive coefficients and the error signal are available,
it is easy to regenerate the original speech by
.LB
.EQ
x(n)~=~~e(n)~+~~ sum from k=1 to p ~a sub k x(n-k) .
.EN
.LE
If the error signal is parametrized into the sound source type
(voiced or unvoiced), amplitude, and pitch (if voiced), it can be
regenerated by an impulse repeated at the appropriate pitch
frequency (if voiced), or white noise (if unvoiced).
.pp
However, it may be that the filter represented by the coefficients $a sub k$ is
unstable, causing the output speech signal to oscillate wildly.
In fact, it is only possible for the covariance method to produce an
unstable filter, and not the autocorrelation method \(em although even
with the latter, truncation of the $a sub k$'s for transmission may turn
a stable filter into an unstable one.  Furthermore, the coefficients
$a sub k$ are not suitable candidates for quantization, because small
changes in them can have a dramatic effect on the characteristics of
the synthesis filter.
.pp
Both of these problems can be solved by using a different set of numbers,
called
.ul
reflection coefficients,
for quantization and transmission.  Thus, for example, in Figures 6.1
and 6.3 these reflection coefficients could be derived at the
transmitter, quantized, and used by the receiver to reproduce
the speech waveform.  They can be related to reflection and transmission
parameters at the junctions of an acoustic tube model of the vocal tract;
hence the name.  Procedure 6.3 shows an algorithm for calculating the
reflection coefficients from the filter coefficients $a sub k$.
.RF
.fi
.na
.nh
.ul
const
p=15;
.ul
type
cvec =
.ul
array
[1..p]
.ul
of
real;
.sp
.ul
procedure
reflection(coeff: cvec;
.ul
var
refl: cvec);
.sp
{computes reflection coefficients in refl[1..p] corresponding
to linear prediction coefficients in coeff[1..p]}
.sp
.ul
var
temp: cvec;  i, m: 1..p;
.sp
.ul
begin
.in+6n
.ul
for
m:=p
.ul
downto
1
.ul
do begin
.in+2n
refl[m] := coeff[m];
.br
.ul
for
i:=1
.ul
to
m\-1
.ul
do
temp[i] := coeff[i];
.br
.ul
for
i:=1
.ul
to
m\-1
.ul
do
.ti+2n
coeff[i] :=
.ti+4n
(coeff[i] + refl[m]*temp[m\-i]) / (1 \- refl[m]*refl[m]);
.in-2n
.ul
end
.in-6n
.ul
end.
.nf
.MT 2
Procedure 6.3  Pascal algorithm for producing reflection coefficients
from filter coefficients
.TE
.pp
Although we will not go into the theoretical details here,
reflection coefficients are bounded by $+-$1 for stable filters,
and hence form a useful test for stability.  Having a limited
range makes them easy to quantize for transmission, and in fact
they behave better under quantization than do the filter coefficients.
One could resynthesize speech from reflection coefficients by first
converting them to filter coefficients and using the synthesis
method described above.  However, it is natural to seek a single-stage
procedure which can regenerate speech directly from reflection
coefficients.
.pp
Such a procedure does exist, and is called a
.ul
lattice filter.
Figure 6.8 shows one form of lattice for speech synthesis.
.FC "Figure 6.8"
The error signal (whether transmitted or synthesized)
enters at the upper left-hand corner, passes along the top forward
signal path, being modified on the way, to give the output signal
at the right-hand side.
Then it passes back through a chain of delays along the bottom,
backward, path, and is used to modify subsequent forward signals.
Finally it is discarded at the lower left-hand corner.
.pp
There are $p$ stages in the lattice structure of Figure 6.8, where $p$ is the
order of the linear predictive filter.
Each stage involves two multiplications by the appropriate
reflection coefficients, one by the backward signal \(em the
result of which is added into the forward path \(em and the other by
the forward signal \(em the result of which is subtracted from the
backward path.  Thus the number of multiplications is twice
the order of the filter, and hence twice as many as for the
realization using coefficients $a sub k$.  If the labour necessary
to turn the reflection coefficients into $a sub k$'s is included,
the computational load becomes the same.  Moreover, since the
reflection coefficients need fewer quantization bits than the $a sub k$'s
(for a given speech quality), the word lengths are smaller in the
lattice realization.
.pp
The advantages of the lattice method of synthesis over direct evaluation
of the prediction using filter coefficients $a sub k$, then, are:
.LB
.NP
the reflection coefficients are used directly
.NP
the stability of the filter is obvious from the reflection coefficient
values
.NP
the system is more tolerant to quantization errors in fixed-point
implementations.
.LE
Although it may seem unlikely that an unstable filter would be produced
by linear predictive analysis, instability is in fact a real problem
in non-lattice implementations.  For example,
coefficients are often interpolated at the receiver, to allow longer
frame times and smooth over sudden transitions, and it is quite likely that
an unstable configuration is obtained when interpolating filter coefficients
between two stable configurations.
This cannot happen with reflection coefficients, however, because a
necessary and sufficient condition for stability is that all
coefficients lie in the interval $(-1,+1)$.
.sh "6.3  Lattice filtering"
.pp
Lattice filters are an important new method of linear predictive
.ul
analysis
as well as synthesis, and so
it is worth considering the theory behind them a little further.
.rh "Theory of the lattice synthesis filter."
Figure 6.9 shows a single stage of the synthesis lattice given earlier.
.FC "Figure 6.9"
There are two signals at each side of the lattice, and the $z$-transforms
of these have been labelled $X sup +$ and $X sup -$ at the left-hand side
and $Y sup +$ and $Y sup -$ at the right-hand side.
The direction of signal flow is forwards along the upper ("positive") path
and backwards along the lower ("negative") one.
.pp
The signal flows show that the following two relationships hold:
.LB
.EQ
Y sup + ~=~~ X sup + ~+~ k z sup -1 Y sup - ~~~~~~
.EN
for the forward (upper) path
.br
.EQ
X sup - ~ =~ -kY sup + ~+~ z sup -1 Y sup - ~~~~~~~
.EN
\h'-\w'\-'u'for the backward (lower) path.
.LE
Re-arranging the first equation yields
.LB
.EQ
X sup + ~ =~~ Y sup + ~-~ k z sup -1 Y sup - ,
.EN
.LE
and so we can describe the function of the lattice by a single matrix
equation:
.LB
.ne4
.EQ
left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
~ left [ matrix {ccol {Y sup + above Y sup -}} right ] ~ .
.EN
.LE
It would be nice to be able to
call this an input-output equation, but it is not;
for the input signals to the lattice stage are $X sup +$ and $Y sup -$,
and the outputs are $X sup -$ and $Y sup +$.
We have written it in this form because it allows a multi-stage lattice to
be described by cascading these matrix equations.
.pp
A single-stage lattice filter has $Y sup +$ and $Y sup -$ connected together,
forming its output (call this $X sub output$), while the input is $X sup +$
($X sub input$).
Hence the input is related to the output by
.LB
.EQ
left [ matrix {ccol {X sub input above \(sq }} right ] ~~ =
~~ left [ matrix {ccol {1 above -k} ccol {-k z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ ,
.EN
.LE
so
.LB
.EQ
X sub input ~ = ~~ (1~-~ k z sup -1 )~X sub output ,
.EN
.LE
or
.LB
.EQ
{X sub output} over {X sub input} ~~=~~ 1 over {1~-~ k sub 1 z sup -1} ~ .
.EN
.LE
(The symbol \(sq is used here and elsewhere
to indicate an unimportant element of a vector
or matrix.)  This certainly has the form of a linear predictive
synthesis filter, which is
.LB
.EQ
X(z) over E(z) ~~=~~ 1 over {1~-~~ sum from k=1 to p ~a sub k
z sup -k}~~=~~ 1 over {1~-~a sub 1 z sup -1 } ~~~~~~
.EN
when $p=1$.
.LE
.pp
The behaviour of a second-order lattice filter, shown in Figure 6.10,
can be described by
.LB
.ne4
.EQ
left [ matrix {ccol {X sub 3 sup + above X sub 3 sup -}} right ] ~~ =
~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ]
.EN
.sp
.ne4
.EQ
left [ matrix {ccol {X sub 2 sup + above X sub 2 sup -}} right ] ~~ =
~~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {X sub 1 sup + above X sub 1 sup -}} right ]
.EN
.LE
with
.LB
.ne3
.EQ
X sub 3 sup + ~=~X sub input
.EN
.br
.EQ
X sub 1 sup + ~=~ X sub 1 sup - ~=~ X sub output .
.EN
.LE
.FC "Figure 6.10"
$X sub 2 sup +$ and $X sub 2 sup -$ can be eliminated by substituting the
second equation into the first, which yields
.LB
.EQ
left [ matrix {ccol {X sub input above \(sq }} right ] ~~ mark =
~~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {X sub output above X sub output}} right ]
.EN
.sp
.sp
.EQ
lineup = ~~ left [ matrix {ccol {1+k sub 1 k sub 2 z sup -1 above \(sq }
ccol { -k sub 1 z sup -1 -k sub 2 z sup -2 above \(sq }} right ]
~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ .
.EN
.LE
This leads to an input-output relationship
.LB
.EQ
{X sub output} over {X sub input} ~~ = ~~
1 over {1~+~k sub 1 (k sub 2 -1)z sup -1 ~-~k sub 2 z sup -2} ~ ,
.EN
.LE
which has the required form, namely
.LB
.EQ
1 over {1~-~~ sum from k=1 to p ~a sub k z sup -k } ~~~~~~ (p=2)
.EN
.LE
when
.LB
.EQ
a sub 1 ~=~-k sub 1 (k sub 2 -1)
.EN
.br
.EQ
a sub 2 ~=~k sub 2.
.EN
.LE
.pp
A third-order filter is described by
.LB
.EQ
left [ matrix {ccol {X sub input above \(sq }} right ] ~~ =
~~ left [ matrix {ccol {1 above -k sub 3 } ccol {-k sub 3 z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {1 above -k sub 2 } ccol {-k sub 2 z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {1 above -k sub 1 } ccol {-k sub 1 z sup -1
above z sup -1}} right ]
~ left [ matrix {ccol {X sub output above X sub output}} right ] ~ ,
.EN
.LE
and brave souls can verify that this gives an input-output
relationship
.LB
.EQ
{X sub output} over {X sub input} ~~ = ~~ 
1 over {1~+~[k sub 2 k sub 3 + k sub 1 (1-k sub 2 )] z sup -1 ~+~
[k sub 1 k sub 3 (1-k sub 2 ) -k sub 2 ] z sup -2 ~-~ k sub 3 z sup -3 } ~ .
.EN
.LE
It is fairly obvious that a $p$'th order lattice filter will give the
required all-pole $p$'th order synthesis form,
.LB
.EQ
1 over { 1~-~~ sum from k=1 to p ~a sub k z sup -k } ~ .
.EN
.LE
.pp
We have not shown that the algorithm given in Procedure 6.3 for producing
reflection coefficients from filter coefficients gives those values
for $k sub i$ which are necessary to make the lattice filter equivalent
to the ordinary synthesis filter.  However, this is the case, and it is
easy to verify by hand for the first, second, and third-order cases.
.rh "Different lattice configurations."
The lattice filters of Figures 6.8, 6.9, and 6.10 have two multipliers
per section.
This is called a "two-multiplier" configuration.
However, there are other configurations which achieve
the same effect, but require different numbers of multiplies.
Figure 6.11 shows one-multiplier and four-multiplier configurations,
along with the familiar two-multiplier one.
.FC "Figure 6.11"
It is easy to verify that the three configurations can be modelled in
matrix terms by
.LB
.ne4
$
left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
$		two-multiplier configuration
.sp
.sp
.ne4
$
left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~
left [ {1-k over 1+k} right ] sup 1/2 ~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
$	one-multiplier configuration
.sp
.sp
.ne4
$
left [ matrix {ccol {X sup + above X sup -}} right ] ~~ = ~~
1 over {(1-k sup 2) sup 1/2} ~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
$	four-multiplier configuration.
.LE
Each of the three has the same frequency-domain response, although
a different constant factor is involved in each case.
The effect of this can be annulled by performing a single multiply
operation on the output of a complete lattice chain.
The multiplier has the form
.LB
.EQ
left [ {1 - k sub p} over {1 + k sub p} ~.~
{1 - k sub p-1} over {1 + k sub p-1} ~.~...~.~
{1 - k sub 1} over {1 + k sub 1} right ] sup 1/2
.EN
.sp
.LE
for single-multiplier lattices, and
.LB
.EQ
left [ 1 over {1 - k sub p sup 2} ~.~
1 over {1 - k sub p-1 sup 2} ~.~...~.~
1 over {1 - k sub 1 sup 2} right ] sup 1/2
.EN
.LE
for four-multiplier lattices, where the reflection coefficients
in the lattice are $k sub p$, $k sub p-1$, ..., $k sub 1$.
.pp
There are important differences between these three configurations.
If multiplication is time-consuming, the one-multiplier model has obvious
computational advantages over the other two methods.
However, the four-multiplier structure behaves substantially better
in finite word-length implementations.  It is easy to show that, with this
configuration,
.LB
.EQ
(X sup - ) sup 2 ~+~ (Y sup + ) sup 2 ~~ = ~~
(X sup + ) sup 2 ~+~ (z sup -1 Y sup - ) sup 2 ,
.EN
.LE
\(em a relationship which suggests that the "energy" in the
the input signals, namely  $X sup +$ and $Y sup -$,  is preserved in the output
signals,  $X sup -$ and $Y sup +$.
Notice that care must be taken with the $z$-transforms, since squaring is a
non-linear operation.  $(z sup -1 Y sup - ) sup 2$  means the square of
the previous value of  $Y sup -$,  which is not the same
as  $z sup -2 (Y sup - ) sup 2$.
.pp
It has been shown (Gray and Markel, 1975) that the four-multiplier
configuration has some stability properties which are not shared by other
digital filter structures.
.[
Gray Markel 1975 Normalized digital filter structure
.]
When a linear predictive filter is used for synthesis, the parameters
of the filter \(em the $k$-parameters in the case of lattice filters,
and the $a$-parameters in the case of direct ones \(em change with time.
It is usually rather difficult to guarantee stability in the case of
time-varying filter parameters, but some guarantees can be made for a
chain of four-multiplier lattices.  Furthermore, if the input is a
discrete delta function, the cumulative energies at each stage of the
lattice are the same, and so maximum dynamic range will be achieved
for the whole filter if each section is implemented with the same
word size.
.rh "Lattice analysis."
It is quite easy to construct a filter which is inverse to
a single-stage lattice.
The structure of Figure 6.12(a) does the job.
(Ignore for a moment
the dashed lines connecting Figure 6.12(a) and (b).)  Its matrix transfer
function is
.FC "Figure 6.12"
.LB
.ne4
$
left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~=~~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
~ left [ matrix {ccol {X sup + above X sup -}} right ]
$	analysis lattice (Figure 6.12(a)).
.LE
Notice that this is exactly the same as the transfer function of the
synthesis lattice of Figure 6.9, which is reproduced
in Figure 6.12(b), except that the $X$'s and $Y$'s are reversed:
.LB
.ne4
$
left [ matrix {ccol {X sup + above X sup -}} right ] ~~=~~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}} right ]
~ left [ matrix {ccol {Y sup + above Y sup -}} right ]
$	synthesis lattice (Figure 6.12(b)),
.LE
or, in other words,
.LB
.ne4
$
left [ matrix {ccol {Y sup + above Y sup -}} right ] ~~ = ~~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}}
right ] sup -1
~ left [ matrix {ccol {X sup + above X sup -}} right ]
$	synthesis lattice (Figure 6.12(b)).
.LE
Hence if the filters of Figures 6.12(a) and (b) were connected together
as shown by the dashed lines, they
would cancel each other out, and the overall transfer would be unity:
.LB
.ne4
.EQ
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}}
right ] ~
left [ matrix {ccol {1 above -k} ccol {-kz sup -1 above z sup -1}}
right ] sup -1 ~~ = ~~
left [ matrix {ccol {1 above 0} ccol {0 above 1}} right ] ~ .
.EN
.LE
Actually, such a connection is not possible in physical terms,
for although the upper paths can be joined together the lower ones can not.
The right-hand lower point of Figure 6.12(a) is an
.ul
output
terminal, and so is the left-hand lower one of Figure 6.12(b)!  However,
there is no need to envisage a physical connection of the lower paths.
It is sufficient for cancellation just to assume that the signals at both
of the points turn out to be the same.
.pp
And they do.
The general case of a $p$-stage analysis lattice
connected to a $p$-stage synthesis
lattice is shown in Figure 6.13.
.FC "Figure 6.13"
Notice that the forward and backward paths are connected together at both
of the extreme ends of the system.
It is not difficult to show that under these
conditions the signal at the lower righthand
terminal of the analysis chain will equal that at the lower lefthand
terminal of the synthesis chain, even though they are not connected,
provided the upper terminals are connected together as shown by the dashed
line.
Of course, the reflection coefficients  $k sub 1$, $k sub 2$, ...,
$k sub p$  in the analysis lattice must equal those in the synthesis
lattice, and as Figure 6.13 shows the order is reversed in the synthesis
lattice.
Successive analysis and synthesis sections pair off, working from
the middle outwards.  At each stage the sections cancel each other out,
giving a unit transfer function as demonstrated above.
.rh "Estimating reflection coefficients."
As stated earlier in this chapter, the key problem in linear prediction is to
determine the values of the predictive coefficients \(em in this case, the
reflection coefficients.
If this is done correctly, we have shown using Procedure 6.3 that
the the synthesis part of Figure 6.13 performs the same calculation that
a conventional direct-form linear predictive synthesizer would, and hence
the signal that excites it \(em that is, the signal represented by the
dashed line \(em must be the prediction residual, or error signal, discussed
earlier.  The system is effectively the same as the high-order adaptive
differential pulse code modulation one of Figure 6.1.
.pp
One of the most interesting features of the lattice structure for
analysis filters is that calculation of suitable values for the
reflection coefficients can be done locally at each stage of the lattice.
For example, consider the $i$'th section of the analysis lattice in
Figure 6.13.  It is possible to determine a suitable value of $k sub i$
simply by performing a calculation on the inputs to the $i$'th
section (ie $X sup +$ and $X sup -$ in Figure 6.12).
No longer need the complicated global optimization technique of matrix
inversion be used, as in the autocorrelation and covariance methods discussed
earlier.
.pp
A suitable value for $k$ in the single lattice section of Figure 6.12 is
.LB
.EQ
k~ = ~~ {E[ x sup + (n) x sup - (n-1)]} over
{( E[ x sup + (n) sup 2 ] E[ x sup - (n-1) sup 2 ] ) sup 1/2} ~~ ;
.EN
.LE
that is, the statistical correlation between $x sup + (n)$ and
$x sup - (n-1)$.
Here, $x sup + (n)$ and $x sup - (n)$ represent the input signals to the
upper and lower paths (recall that $X sup +$ and $X sup -$
are their $z$-transforms).
$x sup - (n-1)$ is just $x sup - (n)$ delayed by one time unit, that is,
the output of the $z sup -1$ box in the Figure.
.pp
The criterion of optimality for the autocorrelation and covariance methods
was that the prediction error, that is, the signal which emerges from
the right-hand end of the upper path of a lattice analysis filter,
should be minimized in a mean-square sense.
The reflection coefficients obtained from the above formula do not necessarily
satisfy any such global minimization criterion.
Nevertheless, they do keep the error signal small, and have been used with
success in speech analysis systems.
.pp
It is easy to minimize the output from either the upper or the lower path
of the lattice filter at each stage.  For example, the $z$-transform of the
upper output is given by
.LB
.EQ
Y sup + ~~=~~ X sup + ~-~ k z sup -1 X sup - ,
.EN
.LE
or
.LB
.EQ
y sup + (n) ~~=~~ x sup + (n) ~-~ k x sup - (n-1) .
.EN
.LE
Hence
.LB
.EQ
E[y sup + (n) sup 2 ] ~~ = ~~ E[x sup + (n) sup 2 ] ~-~
2kE[x sup + (n) x sup - (n-1) ] ~+~ k sup 2 E [x sup - (n-1) sup 2 ] ,
.EN
.LE
where $E$ stands for expected value, and this reaches a minimum when the
derivative with respect to $k$ becomes zero:
.LB
.EQ
-2E[x sup + (n) x sup - (n-1) ] ~+~ 2kE[x sup - (n-1) sup 2 ] ~~=~0 ,
.EN
.LE
that is, when
.LB
.EQ
k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup - (n-1) sup 2 ]
} ~ .
.EN
.LE
A similar calculation shows that the output of the lower path is minimized
when
.LB
.EQ
k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over {E[x sup + (n-1) sup 2 ]
} ~ .
.EN
.LE
Unfortunately, either of these expressions can exceed 1, leading to an
unstable filter.
The value of $k$ cited earlier is the geometric mean of these two
expressions, and since it is a correlation coefficient, must be less than 1.
.pp
Another possibility is to minimize the expected value of the sum of the
squares of the upper and lower outputs:
.LB
.EQ
y sup + (n) sup 2 ~+~ y sup - (n) sup 2 ~~ = ~~
(1+k sup 2 )x sup + (n) sup 2 ~-~ 2kx sup + (n) x sup - (n-1) ~+~
(1+k sup 2 )x sup - (n) sup 2 .
.EN
.LE
Taking expected values and setting the derivative with respect to k to zero
leads to
.LB
.EQ
k~ = ~~ {E[x sup + (n) x sup - (n-1) ]} over
{ half ~ E[x sup + (n) sup 2 ~+~ x sup - (n-1) sup 2 ]} ~.
.EN
.LE
This also is guaranteed to be less than 1, and has given good results
in speech analysis systems.
.pp
Figure 6.14 shows the implementation of a single section of an analysis
lattice.
.FC "Figure 6.14"
The signals $x sup + (n)$ and $x sup - (n-1)$ are fed to a
correlator, which produces a suitable value for $k$.
This value is used to calculate the output of the lattice section,
and hence the input to the next lattice section.
The reflection coefficient needs to be low-pass filtered, because it will
only be transmitted to the synthesizer occasionally (say every 20\ msec) and so a
short-term average is required.
.pp
One implementation of the correlator is shown in Figure 6.15 (Kang, 1974).
.[
Kang 1974
.]
.FC "Figure 6.15"
This calculates the value of $k$ given by the last equation above, and does it
by summing and differencing the two
signals $x sup + (n)$ and $x sup - (n-1)$, squaring the results to give
.LB
.EQ
x sup + (n) sup 2 + 2x sup + (n mark ) x sup - (n-1) +x sup - (n-1) sup 2
~~~~~~~~ x sup + (n) sup 2 - 2x sup + (n) x sup - (n-1) +x sup - (n-1) sup 2
~ ,
.EN
.LE
and summing and differencing these, to yield
.LB
.EQ
lineup 2x sup + (n) sup 2 + 2x sup - (n-1) sup 2 ~~~~~~~~
4x sup + (n) x sup - (n-1) ~ .
.EN
.LE
.sp
Before these are divided to give the final coefficient $k$, they are
individually low-pass filtered.
While some rather complex schemes have been proposed,
based upon Kalman filter theory (eg Matsui
.ul
et al,
1972),
.[
Matsui Nakajima Suzuki Omura 1972
.]
a simple exponential weighted past average has been found to be
satisfactory.  This has $z$-transform
.LB
.EQ
1 over {64 - 63 z sup -1} ~ ,
.EN
.LE
that is, in the time domain,
.LB
.EQ
y(n)~ = ~~ 63 over 64 ~ y(n-1) ~+~ 1 over 64 ~ y(n) ~ .
.EN
.LE
This filter exponentially averages past sample values
with a time-constant of 64 sampling intervals
\(em that is, 8\ msec at an 8\ kHz sampling rate.
.sh "6.4  Pitch estimation"
.pp
It is sometimes useful to think of linear prediction as a kind of
curve-fitting technique.
Figure 6.16 illustrates how four samples of a speech signal can predict
the next one.
.FC "Figure 6.16"
In essence, a curve is drawn through four points
to predict the position of the fifth, and only the prediction error
is actually transmitted.  Now if the order of linear prediction
is high enough (at least 10), and if the coefficients are chosen
correctly, the prediction will closely model the resonances of the
vocal tract.  Thus the error will actually be zero, except at pitch
pulses.
.pp
Figure 6.17 shows a segment of voiced speech together with the prediction
error (often called the prediction residual).
.FC "Figure 6.17"
It is apparent that the
error is indeed small, except at pitch pulses.
This suggests that a good way to determine the pitch period is to examine
the error signal, perhaps by looking at its autocorrelation function.
As with all pitch detection methods, one must be
careful:  spurious peaks can occur, especially in nasal sounds when
the all-pole model provided by linear prediction fails.  Continuity
constraints, which use previous values of pitch period when determining
which peak to accept as a new pitch impulse, can eliminate many of these
spurious peaks.  Unvoiced speech should produce an error signal with no
prominent peaks, and this needs to be detected.
Voiced fricatives are a difficult case:  peaks should be present
but the general noise level of the error signal will be greater than
it is in
purely voiced speech.
Such considerations have been taken into account in a practical pitch
estimation system based upon this technique (Markel, 1972).
.[
Markel 1972 SIFT
.]
.pp
This method of pitch detection highlights another advantage of the lattice
analysis technique.  When using autocorrelation or covariance analysis to
determine the filter (or reflection) coefficients, the error signal is not
normally produced.  It can, of course, be found by taking the speech samples
which constitute the current frame and running them through an analysis
filter whose parameters are those determined by the analysis, but this
is a computationally demanding exercise, for the filter must run at the
speech sampling rate (say 8\ kHz) instead of at the frame rate (say 50\ Hz).
Usually, pitch is estimated by other methods, like those discussed in
Chapter 4, when using autocorrelation or covariance linear prediction.
However, we have seen above that with the lattice method, the error
signal is produced as a byproduct:  it appears at the right-hand end
of the  upper path of the lattice chain.  Thus it is already available
for use in determining pitch periods.
.sh "6.5  Parameter coding for linear predictive storage or transmission"
.pp
In this section, the coding requirements of linear predictive parameters
will be examined.  The parameters that need to be stored or transmitted
are:
.LB
.NP
pitch
.NP
voiced-unvoiced flag
.NP
overall amplitude level
.NP
filter coefficients or reflection coefficients.
.LE
The first three are parameters of the excitation source.
They can be derived directly from the error signal as indicated above, if
it is generated (as it is in lattice implementations); or by other
methods if no error signal is calculated.
The filter or reflection coefficients are, of course, the main product
of linear predictive analysis.
.pp
It is generally agreed that around 60 levels, logarithmically spaced,
are needed to represent pitch for telephone quality speech.
The voiced-unvoiced indication requires one bit, but since pitch is
irrelevant in unvoiced speech it can be coded as one of the pitch
levels.  For example, with 6-bit coding of pitch, the value 0 can be
reserved to indicate unvoiced speech, with values 1\-63 indicating the
pitch of voiced speech.
The overall gain has not been discussed above:  it is simply the average
amplitude of the error signal.  Five bits on a logarithmic scale
are sufficient to represent it.
.pp
Filter coefficients are not very amenable to quantization.  At least
8\-10\ bits are required for each one.  However, reflection coefficients
are better behaved, and 5\-6\ bits each seems adequate.  The number of
coefficients that must be stored or transmitted is the same as the
order of the linear prediction:  10 is commonly used for low-quality
speech, with as many as 15 for higher qualities.
.pp
These figures give around 100\ bits/frame for a 10'th order system using
filter coefficients, and around 65\ bits/frame for a 10'th order system
using reflection coefficients.  Frame lengths vary between 10\ msec
and 25\ msec, depending on the quality desired.  Thus for 20\ msec frames,
the data rates work out at around 5000\ bit/s using filter coefficients,
and 3250\ bit/s using reflection coefficients.
.pp
Substantially lower data rates can be achieved by more careful
coding of parameters.  In 1976, the US Government defined a standard
coding scheme for 10-pole linear prediction with a data rate of
2400\ bit/s \(em conveniently chosen as one of the
commonly-used rates for serial data transmission.
This standard, called LPC-10, tackles the difficult problem of
protection against transmission errors (Fussell
.ul
et al,
1978).
.[
Fussell Boudra Abzug Cowing 1978
.]
.pp
Whenever data rates are reduced, redundancy inherent in the signal is
necessarily lost and so the effect of transmission errors becomes
greatly magnified.
For example, a single corrupted sample in PCM transmission of speech
will probably not be noticed, and even a short burst of errors will be
perceived as a click which can readily be distinguished from the speech.
However, any error in LPC transmission will last for one entire
frame \(em say 20\ msec \(em and worse still, it will be integrated into the
speech signal and not easily discriminated from it by the listener's brain.
A single corruption may, for example, change a voiced frame into an
unvoiced one, or vice versa.  Even if it affects only 
a reflection coefficient it will change the resonance characteristics
of that frame, and change them in a way that does not simply sound like
superimposed noise.
.pp
Table 6.1 shows the LPC-10 coding scheme.
.RF
.in+0.1i
.ta 2.0i +1.8i +0.6i
.nr x1 (\w'voiced sounds'/2)
.nr x2 (\w'unvoiced sounds'/2)
.ul
	\h'-\n(x1u'voiced sounds	\h'-\n(x2u'unvoiced sounds
.sp
pitch/voicing	7	7	60 pitch levels, Hamming
			\h'\w'00 'u'and Gray coded
energy	5	5	logarithmically coded
$k sub 1$	5	5	coded by table lookup
$k sub 2$	5	5	coded by table lookup
$k sub 3$	5	5
$k sub 4$	5	5
$k sub 5$	4	\-
$k sub 6$	4	\-
$k sub 7$	4	\-
$k sub 8$	4	\-
$k sub 9$	3	\-
$k sub 10$	2	\-
synchronization	1	1	alternating 1,0 pattern
error detection/	\-	\h'-\w'0'u'21
correction
	\h'-\w'__'u+\w'0'u'__	\h'-\w'__'u+\w'0'u'__
.sp
	\h'-\w'0'u'54	\h'-\w'0'u'54
.sp
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
	frame rate: 44.4\ Hz (22.5\ msec frames)
.in 0
.FG "Table 6.1  Bit requirements for each parameter in LPC-10 coding scheme"
Different coding is used for voiced and unvoiced frames.
Only four reflection coefficients are transmitted for unvoiced frames,
because it has been determined that no perceptible increase in speech quality
occurs when more are used.
The bits saved are more fruitfully employed to provide error detection
and correction for the other parameters.
Seven bits are used for pitch and the voiced-unvoiced flag, and they are
redundant in that only 60 possible pitch values are
allowed.
Most transmission errors in this field will be detected by the receiver;
which can then use an estimate of pitch based on previous values and
discard the erroneous one.  Pitch values are also Gray coded so that
even if errors are not detected, there is a good chance that an adjacent
pitch value is read instead.
Different numbers of bits are allocated to the various reflection
coefficients:  experience shows that the lower-numbered ones contribute
most highly to intelligibility and so these are quantized most finely.
In addition, a table lookup operation is performed on the code
generated for the first two, providing a non-linear quantization which is
chosen to minimize the error on a statistical basis.
.pp
With 54\ bits/frame and 22.5\ msec frames, LPC-10 requires a 2400\ bit/s
data rate.  Even lower rates have been used successfully for lower-quality
speech.  The Speak 'n Spell toy, described in Chapter 11, has an
average data rate of 1200\ bit/s.  Rates as low as 600\ bit/s have
been achieved (Kang and Coulter, 1976) by pattern recognition techniques operating
on the reflection coefficients:  however, the speech quality is not good.
.[
Kang Coulter 1976
.]
.sh "6.6  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "6.7  Further reading"
.pp
Most recent books on digital signal processing contain some information
on linear prediction (see Oppenheim and Schafer, 1975; Rabiner and Gold, 1975;
and Rabiner and Schafer, 1978; all referenced at the end of Chapter 4).
.LB "nn"
.\"Atal-1971-1
.]-
.ds [A Atal, B.S.
.as [A " and Hanauer, S.L.
.ds [D 1971
.ds [T Speech analysis and synthesis by linear prediction of the acoustic wave
.ds [J JASA
.ds [V 50
.ds [P 637-655
.nr [P 1
.ds [O August
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
This paper is of historical importance because it introduced the idea
of linear prediction to the speech processing community.
.in-2n
.\"Makhoul-1975-2
.]-
.ds [A Makhoul, J.I.
.ds [D 1975
.ds [K *
.ds [T Linear prediction: a tutorial review
.ds [J Proc IEEE
.ds [V 63
.ds [N 4
.ds [P 561-580
.nr [P 1
.ds [O April
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
An interesting, informative, and readable survey of linear prediction.
.in-2n
.\"Markel-1976-3
.]-
.ds [A Markel, J.D.
.as [A " and Gray, A.H.
.ds [D 1976
.ds [T Linear prediction of speech
.ds [I Springer Verlag
.ds [C Berlin
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is the only book which is entirely devoted to linear prediction of speech.
It is an essential reference work for those interested in the subject.
.in-2n
.\"Wiener-1947-4
.]-
.ds [A Wiener, N.
.ds [D 1947
.ds [T Extrapolation, interpolation and smoothing of stationary time series
.ds [I MIT Press
.ds [C Cambridge, Massachusetts
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Linear prediction is often thought of as a relatively new technique,
but it is only its application to speech processing that is novel.
Wiener develops all of the basic mathematics used in linear prediction
of speech, except the lattice filter structure.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "7  JOINING SEGMENTS OF SPEECH"
.ds RT "Joining segments of speech
.ds CX "Principles of computer speech
.pp
The obvious way to provide speech output from computers
is to select the basic acoustic units to be used; record them;
and generate utterances by concatenating together appropriate segments
from this pre-stored inventory.
The crucial question then becomes, what are the basic units?
Should they be whole sentences, words, syllables, or phonemes?
.pp
There are several trade-offs to be considered here.
The larger the units, the more utterances have to be stored.
It is not so much the length of individual utterances that is of concern,
but rather their variety, which tends to increase exponentially instead
of linearly with the size of the basic unit.  Numbers provide an
easy example:  there are $10 sup 7$ 7-digit telephone numbers, and it is
certainly infeasible to record each one individually.
Note that as storage technology improves the limitation is becoming
more and more one of recording the utterances in the first place rather
than finding somewhere to store them.
At a PCM data rate of 50\ Kbit/s, a 100\ Mbyte disk can hold over 4\ hours
of continuous speech.
With linear predictive coding at 1\ Kbit/s it holds 0.8 of a
megasecond \(em well over a week.  And this is a 24-hour 7-day week,
which corresponds to a working month; and continuous speech \(em without
pauses \(em which probably requires another factor of five for
production by a person.
Setting up a recording session to fill the disk would be a formidable
task indeed!
Furthermore, the use of videodisks \(em which will be common domestic items
by the end of the decade \(em could increase these figures by a factor of 50.
.pp
The word seems to be a sensibly-sized basic unit.
Many applications use a rather limited vocabulary \(em 190 words
for the airline reservation system described in Chapter 1.
Even at PCM data rates, this will consume less than 0.5\ Mbyte of
storage.
Unfortunately, coarticulation and prosodic factors now come into play.
.pp
Real speech is connected \(em there are few gaps between words.
Coarticulation, where sounds are affected by those on either side,
naturally operates across word boundaries.
And the time constants of coarticulation are associated with the
mechanics of the vocal tract and hence measure tens or hundreds
of msec.  Thus the effects straddle several pitch periods (100\ Hz pitch
has 10\ msec period) and cannot be simulated by simple interpolation of the
speech waveform.
.pp
Prosodic features \(em notably pitch and rhythm \(em span much longer
stretches of speech than single words.  As far as most speech output
applications are concerned, they operate at the utterance level of
a single, sentence-sized, information unit.  They cannot be
accomodated if speech waveforms of individual words of
the utterance are stored,
for it is rarely feasible to alter the fundamental
frequency or duration of a time waveform without changing all the formant
resonances as well.
However, both word-to-word coarticulation and the essential features
of rhythm and intonation can be incorporated if the stored words are
coded in source-filter form.
.pp
For more general applications of speech output, the limitations of
word storage soon become apparent.  Although people's daily
vocabularies are not large, most words have a variety
of inflected forms which need to be treated separately if a strict
policy is adopted of word storage.  For instance, in this book
there are 84,000 words, and 6,500 (8%) different ones (counting
inflected forms).
In Chapter 1 alone, there are 6,800 words and 1,700 (25%) different ones.
.pp
It seems crazy to treat a simple inflection like "$-s$" or its voiced
counterpart, "$-z$" (as in "inflection\c
.ul
s\c
"),
as a totally different word from the base form.
But once you consider storing roots and endings separately,
it becomes apparent
that there is a vast number of different endings, and it is difficult to know
where to draw the line.  It is natural to think instead of simply
using the syllable as the basic unit.
.pp
A generous estimate of the number of different syllables in English is 10,000.
At three a second, only about an
hour's storage is required for them all.  But waveform storage
will certainly not do.
Although coarticulation effects between words are needed to make
speech sound fluent, coarticulation between syllables is necessary
for it even to be
.ul
comprehensible.
Adopting a source-filter form of representation is essential, as is
some scheme of interpolation between syllables which simulates
coarticulation.
Unfortunately, a great deal of acoustic action occurs at syllable
boundaries \(em stops are exploded, the sound source changes
between voicing and frication, and so on.  It may be more appropriate
to consider inverse syllables, comprising a vowel-consonant-vowel sequence
instead of consonant-vowel-consonant.
(These have jokingly been dubbed "lisibles"!)
.pp
There is again some considerable practical difficulty in creating
an inventory of syllables, or lisibles.
Now it is not so much the recording that is impractical, but
the editing needed to ensure that the cuts between syllables are made
at exactly the right point.  As units get smaller, the exact
placement of the boundaries becomes ever more critical; and several thousand
sensitive editing jobs is no easy task.
.pp
Since quite general effects of coarticulation must be accomodated
with syllable synthesis, there will not necessarily be significant
deterioration if smaller, demisyllable, units are employed.
This reduces the segment inventory to an estimated 1000\-2000 entries,
and the tedious job of editing each one individually becomes at
least feasible, if not enviable.
Alternatively, the segment inventory could be created by artificial
means involving cut-and-try experiments with resonance parameters.
.pp
The ultimate in economy of inventory size, of course, is to use
phonemes as the basic unit.  This makes the most critical
part of the task interpolation between units, rather than their
construction or recording.  With only about 40 phonemes
in English, each one can be examined in many different contexts to
ascertain the best data to store.
There is no need to record them directly from a human voice \(em it
would be difficult anyway for most cannot be produced in isolation.
In fact, a phoneme is an abstract unit, not a particular sound
(recall the discussion of phonology in Chapter 2), and so it is
most appropriate that data be abstracted from several different
realizations rather than an exact record made of any one.
.pp
If information is stored about phonological units of
speech \(em phonemes \(em the difficult task of phonological-to-phonetic
conversion must necessarily be performed automatically.
Allophones are created by altering the transitions between units,
and to a lesser extent by modifying the central parts of the units
themselves.
The rules for making transitions will have a big effect on the
quality of the resulting speech.
Instead of trying to perform this task automatically by a computer
program, the allophones themselves could be stored.  This will
ease the job of generating transitions between segments, but
will certainly not eliminate it.
The total number of allophones will depend on the narrowness of the
transcription system:  60\-80 is typical, and it is unlikely to exceed
one or two hundred.  In any case there will not be a storage problem.
However, now the burden of producing an allophonic transcription
has been transferred to the person who codes the utterance prior
to synthesizing it.  If he is skilful and patient, he should
be able to coax the system into producing fairly understandable
speech, but the effort required for this on a per-utterance basis
should not be underestimated.
.RF
.nr x0 \w'sentences  '
.nr x1 \w'  '
.nr x2 \w'depends on  '
.nr x3 \w'generalized or  '
.nr x4 \w'natural speech  '
.nr x5 \w'author of segment'
.nr x6 \n(x0u+\n(x1u+\n(x2u+\n(x3u+\n(x4u+\n(x5u
.nr x7 (\n(.l-\n(x6)/2
.in \n(x7u
.ta \n(x0u +\n(x1u +\n(x2u +\n(x3u +\n(x4u
	|	size of	storage	source of	principal
	|	utterance	method	utterance	burden is
	|	inventory		inventory	placed on
	|\h'-1.0i'\l'\n(x6u\(ul'
	|
sentences	|	depends on	waveform or	natural speech	recording artist,
	|	application	source-filter		storage medium
	|		parameters
	|
words	|	depends on	source-filter	natural speech	recording artist
	|	application	parameters		and editor,
	|				storage medium
	|
syllables/	|	\0\0\010000	source-filter	natural speech	recording editor
  lisibles	|		parameters
	|
demi-	|	\0\0\0\01000	source-filter	natural speech	recording editor
  syllables	|		parameters	or artificially	or inventory
	|			generated	compiler
	|
phonemes	|	\0\0\0\0\0\040	generalized	artificially	author of segment
	|		parameters	generated	concatenation
	|				program
	|
allophones	|	\0\050\-100	generalized or	artificially	coder of
	|		source-filter	generated or	synthesized
	|		parameters	natural speech	utterances
	|\h'-1.0i'\l'\n(x6u\(ul'
.in 0
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.FG "Table 7.1  Some issues relevant to choice of basic unit"
.pp
Table 7.1 summarizes in broad brush-strokes the issues which relate to the
choice of basic unit for concatenation.
The sections which follow provide more detail about the different
methods of joining segments of speech together.
Only segmental aspects are considered, for the important problems of
prosody will be treated in the next chapter.
All of the methods rely to some extent on the acoustic properties of speech,
and as smaller basic units are considered the role of speech acoustics
becomes more important.
It is impossible in a book like this to give a detailed account of acoustic
phonetics, for it would take several volumes!
What I aim to do in the following pages is to highlight some salient features
which are relevant to segment concatenation, without attempting to be
complete.
.sh "7.1  Word concatenation"
.pp
For general speech output, word concatenation is an inherently limited
technique because of the large number of phonetically different words.
Despite this fact, it is at present the most widely-used synthesis
method, and is likely to remain so for several years.
We have seen that the primary problems are word-to-word
coarticulation and prosody; and both can be overcome, at least to a useful
approximation, by coding the words in source-filter form.
.rh "Time-domain techniques."
Nevertheless, a surprising number of applications simply store
the time waveform, coded, usually, by one of the techniques described in
Chapter 3.
From an implementation point of view there are many advantages to this.
Speech quality can easily be controlled by selecting a suitable sampling
rate and coding scheme.
A natural-sounding voice is guaranteed; male or female as desired.
The equipment required is minimal \(em a digital-to-analogue
converter and post-sampling filter will do for synthesis if
PCM coding is used, and
DPCM, ADPCM, and delta modulation decoders are not much more complicated.
.pp
From a speech point of view, the resulting utterances can never be made
convincingly fluent.
We discussed the early experiments of Stowe and Hampton (1961)
at the beginning of Chapter 3.
.[
Stowe Hampton 1961
.]
A major drawback to word concatenation in the
analogue domain is the introduction of clicks and other interference
between words:  it is difficult to prevent the time waveform transitions
from adding extraneous sounds.
This poses no problem with digital storage, however, for the waveforms
can be edited accurately prior to storage so that they start
and finish at an exactly
zero level.
Rather, the lack of fluency stems from the absence of proper control
of coarticulation and prosody.
.pp
But this is not necessarily a serious drawback if the application is
a sufficiently limited one.  Complete, invariant utterances can be
stored as one unit.  Often they must contain data-dependent
slot-fillers, as in
.LB
This flight makes \(em stops
.LE
and
.LB
Flight number \(em leaves \(em at \(em , arrives in \(em at \(em
.LE
(taken from the airline reservation system of Chapter 1
(Levinson and Shipley, 1980)).
.[
Levinson Shipley 1980
.]
Then, each slot-filling word is recorded in an intonation consistent
both with its position in the template utterance and with the
intonation of that utterance.
This could be done by embedding the word in the utterance
for recording, and excising it by digital editing before storage.
It would be dangerous to try to take into account coarticulation effects,
for the coarticulation could not be made consistent with both the
several slot-fillers and the single template.
This could be overcome if several versions of the template were stored,
but then the scheme becomes subject to combinatorial explosion
if there is more than one slot in a single utterance.
But it is not really necessary, for the lack of fluency will probably
be interpreted by a benevolent listener as an attempt to convey the
information as clearly as possible.
.pp
Difficulties will occur if the same slot-filler is used in different
contexts.  For instance, the first gap in each of the sentences above
contains a number; yet the intonation of that number is different.
Many systems simply ignore this problem.
Then one does notice anomalies, if one is attentive:  the words come,
as it were, from different mouths, without fluency.
However, the problem is not necessarily acute.  If it is, two or more
versions of each slot-filler can be recorded, one for each context.
.pp
As an example, consider the synthesis of 7-digit telephone numbers,
like 289\-5371.  If one version only of each digit is stored,
it should be recorded in a level tone of voice.  A pause should be
inserted after the third digit of the synthetic number, to accord
with common elocution.  The result will certainly be unnatural, although
it should be clear and intelligible.
Any pitch errors in the recordings will make certain numbers
audibly anomalous.
At the other extreme, 70 single digits could be stored, one version of
each digit for each position in the number.  The recording will be
tedious and error-prone, and the synthetic utterances will still not
be fluent \(em for coarticulation is ignored \(em but instead
unnaturally clearly enunciated.  A compromise is to record only
three versions of each digit, one for any of the
five positions
.nr x1 \w'\(ul'
.nr x2 (8*\n(x1)
.nr x3 0.2m
\zx\h'\n(x1u'\zx\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\zx\h'\n(x1u'\zx\h'\n(x1u'\c
\zx\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' ,
another one for the third position
\h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c
\h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' ,
and the last for the final position
\h'\n(x1u'\h'\n(x1u'\h'\n(x1u'\z\-\h'\n(x1u'\h'\n(x1u'\c
\h'\n(x1u'\h'\n(x1u'\zx\h'\n(x1u'\v'\n(x3u'\l'-\n(x2u\(ul'\v'-\n(x3u' .
The first version will be in a level voice, the second an
incomplete, rising tone; and the third a final, dropping pitch.
.rh "Joining formant-coded words."
The limitations of the time-domain method are lack of
fluency caused by unnatural transitions between words, and the
combinatorial explosion created by recording slot-fillers several times
in different contexts.
Both of these problems can be alleviated by storing formant tracks,
concatenating them with suitable interpolation, and applying a complete
pitch contour suitable for the whole utterance.
But one can still not generate conversational speech, for natural speech
rhythms cause non-linear warpings of the time axis which cannot reasonably
be imitated by this method.
.pp
Solving problems often creates others.
As we saw in Chapter 4, it is not easy to obtain reliable formant tracks
automatically.  Yet hand-editing of formant parameters adds a whole new
dimension to the problem of vocabulary construction, for it is
an exceedingly tiresome and time-consuming task.
Even after such tweaking, resynthesized utterances will be degraded
considerably from the original, for the source-filter model is by no means
a perfect one.
A hardware or real-time software formant synthesizer must be added
to the system, presenting design problems and creating extra cost.
Should a serial or parallel synthesizer be used? \(em the latter offers
potentially better speech (especially in nasal sounds), but requires
additional parameters, namely formant amplitudes, to be estimated.
Finally, as we will see in the next chapter, it is not an easy matter to
generate a suitable pitch contour and apply it to the utterance.
.pp
Strangely enough, the interpolation itself does not present any great
difficulty, for there is not enough information in the formant-coded
words to make possible sophisticated coarticulation.
The need for interpolation is most pressing when one word ends with
a voiced sound and the next begins with one.
If either the end of the first or the beginning of the second word
(or both) is unvoiced, unnatural formant transitions do not matter
for they will not be heard.
Actually, this is only strictly true for fricative transitions:  if
the juncture is aspirated then formants will be perceived in the
aspiration.  However,
.ul
h
is the only fully aspirated sound in English,
and it is relatively uncommon.
It is not absolutely necessary to interpolate the fricative filter resonance,
because smooth transitions from one fricative sound to another are rare
in natural speech.
.pp
Hence unless both sides of the junction are voiced, no interpolation
is needed:  simple abuttal of the stored parameter tracks will do.
Note that this is
.ul
not
the same as joining time waveforms, for the synthesizer
will automatically ensure a relatively smooth transition from one
segment to another because of energy storage in the filters.
A new set of resonance parameters for the formant-coded words will be stored
every 10 or 20 msec (see Chapter 5), and so the transition will automatically
be smoothed over this time period.
.pp
For voiced-to-voiced transitions, some interpolation is needed.
An overlap period of duration, say, 50\ msec, is established, and
the resonance parameters in the final 50\ msec of the first word are
averaged with those in the first 50\ msec of the second.
The average is weighted, with the first word's formants dominating
at the beginning and their effect progressively dying out
in favour of the second word.
.pp
More sophisticated than a simple average is to weight the components
according to how rapidly they are changing.
If the spectral change in one word is much greater than that in the
other, we might expect that this will dominate the transition.
A simple measure of spectral derivative at any given time can be found
by adding the magnitude of the discrepancies in each formant frequency
between one sample and the next.
The spectral change in the transition region can be obtained by summing
the spectral derivatives at each sample in the region.
Such a measure can perhaps be made more accurate by taking into
account the relative importance of the formants, but will probably
never be more than a rough and ready yardstick.
At any rate, it can be used to load the average in favour of the
dominant side of the junction.
.pp
Much more important for naturalness of the speech are the effects
of rhythm and intonation, discussed in the next chapter.
.pp
Such a scheme has been implemented and tested on \(em guess what! \(em 7-digit
telephone numbers (Rabiner
.ul
et al,
1971).
.[
Rabiner Schafer Flanagan 1971
.]
Significant improvement (at the 5% level of statistical
significance) in people's
ability to recall numbers was found for this method over direct
abuttal of either natural or synthetic versions of the digits.
Although the method seemed, on balance, to produce utterances that were
recalled less accurately than completely natural spoken
telephone numbers, the difference was not significant (at the 5% level).
The system was also used to generate wiring instructions by computer
directly from the connection list, as described in Chapter 1.
As noted there, synthetic speech was actually preferred to natural speech
in the noisy environment of the production line.
.rh "Joining linear predictive coded words."
Because obtaining accurate formant tracks for natural utterances
by Fourier transform methods is difficult, it is worth considering
the use of linear prediction as the source-filter model.
Actually, formant resonances can be extracted from linear predictive
coefficients quite easily, but there is no need to do this because
the reflection coefficients themselves are quite suitable
for interpolation.
.pp
A slightly different interpolation scheme from that described in the
previous section has been reported (Olive, 1975).
.[
Olive 1975
.]
The reflection coefficients were spliced during an overlap region of
only 20\ msec.
More interestingly, attempts were made to suppress the plosive bursts
of stop sounds in cases where they were followed by another stop at
the beginning of the next word.
This is a common coarticulation, occurring, for instance, in the phrase
"stop burst".  In running speech, the plosion on the
.ul
p
of "stop" is
normally suppressed because it is followed by another stop.
This is a particularly striking case because the place of articulation
of the two stops
.ul
p
and
.ul
b
is the same:  complete suppression is not as likely
to happen in "stop gap", for example (although it may occur).
Here is an instance of how extra information could improve the
quality of the synthetic transitions considerably.
However, automatically identifying the place of articulation of stops is
a difficult job, of a complexity far above what is appropriate for
simply joining words stored in source-filter form.
.pp
Another innovation was introduced into the transition between two
vowel sounds, when the second word began with an accented syllable.
A glottal stop was placed at the juncture.
Although the glottal stop was not described in Chapter 2, it is a sound
used in many dialects of English.  It frequently occurs
in the utterance "uh-uh", meaning "no".  Here it
.ul
is
used to separate two vowel sounds, but in fact this is not particularly
common in most dialects.
One could say "the apple", "the orange", "the onion" with a neutral vowel
in "the" (to rhyme with "\c
.ul
a\c
bove") and a glottal stop as separator,
but it is much more usual to rhyme "the" with "he" and introduce a
.ul
y
between the words.
Similarly, even speakers who do not normally pronounce an
.ul
r
at the
end of words will introduce one in "bigger apple", rather than
using a glottal stop.
Note that it would be wrong to put an
.ul
r
in "the apple", even
for speakers who usually terminate "the" and "bigger" with the same sound.
Such effects occur at a high level of processing, and are practically
impossible to simulate with word-interpolation rules.
Hence the expedient of introducing a glottal stop is a good one, although
it is certainly unnatural.
.sh "7.2  Concatenating whole or partial syllables"
.pp
The use of segments larger than a single phoneme or allophone but smaller
than a word as the basic unit for speech synthesis has an interesting
history.
It has long been realized that transitions between phonemes are
extremely sensitive and critical components of speech, and thus are
essential for successful synthesis.
Consider the unvoiced stop sounds
.ul
p, t,
and
.ul
k.
Their central portion is actually silence!  (Try saying a word like
"butter" with a very long
.ul
t.\c
)  Hence
in this case it is
.ul
only
the transitional information which can distinguish these sounds from
each other.
.pp
Sound segments which comprise the transition from the centre of one phoneme
to the centre of the next are called
.ul
dyads
or
.ul
diphones.
The possibility of using them as the basic units for concatenation
was first mooted in the mid 1950's.
The idea is attractive because there is relatively little spectral
movement in the central, so-called "steady-state", portion of many
phonemes \(em in the extreme case of unvoiced stops there is not only
no spectral movement, but no spectrum at all in the steady state!
At that time the resonance synthesizer was in its infancy, and
so recorded segments of live speech were used.  The early experiments
met with little success because of the technical difficulties
of joining analogue waveforms and inevitable discrepancies between
the steady-state parts of a phoneme recorded in different contexts \(em not
to mention the problems of coarticulation and prosody which effectively
preclude the use of waveform concatenation at such a low level.
.pp
In the mid 1960's, with the growing use of resonance synthesizers,
it became possible to generate diphones by copying resonance parameters
manually from a spectrogram, and improving the result by trial and error.
It was not feasible to extract formant frequencies automatically from real
speech, though, because the fast Fourier transform was not yet widely
known and the computational burden of slow Fourier transformation was
prohibitive.
For example, a project at IBM stored manually-derived parameter tracks
for diphones, identified by pairs of phoneme names (Dixon and Maxey, 1968).
.[
Dixon Maxey 1968
.]
To generate a synthetic utterance it was coded in
phonetic form and used to access
the diphone table to give a set of parameter tracks for the complete
utterance.  Note that this is the first system we have encountered
whose input is a phonetic transcription which relates to an inventory
of truly synthetic character:  all previous schemes used recordings of
live speech, albeit processed in some form.
Since the inventory was synthetic, there was no difficulty in ensuring
that discontinuities did not arise between segments beginning and ending with
the same phoneme.  Thus interpolation was irrelevant, and the synthesis
procedure concentrated on prosodic questions.  The resulting speech
was reported to be quite impressive.
.pp
Strictly speaking, diphones are not demisyllables but phoneme pairs.
In the simplest case they happen to be similar, for two primary diphones
characterize a consonant-vowel-consonant syllable.
There is an advantage to using demisyllables rather than diphones as the basic
unit, for many syllables begin or end with complicated consonant clusters
which are not easy to produce convincingly by diphone
concatenation.
But they are not easy to produce by hand-editing resonance parameters
either!
Now that speech analysis methods have been developed and refined,
resonance parameters or linear predictive coefficients
can be extracted automatically
from natural utterances, and there has been a resurgence of interest in
syllabic and demisyllabic synthesis methods.  The wheel has turned
full circle, from segments of natural speech to hand-tailored parameters
and back again!
.pp
The advantage of storing demisyllables over syllables (or lisibles) from
the point of view of storage capacity has already been pointed out
(perhaps 1,000\-2,000 demisyllables as opposed to 4,000\-10,000 syllables).
But it is probably not too significant with the continuing decline
of storage costs.
The requirements are of the order of 25\ Kbyte versus 0.5\ Mbyte
for 1200\ bit/s linear predictive coding, and the latter could
almost be accomodated today \(em 1981 \(em on a state-of-the-art
read-only memory chip.
A bigger advantage comes from rhythmic considerations.
As we will see in the next chapter, the rhythms of fluent speech cause
dramatic variations in syllable duration, but these seem to affect
the vowel and closing consonant cluster much more than the initial consonant
cluster.  Thus if a demisyllable is deemed to begin shortly (say 60\ msec)
after onset of the vowel, when the formant structure has settled down,
the bulk of the vowel and the closing consonant cluster will form a
single demisyllable.  The opening cluster of the next syllable will lie
in the next demisyllable.  Then differential lengthening can be applied
to that part of the syllable which tends to be stretched in live speech.
.pp
One system for demisyllable concatenation has produced excellent results
for monosyllabic English words (Lovins and Fujimura, 1976).
.[
Lovins Fujimura 1976
.]
Complex word-final consonant clusters are excluded from the inventory by
using syllable affixes
.ul
s, z, t,
and
.ul
d;
these are attached to the
syllabic core as a separate exercise (Macchi and Nigro, 1977).
.[
Macchi Nigro 1977
.]
Prosodic rather than segmental considerations are likely to prove the major
limiting factor when this scheme is extended to running speech.
.pp
Monosyllabic words spoken in isolation are coded as linear predictive
reflection coefficients, and segmented by digital editing into the initial
consonant cluster and the vocalic nucleus plus final cluster.
The cut is made 60\ msec into the vowel, as suggested above.
This minimizes the difficulty of interpolation when concatenating
segments, for there is ample voicing on either side of the juncture.
The reflection coefficients should not differ radically because the
vowel is the same in each demisyllable.
A 40\ msec overlap is used, with the usual linear interpolation.
An alternative smoothing rule applies when the second segment has
a nasal or glide after the vowel.  In this case anticipatory coarticulation
occurs, affecting even the early part of the vowel.  For example, a vowel
is frequently nasalized when followed by a nasal sound \(em even in English
where nasalization is not a distinctive feature in vowels (see Chapter 2).
Under these circumstances the overlap area is moved forward in time so
that the colouration applies throughout almost the whole vowel.
.sh "7.3  Phoneme synthesis"
.pp
Acoustic phonetics is the study of how the acoustic
signal relates to the phonetic sequence which was spoken or heard.
People \(em especially engineers \(em often ask, how could phonetics not
be acoustic?  In fact it can be articulatory, auditory, or linguistic
(phonological), for example, and we have touched on the first and last
in Chapter 2.
The invention of the sound spectrograph in the late 1940's was an
event of colossal significance for acoustic phonetics, for it somehow
seemed to make the intricacies of speech visible.
(This was thought to be a greater advance than actually turned
out:  historically-minded readers should refer to Potter
.ul
et al,
1947,
for an enthusiastic contemporary appraisal of the invention.)  A
.[
Potter Kopp Green 1947
.]
result of several years of research at Haskins Laboratories in New York
during the 1950's was a set of "minimal rules for synthesizing speech",
which showed how stylized formant patterns could generate cues for
identifying vowels and, particularly, consonants
(Liberman, 1957; Liberman
.ul
et al,
1959).
.[
Liberman 1957 Some results of research on speech perception
.]
.[
Liberman Ingemann Lisker Delattre Cooper 1959
.]
.pp
These were to form the basis of many speech synthesis-by-rule computer
programs in the ensuing decades.  Such programs take as input a
phonetic transcription of the utterance and generate a spoken version
of it.  The transcription may be broad or narrow, depending on the
system.  Experience has shown that the Haskins rules really are
minimal, and the success of a synthesis-by-rule program depends on
a vast collection of minutia, each seemingly insignificant in isolation
but whose effects combine to influence the speech quality dramatically.
The best current systems produce clearly understandable
speech which is nevertheless something of a strain to listen to for
long periods.
However, many are not good; and some are execrable.
In recent times commercial influences have unfortunately restricted
the free exchange of results and programs between academic researchers,
thus slowing down progress.
Research attention has turned to prosodic factors,
which are certainly less well understood than segmental ones, and
to synthesis from plain English text rather than from phonetic transcriptions.
.pp
The remainder of this chapter describes the techniques of segmental
synthesis.  First it is necessary to introduce some
elements of acoustic phonetics.
It may be worth re-reading Chapter 2 at this point, to refresh
your memory about the classification of speech sounds.
.sh "7.4  Acoustic characterization of phonemes"
.pp
Shortly after the invention of the sound spectrograph an inverse
instrument was developed, called the "pattern playback" synthesizer.
This took as input a spectrogram, either in its original form or
painted by hand.
An optical arrangment was used to modulate the amplitude of some
fifty harmonically-related oscillators by the lightness or darkness
of each point on the frequency axis of the spectrogram.
As it was drawn past the playing head, sound was produced which
had approximately the frequency components shown on the spectrogram,
although the fundamental frequency was constant.
.pp
This device allowed the complicated
acoustic effects seen on a spectrogram (see for example Figures 2.3 and 2.4)
to be replayed in either original or simplified form.
Hence the features which are important for perception of the different sounds
could be isolated.  The procedure was to copy from an actual spectrogram
the features which were most prominent visually, and then to make further
changes by trial and error until the result was judged to have
reasonable intelligibility when replayed.
.pp
For the purpose of acoustic characterization of particular phonemes,
it is useful to consider the central, steady-state part separately from
transitions into and out of the segment.
The steady-state part is that sound which is heard when the phoneme
is prolonged.  The term "phoneme" is being used in a rather loose sense
here:  it is more appropriate to think of a "sound segment" rather than
the abstract unit which forms the basis of phonological classification,
and this is the terminology I will adopt.
.pp
The essential auditory characteristics of some sound segments are inherent in
their steady states.
If a vowel, for example, is spoken and prolonged, it can readily be
identified by listening to any part of the utterance.
This is not true for diphthongs:  if you say "I" very slowly and freeze
your vocal tract posture at any time, the resulting steady-state sound
will not be sufficient to identify the diphthong.  Rather, it will be
a vowel somewhere between
.ul
aa
(in "had") or
.ul
ar
(in "hard") and
.ul
ee
(in "heed").
Neither is it true for glides, for prolonging
.ul
w
(in "want") or
.ul
y
(in "you") results in vowels resembling respectively
.ul
u
("hood") or
.ul
ee
("heed").
Fricatives, voiced or unvoiced, can be identified from the steady state;
but stops can not, for their's is silent (or \(em in the case
of voiced stops \(em something close to it).
.pp
Segments which are identifiable from their steady state are easy to synthesize.
The difficulty lies with the others, for it must be the transitions which
carry the information.  Thus "transitions" are an essential part of speech,
and perhaps the term is unfortunate for it calls to mind an unimportant
bridge between one segment and the next.
It is tempting to use the words "continuant" and "non-continuant" to distinguish
the two categories; unfortunately they are used by phoneticians in a different
sense.
We will call them "steady-state" and "transient" segments.  The latter term
is not particularly appropriate, for even sounds in this class
.ul
can
be prolonged:  the point is that the identifying information is in the
transitions rather than the steady state.
.RF
.nr x1 (\w'excitation'/2)
.nr x2 (\w'formant resonance'/2)
.nr x3 (\w'fricative'/2)
.nr x4 (\w'frequencies (Hz)'/2)
.nr x5 (\w'resonance (Hz)'/2)
.nr x0 4n+1.7i+0.8i+0.6i+0.6i+1.0i+\w'00'+\n(x5
.nr x6 (\n(.l-\n(x0)/2
.in \n(x6u
.ta 4n +1.7i +0.8i +0.6i +0.6i +1.0i
		\h'-\n(x1u'excitation		\0\0\h'-\n(x2u'formant resonance	\0\0\h'-\n(x3u'fricative
				\0\0\h'-\n(x4u'frequencies (Hz)	\0\0\c
\h'-\n(x5u'resonance (Hz)
\l'\n(x0u\(ul'
.sp
.nr x1 (\w'voicing'/2)
\fIuh\fR	(the)	\h'-\n(x1u'voicing	\0500	1500	2500
\fIa\fR	(bud)	\h'-\n(x1u'voicing	\0700	1250	2550
\fIe\fR	(head)	\h'-\n(x1u'voicing	\0550	1950	2650
\fIi\fR	(hid)	\h'-\n(x1u'voicing	\0350	2100	2700
\fIo\fR	(hod)	\h'-\n(x1u'voicing	\0600	\0900	2600
\fIu\fR	(hood)	\h'-\n(x1u'voicing	\0400	\0950	2450
\fIaa\fR	(had)	\h'-\n(x1u'voicing	\0750	1750	2600
\fIee\fR	(heed)	\h'-\n(x1u'voicing	\0300	2250	3100
\fIer\fR	(heard)	\h'-\n(x1u'voicing	\0600	1400	2450
\fIar\fR	(hard)	\h'-\n(x1u'voicing	\0700	1100	2550
\fIaw\fR	(hoard)	\h'-\n(x1u'voicing	\0450	\0750	2650
\fIuu\fR	(food)	\h'-\n(x1u'voicing	\0300	\0950	2300
.nr x1 (\w'aspiration'/2)
\fIh\fR	(he)	\h'-\n(x1u'aspiration
.nr x1 (\w'frication'/2)
.nr x2 (\w'frication and voicing'/2)
\fIs\fR	(sin)	\h'-\n(x1u'frication				6000
\fIz\fR	(zed)	\h'-\n(x2u'frication and voicing				6000
\fIsh\fR	(shin)	\h'-\n(x1u'frication				2300
\fIzh\fR	(vision)	\h'-\n(x2u'frication and voicing				2300
\fIf\fR	(fin)	\h'-\n(x1u'frication				4000
\fIv\fR	(vat)	\h'-\n(x2u'frication and voicing				4000
\fIth\fR	(thin)	\h'-\n(x1u'frication				5000
\fIdh\fR	(that)	\h'-\n(x2u'frication and voicing				5000
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 7.2  Resonance synthesizer parameters for steady-state sounds"
.rh "Steady-state segments."
Table 7.2 shows appropriate values for the resonance parameters and
excitation sources of a resonance synthesizer, for steady-state
segments only.
There are several points to note about it.
Firstly, all the frequencies involved obviously depend upon the
speaker \(em the size of his vocal tract, his accent and speaking habits.
The values given are nominal ones for a male speaker with a dialect of
British English called "received pronunciation" (RP) \(em for it is what
used to be "received" on the wireless in the old days
before the British Broadcasting Corporation
adopted a policy of more informal, more regional, speech.
Female speakers have formant frequencies approximately 15% higher
than male ones.
Secondly, the third formant is relatively unimportant for vowel
identification; it is
the first and second that give the vowels their character.
Thirdly, formant values for
.ul
h
are not given, for they would be meaningless.
Although it is certainly a steady-state sound,
.ul
h
changes radically
in context.  If you say "had", "heed", "hud", and so on, and freeze
your vocal tract posture on the initial
.ul
h,
you will find it
already configured for the following vowel \(em an excellent
example of anticipatory coarticulation.
Fourthly, amplitude values do play some part in identification,
particularly for fricatives.
.ul
th
is the weakest sound, closely followed by
.ul
f,
with
.ul
s
and
.ul
sh
the
strongest.  It is necessary to get a reasonable mix of excitation in
the voiced fricatives; the voicing amplitude is considerably less than
in vowels.  Finally, there are other sounds that might be considered
steady state ones.  You can probably identify
.ul
m, n,
and
.ul
ng
just by
their steady states.  However, the difference is not particularly
strong; it is the transitional parts which discriminate most effectively
between these sounds.  The steady state of
.ul
r
is quite distinctive, too,
for most speakers, because the top of the tongue is curled back in a
so-called "retroflex" action and this causes a radical change in the
third formant resonance.
.rh "Transient segments."
Transient sounds include diphthongs, glides,
nasals, voiced and unvoiced stops, and affricates.
The first two are relatively easy to characterize, for they are
basically continuous, gradual transitions from one vocal tract posture
to another \(em sort of dynamic vowels.  Diphthongs and glides are
similar to each other.  In fact "you" could be transcribed as
a triphthong,
.ul
i e uu,
except that in the initial posture the tongue
is even higher, and the vocal tract correspondingly more constricted,
than in
.ul
i
("hid") \(em though not as constricted as in
.ul
sh.
Both categories can be represented in terms of target formant
values, on the understanding that these are not to be
interpreted as steady state configurations but strictly as
extreme values at the beginning or end of the formant motion (for
transitions out of and into the segment, respectively).
.pp
Nasals have a steady-state portion comprising a strong nasal formant
at a fairly low frequency, on account of the large size of the
combined nasal and oral cavity which is resonating.
Higher formants are relatively weak, because of attenuation effects.
Transitions into and out of nasals are strongly nasalized,
as indeed are adjacent vocalic segments, with
the oral and nasal tract operating in parallel.  As discussed in
Chapter 5, this cannot be simulated on a series synthesizer.
However, extremely fast motions of the formants occur on account of
the binary switching action of the velum, and it turns out that
fast formant transitions are sufficient to simulate nasals because
the speech perception mechanism is accustomed to hearing them only
in that context!  Contrast this with the extremely slow transitions
in diphthongs and glides.
.pp
Stops form the most interesting category, and research using the pattern
playback synthesizer was instrumental in providing adequate acoustic
characterizations for them.  Consider unvoiced stops.
They each have three phases:  transition in, silent central portion,
and transition out.  There is a lot of action on the transition out
(and many phoneticians would divide this part alone into several "phases").
First, as the release occurs, there is a small burst of fricative noise.
Say "t\ t\ t\ ..." as in "tut-tut", without producing any voicing.
Actually, when used as an admonishment this is accompanied by
an ingressive, inhaling air-stream instead of the normal egressive,
exhaling one used in English speech (although some languages
do have ingressive sounds).
In any case, a short fricative somewhat resembling a tiny
.ul
s
can be heard as the tongue leaves the roof of the mouth.
Frication is produced when the gap is very narrow, and ceases
rapidly as it becomes wider.
Next, when an unvoiced stop is released, a significant amount of aspiration
follows the release.
Say "pot", "tot", "cot" with force and you will hear the
.ul
h\c
-like
aspiration quite clearly.
It doesn't always occur, though; for example you will hear little
aspiration when a fricative like
.ul
s
precedes the stop in the
same syllable, as in "spot", "scot".  The aspiration is a distinguishing
feature between "white spot" and the rather unlikely "White's pot".
It tends to increase as the emphasis on the syllable increases,
and this in an example of a prosodic feature influencing segmental
characteristics.  Finally, at the end of the segment,
the aspiration \(em if any \(em will turn to voicing.
.pp
What has been described applies to
.ul
all
unvoiced stops.
What distinguishes one from another?
The tiny fricative burst will be different because the noise is produced
at different places in the vocal tract \(em at the lips for
.ul
p,
tongue and front of palate for
.ul
t,
and tongue and back of palate for
.ul
k.
The most important difference, however, is the formant motion illuminated
by the last vestiges of voicing at closure and by both aspiration and the
onset of voicing at opening.
Each stop has target formant values which, although
they cannot be heard during the stopped portion (for there is no
sound there), do affect the transitions in and out.
An added complexity is that the target positions themselves vary to some
extent depending on the adjacent segments.
If the stop is heavily aspirated, the vocal posture will have almost
attained that for the following vowel before voicing begins, but
the formant transitions will be perceived because they affect
the sound quality of aspiration.
.pp
The voiced stops
.ul
b, d,
and
.ul
g
are quite similar to their unvoiced analogues
.ul
p, t,
and
.ul
k.
What distinguishes them from each other are the formant transitions to
target positions, heard during closure and opening.
They are distinguished from their unvoiced counterparts by the fact
that more voicing is present:  it lingers on longer at closure
and begins earlier on opening.  Thus little or no aspiration appears
during the opening phase.  If an unvoiced stop is uttered in a context
where aspiration is suppressed, as in "spot", it is almost identical to the
corresponding voiced stop, "sbot".  Luckily no words in English require
us to make a distinction in such contexts.
Voicing sometimes pervades the entire stopped portion of a voiced stop,
especially when it is surrounded by other voiced segments.
When saying a word like "baby" slowly you can choose whether or not to
prolong voicing throughout the second
.ul
b.
If you do, creating what is
called a "voice bar" in spectrograms,
the sound escapes through the cheeks, for
the lips are closed \(em try doing it for a very long time and your cheeks
will fill up with air!
This severely attenuates high-frequency components, and can
be simulated with a weak first formant at a low resonant frequency.
.RF
.nr x0 \w'unvoiced stops:    'u
.nr x1 4n
.nr x2 \n(x0+\n(x1+\w'aspiration burst (context- and emphasis-dependent)'u
.nr x3 (\n(.l-\n(x2)/2
.in \n(x3u
.ta \n(x0u +\n(x1u
unvoiced stops:	closure (early cessation of voicing)
	silent steady state
	opening, comprising
		short fricative burst
		aspiration burst (context- and emphasis-dependent)
		onset of voicing
.sp
voiced stops:	closure (late cessation of voicing)
	steady state (possibility of voice bar)
	opening, comprising
		pre-voicing
		short fricative burst
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 7.3  Acoustic phases of stop consonants"
.pp
Table 7.3 summarizes some of the acoustic phases of voiced and unvoiced
stops.  There are many variations that have not been mentioned.
Nasal plosion ("good news") occurs (at the word boundary, in this case)
when the nasal formant pervades the
opening phase.  Stop bursts are suppressed when the next sound is a stop
too (the burst on the
.ul
p
of "apt", for example).
It is difficult to distinguish a voiced stop from an unvoiced one
at the end of a word ("cab" and "cap"); if the speaker is trying to
make himself particularly clear he will put a short neutral vowel
after the voiced stop to emphasize its early onset of voicing.
(If he is Italian he will probably do this anyway, for it is the norm
in his own language.)
.pp
Finally, we turn to affricates, of which there are only two
in English:
.ul
ch
("chin") and
.ul
j
("djinn").
They are very similar to the stops
.ul
t
and
.ul
d
followed by the fricatives
.ul
sh
and
.ul
zh
respectively, and their acoustic characterization is similar to that
of the phoneme pair.
.ul
ch
has a closing phase, a stopped phase, and a long fricative burst.
There is no aspiration,
for the vocal cords are not involved.
.ul
j
is the same except that voicing extends further into the stopped
portion, and the terminating fricative is also voiced.
It may be pronounced with a voice bar if the preceding segment is voiced
("adjunct").
.sh "7.5  Speech synthesis by rule"
.pp
Generation of speech by rules acting upon a phonetic transcription
was first investigated in the early 1960's (Kelly and Gerstman, 1961).
.[
Kelly Gerstman 1961
.]
Most systems employ a hardware resonance synthesizer, analogue or digital,
series or parallel,
to reduce the load on the computer which operates the rules.
The speech-by-rule program, rather than the
synthesizer, inevitably contributes by far the greater part of the
degradation in the resulting speech.
Although parallel synthesizers offer greater potential control over
the spectrum, it is not clear to what extent a synthesis program can take
advantage of this.  Parameter tracks for a series synthesizer can
easily be converted into linear predictive coefficients, and systems
which use a linear predictive synthesizer will probably become popular
in the near future.
.pp
The phrase "synthesis by rule", which is in common use, does not
make it clear just what sort of features the rules are supposed to
accomodate, and what information must be included explicitly in the
input transcription.
Early systems made no attempt to simulate prosodics.
Pitch and rhythm could be controlled, but only by inserting
pitch specifiers and duration markers in the input.
Some kind of prosodic control was often incorporated later,
but usually as a completely separate phase from segmental synthesis.
This does not allow interaction effects (such as the extra
aspiration for voiceless stops in accented syllables) to be taken
into account easily.
Even systems which perform prosodic operations invariably need to have
prosodic specifications embedded explicitly in the input.
.pp
Generating parameter tracks for a synthesizer from a phonetic transcription
is a process of data
.ul
expansion.
Six bits are ample to specify a phoneme, and a speaking rate of 12 phonemes/sec
leads to an input data rate of 72 bit/s.
The data rate required to control the synthesizer will depend upon the number
of parameters and the rate at which they are sampled,
but a typical figure is 6 Kbit/s (Chapter 5).
Hence there is something like a hundredfold data expansion.
.pp
Figure 7.1 shows the parameter tracks for a series synthesizer's rendering
of the utterance
.ul
s i k s.
.FC "Figure 7.1"
There are eight parameters.
You can see the onset of frication at the beginning and end (parameter 5),
and the amplitude of voicing (parameter 1) come on for the
.ul
i
and off again before the
.ul
k.
The pitch (parameter 0) is falling slowly throughout the utterance.
These tracks are stylized:  they come from a computer synthesis-by-rule
program and not from a human utterance.
With a parameter update rate of 10 msec, the graphs can be represented
by 90 sets of eight parameter values, a total of 720 values or 4320 bits
if a 6-bit representation is used for each value.
Contrast this with the input of only four phoneme segments, or say 24 bits.
.rh "A segment-by-segment system."
A seminal paper appearing in 1964 was the first comprehensive
description of a computer-based synthesis-by-rule system
(Holmes
.ul
et al,
1964).
.[
Holmes Mattingly Shearme 1964
.]
The same system is still in use and has been reimplemented in a more
portable form (Wright, 1976).
.[
Wright 1976
.]
The inventory of sound segments
includes the phonemes listed in Table 2.1, as well as diphthongs and
a second allophone of
.ul
l.
(Many British speakers use quite a different vocal posture for
pre- and post-vocalic
.ul
l\c
\&'s, called clear and dark
.ul
l\c
\&'s
respectively.)  Some phonemes are expanded into sub-phonemic
"phases" by the program.  Stops have three phases, corresponding to
the closure, silent steady state, and opening.
Diphthongs have two phases.  We will call individual phases and
single-phase phonemes "segments", for they are subject to exactly
the same transition rules.
.pp
Parameter tracks are constructed out of linear pieces.
Consider a pair of adjacent segments in an utterance to be synthesized.
Each one has a steady-state portion and an internal transition.
The internal transition of one phoneme is dubbed "external"
as far as the other is concerned.
This is important because instead of each segment being responsible
for its own internal transition, one of the pair is identified
as "dominant" and it controls the duration of both transitions \(em its
internal one and its external (the other's internal) one.
For example, in Figure 7.2 the segment
.ul
sh
dominates
.ul
ee
and so it
governs the duration of both transitions shown.
.FC "Figure 7.2"
Note that each
segment contributes as many as three linear pieces to the parameter track.
.pp
The notion of domination is similar to that discussed earlier for
word concatenation.
The difference is that for word concatenation the dominant segment was
determined by computing the spectral derivative over the transition
region, whereas for synthesis-by-rule
segments are ranked according to a static precedence,
and the higher-ranking segment dominates.
Segments of stop consonants have the highest rank (and also
the greatest spectral derivative), while fricatives, nasals, glides,
and vowels follow in that order.
.pp
The concatenation procedure is controlled by a table which associates
25 quantities with each segment.  They are
.LB
.NI
rank
.NI
2\ \ overall durations (for stressed and unstressed occurrences)
.NI
4\ \ transition durations (for internal and external transitions of
formant frequencies and amplitudes)
.NI
8\ \ target parameter values (amplitudes and frequencies of three
formant resonances, plus fricative information)
.NI
5\ \ quantities which specify how to calculate boundary values for
formant frequencies (two for each formant except the third,
which has only one)
.NI
5\ \ quantities which specify how to calculate boundary values for
amplitudes.
.LE
This table is rather large.  There are 80 segments in all (remember
that many phonemes are represented by more than one segment),
and so it has 2000 entries.  The system was an offline one which ran on
what was then \(em 1964 \(em a large computer.
.pp
The advantage of such a large table of "rules" is the
flexibility it affords.
Notice that transition durations are specified independently for
formant frequency and amplitude parameters \(em this permits
fine control which is particularly useful for stops.
For each parameter the boundary value between segments is calculated
using a fixed contribution from the dominant one
and a proportion of the steady state value of the other.
.pp
It is possible that the two transition durations which are
calculated for a segment actually exceed the overall duration specified
for it.  In this case, the steady-state target values will be approached
but not actually attained, simulating a situation where coarticulation
effects prevent a target value from being reached.
.rh "An event-based system."
The synthesis system described above, in common with many others, takes
an uncompromisingly segment-by-segment view of speech.
The next phoneme is read, perhaps split into a few segments, and
these are synthesized one by one with due attention being paid
to transitions between them.
Some later work has taken a more syllabic view.
Mattingly (1976) urges a return to syllables for both practical and
theoretical reasons.
.[
Mattingly 1976 Syllable synthesis
.]
Transitional effects are particularly strong
within a syllable and comparatively weak (but by no means negligible)
from one syllable to the next.  From a theoretical viewpoint,
there are much stronger phonetic restrictions on phoneme sequences
than there are on syllable sequences:  pretty well any syllable can
follow another (although whether the pair makes sense is
a different matter), but the linguistically
acceptable phoneme sequences are only a fraction
of those formed by combining phonemes in all
possible ways.
Hill (1978) argues against what be calls the "segmental assumption"
that progress through the utterance should be made one segment at a time,
and recommends a description of speech based upon perceptually relevant
"events".
.[
Hill 1978 A program structure for event-based speech synthesis by rules
.]
This framework is interesting because it provides an opportunity for prosodic
considerations to be treated as an integral part of the synthesis
process.
.pp
The phonetic segments and other information that specify an utterance
can be regarded as a list of events which describes it
at a relatively high level.
Synthesis-by-rule is the act of taking this list and elaborating on it
to produce lower-level events which are realized by the vocal tract,
or acoustically simulated by a resonance synthesizer, to give a speech
waveform.
In articulatory terms, an event might be "begin tongue motion towards
upper teeth with a given effort", while in resonance terms it could be
"begin second formant transition towards 1500\ Hz at a given rate".
(These two examples are
.ul
not
intended to describe the same event:  a tongue motion causes much more
than the transition of a single formant.)  Coarticulation
issues such as stop burst suppression and nasal plosion should
be easier to imitate within an event-based scheme than a segment-to-segment
one.
.pp
The ISP system (Witten and Abbess, 1979) is event-based.
.[
Witten Abbess 1979
.]
The key to its operation is the
.ul
synthesis list.
To prepare an utterance for synthesis, the lexical items which specify
it are joined into a linked list.  Figure 7.3 shows the start of
the list created for
.LB
1
.ul
dh i z  i z  /*d zh aa k s  /h aa u s
.LE
(this is Jack's house); the "1\ ...\ /*\ ...\ /\ ..." are
prosodic markers which will be discussed in the next chapter.
.FC "Figure 7.3"
Next, the rhythm and pitch assignment routines
augment the list with syllable boundaries, phoneme
cluster identifiers, and duration and pitch specifications.
Then it is passed to the segmental synthesis routine
which chains events into the appropriate places and, as it
proceeds, removes the no longer useful elements (phoneme names,
pitch specifiers, etc) which originally constituted the synthesis list.
Finally, an interrupt-driven speech synthesizer handler removes
events from the list as they become due and uses them to control
the hardware synthesizer.
.pp
By adopting the synthesis list as a uniform data structure for
holding utterances at every stage of processing, the problems of storage
allocation and garbage collection are minimized.
Each list element has a forward pointer and five data words, the first
indicating what type of element it is.
Lexical items which may appear in the input are
.LB
.NI
end of utterance (".", "!", ",", ";")
.NI
intonation indicator ("1", ...)
.NI
rhythm indicator ("/", "/*")
.NI
word boundary ("  ")
.NI
syllable boundary ("'")
.NI
phoneme segment
(\c
.ul
ar, b, ng, ...\c
)
.NI
explicit duration or pitch information.
.LE
Several of these have to do with prosodic features \(em a prime
advantage of the structure is that it does not create an artificial
division between segmentals and prosody.
Syllable boundaries and duration and pitch information are optional.
They will normally be computed by ISP, but the user can override them in the
input in a natural way.
The actual characters which identify lexical items are not fixed
but are taken from the rule table.
.pp
As synthesis
proceeds, new elements are chained in to the synthesis list.
For segmental purposes, three types of event are defined \(em
target events, increment events, and aspiration events.
With each event is associated a time at which the event becomes due.
For a target event, a parameter number, target parameter value,
and time-increment are specified.
When it becomes due, motion of the parameter towards the
target is begun.  If no other event for that parameter intervenes,
the target value will be reached after the given time-increment.
However, another target event for the parameter may change its motion
before the target has been attained.
Increment events contain a parameter number, a parameter increment,
and a time-increment.  The fixed increment is added to the parameter value
throughout the time specified.  This provides an easy way to make a
fricative burst during the opening phase of a stop consonant.
Aspiration events switch the mode of excitation from voicing to aspiration
for a given period of time.  Thus the aspirated part of unvoiced stops
can be accomodated in a natural manner, by changing the mode of excitation
for the duration of the aspiration.
.RF
.nr x1 (\w'excitation'/2)
.nr x2 (\w'formant resonance'/2)
.nr x3 (\w'fricative'/2)
.nr x4 (\w'type'/2)
.nr x5 (\w'frequencies (Hz)'/2)
.nr x6 (\w'resonance (Hz)'/2)
.nr x0 1.0i+0.7i+0.6i+0.6i+1.0i+1.2i+(\w'long vowel'/2)
.nr x7 (\n(.l-\n(x0)/2
.in \n(x7u
.ta 1.0i +0.7i +0.6i +0.6i +1.0i +1.2i
	\h'-\n(x1u'excitation		\0\0\h'-\n(x2u'formant resonance	\0\0\h'-\n(x3u'fricative	\h'-\n(x4u'type
			\0\0\h'-\n(x5u'frequencies (Hz)	\0\0\h'-\n(x6u'resonance (Hz)
\l'\n(x0u\(ul'
.sp
.nr x1 (\w'voicing'/2)
.nr x2 (\w'vowel'/2)
\fIuh\fR	\h'-\n(x1u'voicing	\0490	1480	2500		\c
\h'-\n(x2u'vowel
\fIa\fR	\h'-\n(x1u'voicing	\0720	1240	2540		\h'-\n(x2u'vowel
\fIe\fR	\h'-\n(x1u'voicing	\0560	1970	2640		\h'-\n(x2u'vowel
\fIi\fR	\h'-\n(x1u'voicing	\0360	2100	2700		\h'-\n(x2u'vowel
\fIo\fR	\h'-\n(x1u'voicing	\0600	\0890	2600		\h'-\n(x2u'vowel
\fIu\fR	\h'-\n(x1u'voicing	\0380	\0950	2440		\h'-\n(x2u'vowel
\fIaa\fR	\h'-\n(x1u'voicing	\0750	1750	2600		\h'-\n(x2u'vowel
.nr x2 (\w'long vowel'/2)
\fIee\fR	\h'-\n(x1u'voicing	\0290	2270	3090		\h'-\n(x2u'long vowel
\fIer\fR	\h'-\n(x1u'voicing	\0580	1380	2440		\h'-\n(x2u'long vowel
\fIar\fR	\h'-\n(x1u'voicing	\0680	1080	2540		\h'-\n(x2u'long vowel
\fIaw\fR	\h'-\n(x1u'voicing	\0450	\0740	2640		\h'-\n(x2u'long vowel
\fIuu\fR	\h'-\n(x1u'voicing	\0310	\0940	2320		\h'-\n(x2u'long vowel
.nr x1 (\w'aspiration'/2)
.nr x2 (\w'h'/2)
\fIh\fR	\h'-\n(x1u'aspiration					\h'-\n(x2u'h
.nr x1 (\w'voicing'/2)
.nr x2 (\w'glide'/2)
\fIr\fR	\h'-\n(x1u'voicing	\0240	1190	1550			 \h'-\n(x2u'glide
\fIw\fR	\h'-\n(x1u'voicing	\0240	\0650			\h'-\n(x2u'glide
\fIl\fR	\h'-\n(x1u'voicing	\0380	1190			\h'-\n(x2u'glide
\fIy\fR	\h'-\n(x1u'voicing	\0240	2270			\h'-\n(x2u'glide
.nr x2 (\w'nasal'/2)
\fIm\fR	\h'-\n(x1u'voicing	\0190	\0690	2000		\h'-\n(x2u'nasal
.nr x1 (\w'none'/2)
.nr x2 (\w'stop'/2)
\fIb\fR	\h'-\n(x1u'none	\0100	\0690	2000		\h'-\n(x2u'stop
\fIp\fR	\h'-\n(x1u'none	\0100	\0690	2000		\h'-\n(x2u'stop
.nr x1 (\w'voicing'/2)
.nr x2 (\w'nasal'/2)
\fIn\fR	\h'-\n(x1u'voicing	\0190	1780	3300		\h'-\n(x2u'nasal
.nr x1 (\w'none'/2)
.nr x2 (\w'stop'/2)
\fId\fR	\h'-\n(x1u'none	\0100	1780	3300		\h'-\n(x2u'stop
\fIt\fR	\h'-\n(x1u'none	\0100	1780	3300		\h'-\n(x2u'stop
.nr x1 (\w'voicing'/2)
.nr x2 (\w'nasal'/2)
\fIng\fR	\h'-\n(x1u'voicing	\0190	2300	2500		\h'-\n(x2u'nasal
.nr x1 (\w'none'/2)
.nr x2 (\w'stop'/2)
\fIg\fR	\h'-\n(x1u'none	\0100	2300	2500		\h'-\n(x2u'stop
\fIk\fR	\h'-\n(x1u'none	\0100	2300	2500		\h'-\n(x2u'stop
.nr x1 (\w'frication'/2)
.nr x2 (\w'voice + fric'/2)
.nr x3 (\w'fricative'/2)
\fIs\fR	\h'-\n(x1u'frication				6000	\h'-\n(x3u'fricative
\fIz\fR	\h'-\n(x2u'voice + fric	\0190	1780	3300	6000	\h'-\n(x3u'fricative
\fIsh\fR	\h'-\n(x1u'frication				2300	\h'-\n(x3u'fricative
\fIzh\fR	\h'-\n(x2u'voice + fric	\0190	2120	2700	2300	\h'-\n(x3u'fricative
\fIf\fR	\h'-\n(x1u'frication				4000	\h'-\n(x3u'fricative
\fIv\fR	\h'-\n(x2u'voice + fric	\0190	\0690	3300	4000	\h'-\n(x3u'fricative
\fIth\fR	\h'-\n(x1u'frication				5000	\h'-\n(x3u'fricative
\fIdh\fR	\h'-\n(x2u'voice + fric	\0190	1780	3300	5000	\h'-\n(x3u'fricative
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 7.4  Rule table for an event-based synthesis-by-rule program"
.pp
Now the rule table, which is shown in Table 7.4,
holds simple target positions for each phoneme segment, as well as
the segment type.  The latter is used to trigger events by computer
procedures which have access to the context of the segment.
In principle, this allows considerably more sophistication to be
introduced than does a simple segment-by-segment approach.
.RF
.nr x1 0.5i+0.5i+\w'preceding consonant in this syllable (suppress burst if fricative)'u
.nr x1 (\n(.l-\n(x1)/2
.in \n(x1u
.ta 0.5i +0.5i
fricative bursts on stops
aspiration bursts on unvoiced stops, affected by
	preceding consonant in this syllable (suppress burst if fricative)
	following consonant (suppress burst if another stop; introduce
		nasal plosion if a nasal)
	prosodics (increase burst if syllable is stressed)
voice bar on voiced stops (in intervocalic position)
post-voicing on terminating voiced stops, if syllable is stressed
anticipatory coarticulation for \fIh\fR
vowel colouring when a nasal or glide follows
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 7.5  Some coarticulation effects"
.pp
For example, Table 7.5 summarizes some of the subtleties of the
speech production process which have been mentioned earlier in this
chapter.  Most of them are context-dependent, with the prosodic
context (whether two segments are in the same syllable; whether a
syllable is stressed) playing a significant role.  A scheme where
data-dependent "demons" fire on particular patterns in a linked list
seems to be a sensible approach towards incorporating such rules.
.rh "Discussion."
There are two opposing trends in speech synthesis by rule.
On the one hand larger and larger segment inventories can be used,
containing more and more allophones explicitly.
This is the approach of the Votrax sound-segment synthesizer,
discussed in Chapter 11.
It puts an increasing burden on the person who codes the utterances
for synthesis, although, as we shall see, computer programs can assist with
this task.
On the other hand the segment inventory can be kept small, perhaps
comprising just the logical phonemes as in the ISP system.
This places the onus on the computer program to accomodate allophonic variations,
and to do so it must take account of the segmental and prosodic
context of each phoneme.
An event-based approach seems to give the best chance of incorporating
contextual modification whilst avoiding undesired interactions.
.pp
The second trend brings synthesis closer to the articulatory process
of speech production.  In fact an event-based system would be
an ideal way of implementing an articulatory model for speech synthesis
by rule.  It would be much more satisfying to have the rule table
contain articulatory target positions instead of resonance ones,
with events like "begin tongue motion towards upper teeth with a given
effort".  The problem is that hard data on articulatory postures and
constraints is much more difficult to gather than resonance information.
.pp
An interesting question that relates to articulation is whether formant
motion can be simulated adequately by a small number of linear pieces.
The segment-by-segment system described above had as many as nine
pieces for a single phoneme, for some phonemes had three phases
and each one contributes up to three pieces (transition in,
steady state, and transition out).
Another system used curves of decaying exponential
form which ensured that all transitions started rapidly towards
the target position but slowed down as it was approached (Rabiner, 1968, 1969).
.[
Rabiner 1968 Speech synthesis by rule Bell System Technical J
.]
.[
Rabiner 1969 A model for synthesizing speech by rule
.]
The time-constant of decay was stored with each segment in the rule
table.  The rhythm of the synthetic speech was controlled at this level,
for the next segment was begun when all the formants had attained
values sufficiently close to the current targets.
This is a poor model of the human speech production process, where rhythm
is dictated at a relatively high level and the next phoneme is not
simply started when the current one happens to end.
Nevertheless, the algorithm produced smooth, continuous formant motions
not unlike those found in spectrograms.
.pp
There is, however, by no means universal agreement on decaying exponential formant
motions.  Lawrence (1974) divided segments into "checked" and "free"
categories, corresponding roughly to consonants and vowels; and postulated
.ul
increasing
exponential transitions into checked segments, and decaying transitions into
free ones.
.[
Lawrence 1974
.]
This is a reasonable supposition if you consider the mechanics of
articulation.  The speed of movement of the tongue (for example) is likely
to increase until it is physically stopped by reaching the roof of the
mouth.
When moving away from a checked posture into a free one the transition will
be rapid at first but slow down to approach the target asymptotically,
governed by proprioceptive feedback.
.pp
The only thing that seems to be agreed is that the formant tracks should
certainly
.ul
not
be piecewise linear.  However, in the face of
conflicting opinions as to whether exponentials should be decaying
or increasing, piecewise linear motions seem to be a reasonable
compromise!  It is likely that the precise shape of formant
tracks is unimportant so long as the gross features are imitated
correctly.
Nevertheless, this is a question which an articulatory model
could help to answer.
.sh "7.6  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "7.7  Further reading"
.pp
There are unfortunately few books to recommend on the subject of
joining segments of speech.
The references form a representative and moderately comprehensive bibliography.
Here is some relevant background reading in linguistics.
.LB "nn"
.\"Fry-1976-1
.]-
.ds [A Fry, D.B.(Editor)
.ds [D 1976
.ds [T Acoustic phonetics
.ds [I Cambridge Univ Press
.ds [C Cambridge, England
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.in+2n
This book of readings contains many classic papers on acoustic phonetics
published from 1922\-1965.
It covers much of the history of the subject, and is intended
primarily for students of linguistics.
.in-2n
.\"Lehiste-1967-2
.]-
.ds [A Lehiste, I.(Editor)
.ds [D 1967
.ds [T Readings in acoustic phonetics
.ds [I MIT Press
.ds [C Cambridge, Massachusetts
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.in+2n
Another basic collection of references which covers much the same ground
as Fry (1976), above.
.in-2n
.\"Sivertsen-1961-3
.]-
.ds [A Sivertsen, E.
.ds [D 1961
.ds [K *
.ds [T Segment inventories for speech synthesis
.ds [J Language and Speech
.ds [V 4
.ds [P 27-89
.nr [P 1
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
This is a careful early study of the quantitative implications of using
phonemes, demisyllables, syllables, and words as the basic building
blocks for speech synthesis.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "8  PROSODIC FEATURES IN SPEECH SYNTHESIS"
.ds RT "Prosodic features
.ds CX "Principles of computer speech
.pp
Prosodic features are those which characterize an utterance as a whole,
rather than having a local influence on individual sound segments.
For speech output from computers, an "utterance" usually comprises a
single unit of information which stretches over several words \(em a clause
or sentence.  In natural speech an utterance can be very much longer, but
it will be broken into prosodic units which are again roughly the size of a
clause or sentence.  These prosodic units are certainly closely related
to each other.  For example, the pitch contour used when introducing a new
topic is usually different from those employed to develop it subsequently.
However, for the purposes of synthesis the successive prosodic units can
be treated independently, and information about pitch contours to be used
will have to be specified in the input for each one.
The independence between them is not complete, however, and
lower-level contextual effects, such as interpolation of pitch between
the end of one prosodic unit and the start of the next, must still be
imitated.
.pp
Prosodic features were introduced briefly in Chapter 2.
Variations in voice dynamics occur in three dimensions:  pitch of the voice,
time, and amplitude.
These dimensions are inextricably twined together in living speech.
Variations in voice quality are much less important for the factual
kind of speech usually sought in voice response applications,
although they can play a considerable in conveying emotions
(for a discussion of the acoustic manifestations of emotion in speech,
see Williams and Stevens, 1972).
.[
Williams Stevens 1972
.]
.pp
The distinction between prosodic and segmental effects is a traditional one,
but it becomes rather fuzzy when examined in detail.
It is analogous to the distinction between hardware and
software in computer science:  although useful from some points of view
the borderline becomes blurred as one gets closer to actual systems \(em with
microcode, interrupts, memory management, and the like.
At a trivial level, prosodics
cannot exist without segmentals, for there must be some vehicle to carry the
prosodic contrasts.
Timing \(em a prosodic feature \(em is actually realized by the durations of
individual segments.  Pauses are tantamount to silent segments.
.pp
While pitch may seem to be relatively independent of segmentals \(em and
this view is reinforced by the success of the source-filter model
which separates the frequency of the
excitation source from the filter characteristics \(em there
are some subtle phonetic effects of pitch.
It has been observed that it drops on the transition into certain
consonants, and rises again on the transition out (Haggard
.ul
et al,
1970).
.[
Haggard Ambler Callow 1970
.]
This can be explained in terms of variations in pressure from the
lungs on the vocal cords (Ladefoged, 1967).
.[
Ladefoged 1967
.]
Briefly, the increase in mouth pressure which occurs during some consonants
causes a reduction in the pressure difference across the vocal cords
and in the rate of flow of air between them.
This results in a decrease in their frequency of vibration.
When the constriction is released, there is a temporary increase in the air
flow which increases the pitch again.
The phenomenon is called "microintonation".
It is particularly noticeable in voiced stops, but also occurs in voiced
fricatives and unvoiced stops.
Simulation of the effect in synthesis-by-rule has often been found to give
noticeable improvements in the speech quality.
.pp
Loudness also has a segmental role.  For example, we noted in the last chapter
that amplitude values play a small part in identification of fricatives.
In fact loudness is a very
.ul
weak
prosodic feature.  It contributes little to the perception of stress.
Even for shouting the distinction from normal speech is as much in the voice
quality as in amplitude
.ul
per se.
It is not necessary to consider varying loudness on a prosodic basis
in most speech synthesis systems.
.pp
The above examples show how prosodic features have segmental influences
as well.
The converse is also true:  some segmental features have a prosodic effect.
The last chapter described how stress is associated with increased aspiration
of syllable-initial unvoiced stops.  Furthermore, stressed syllables
are articulated with greater effort than unstressed ones, and hence the formant
transitions are more likely to attain their target values
under circumstances which would otherwise cause them to fall short.
In unstressed syllables, extreme vowels (like
.ul
ee, aa, uu\c
)
tend to more centralized sounds
(like
.ul
i, uh, u
respectively).
Although all British English vowels
.ul
can
appear in unstressed syllables, they often become "reduced" into a
centralized form.
Consider the following examples.
.LB
.NI
diplomat	\ 
.ul
d i p l uh m aa t
.NI
diplomacy	\ 
.ul
d i p l uh u m uh s i
.NI
diplomatic	\ 
.ul
d i p l uh m aa t i k.
.LE
The vowel of the second syllable is reduced to
.ul
uh
in "diplomat" and "diplomatic", whereas the root form "diploma", and also
"diplomacy", has a diphthong
(\c
.ul
uh u\c
)
there.  The third syllable has an
.ul
aa
in "diplomat" and "diplomatic" which is reduced to
.ul
uh
in "diplomacy".
In these cases the reduction is shown explicitly in the phonetic transcription;
but in more marginal examples where it is less extreme it will not be.
.pp
I have tried to emphasize in previous chapters that prosodic features are
important in speech synthesis.
There is something very basic about them.
Rhythm is an essential part of all bodily activity \(em of breathing,
walking, working and playing \(em and so it pervades speech too.
Mothers and babies communicate effectively using intonation alone.
Some experiments have indicated that the language environment of
an infant affects his babbling at an early age, before he has effective
segmental control.
There is no doubt that "tone of voice" plays a large part in human
communication.
.pp
However, early attempts at synthesis did not pay too
much attention to prosodics, perhaps because it was thought sufficient to get the
meaning across by providing clear segmentals.
As artificial speech grows more widespread, however, it is becoming
apparent that its acceptability to users, and hence its ultimate
success, depends to a large extent on incorporating natural-sounding
prosodics.  Flat, arhythmic speech may be comprehensible in short stretches,
but it strains the concentration in significant discourse and people
are not usually prepared to listen to it.
Unfortunately, current commercial speech output systems do not really tackle
prosodic questions, which indicates our present rather inadequate
state of knowledge.
.pp
The importance of prosodics for automatic speech
.ul
recognition
is beginning to be appreciated too.  Some research projects
have attended to the automatic identification of points of stress,
in the hope that the clear articulation of stressed syllables can be used
to provide anchor points in an unknown utterance (for example, see Lea
.ul
et al,
1975).
.[
Lea Medress Skinner 1975
.]
.pp
But prosodics and segmentals are closely intertwined.
I have chosen to
treat them in separate chapters in order to split the material up into
manageable chunks rather than to enforce a deep division between them.
It is also true that synthesis of prosodic features is an uncharted and
controversial area, which gives this chapter rather a different
flavour from the last.
It is hard to be as definite about alternative strategies
and methods as you can for segment concatenation.
In order to make the treatment as concrete and down-to-earth as possible,
I will describe in some detail two example projects in prosodic synthesis.
The first treats the problem of transferring pitch from one utterance to
another, while the second considers how artificial timing and pitch can be
assigned to synthetic speech.
These examples illustrate quite different problems, and are reasonably
representative of current research activity.
(Other systems are described by Mattingly, 1966; Rabiner
.ul
et al,
1969.)  Before
.[
Mattingly 1966
.]
.[
Rabiner Levitt Rosenberg 1969
.]
looking at the two examples, we will discuss
a feature which is certainly prosodic but does not appear in the
list given earlier \(em stress.
.sh "8.1  Stress"
.pp
Stress is an everyday notion, and when
listening to natural speech people can usually agree on which syllables
are stressed.  But it is difficult to characterize in acoustic terms.
From the speaker's point of view, a stressed syllable is produced by
pushing more air out of the lungs.  For a listener, the points of stress
are "obvious".
You may think that stressed syllables are louder than the others:  however,
instrumental studies show that this is not necessarily (nor even usually)
so (eg Lehiste and Peterson, 1959).
.[
Lehiste Peterson 1959
.]
Stressed syllables frequently have a longer vowel than unstressed
ones, but this is by no means universally true \(em if you say "little"
or "bigger" you will find that the vowel in the first, stressed, syllable
is short and shows little sign of lengthening as you increase the emphasis.
Moreover, experiments using bisyllabic nonsense words have indicated
that some people consistently judge the
.ul
shorter
syllable to be stressed in the absence of other clues (Morton and Jassem,
1965).
.[
Morton Jassem 1965
.]
Pitch often helps to indicate stress.
It is not that stressed syllables are always higher- or lower-pitched
than neighbouring ones, or even that they are uttered with a rising or
falling pitch.  It is the
.ul
rate of change
of pitch that tends to be greater
for stressed syllables:  a sharp rise or fall,
or a reversal of direction, helps to give emphasis.
.pp
Stress is acoustically manifested in timing and pitch,
and to a much lesser extent in loudness.
However it is a rather subtle feature and does
.ul
not
correspond simply to duration increases or pitch rises.
It seems that listeners unconsciously put together all the clues
that are present in an utterance in order to deduce which syllables are
stressed.
It may be that speech is perceived by a listener with reference to how
he would have produced it himself, and that this is how he detects which syllables
were given greater vocal effort.
.pp
The situation is confused by the fact that certain syllables in words are
often said in ordinary language to be "stressed" on account of their
position in the word.  For example, the words
"diplomat", "diplomacy", and "diplomatic" have stress on the first,
second, and third syllables respectively.
But here we are talking about the word itself rather than
any particular utterance of it.  The "stress" is really
.ul
latent
in the indicated syllables and only made manifest upon uttering them,
and then to a greater or lesser degree depending on exactly how
they are uttered.
.pp
Some linguists draw a careful distinction between salient syllables,
accented syllables, and stressed syllables,
although the words are sometimes used differently by different authorities.
I will not adopt a precise terminology here,
but it is as well to be aware of the subtle distinctions involved.
The term "salience" is applied to actual utterances, and salient
syllables are those that are perceived as being more prominent than their
neighbours.
"Accent" is the potential for salience, as marked, for example,
in a dictionary or lexicon.
Thus the discussion of the "diplo-" words above is about accent.
Stress is an articulatory phenomenon associated with increased
muscular activity.
Usually, syllables which are perceived as salient were produced with stress,
but in shouting, for example, all syllables can be stressed \(em even
non-salient ones.
Furthermore, accented syllables may not be salient.
For instance, the first syllable of the word "very" is accented,
that is, potentially salient, but in a sentence as uttered it may or may not be
salient.  One can say
.LB
"\c
.ul
he's
very good"
.LE
with salience on "he" and possibly "good", or
.LB
"he's
.ul
very
good"
.LE
with salience on the first syllable of "very", and possibly "good".
.pp
Non-standard stress patterns are frequently used to bring out contrasts.
Words like "a" and "the" are normally unstressed, but can be stressed
in contexts where ambiguity has arisen.
Thus factors which operate at a much higher level than the phonetic structure
of the utterance must be taken into account when deciding where stress
should be assigned.  These include syntactic and semantic considerations,
as well as the attitude of the speaker and the likely attitude of
the listener to the material being spoken.
For example, I might say
.LB
"Anna
.ul
and
Nikki should go",
.LE
with emphasis on the "and" purely because I was aware that my listener
might quibble about the expense of sending them both.
Clearly some notation is needed to communicate to the synthesis process
how the utterance is supposed to be rendered.
.sh "8.2  Transferring pitch from one utterance to another"
.pp
For speech stored in source-filter form and concatenated on a
slot-filling basis, it would be useful to
have stored typical pitch contours which can be applied to the
synthetic utterances.
From a practical point of view it is important to be able to generate
natural-sounding pitch for high-quality artificial speech.
Although several algorithms for creating completely synthetic contours
have been proposed \(em and we will examine one later in this chapter \(em
they are unsuitable for high-quality speech.
They are generally designed for use with synthesis-by-rule from phonetics,
and the rather poor quality of articulation does not encourage the
development of excellent pitch assignment procedures.  With speech
synthesized by rule there is generally an emphasis on keeping the
data storage requirements to a minimum, and so it is not appropriate
to store complete contours.
Moreover, if speech is entered in textual
form as phoneme strings, it is natural to attach pitch information as markers
in the text rather than by entering a complete and detailed contour.
.pp
The picture is rather different for concatenated segments of natural speech.
In the airline reservation system, with utterances formed from templates like
.LB
Flight number \(em leaves \(em at \(em , arrives in \(em at \(em ,
.LE
it is attractive to store the pitch contour of one complete instance of the
utterance and apply it to all synthetic versions.
.pp
There is an enormous literature on the anatomy of intonation, and much of it
rests upon the notion of a pitch contour as a descriptive aid to analysis.
Underlying this is the assumption, usually unstated, that a contour can be
discussed independently of the particular stream of words that manifests it;
that a single contour can somehow be bound to any sentence (or phrase, or
clause) to produce an acceptable utterance.  But the contour, and its binding,
are generally described only at the grossest level, the details being left
unspecified.
.pp
There are phonetic influences on pitch \(em the characteristic lowering
during certain consonants was mentioned above \(em and these are
not normally considered as part of intonation.
Such effects will certainly spoil attempts to store contours extracted
from living speech and apply them to different utterances, but the impairment
may not be too great, for pitch is only one of many segmental clues to
consonant identification.
.pp
In the system mentioned earlier which generated 7-digit telephone numbers
by concatenating formant-coded words, a single natural pitch contour
was applied to all utterances.
It was taken to match as well as possible the general shape of the
contours measured in naturally-spoken telephone numbers.  However, this is a very
restricted environment, for telephone numbers exhibit almost no variety in
the configuration of stressed and unstressed syllables \(em
the only digit which is not a monosyllable is "seven".
Significant problems arise when more general utterances are considered.
.pp
Suppose the pitch contour of one utterance (the "source")
is to be transferred to another (the "target").
Assume that the utterances are encoded in source-filter form,
either as parameter tracks for a formant synthesizer or as linear predictive
coefficients.
Then there are no technical obstacles to combining pitch and segmentals.
The source must be available as a complete utterance, while the target
may be formed by concatenating smaller units such as words.
.pp
For definiteness, we will consider utterances of the form
.LB
The price is \(em dollars and \(em cents,
.LE
where the slots are filled by numbers less than 100;
and of the form
.LB
The price is \(em cents.
.LE
The domain of prices encompasses a wide range of syllable
configurations.
There are between one and five syllables in each variable part,
if the numbers are restricted to be less than 100.
The sentences have a constant pragmatic, semantic, and syntactic structure.
As in the vast majority of real-life situations,
minimal phonetic distinctions between utterances do not occur.
.pp
Pitch transfer is complicated by the fact that values of the source pitch
are only known during the voiced parts of the utterance.
Although it would certainly be possible to extrapolate pitch
over unvoiced parts, this would introduce some artificiality into
the otherwise completely natural contours.
Let us assume, therefore, that the pitch contour
of the voiced nucleus of each syllable in the source is applied to the
corresponding syllable nucleus in the target.
.pp
The primary factors which might tend to inhibit successful transfer
are
.LB
.NP
different numbers of syllables in the utterances;
.NP
variations in the pattern of stressed and unstressed syllables;
.NP
different syllable durations;
.NP
pitch discontinuities;
.NP
phonetic differences between the utterances.
.LE
.rh "Syllabification."
It is essential to take into account the syllable structures
of the utterances, so that pitch is transferred between
corresponding syllables rather than over the utterance
as a whole.
Fortunately, syllable boundaries can be detected automatically
with a fair degree of accuracy, especially if the speech is carefully
enunciated.
It is worth considering briefly how this can be done, even though it takes
us off the main topic of synthesis and into speech analysis.
.pp
A procedure developed by Mermelstein (1975)
involves integrating the spectral energy
at each point in the utterance.
.[
Mermelstein 1975 Automatic segmentation of speech into syllabic units
.]
First the low (<500\ Hz) and high (>4000\ Hz) ends are filtered out
with 12\ dB/octave cutoffs.
The resulting energy signal is smoothed
by a 40\ Hz lowpass filter, giving a so-called "loudness"
function.
All this can be accomplished with simple recursive digital filters.
.pp
Then, the loudness function is compared with its convex hull.
The convex hull is the shape a piece of elastic would assume if
stretched over the top of the loudness function and anchored down at
both ends, as illustrated in Figure 8.1.
.FC "Figure 8.1"
The point of maximum difference between the hull and loudness function
is taken to be a tentative syllable
boundary.
The hull is recomputed, but anchored to the actual loudness function
at the tentative boundary,
and the points of maximum hull-loudness difference in each of the
two halves  are selected as further tentative
boundaries.
The procedure continues recursively until the maximum hull-loudness
difference, with the hull anchored at each tentative boundary,
falls below a certain minimum (say 4\ dB).
.pp
At this stage, the number of tentative boundaries will greatly exceed
the actual number of syllables (by a factor of around 5).
Many of the extraneous boundaries are eliminated by the following
constraints:
.LB
.NP
if two boundaries lie within a certain time of each other
(say 120\ msec), one of them is discarded;
.NP
if the maximum loudness within a tentative syllable falls too
far short of the overall maximum for the utterance
(more than 20\ dB), one boundary is discarded.
.LE
The question of which boundary to discard can be decided by
examining the voicing continuity of the utterance.
If possible, voicing across a syllable boundary should be avoided.
Otherwise, the boundary with the smallest hull-loudness
difference should be rejected.
.RF
.nr x0 \w'boundaries moved slightly to correspond better with voicing:'
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta 3.4i +0.5i
\l'\n(x0u\(ul'
.sp
total syllable count:	332
boundaries missed by algorithm:	\0\09	(3%)
extra boundaries inserted by algorithm:	\029	(9%)
boundaries moved slightly to correspond better with voicing:
	\0\03	(1%)
.sp
total errors:	\041	(12%)
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 8.1  Success of the syllable segmentation procedure"
.pp
Table 8.1 illustrates the success of this syllabification
procedure, in a particular example.
Segmentation is performed with less than 10% of extraneous
boundaries being inserted,
and much less than 10% of actual boundaries being missed.
These figures are rather sensitive to the values of the
three thresholds.
The values were chosen to err on the side
of over-zealous syllabification, because all the boundaries need to be checked
by ear and eye and it is easier to delete
a boundary by hand than to insert one at an appropriate place.
It may well be that with careful optimization of thresholds,
better figures could be
achieved.
.rh "Stressed and unstressed syllables."
If the source and target utterances have the same number of
syllables, and the same pattern of stressed and unstressed syllables,
pitch can simply be transferred from a syllable in the source
to the corresponding one in the target.
But if the pattern differs \(em even though the
number of syllables may be the same, as in "eleven" and "seventeen" \(em
then a one-to-one mapping will conflict with the stress points,
and certainly sound unnatural.
Hence an attempt should be made to ensure that the pitch is mapped in a
plausible way.
.pp
The syllables of each utterance can be classified as "stressed"
and "unstressed".
This distinction could be made automatically by
inspection of the pitch contour, within the domain of utterances used,
and possibly even in general (Lea
.ul
et al,
1975).
.[
Lea Medress Skinner 1975
.]
However, in many cases it is expedient to perform the job by hand.
In our example, the sentences have fixed "carrier" parts and
variable "number" parts.
The stressed carrier syllables, namely
.LB
"... price ... dol\- ... cents",
.LE
can be marked as such, by hand,
to facilitate proper alignment between the source and target.
This marking would be difficult to do automatically
because it would be hard to distinguish the carrier from the numbers.
.pp
Even after classifying the syllables as "carrier stressed",
"stressed", and "unstressed", alignment still presents problems,
because the configuration of syllables in the variable parts
of the utterances may differ.
Syllables in the source which have no
correspondence in the target can be ignored.
The pitch track of
the source syllable can be replicated for each
additional syllable in corresponding
position in the target.
Of course, a stressed syllable should be selected for copying
if the unmatched target syllable is stressed,
and similarly for unstressed ones.
It is rather dangerous to copy exactly a part of a pitch
contour, for the ear is very sensitive to the juxtaposition of
identically intoned segments of speech \(em especially when the segment is stressed.
To avoid this, whenever a stressed syllable is replicated the
pitch values should be decreased by, say, 20%, on the second copy.
It sometimes happens that a single stressed syllable in the source
needs to cover a stressed-unstressed pair in the target:  in
this case the first part of the source pitch track can be used
for the stressed syllable, and the remainder for the
unstressed one.
.pp
The example of Figure 8.2 will help to make these rules clear.
.FC "Figure 8.2"
Note that the marking alone is done by hand.
The detailed mapping decisions can be left to the computer.
The rules were derived intuitively, and do not have any sound theoretical
basis.
They are intended to give reasonable results in the majority of cases.
.pp
Figure 8.3 shows the result of transferring the pitch from "the price is ten
cents" to "the price is seventy-seven cents".
.FC "Figure 8.3"
The syllable boundaries which are marked were determined automatically.
The use of the last 30% of the
"ten" contour to cover the first "-en" syllable, and its replication
to serve the "-ty" syllable, can be seen.
However, the 70%\(em30% proportion is applied to the source contour,
and the linear distortion (described next) upsets the proportion in the
target utterance.
The contour of the second "seven" can be seen to be a
replication of that of the first one, lowered by 20%.
Notice that the pitch extraction procedure has introduced an artifact into the final
part of one of the "cents" contours by doubling the pitch.
.rh "Stretching and squashing."
The pitch contour over a source syllable nucleus must be stretched
or squashed to match the duration
of the target nucleus.
It is difficult to see how anything other than linear stretching
and squashing could be done without considerably increasing the
complexity of the procedure.
The gross non-linearities will have been accounted for
by the syllable alignment process, and so simple linear time-distortion
should not cause too much degradation.
.rh "Pitch discontinuities."
Sudden jumps in pitch during voiced speech sound peculiar,
although they can in fact be produced naturally (by yodelling).
People frequently burst into laughter on hearing them in synthetic speech.
It is particularly important to avoid this diverting effect in
voice response applications,
for the listener's attention is instantly directed
away from what is said to the voice that speaks.
.pp
Discontinuities can arise in the pitch-transfer procedure either by a
voiced-unvoiced-voiced transition between syllables mapping on to
a voiced-voiced transition in the target,
or by voicing continuity being broken when the syllable
alignment procedure drops or replicates a syllable.
There are several ways in which at least some of the possibilities can
be avoided.
For example, one could hold unstressed syllables at a constant pitch
whose value coincides with either the end of the previous
syllable's contour or the beginning of the next syllable's contour,
depending on which transition is voiced.
Alternatively, the policy of reserving the trailing part
of a stressed syllable in the source to cover an unmatched following
unstressed syllable in the target could be generalized to allow use of the leading 30%
of the next stressed syllable's contour instead,
if that maintained voicing continuity.
A third solution is simply to merge the pitch contours
at a discontinuity by mixing the average pitch value at the break
with the pitch contour on either side of it in a proportion which
increases linearly from the edges of the domain of influence to the discontinuity.
Figure 8.4 shows the effect of this merging,
when the pitch contour of "the price is seven cents"
is transferred to "the price is eleven cents".
.FC "Figure 8.4"
Of course, the
interpolated part will not necessarily be linear.
.rh "Results of an experiment on pitch transfer."
Some experiments have been conducted to evaluate the performance
of this pitch transfer method on the kind of utterances discussed above
(Witten, 1979).
.[
Witten 1979 On transferring pitch from one utterance to another
.]
First, the source and target sentences
were chosen to be lexically identical, that is, the same words were spoken.
For this experiment alone,
expert judges were employed.
Each sentence was recorded twice (by the same person),
and pitch was transferred from copy A
to copy B and vice versa.  Also, the originals were resynthesized from their linear
predictive coefficients with their own pitch contours.
Although all four often sounded extremely similar, sometimes the pitch
contours of originals A and B were quite different,
and in these cases it was immediately obvious to the ear that two of
the four utterances shared the same intonation,
which was different to that shared by the other two.
.pp
Experienced researchers in speech analysis-synthesis served as
judges.
In order to make the test as stringent as possible it was explained
to them exactly what had been done,
except that the order of the utterances in each quadruple was kept secret.
They were asked to identify which two of the four sentences did not have their
original contours,
and were allowed to listen to each quadruple as often as they liked.
On occasion they were prepared to identify only one, or even none,
of the sentences as artificial.
.pp
The result was that an utterance with pitch transferred
from another, lexically identical, one is indistinguishable from
a resynthesized version of the original, even to a skilled ear.
(To be more precise, this hypothesis
could not be rejected even at the 1% level of statistical significance.)  This
gave confidence in the transfer procedure.
However, one particular judge was quite successful at identifying the bogus contours,
and he attributed his success to the fact that
on occasion the segmental durations did not accord with the
pitch contour.
This casts a shadow of suspicion on the linear stretching and
squashing mechanism.
.pp
The second experiment examined pitch transfers between utterances having only one variable part
each ("the price is ... cents") to test the transfer
method under relatively controlled conditions.
Ten sentences of the form
.LB
"The price is \(em cents"
.LE
were selected to cover
a wide range of syllable structures.
Each one was regenerated with pitch transferred from each of
the other nine,
and these nine versions were paired with the original resynthesized
with its natural pitch.
The $10 times 9=90$ resulting pairs were recorded on tape in random order.
.pp
Five males and five females, with widely differing occupations
(secretaries, teachers, academics, and students), served as judges.
Written instructions explained that the tape contained pairs of
sentences which were lexically identical but had a slight difference
in "tone of voice", and that the subjects were to judge which of
each pair sounded "most natural and intelligible".  The
response form gave the price associated with each pair \(em
a preliminary experiment had shown that there was never
any difficulty in identifying this \(em and a column for decision.
With each decision, the subjects recorded their confidence in the decision.
Subjects could rest at any time during the test, which lasted for about
30 minutes, but they were not permitted to hear any pair a second time.
.pp
Defining a "success" to be a choice of the utterance with
natural pitch as the best of a pair,
the overall success rate was about 60%.
If choices were random, one would of course expect only a 50% success rate,
and the figure obtained was significantly different from this.
Almost half the choices were correct and made with high confidence;
high-confidence but incorrect choices accounted for a quarter of the
judgements.
.pp
To investigate structural effects in the pitch transfer process,
low confidence decisions were ignored to eliminate noise, and the others
lumped together and tabulated by source and target utterance.
The number of stressed and unstressed syllables does not appear to play
an important part in determining whether a particular utterance is an
easy target.
For example, it proved to be particularly difficult to tell
.EQ
delim @@
.EN
natural from transferred contours with utterances $0.37 and $0.77.
.EQ
delim $$
.EN
In fact, the results showed no better than random discrimination for them,
even though the decisions in which listeners expressed little confidence
had been discarded.
Hence it seems that the syllable alignment procedure and the policy
of replication were successful.
.pp
.EQ
delim @@
.EN
The worst target scores were for utterances $0.11 and $0.79.
.EQ
delim $$
.EN
Both of these contained large unbroken voiced periods
in the "variable" part \(em almost twice as long as the next longest
voiced period.
The first has an unstressed syllable followed by
a stressed one with no break in voicing,
involving, in a natural contour,
a fast but continuous climb in pitch over the juncture,
and it is not surprising that it proved to be the most difficult target.
A more sophisticated "smoothing" algorithm than the
one used may be worth investigating.
.pp
In a third experiment, sentences with two variable parts were used to check
that the results of the second experiment extended to more complex
utterances.
The overall success rate was 75%, significantly different from chance.
However, a breakdown of the results by source and target utterance
showed that there was one contour (for the utterance
"the price is 19 dollars and 8 cents") which exhibited very successful
transfer, subjects identifying the transferred-pitch utterances at only
a chance level.
.pp
Finally, transfers of pitch from utterances with two variable parts
to those with one variable part were tested.
Pitch contours were transferred to sentences with the same "cents"
figure but no "dollars" part; for example,
"the price is five dollars and thirteen cents"
to
"the price is thirteen cents".  The
contour was simply copied between the corresponding
syllables, so that no adjustment needed to be made
for different syllable structures.
The overall score was 60 successes in 100 judgements \(em
the same percentage as in the second experiment.
.pp
To summarize the results of these four experiments,
.LB
.NP
even accomplished linguists cannot distinguish an utterance from one with
pitch transferred from a different recording of it;
.NP
when the utterance contained only one variable part embedded in a
carrier sentence,
lay listeners identified the original correctly in 60% of cases,
over a wide variety of syllable structures:  this
figure differs significantly from the chance value of 50%;
.NP
lay listeners identified the original confidently and correctly in
50% of cases; confidently but incorrectly in 25% of cases;
.NP
the greatest hindrance to successful transfer was the presence of
a long uninterrupted period of voicing in the target utterance;
.NP
the performance of the method deteriorates as the number
of variable parts in the utterances increases;
.NP
some utterances seemed to serve better than others as the pitch source for
transfer, although this was not correlated with complexity of syllable structure;
.NP
even when the utterance contained two variable parts,
there was one source utterance whose pitch contour was
transferred to all the others so successfully that listeners could not identify
the original.
.LE
.pp
The fact that only 60% of originals in the second experiment were
spotted by lay listeners in a stringent
paired-comparison test \(em many of them being identified without confidence \(em
does encourage the use of the procedure for generating stereotyped,
but different, utterances of high quality in voice-response systems.
The experiments indicate that although different syllable patterns
can be handled satisfactorily by this procedure,
long voiced periods should be avoided if possible when designing
the message set,
and that if individual utterances must contain multiple variable parts
the source utterance should be chosen with the aid of listening tests.
.sh "8.3  Assigning timing and pitch to synthetic speech"
.pp
The pitch transfer method can give good results within a fairly narrow
domain of application.
But like any speech output technique which treats complete utterances
as a single unit, with provision for a small number of slot-fillers to
accomodate data-dependent messages, it becomes unmanageable in more general
situations with a large variety of utterances.
As with segmental synthesis it becomes necessary to consider methods
which use a textual rather than an acoustically-based representation
of the prosodic features.
.pp
This raises a problem with prosodics that was not there for segmentals:  how
.ul
can
prosodic features be written in text form?
The standard phonetic transcription method does not give much help with
notation for prosodics.  It does provide a diacritical mark to indicate
stress, but this is by no means enough information for synthesis.
Furthermore, text-to-speech procedures (described in the next chapter)
promise to allow segmentals to be specified by an ordinary orthographic
representation of the utterance; but we have seen that considerable
intelligence is required to derive prosodic features from text.
(More than mere intelligence may be needed:  this is underlined by a paper
(Bolinger, 1972)
delightfully entitled
"Accent is predictable \(em if you're a mind reader"!)
.[
Bolinger 1972 Accent is predictable \(em if you're a mind reader
.]
.pp
If synthetic speech is to be used as a computer output medium rather
than as an experimental tool for linguistic research, it is important
that the method of specifying utterances is natural and easy to learn.
Prosodic features must be communicated to the computer in a manner
considerably simpler than individual duration and pitch specifications
for each phoneme, as was required in early synthesis-by-rule systems.
Fortunately, a notation has been developed for conveying some of the
prosodic features of utterances, as a by-product of the linguistically
important task of classifying the intonation contours used in
conversational English (Halliday, 1967).
.[
Halliday 1967
.]
This system has even been used to help foreigners speak English
(Halliday, 1970) \(em which emphasizes the fact that it was designed for use
by laymen, not just linguists!
.[
Halliday 1970 Course in spoken English: Intonation
.]
.pp
Here are examples of the way utterances can be conveyed to the ISP
speech synthesis system which was described in the previous chapter.
The notation is based upon Halliday's.
.LB
.NI
3
.ul
^  aw\ t\ uh/m\ aa\ t\ i\ k  /s\ i\ n\ th\ uh\ s\ i\ s  uh\ v  /*s\ p\ ee\ t\ sh,
.NI
1
.ul
^  f\ r\ uh\ m  uh  f\ uh/*n\ e\ t\ i\ k  /r\ e\ p\ r\ uh\ z\ e\ n/t\ e\ i\ sh\ uh\ n.
.LE
(Automatic synthesis of speech, from a phonetic representation.)  Three
levels of stress are distinguished:  tonic or "sentence" stress,
marked by "*" before the syllable; foot stress (marked by "/");
and unstressed syllables.
The notion of a "foot" controls the rhythm of the speech in a way that
will be described shortly.
A fourth level of stress is indicated on a segmental basis when a syllable
contains a reduced vowel.
.pp
Utterances are divided by punctuation into
.ul
tone groups,
which are the basic prosodic unit \(em there are two in the example.
The shape of the pitch contour is governed by a numeral at the start of
each tone group.
Crude control over pauses is achieved by punctuation marks:  full stop, for
example, signals a pause while comma does not.
(Longer pauses can be obtained by several full stops as in "...".)  The
"^" character stands for a so-called "silent stress" or breath point.
Word boundaries are marked by two spaces between phonemes.
As mentioned in the previous chapter, syllable boundaries and explicit
pitch and duration specifiers can also be included in the input.
If they are not, the ISP system will attempt to compute them.
.rh "Rhythm."
Our understanding of speech rhythm knows many laws but little order.
In the mid 1970's there was a spate of publications reporting new data
on segmental duration in various contexts, and there is a growing
awareness that segmental duration is influenced by a great many factors,
ranging from the structure of a discourse, through semantic and syntactic
attributes of the utterances, their phonemic and phonetic make-up,
right down to physiological constraints
(these multifarious influences are ably documented and reviewed by
Klatt, 1976).
.[
Klatt 1976 Linguistic uses of segment duration in English
.]
What seems to be lacking in this work is a conceptual framework on to
which new information about segmental duration can be nailed.
.pp
One starting-point for imitating the rhythm of English speech is the
hypothesis of regularly recurring stresses.
These stresses are primarily
.ul
rhythmic
ones, and should be distinguished from the tonic stress mentioned above which
is primarily an
.ul
intonational
one.
Rhythmic stresses are marked in the transcription by a "/".
The stretch between one and the next is called a "foot",
and the hypothesis above is often referred to as that of isochronous feet
("isochronous" means "of equal time").
There is considerable controversy about this hypothesis.
It is most popular among British linguists and, it must be admitted,
amongst those who work by introspection and intuition and do not actually
.ul
measure
things.
Although the question of isochrony of feet has long been debated, there
seems to be general agreement
\(em even amongst American linguists \(em
that there is at least a tendency towards
equal spacing of foot boundaries.
However, little is known about the strength of this tendency and the extent
of deviations from it (see Hill
.ul
et al,
1979, for an attempt
to quantify it) \(em and there is even evidence to suggest that it may in part
be a
.ul
perceptual
phenomenon (Lehiste, 1973).
.[
Hill Jassem Witten 1979
.]
.[
Lehiste 1973
.]
On this basic point, as on many others, the designer of a prosodic synthesis
strategy must needs make assumptions which cannot be properly justified.
.pp
From a pragmatic point of view there are two advantages to basing
a synthesis strategy on this hypothesis.
Firstly, it provides a way to represent the many influences of higher-level
processes (like syntax and semantics) on rhythm using a simple notation which
fits naturally into the phonetic utterance representation,
and which people find quite easy to understand and generate.
Secondly, it tends to produce a heavily accentuated, but not unnatural,
speech rhythm which can easily be moderated into a more acceptable rhythm
by departing from isochrony in a controlled manner.
.pp
The ISP procedure does not make feet exactly isochronous.
It starts with a standard foot time and attempts to fit the syllables of the
foot into this time.
If doing so would result in certain syllables having less than a preset minimum
duration, the isochrony constraint is relaxed and the foot is expanded.
There is no preset
.ul
maximum
syllable length.
However, when the durations of individual phoneme postures are adjusted
to realize the calculated syllable durations,
limits are imposed on the amount by which individual phonemes can be expanded
or contracted.
Thus a hierarchy of limits exists.
.pp
The rate of talking is determined by the standard foot time.
If this time is short, many feet will be forced to have durations longer than
the standard, and the speech will be "less isochronous".
This seems to accord with common human experience.
If the standard time is longer, however, the minimum syllable limit
will always be exceeded and the speech will be completely isochronous.
If it is too long, the above-mentioned limits to phoneme expansion will
come into play and again partially destroy the isochrony.
.pp
It has often been observed that the final foot of an utterance tends to be
longer than others; as does the tonic foot \(em that which bears the
major stress.
This is easy to accomodate, simply by making the target duration
longer for these feet.
.rh "From feet to syllables."
A foot is a succession of syllables, one or more.
And it is obvious that since there are more syllables in some feet than
in others, some syllables must occupy less time than others in order to preserve
the tendency towards isochrony of feet.
.pp
However, the duration of a foot is not divided evenly between its constituent
syllables.  The syllables have a definite rhythm of their own, which seems
to be governed by
.LB
.NP
the nature of the salient (that is, the first) syllable of the foot
.NP
the presence of word boundaries within the foot.
.LE
A salient syllable tends to be long either if it contains one of
a class of so-called "long" vowels, or if there is a cluster of two or more
consonants following the vowel.
The pattern of syllables and word boundaries governs the rhythm of the foot,
and Table 8.2 shows the possibilities for one-, two-, and three-syllable feet.
This theory of speech rhythm is due to Abercrombie (1964).
.[
Abercrombie 1964 Syllable quantity and enclitics in English
.]
.RF
.nr x2 \w'three-syllable feet  'u
.nr x3 \w'sal-short  'u
.nr x4 \w'weak [#]  'u
.nr x5 \w'weak      'u
.nr x6 \w'/\fIit s incon\fR/ceivable    'u
.nr x1 (\w'syllable rhythm'/2)
.nr x7 \n(x2+\n(x3+\n(x4+\n(x5+\n(x6+\n(x1+\n(x1
.nr x7 (\n(.l-\n(x7)/2
.in \n(x7u
.ta \n(x2u +\n(x3u +\n(x4u +\n(x5u +\n(x6u
.ul
	syllable pattern		example	\0\0\h'-\n(x1u'syllable rhythm
.sp
one-syllable feet	salient			/\fIgood\fR /show	1
	^	weak		/\fI^ good\fR/bye	2:1
.sp
two-syllable feet	sal-long	weak		/\fIcentre\fR /forward	1:1
	sal-short	weak		/\fIatom\fR /bomb	1:2
	salient  #	weak		/\fItea for\fR /two	2:1
.sp
three-syllable feet	salient  #	weak [#]	weak	/\fIone for the\fR /road	2:1:1
				/\fIit's incon\fR/ceivable
	sal-long	weak #	weak	/\fIafter the\fR /war	2:3:1
	sal-short	weak #	weak	/\fImiddle to\fR /top	1:3:2
	sal-long	weak	weak	/\fInobody\fR /knows	3:1:2
	sal-short	weak	weak	/\fIanything\fR /more	1:1:1
.sp
	# denotes a word boundary;
	[#] is an optional word boundary
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.FG "Table 8.2  Syllable patterns and rhythms"
.pp
A foot may have the rhythmical characteristics of a two-syllable foot
while having only one syllable, if the first place in it is filled by a
silent stress (marked by "^").
This is shown in the second one-syllable example of
Table 8.2.
A similar effect may occur with two- and three-syllable feet,
although examples are not given in the table.
Feet of four and five syllables \(em with or without a silent stress \(em are
considerably rarer.
.pp
Syllabification \(em splitting an utterance into syllables \(em is a job
which had to be done for the pitch-transfer procedure described earlier,
and the nature of syllable rhythms calls for it here too.
Even though the utterance is now specified phonetically instead of
acoustically, the same basic principle applies.
Syllables normally coincide with peaks of sonority,
where "sonority" measures the inherent loudness of a sound relative to
other sounds of the same duration and pitch.
However, difficult cases exist where it seems to be unclear how many syllables
there are in a word.  (Ladefoged, 1975, discusses this problem with examples
such as "real", "realistic", and "reality".)  Furthermore,
.[
Ladefoged 1975
.]
care must be taken to avoid counting two syllables in a word like "sky"
because of its two peaks of sonority \(em for the stop
.ul
k
has lower
sonority than the fricative
.ul
s.
.pp
Three levels of notional sonority are enough for syllabification.
Dividing phoneme segments into
.ul
sonorants
(glides and nasals),
.ul
obstruents
(stops and fricatives), and vowels; a general syllable has the form
.LB
.EQ
<obstruent> sup * ~ <sonorant> sup * ~ <vowel> sup * ~ <sonorant> sup * ~
<obstruent> sup * ~ ,
.EN
.LE
where "*" means repetition, that is, occurrence zero or more times.
This sidesteps the "sky" problem by giving fricatives the same
sonority as stops.
It is easy to use the above structure to count the number
of syllables in a given utterance by counting the sonority
peaks.
.pp
However, what is required is an indication of syllable
.ul
boundaries
as well as a syllable count.
For slow conversational speech, these can be approximated as follows.
Word divisions obviously form syllable boundaries, as should
foot markers \(em but it may be wise not to assume that the latter do if the
utterance has been prepared by someone with little knowledge of linguistics.
Syllable boundaries should be made to coincide with sonority minima.
As an
.ul
ad hoc
pragmatic
rule, if only one segment has the minimum sonority the boundary is placed
before it.
If there are two segments, each with the minimum sonority, it is placed between
them, while for three or more it is placed after the first two.
.pp
These rules produce obviously acceptable divisions in many cases
(to'day, ash'tray, tax'free), with perhaps unexpected positioning of the
boundary in others (ins'pire, de'par'tment).
Actually, people do differ in placement of syllable boundaries
(Abercrombie, 1967).
.[
Abercrombie 1967
.]
.rh "From syllables to segments."
The theory of isochronous feet (with the caveats noted earlier)
and that of syllable rhythms provide a way of producing durations for
individual syllables.  But where are these durations supposed to be measured?
There is a beat point, or tapping point, near the beginning of each syllable.
This is the place where a listener will tap if asked to give one tap to each
syllable; it has been investigated experimentally by Allen (1972).
.[
Allen 1972 Location of rhythmic stress beats in English One
.]
It is not necessarily at the very beginning of the syllable.
For example, in "straight", the tapping point is certainly after the
.ul
s
and the stopped part of the
.ul
t.
.pp
Another factor which relates to the division of the syllable duration
amongst phonetic segments is the often-observed fact that the length of the
vocalic nucleus is a strong clue to the degree of voicing of the terminating
cluster (Lehiste, 1970).
.[
Lehiste 1970 Suprasegmentals
.]
If you say in pairs words like "cap", "cab"; "cat", "cad"; "tack", "tag"
you will find that the vowel in the first word of each pair is significantly
shorter than that in the second.
In fact, the major difference between such pairs is the vowel length,
not the final consonant.
.pp
Such effects can be taken into account by considering a syllable to comprise
an initial consonant cluster, followed by a vocalic nucleus and a final
consonant cluster.
Any of these elements can be missing \(em the most unusual case where the
nucleus is absent occurs, for example, in so-called syllabic
.ul
n\c
\&'s
(as in renderings of "button", "pudding" which might be written
"butt'n", "pudd'n").
However, it is convenient to modify the definition of the nucleus
so as to rule out the possibility of it being empty.
Using the characterization of the syllable given above, the clusters can
be defined as
.LB
.NI
initial cluster	=  <obstruent>\u*\d <sonorant>\u*\d
.NI
nucleus	=  <vowel>\u*\d <sonorant>\u*\d
.NI
final cluster	=  <obstruent>\u*\d.
.LE
Sonorants are included in the nucleus so that it is always present,
even in the case of a syllabic consonant.
.pp
Then, rules can be used to divide the syllable duration between the
initial cluster, nucleus, and final cluster.
These must distinguish between situations where the terminating cluster
is voiced or unvoiced so that the characteristic differences in vowel lengths
can be accomodated.
.pp
Finally, the cluster durations must be apportioned amongst their constituent
phonetic segments.  There is little published data on which to base this.
Two simple schemes which have been used in ISP are described in
Witten (1977) and Witten & Smith (1977).
.[
Witten 1977 A flexible scheme for assigning timing and pitch to synthetic speech
.]
.[
Witten Smith 1977 Synthesizing British English rhythm
.]
.rh "Pitch."
There are two basically different ways of looking at the pitch of an
utterance.
One is to imagine pitch
.ul
levels
attached to individual syllables.
This has been popular amongst American linguists, and some people
have even gone so far as to associate pitch levels with levels of
stress.
The second approach is to consider pitch
.ul
contours,
as we did earlier when examining how to transfer pitch from one utterance
to another.
This seems to be easier for the person who transcribes the utterances
to produce, for the information required is much less detailed than levels
attached to each syllable.  Some indication needs to be given of how
the contour is to be bound to the utterance, and in the notation introduced above
the most prominent, or "tonic", syllable is indicated in the transcription.
.pp
Halliday's (1970) classification identifies five different primary intonation
contours, each hinging on the tonic syllable.
.[
Halliday 1970 Course in spoken English: Intonation
.]
These are sketched in Figure 8.5, in the style of Halliday.
.FC "Figure 8.5"
Several secondary contours, which are variations on the primary ones,
are defined as well.
However, this classification scheme is intended for consumption by people,
who bring to the problem a wealth of prior knowledge of speech and years
of experience with it!  It captures only the gross features
of the infinite variety of pitch contours found in living speech.
In a sense, the classification is
.ul
phonological
rather than
.ul
phonetic,
for it attempts to distinguish the features which make a logical difference
to the listener instead of the acoustic details of the pitch contours.
.pp
It is necessary to take these contours and subject them to a sort of
phonological-to-phonetic embellishment before applying them in synthetic
speech.
For example, the stretches with constant pitch which precede the tonic
syllable in tone groups 1, 2, and 3 sound
most unnatural when synthesized \(em for pitch is hardly ever
exactly constant in living speech.
Some pretonic pitch variation is necessary,
and this can be made to emphasize the salient syllable
of each foot.  A "lilting" effect which reaches a peak at each foot
boundary, and drops rather faster at the beginning of the foot than it
rises at the end, sounds more natural.  The magnitude of this inflection
can be altered slightly to add interest, but a considerable increase in it
produces a semantic change by making the utterance sound more emphatic.
It is a major problem to pin down exactly the turning points of pitch in
the falling-rising and rising-falling contours (4 and 5 in Figure 8.5).
And even deciding on precise values for the pitch frequencies involved is not
always easy.
.pp
The aim of the pitch assignment method of ISP is to allow the person
(or program) which originates a spoken message to exercise a great deal
of control over its intonation, without having to concern himself with
foot or syllable structure.  The message to be spoken must be broken down
into tone groups,
which correspond roughly to Halliday's tone groups.
Each one comprises a
.ul
tonic
of one or more feet, which is optionally preceded by a
.ul
pretonic,
also with a number of feet.  It is advantageous to allow a tone group
boundary to occur in the middle of a foot (whereas Halliday's scheme
insists that it occurs at a foot boundary).
The first foot of the tonic, the
.ul
tonic foot,
is marked by an asterisk at the beginning.
It is on the first syllable of this foot \(em the
"tonic" or "nuclear"
syllable \(em that the major stress of the tone group occurs.
If there is no asterisk in a tone group,
ISP takes the final foot as the tonic
(since this is the most common case).
.pp
The pitch contour on a tone group is specified by an array of ten numbers.
Of course, the system cannot generate all conceivable contours for a tone
group, but the definitions of the ten specifiable quantities have been
chosen to give a useful range of contours.
If necessary, more precise control over the pitch of an utterance can
be achieved by making the tone groups smaller.
.pp
The overall pitch movement is controlled by specifying the pitch at three
places:  the beginning of the tone group, the beginning of the tonic syllable,
and the end of the tone group.
Provision is made for an abrupt pitch break at the start of the tonic
syllable in order to simulate tone groups 2 and 3, and, to a lesser
extent, tone groups 4 and 5.
The pitch is interpolated linearly over the first part of the
tone group (up to the tonic syllable) and over the last part (from there to
the end), except that it is possible to specify a non-linearity on the tonic
syllable, for emphasis, as shown in Figure 8.6.
.FC "Figure 8.6"
.pp
On this basic shape are superimposed two finer pitch patterns.
One of these is an initialization-continuation option which allows
the pitch to rise (or fall) independently on the initial and final feet
to specified values, without affecting the contour on the rest
of the tone group (Figure 8.7).
.FC "Figure 8.7"
The other is a foot pattern which is superimposed on each pretonic foot,
to give the stressed syllables of the pretonic added prominence and avoid
the monotony of constant pitch.
This is specified by a
.ul
non-linearity
parameter which distorts the contour on the foot at a pre-determined
point along it.
Figure 8.8 shows the effect.
.FC "Figure 8.8"
.pp
The ten quantities that define a pitch contour are summarized in
Table 8.3, and shown diagrammatically in Figure 8.9.
.FC "Figure 8.9"
.RF
.nr x0 \w'H:    'u
.nr x1 \n(x0+\w'fraction along foot of the non-linearity position, for the tonic foot'u
.nr x1 (\n(.l-\n(x1)/2
.in \n(x1u
.ta \n(x0u +4n
A:	continuation from previous tone group
		zero gives no continuation
		non-zero gives pitch at start of tone group
B:	notional pitch at start
C:	pitch range on whole of pretonic
D:	departure from linearity on each foot of pretonic
E:	pitch change at start of tonic
F:	pitch range on tonic
G:	departure from linearity on tonic
H:	continuation to next tone group
		zero gives no continuation
		non-zero gives pitch at end of tone group
I:	fraction along foot of the non-linearity position, for pretonic feet
J:	fraction along foot of the non-linearity position, for the tonic foot
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 8.3  The quantities that define a pitch contour"
.pp
The intention of this parametric method of specifying contours
is that the parameters should be easily derivable from semantic variables
like emphasis, novelty of idea, surprise, uncertainty, incompleteness.
Here we really are getting into controversial, unresearched areas.
Roughly speaking, parameters D and G control emphasis, G by itself
controls novelty and surprise, and H and the relative sizes of E and F
control uncertainty and incompleteness.
Certain parameters (notably I and J) are defined because although they
do not appear to correspond to semantic distinctions, we do not yet know
how to generate them automatically.
.RF
.nr x0 0.6i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+0.5i+\w'0000'
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta 0.6i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i +0.5i
Halliday's
tone group	\0\0A	\0\0B	\0\0C	\0\0D	\0\0E	\0\0F	\0\0G	\0\0H	\0\0I	\0\0J
\l'\n(x0u\(ul'
.sp
	1	\0\0\00	\0175	\0\0\00	\0\-40	\0\0\00	\-100	\0\-40	\0\0\00	0.33	\00.5
	2	\0\0\00	\0280	\0\0\00	\0\-40	\-190	\0100	\0\0\00	\0\0\00	0.33	\00.5
	3	\0\0\00	\0175	\0\0\00	\0\-40	\0\-70	\0\045	\0\-10	\0\0\00	0.33	\00.5
	4	\0\0\00	\0280	\-100	\0\-40	\0\020	\0\045	\0\-45	\0\0\00	0.33	\00.5
	5	\0\0\00	\0175	\0\060	\0\-40	\0\-20	\0\-45	\0\045	\0\0\00	0.33	\00.5
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 8.4  Pitch contour table for Halliday's primary tone groups"
.pp
One basic requirement of the pitch assignment scheme was the ability to
generate contours which approximate Halliday's five primary tone groups.
Values of the ten specifiable quantities are given in Table 8.4, for each
tone group.
All pitches are given in\ Hz.
A distinctly dipping pitch movement has been given to each pretonic foot
(parameter D),
to lend prominence to the salient syllables.
.sh "8.4  Evaluating prosodic synthesis"
.pp
It is extraordinarily difficult to evaluate schemes for prosodic synthesis,
and this is surely a large part of the reason why prosodics are among the
least advanced aspects of artificial speech.
Segmental synthesis can be tested by playing people minimal pairs of
words which differ in just one feature that is being investigated.
For example, one might experiment with "pit", "bit"; "tot", "dot";
"cot", "got" to test the rules which discriminate unvoiced from voiced stops.
There are standard word-lists for intelligibility tests which can be
used to compare systems, too.
No equivalent of such micro-level evaluation exists for prosodics,
for they by definition have a holistic effect on utterances.
They are most noticeable, and most important, in longish stretches of speech.
Even monotonous, arhythmic speech will be intelligible in
sufficiently short samples provided the segmentals are good enough;
but it is quite impossible to concentrate on such speech in quantity.
Some attempts at evaluation appear in Ainsworth (1974) and McHugh (1976),
but these are primarily directed at assessing the success of pronunciation
rules, which are discussed in the next chapter.
.[
Ainsworth 1974 Performance of a speech synthesis system
.]
.[
McHugh 1976 Listener preference and comprehension tests
.]
.pp
One evaluation technique is to compare synthetic with natural versions
of utterances, as was done in the pitch transfer experiment.
The method described earlier used a sensitive paired-comparison test,
where subjects heard both versions in quick succession and were asked
to judge which was "most natural and intelligible".
This is quite a stringent test, and one that may not be so useful
for inferior, completely synthetic, contours.
It is essential to degrade the "natural" utterance so that it is
comparable segmentally to the synthetic one:  this was done in the
experiment described by extracting its pitch and resynthesizing it
from linear predictive coefficients.
.pp
Several other experiments could be undertaken to evaluate artificial
prosody.
For example, one could compare
.LB
.NP
natural and artificial rhythms, using artificial segmental synthesis
in both cases;
.NP
natural and artificial pitch contours, using artificial segmental synthesis
in both cases;
.NP
natural and artificial pitch contours, using segmentals extracted from
natural utterances.
.LE
There are many other topics which have not yet been fully investigated.
It would be interesting, for example, to define rules for generating speech
at different tempos.
Elisions, where phonemes or even whole syllables are suppressed,
occur in fast speech; these have been analyzed by linguists
but not yet incorporated into synthetic models.
It should be possible to simulate emotion by altering parameters such as
pitch range and mean pitch level; but this seems exceptionally difficult
to evaluate.  One situation where it would perhaps be possible to
measure emotion is in the reading of sports results \(em in fact a study
has already been made of intonation in soccer results (Bonnet, 1980)!
.[
Bonnet 1980
.]
Even the synthesis of voices with different pitch ranges requires
investigation, for, as noted earlier, it is difficult to place
precise frequency specifications on phonological contours such as
those sketched in Figure 8.5.
Clearly the topic of prosodic synthesis is a rich and potentially
rewarding area of research.
.sh "8.5  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "8.6  Further reading"
.pp
There are quite a lot of books in the field of linguistics which
describe prosodic features.
Here is a small but representative sample from both sides of the Atlantic.
.LB "nn"
.\"Abercrombie-1965-1
.]-
.ds [A Abercrombie, D.
.ds [D 1965
.ds [T Studies in phonetics and linguistics
.ds [I Oxford Univ Press
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Abercrombie is one of the leading English authorities on phonetics,
and this is a collection of essays which he has written over the years.
Some of them treat prosodics explicitly, and others show the influence
of verse structure on Abercrombie's thinking.
.in-2n
.\"Bolinger-1972-2
.]-
.ds [A Bolinger, D.(Editor)
.ds [D 1972
.ds [T Intonation
.ds [I Penguin
.ds [C Middlesex, England
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.in+2n
A collection of papers that treat a wide variety of different aspects
of intonation in living speech.
.in-2n
.\"Crystal-1969-3
.]-
.ds [A Crystal, D.
.ds [D 1969
.ds [T Prosodic systems and intonation in English
.ds [I Cambridge Univ Press
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This book attempts to develop a theoretical basis for the study of British
English intonation.
.in-2n
.\"Gimson-1966-3
.]-
.ds [A Gimson, A.C.
.ds [D 1966
.ds [T The linguistic relevance of stress in English
.ds [B Phonetics and linguistics
.ds [E W.E.Jones and J.Laver
.ds [P 94-102
.nr [P 1
.ds [I Longmans
.ds [C London
.nr [T 0
.nr [A 1
.nr [O 0
.][ 3 article-in-book
.in+2n
Here is a careful discussion of what is meant by "stress", with much more
detail than has been possible in this chapter.
.in-2n
.\"Lehiste-1970-4
.]-
.ds [A Lehiste, I.
.ds [D 1970
.ds [T Suprasegmentals
.ds [I MIT Press
.ds [C Cambridge, Massachusetts
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This is a comprehensive study of suprasegmental phenomena in natural speech.
It is divided into three major sections:  quantity (timing), tonal features
(pitch), and stress.
.in-2n
.\"Pike-1945-5
.]-
.ds [A Pike, K.L.
.ds [D 1945
.ds [T The intonation of American English
.ds [I Univ of Michigan Press
.ds [C Ann Arbor, Michigan
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
A classic, although somewhat dated, study.
Notice that it deals specifically with American English.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "9  GENERATING SPEECH FROM TEXT"
.ds RT "Generating speech from text
.ds CX "Principles of computer speech
.pp
In the preceding two chapters I have described how artificial speech
can be produced from a written phonetic representation with additional
markers indicating intonation contours, points of major stress, rhythm,
and pauses.
This representation is substantially the same as that used by linguists
when recording natural utterances.
What we will discuss now are techniques for generating this information,
or at least some of it, from text.
.pp
Figure 9.1 shows various levels of the speech synthesis process.
.FC "Figure 9.1"
Starting from the top with plain text, the first box splits it into
intonation units (tone groups), decides where the major emphases
(tonic stresses) should be placed,
and further subdivides the tone group into rhythmic units (feet).
For intonation analysis it is necessary to decide on an "interpretation"
of the text, which in turn, as was emphasized at the beginning of the
previous chapter, depends both on the semantics of what is being said and
on the attitude of the speaker to his material.
The resulting representation will be at the level of Halliday's notation
for utterances, with the words still in English rather than phonetics.
Table 9.1 illustrates the utterance representation at the various levels
of the Figure.
.RF
.nr x0 \w'pitch and duration    '+\w'at 8 kHz sampling rate a 4-second utterance'
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta \w'pitch and duration    'u +\w'pause  'u +\w'00 msec   'u
representation	example
\l'\n(x0u\(ul'
.sp
plain text	Automatic synthesis of speech,
	from a phonetic representation.
.sp
text adorned with	3\0^ auto/matic /synthesis of /*speech,
prosodic markers	1\0^ from a pho/*netic /represen/tation.
.sp
phonetic text with	3\0\fI^  aw t uh/m aa t i k  /s i n th uh s i s\fR
prosodic markers	\0\0\fIuh v  /*s p ee t sh\fR ,
	1\0\fI^  f r uh m  uh  f uh/*n e t i k\fR
	\0\0\fI/r e p r uh z e n/t e i sh uh n\fR .
.sp
phonemes with	pause	80 msec
pitch and duration	\fIaw\fR	70 msec	105 Hz
	\fIt\fR	40 msec	136 Hz
	\fIuh\fR	50 msec	148 Hz
	\fIm\fR	70 msec	175 Hz
	\fIaa\fR	90 msec	140 Hz
		...
		...
		...
.sp
parameters for	10 parameters, each updated at a frame
formant or linear	rate of 10 msec
predictive	(4 second utterance gives 400 frames,
synthesizer	or 4,000 data values)
.sp
acoustic wave	at 8 kHz sampling rate a 4-second utterance
	has 32,000 samples
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 9.1  Utterance representations at various levels in speech synthesis"
.pp
The next job is to translate the plain text into a broad phonetic
transcription.
This requires knowledge of letter-to-sound pronunciation
rules for the language under consideration.
But much more is needed.  The structure of each word must be examined for
prefixes and suffixes, because they \(em especially the latter \(em have a
strong influence on pronunciation.
This is called "morphological" analysis.
Actually it is also required for rhythmical purposes, because prefixes
are frequently unstressed (note that the word "prefix" is itself an
exception to this!).
Thus the appealing segmentation of the overall problem shown in Figure 9.1
is not very accurate, for the individual processes cannot be rigidly
separated as it implies.  In fact, we saw earlier how this intermixing of
levels occurs with prosodic and segmental features.
Nevertheless, it is helpful to structure discussion of the problem by
separating levels as a first approximation.
Further influences on pronunciation come from the semantics and syntax
of the utterance \(em and both also play a part in intonation and rhythm analysis.
The result of this second process is a phonetic representation, still
adorned with prosodic markers.
.pp
Now we move down from higher-level intonation and rhythm considerations
to the details of the pitch contour and segment durations.
This process was the subject of the previous chapter.
The problems are twofold:  to map an appropriate acoustic pitch contour
on to the utterance, using tonic stress point and foot boundaries as
anchor points; and to assign durations to segments using the
foot\(emsyllable\(emcluster\(emsegment hierarchy.
If it is accepted that the overall rhythm can be captured adequately by foot
markers, this process does not interact with earlier ones.
However, many researchers do not, believing instead that rhythm is
syntactically determined at a very detailed level.
This will, of course, introduce strong interaction between the duration
assignment process and the levels above.
(Klatt, 1975, puts it into his title \(em
"Vowel lengthening is syntactically determined in a connected discourse".
.[
Klatt 1975 Vowel lengthening is syntactically determined
.]
Contrast this with the paper cited earlier (Bolinger, 1972) entitled
"Accent is predictable \(em if you're a mind reader".
.[
Bolinger 1972 Accent is predictable \(em if you're a mind reader
.]
No-one would disagree that "accent" is an influential factor in vowel length!)
.pp
Notice incidentally that the representation of the result of the pitch
and duration assignment process in Table 9.1 is inadequate, for each segment
is shown as having just one pitch.
In practice the pitch varies considerably throughout every segment,
and can easily rise and fall on a single one.  For example,
.LB
"he's
.ul
very
good"
.LE
may have a rise-fall on the vowel of "very".
The linked event-list data-structure of ISP is much more suitable
than a textual string for utterance representation at this level.
.pp
The fourth and fifth processes of Figure 9.1 have little interaction with
the first two, which are the subject of this chapter.  Segmental
concatenation, which was treated in Chapter 7, is affected by prosodic
features like stress; but a notation which indicates stressed syllables
(like Halliday's) is sufficient to capture this influence.
Contextual modification of segments, by which I mean
the coarticulation effects which govern allophones of phonemes,
is included explicitly in the fourth process to emphasize that the upper levels
need only provide a broad phonemic transcription rather than a detailed
phonetic one.
Signal synthesis can be performed by either a formant synthesizer or a
linear predictive one (discussed in Chapters 5 and 6).
This will affect the details of the segmental concatenation process but should have no
impact at all on the upper levels.
.pp
Figure 9.1 performs a useful function by summarizing where we have
been in earlier chapters \(em the lower three boxes \(em and introducing the
remaining problems that must be faced by a full text-to-speech system.
It also serves to illustrate an important point:  that a speech output system
can demand that its utterances be entered in any of a wide range of
representations.
Thus one can enter at a low level with a digitized waveform or linear
predictive parameters; or higher up with a phonetic representation
that includes detailed pitch and duration specification at the phoneme level;
or with a phonetic text or plain text adorned with prosodic markers;
or at the very top with plain text as it would appear in a book.
A heavy price in naturalness and intelligibility is paid by moving up
.ul
any
of these levels \(em and this is just as true at the top of the Figure as
at the bottom.
.sh "9.1  Deriving prosodic features"
.pp
If you really need to start with plain text,
some very difficult problems present themselves.
The text should be understood, first of all, and then decisions need to be
made about how it is to be interpreted.
For an excellent speaker \(em like an actor \(em these decisions will be artistic,
at least in part.
They should certainly depend upon the opinion and attitude of the speaker,
and his perception of the structure and progress of the dialogue.
Very little is known about this upper level of speech synthesis from text.
In practice it is almost completely ignored \(em and the speech is at most
barely intelligible, and certainly uncomfortable to listen to.
Hence anybody contemplating building or using a speech output system which
starts from something close to plain text should consider carefully whether some extra
semantic information can be coded into the initial utterances to help with
prosodic interpretation.
Only rarely is this impossible \(em and reading machines for the blind are
a prime example of a situation where arbitrary, unannotated, texts
must be read.
.rh "Intonation analysis."
One distinction which a program can usefully try
to make is between basically rising
and basically falling pitch contours.  It is often said that pitch rises on
a question and falls on a statement, but if you listen to speech you will
find this to be a gross oversimplification.  It normally
falls on statements, certainly; but it falls as often as it rises on questions.
It is more accurate to say that pitch rises on "yes-no" questions
and falls on other utterances, although this rule is still only a rough guide.
A simple test which operates lexically on the input text is to determine
whether a sentence is a question by looking at the 
punctuation mark at its end, and then to examine the first word.
If it is a "wh"-word like "what", "which", "when", "why" (and also "how")
a falling contour is likely to fit.
If not, the question is probably a yes-no one, and the contour
should rise.
Such a crude rule will certainly not be very accurate
(it fails, for example, when the "wh"-word is embedded in a phrase as in
"at what time are you going?"), but at least it provides a starting-point.
.pp
An air of finality is given to an utterance when it bears a definite
fall in pitch, dropping to a rather low value at the end.
This should accompany the last intonation unit in an utterance
(unless it is a yes-no question).
However, a rise-fall contour such as Halliday's tone group 5 (Figure 8.5)
can easily be used in utterance-final position by one person
in a conversation \(em
although it would be unlikely to terminate the dialogue altogether.
A new topic is frequently introduced by a fall-rise contour \(em such as
Halliday's tone group 4 \(em and this often begins a paragraph.
.pp
Determining the type of pitch contour is only one part of
intonation assignment.  There are really three separate problems:
.LB
.NP
dividing the utterance into tone groups
.NP
choosing the tonic syllable, or major stress point, of each one
.NP
assigning a pitch contour to each tone group.
.LE
Let us continue to use the Halliday notation for intonation, which was introduced
in simplified form in the previous chapter.
Moreover, assume that the foot boundaries can be placed correctly \(em
this problem will be discussed in the next subsection.
Then a scheme which considers only the lexical form of the utterance
and does not attempt to "understand" it (whatever that means) is as follows:
.LB
.NP
place a tone group boundary at every punctuation mark
.NP
place the tonic at the first syllable of the last foot in a tone group
.NP
use contour 4 for the first tone group in a paragraph and contour 1
elsewhere, except for a yes-no question which receives contour 2.
.LE
.RF
.nr x0 \w'From Scarborough to Whitby\0\0\0\0'+\w'4  ^  from /Scarborough to /*Whitby is a'
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta \w'From Scarborough to Whitby\0\0\0\0\0\0'u
plain text	text adorned with prosodic markers
\l'\n(x0u\(ul'
.sp
From Scarborough to Whitby is a	4 ^ from /Scarborough to /*Whitby is a
very pleasant journey, with	1\- very /pleasant /*journey with
very beautiful countryside.	1\- very /beautiful /*countryside ...
In fact the Yorkshire coast is	1+ ^ in /fact the /Yorkshire /coast is
\0\0\0\0lovely,	\0\0\0\0/*lovely
all along, ex-	1+ all a/*long ex
cept the parts that are covered	_4 cept the /parts that are /covered
\0\0\0\0in caravans of course; and	\0\0\0\0in /*caravans of /course and
if you go in spring,	4 if you /go in /*spring
when the gorse is out,	4 ^ when the /*gorse is /out
or in summer,	4 ^ or in /*summer
when the heather's out,	4 ^ when the /*heather's /out
it's really one of the most	13 ^ it's /really /one of the /most
\0\0\0\0delightful areas in the	\0\0\0\0de/*lightful /*areas in the
whole country.	1 whole /*country
.sp
The moorland is	4 ^ the /*moorland is
rather high up, and	1 rather /high /*up and
fairly flat \(em a	1 fairly /*flat a
sort of plateau.	1 sort of /*plateau ...
At least,	1 ^ at /*least
it isn't really flat,	13 ^ it /*isn't /really /*flat
when you get up on the top;	\-3 ^ when you /get up on the /*top
it's rolling moorland	1 ^ it's /rolling /*moorland
cut across by steep valleys.  But	1 cut across by /steep /*valleys but
seen from the coast it's	4 seen from the /*coast it's ...
"up there on the moors", and you	1 up there on the /*moors and you
always think of it as a	_4 always /*think of it as a
kind of tableland.	1 kind of /*tableland
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 9.2  Example of intonation and rhythm analysis (from Halliday, 1970)"
.[
Halliday 1970 Course in spoken English: Intonation
.]
.pp
These extremely crude and simplistic rules are really the most that one can do
without subjecting the utterance to a complicated semantic analysis.
In statistical terms, they are actually remarkably effective.
Table 9.2 shows part of a spontaneous monologue which was transcribed by
Halliday and appears in his teaching text on intonation
(Halliday, 1970, p 133).
.[
Halliday 1970 Course in Spoken English: Intonation
.]
Among the prosodic markers are some that were not introduced in Chapter 8.
Firstly, each tone group has secondary contours which are identified
by "1+", "1\-" (for tone group 1), and so on.
Secondly, the mark "..." is used to indicate a pause which disrupts
the speech rhythm.
Notice that its positioning belies the advice of the old elocutionists:
.br
.ev2
.in 0
.LB
.fi
A Comma stops the Voice while we may privately tell
.NI
.ul
one,
a Semi-colon
.ul
two;
a Colon
.ul
three:\c
  and a Period
.ul
four.
.br
.nr x0 \w'\fIone,\fR a Semi-colon \fItwo;\fR a Colon \fIthree:\fR  and a Period \fIfour.'-\w'(Mason,\fR 1748)'
.NI
\h'\n(x0u'(Mason, 1748)
.nf
.LE
.br
.ev
Thirdly, compound tone groups such as "13" appear which contain
.ul
two
tonic syllables.
This differs from a simple concatenation of tone groups
(with contours 1 and 3 in this case) because the second is in some sense subsidiary to
the first.
Typically it forms an adjunct clause, while the first clause gives the
main information.  Halliday provides many examples, such as
.LB
.NI
/Jane goes /shopping in /*town /every /*Friday
.NI
/^ I /met /*Arthur on the /*train.
.LE
But he does not comment on the
.ul
acoustic
difference between a compound tone group and a concatenation of simple ones \(em
which is, after all, the information needed for synthesis.
A final, minor, difference between Halliday's scheme and that outlined earlier
is that he compels tone group boundaries to occur at the beginning
of a foot.
.RF
.nr x0 3.3i+1.3i+\w'complete'
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta 3.3i +1.3i
	excerpt in	complete
	Table 9.2	passage
\l'\n(x0u\(ul'
.sp
number of tone groups	25	74
.sp
number of boundaries correctly	19 (76%)	47 (64%)
placed
.sp
number of boundaries incorrectly	\00	\01 (\01%)
placed
.sp
number of tone groups having a	22 (88%)	60 (81%)
tonic syllable at the beginning
of the final foot
.sp
number of tone groups whose	17 (68%)	51 (69%)
contours are correctly assigned
\l'\n(x0u\(ul'
.sp
number of compound tone groups	\02 (\08%)	\06 (\08%)
.sp
number of secondary intonation	\07 (28%)	13 (17%)
contours
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 9.3  Success of simple intonation assignment rules"
.pp
Applying the simple rules given above to the text of Table 9.2 leads to
the results in the first column of Table 9.3.
Three-quarters of the foot boundaries are flagged by
punctuation marks, with no extraneous ones being included.
88% of tone groups have a tonic syllable at the start of the final foot.
However, the compound tone groups each have two tonic syllables,
and of course only the second one is predicted by the final-foot rule.
Assigning intonation contours on the extremely simple basis of using
contour 4 for the first tone group in a paragraph, and contour 1 thereafter,
also seems to work quite well.  Secondary contours such as "1+" and "1\-"
have been mapped into the appropriate primary contour (1, in this case)
for the present purpose, and compound tone groups have been assigned the first
contour of the pair.
The result is that 68% of contours are given correctly.
.pp
In order to give some idea of the reliability of these figures, the results
for the whole passage transcribed by Halliday \(em of which Table 9.2 is an
excerpt \(em are shown in the second column of Table 9.3.  Although it
looks as though the rules may have been slightly lucky with the excerpt,
the general trends are the same, with 65% to 80% of features being assigned
correctly.
It could be argued, though, that the complete text is punctuated fairly liberally by
present-day standards, so that the tone-group boundary rule is unusually
successful.
.pp
These results are really astonishingly good, considering the crudeness of
the rules.  However, they should be interpreted with caution.
What is missed by the rules, although appearing to comprise only
20% to 35% of the features, is certain to include the important,
information-bearing, and variety-producing features that give the utterance
its liveliness and interest.
It would be rash to assume that all tone-group boundaries,
all tonic positions, and all intonation contours, are equally
important for intelligibility and naturalness.
It is much more likely that the rules predict a
default pattern, while most information is borne by deviations from
them.
To give an engineering analogy, it may be as though the carrier waveform
of a modulated transmission is being simulated, instead of the
information-bearing signal!
Certainly the utterance will, if synthesized with intonation given by these
rules, sound extremely dull and repetitive, mainly because of the
overwhelming predominance of tone group 1 and the universal placement
of tonic stress on the final foot.
.pp
There are certainly many different ways to orate any particular text,
and that given by Halliday and reproduced in Table 9.2 is only one possible
version.
However, it is fair to say that the default intonation discussed above
could only occur naturally under very unusual circumstances \(em such as
a petulant child, unwilling and sulky, having been forced to read aloud.
This is hardly how we want our computers to speak!
.rh "Rhythm analysis."
Consider now how to decide where foot boundaries should be placed
in English text.
Clearly semantic considerations sometimes play a part in this \(em one could
say
.LB
/^ is /this /train /going /*to /London
.LE
instead of the more usual
.LB
/^ is /this /train /going to /*London
.LE
in circumstances where the train might be going
.ul
to
or
.ul
from
London.
Such effects are ignored here, although it is worth noting in passing that the
rogue words will often be marked by underscoring or italicizing
(as in the previous sentence).
If the text is liberally underlined, semantic analysis may
be unnecessary for the purposes of rhythm.
.pp
A rough and ready rule for placing foot boundaries is to insert one before
each word which is not in a small closed set of "function words".
The set includes, for example, "a", "and", "but", "for", "is", "the", "to".
If a verb or adjective begins with a prefix, the boundary should be moved
between it and the root \(em but not for a noun.
This will give the distinction between
.ul
con\c
vert (noun) and con\c
.ul
vert
(verb),
.ul
ex\c
tract and ex\c
.ul
tract,
and for many North American speakers,
will help to distinguish
.ul
in\c
quiry from in\c
.ul
quire.
However, detecting prefixes by a simple splitting algorithm is dangerous.
For example, "predate" is a verb with stress on what appears to be a prefix,
contrary to the rule; while the "pre" in "predator" is not a prefix \(em at
least, it is not pronounced as the prefix "pre" normally is.
Moreover, polysyllabic words like "/diplomat", "dip/lomacy", "diplo/matic";
or "/telegraph", "te/legraphy", "tele/graphic" cannot be handled on such a simple
basis.
.pp
In 1968, a remarkable work on English sound structure was published
(Chomsky and Halle, 1968) which proposes a system of rules to transform
English text into a phonetic representation in terms of distinctive features,
with the aid of a lexicon.
.[
Chomsky Halle 1968
.]
A great deal of attention is paid to stress, and rules are given which
perform well in many tricky cases.
.pp
It uses the American system of levels of stress, marking
so-called primary stress with a superscript 1, secondary stress with a
superscript 2, and so on.
The superscripts are written on the vowel of the stressed
syllable:  completely unstressed syllables receive no annotation.
For example, the sentence "take John's blackboard eraser" is written
.LB
ta\u2\dke Jo\u3\dhn's bla\u1\dckboa\u5\drd era\u4\dser.
.LE
In foot notation this utterance
is
.LB
/take /John's /*blackboard e/raser.
.LE
It undoubtedly contains less information than the stress-level version.
For example, the second syllable of "blackboard" and the first one of "erase"
are both unstressed, although the rhythm rules given in Chapter 8
will cause them
to be treated differently because they occupy different places in the
syllable pattern of the foot.
"Take", "John's", and the second syllable of "erase" are all non-tonic
foot-initial syllables and hence are not distinguished in the notation;
although the pitch contours schematized in Figure 8.9 will give them different
intonations.
.pp
An indefinite number of levels of stress can be used.  For example, according
to the rules given by Chomsky and Halle, the word "sad" in
.LB
my friend can't help being shocked at anyone who would fail to consider
his sad plight
.LE
has level-8 stress, the final two words being annotated
as "sa\u8\dd pli\u1\dght".
However, only the first few levels are used regularly, and
it is doubtful whether acoustic distinctions are made in speech
between the weaker ones.
.pp
Chomsky and Halle are concerned to distinguish between such utterances as
.LB
.NI
bla\u2\dck boa\u1\drd-era\u3\dser    ("board eraser that is black")
.NI
bla\u1\dckboa\u3\drd era\u2\dser     ("eraser for a blackboard")
.NI
bla\u3\dck boa\u1\drd era\u2\dser    ("eraser of a black board"),
.LE
and their stress assignment rules do indeed produce each version when
appropriate.
In foot notation the distinctions can still be made:
.LB
.NI
/black /*board-eraser/
.NI
/*blackboard e/raser/
.NI
/black /*board e/raser/
.LE
.pp
The rules operate on a grammatical derivation tree
of the text.
For instance, input for the three examples would be written
.LB
.NI
[\dNP\u[\dA\u black ]\dA\u [\dN\u[\dN\u board]\dN\u
[\dN\u eraser ]\dN\u]\dN\u]\dNP\u
.NI
[\dN\u[\dN\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dN\u [\dN\u eraser ]\dN\u]\dN\u
.NI
[\dN\u[\dNP\u[\dA\u black ]\dA\u [\dN\u board ]\dN\u]\dNP\u [\dN\u eraser ]\dN\u]\dN\u,
.LE
representing the trees shown in Figure 9.2.
.FC "Figure 9.2"
Here, N stands for a noun, NP for a noun phrase, and A for an adjective.
These categories appear explicitly as nodes in the tree.
In the linearized textual representation they are used to label
brackets which represent the tree structure.
An additional piece of information which is needed is the lexical entry for
"eraser", which would show that it has only one accented
(that is, potentially stressed) syllable, namely, the second.
.pp
Consider now how to account for stress in prefixed and
suffixed words, and those polysyllabic ones with more than one potential
stress point.
For these, the morphological structure must appear in the input.
.pp
Now
.ul
morphemes
are well-defined minimal units of grammatical analysis from which a word
may be composed.
For example,  [went]\ =\ [go]\ +\ [ed]  is
a morphemic decomposition, where "[ed]" denotes the
past-tense morpheme.
This representation is not particularly suitable for speech synthesis
for the obvious reason that the result bears no phonetic resemblance to
the input.
What is needed is a decomposition into
.ul
morphs,
which occur only when the lexical or phonetic representation of a word may
easily be segmented into parts.
Thus  [wanting]\ =\ [want]\ +\ [ing]  and  [bigger]\ =\ [big]\ +\ [er]  are
simultaneously morphic and morphemic decompositions.
Notice that in the second example, a rule about final consonant doubling has
been applied at the lexical level (although it is not needed in
a phonetic representation):  this comes into the sphere
of "easy" segmentation.
Contrast this with  [went]\ =\ [go]\ +\ [ed]  which
is certainly not an easy segmentation and hence a
morphemic but not a morphic decomposition.
But between these extremes there are some difficult
cases:  [specific]\ =\ [specify]\ +\ [ic]  is probably morphic
as well as morphemic, but it is not clear
that  [galactic]\ =\ [galaxy]\ +\ [ic]  is.
.pp
Assuming that the input is given as a derivation tree with morphological
structure made explicit, Chomsky and Halle present rules which assign stress
correctly in nearly all cases.  For example, their rules give
.LB
.NI
[\dA\u[\dN\u incident ]\dN\u + al]\dA\u  \(em>  i\u2\dncide\u1\dntal;
.LE
and if the stem is marked by  [\dS\u\ ...\ ]\dS\u  in prefixed words,
they can deduce
.LB
.NI
[\dN\u tele [\dS\u graph ]\dS\u]\dN\u		\(em>  te\u1\dlegra\u3\dph
.NI
[\dN\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u y ]\dN\u	\(em>  tele\u1\dgraphy
.NI
[\dA\u[\dN\u tele [\dS\u graph ]\dS\u]\dN\u ic ]\dA\u	\(em>  te\u3\dlegra\u1\dphi\u2\dc.
.LE
.pp
There are two rules which account for the word-level stress
on such examples:  the "main stress"
rule and the "alternating stress" rule.
In essence, the main stress rule emphasizes the last strong syllable
of a stem.
A syllable is "strong" either if it contains one of a class of so-called
"long" vowels, or if there is a cluster of two or more consonants
following the vowel; otherwise it is "weak".
(If you are exceptionally observant you will notice that this strong\(emweak
distinction has been used before, when discussing the rhythm of feet in
syllables.)  Thus the verb "torment" receives stress on the second syllable,
for it is a strong one.
A noun like "torment" is treated as being derived from the corresponding verb,
and the rule assigns stress to the verb first and then modifies it for the noun.
The second, "alternating stress", rule gives some stress to alternate
syllables of polysyllabic words like "form\c
.ul
al\c
de\c
.ul
hyde\c
".
.pp
It is quite easy to incorporate the word-level rules into a computer
program which uses feet rather than stress levels as the basis for prosodic
description.
A foot boundary is simply placed before the primary-stressed (level-1) syllable,
except for function words, which do not begin a foot.
The other stress levels should be ignored,
except that for slow, deliberate speech, secondary (level-2) stress is
mapped into a foot boundary too, if it precedes the primary stress.
There is also a rule which reduces vowels in unstressed
syllables.
.pp
The stress assignment rules can work on phonemic script, as well as English.
For example, starting from the phonetic
form  [\d\V\u\ \c
.ul
aa\ s\ t\ o\ n\ i\ sh\ \c
]\dV\u,
the stress assignment rules
produce  \c
.ul
aa\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c
  the
vowel reduction rule
generates  \c
.ul
uh\ s\ t\ o\u1\d\ n\ i\ sh\ ;\c
  and
the foot conversion process
gives  \c
.ul
uh\ s/t\ o\ n\ i\ sh.
This appears to provide a fairly reliable algorithm for foot boundary
placement.
.rh "Speech synthesis from concept."
I argued earlier that in order to derive prosodic features
of an utterance from text it
is necessary to understand its role in the dialogue, its semantics,
its syntax, and \(em as we have just seen \(em its morphological structure.
This is a very tall order, and the problem of natural language comprehension
by machine is a vast research area in its own right.
However, in many applications requiring speech output,
utterances are generated by the computer from internally stored data
rather than being read aloud from pre-prepared text.
Then the problem of comprehending text may be evaded, for
presumably the language-generation module can provide a semantic,
syntactic, and even morphological decomposition of the utterance,
as well as some indication of its role in the dialogue
(that is, why it is necessary to say it).
.pp
This forms the basis of the appealing notion of "speech synthesis from concept".
It has some advantages over speech generation from text, and in principle
should provide more natural-sounding speech.
Every word produced by the system can have a complete lexical entry which
shows its morphological decomposition and potential stress points.
The full syntactic history of each utterance is known.
The Chomsky-Halle rules described above can therefore be used to place
foot boundaries accurately, without the need for a complex parsing program
and without the risk of having to make guesses about unknown words.
.pp
However, it is not clear how to take advantage of any semantic information
which is available.  Ideally, it should be possible to place tone group
boundaries and tonic stress points, and assign intonation contours, in
a natural-sounding way.
But look again at the example text of Table 9.2 and imagine that you have
at your disposal as much semantic information as is needed.
It is
.ul
still
far from obvious how the intonation features could be assigned!
It is, in the ultimate analysis, interpretive and stylistic
.ul
choices
that add variety and interest to speech.
.pp
Take the problem of determining pitch contours, for instance.
Some of them may be explicable.
Contour 4 on
.LB
.NI
except the parts that are covered in caravans of course
.LE
is due to its being a contrastive clause, for it presents
essentially new information.
Similarly, the succession
.LB
.NI
if you go in spring
.NI
when the gorse is out
.NI
or in summer
.NI
when the heather's out
.LE
could be considered contrastive, being in the subjunctive voice, and
this could explain why contour 4's were used.
But this is all conjecture, and it is difficult to apply throughout the
passage.
Halliday (1970) explains the contexts in which each tone group is typically
used, but in an extremely high-level manner which would be impossible
to embody directly in a computer program.
.[
Halliday 1970 Course in spoken English: Intonation
.]
At the other end of the spectrum, computer systems for written
discourse production do not seem to provide the subtle information needed
to make intonation decisions (see, for example, Davey, 1978, for a fairly
complete description of such a system).
.[
Davey 1978
.]
.pp
One project which uses such a method for generating speech has been
described (Young and Fallside, 1980).
.[
Young Fallside 1980
.]
Although some attention is paid to rhythm, the intonation contours
which are generated are disappointingly repetitive and lacking in
richness.
In fact, very little semantic information is used to assign contours; really
just that inferred by the crude punctuation-driven method described
earlier.
.pp
The higher-level semantic problems associated with speech output were
studied some years go under the
title "synthetic elocution" (Vanderslice, 1968).
.[
Vanderslice 1968
.]
A set of rules was generated and tested by hand on a sample passage,
the first part of which is shown in Table 9.4.
However, no attempt was made to formalize the rules in a computer program,
and indeed it was recognized that a number of important questions,
such as the form of the semantic information assumed at the input,
had been left unanswered.
.RF
.nr x0 \w'\0\0  psychologist   '+\w'emphasis assigned because of antithesis with  '
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta \w'\0\0  psychologist   'u
\l'\n(x0u\(ul'
.sp
Human experience and human behaviour are accessible to
observation by everyone.  The psychologist tries to bring
them under systematic study.  What he perceives, however,
anyone can perceive; for his task he requires no microscope
or electronic gear.
.sp2
\0\0  word	comments
\l'\n(x0u\(ul'
.sp
\01  Human	special treatment because paragraph-initial
\04  human	accent deleted because it echoes word 1
13  psychologist	emphasis assigned because of antithesis with
	"everyone"
17  them	anaphoric to "Human experience and human
	behaviour"
19  systematic	emphasis assigned because of contrast with
	"observation"
20  study	emphasis? \(em text is ambiguous whether
	"observation" is a kind of study that is
	nonsystematic, or an activity contrasting
	with the entire concept of "systematic study"
21  What	increase in pitch for "What he perceives"
	because it is not the subject
22  he	accented although anaphoric to word 13
	because of antithesis with word 25
24  however	decrease in pitch because it is parenthetical
25  anyone	emphasized by antithesis with word 22
27  perceive	unaccented because it echoes word 23,
	"perceives"
\0\0  ;	semicolon assigns falling intonation
30  task	unaccented because it is anaphoric with
	"tries to bring them under systematic study"
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 9.4  Sample passage and comments pertinent to synthetic elocution"
.pp
The comments in the table, which are selected and slightly edited versions
of those appearing in the original work (Vanderslice, 1968), are intended
as examples of the nature and subtlety of the prosodic influences which
were examined.
.[
Vanderslice 1968
.]
The concepts of "accent" and "emphasis" are used; these relate to stress
but are not easy to define precisely in our tone-group terminology.
Fortunately we do not need an exact characterization of them for the present
purpose.
Roughly speaking, "accent" encompasses both foot-initial stress and
tonic stress, whereas "emphasis" is something more than this,
typically being realized by the fall-rise or rise-fall contours of
Halliday's tone groups 4 and 5 (Figure 8.5).
.pp
Particular attention is paid to anaphora and antithesis (amongst other things).
The first term means the repetition of a word or phrase in the text,
and is often applied to pronoun references.
In the example, the word "human" is repeated in the first few words;
"them" in the second sentence refers to "human experience and human
behaviour"; "he" in the third sentence is the previously-mentioned
psychologist; and "task" is anaphoric with "tries to bring them under
systematic study".
Other things being equal, anaphoric references are unaccented.
In our terms this means that they certainly do not receive tonic stress
and may not even receive foot stress.
.pp
Antithesis is defined as the contrast of ideas expressed by parallelism of
strongly contrasting words or phrases; and the second element taking part
in it is generally emphasized.
"Psychologist" in the passage is an antithesis of "everyone";
"systematic" and possibly "study" of "observation".
Thus
.LB
.NI
/^ the psy/*chologist
.LE
would probably receive intonation contour 4, since it is also introducing
a new actor; while
.LB
.NI
/tries to /bring them /under /system/*matic /study
.LE
could receive contour 5.
"He" and "everyone" are antithetical; not only does the latter receive
emphasis but the former has its accent restored \(em for otherwise
it would have been removed because of anaphora with "psychologist".
Hence it will certainly begin a foot, possibly a tonic foot.
.pp
A factor that does not affect the sample passage is the accentuation
of unusual syllables of similar words to bring out a contrast.
For example,
.LB
.NI
he went
.ul
out\c
side, not
.ul
in\c
side.
.LE
Although this may seem to be just another facet of antithesis,
Vanderslice points out that it is phonetic rather than structural
similarity that is contrasted:
.LB
.NI
I said
.ul
de\c
plane, not
.ul
com\c
plain.
.LE
This introduces an interesting interplay between the phonetic and
prosodic levels.
.pp
Anaphora and antithesis provide an ideal domain for speech synthesis from
concept.
Determining them from plain text is a very difficult problem,
requiring a great deal of real-world knowledge.
The first has received some attention in the field of natural language
understanding.
Finding pronoun referents is an important problem for language translation,
for their gender is frequently distinguished in, say, French where it is not
in English.
Examples such as
.LB
.NI
I bought the wine, sat on a table, and drank it
.NI
I bought the wine, sat on a table, and broke it
.LE
have been closely studied (Wilks, 1975); for if they were to be translated
into French the pronoun "it" would be rendered differently in each case
(\c
.ul
le
vin,
.ul
la
table).
.[
Wilks 1975 An intelligent analyzer and understander of English
.]
.pp
In spoken language, emphasis is used to indicate the referent of a pronoun
when it would not otherwise be obvious.
Vanderslice gives the example
.LB
.NI
Bill saw John across the room and he ran over to him
.NI
Bill saw John across the room and
.ul
he
ran over to
.ul
him,
.LE
where the emphasis reverses the pronoun referents
(so that John did the running).
He suggests accenting a personal pronoun whenever the true
antecedent is not the same as the "unmarked" or default one.
Unfortunately he does not elaborate on what is meant by "unmarked".
Does it mean that the referent cannot be predicted from
knowledge of the words alone \(em as in the second example above?
If so, this is a clear candidate for speech synthesis from concept,
for the distinction cannot be made from text! 
.sh "9.2  Pronunciation"
.pp
English pronunciation is notoriously irregular.
A poem by Charivarius, the pseudonym of a Dutch high school teacher
and linguist G.N.Trenite (1870\-1946), surveys the problems in an amusing
way and is worth quoting in full.
.br
.ev2
.in 0
.LB "nnnnnnnnnnnnnnnn"
.ul
              The Chaos
.sp2
.ne4
Dearest creature in Creation
Studying English pronunciation,
.in +5n
I will teach you in my verse
Sounds like corpse, corps, horse and worse.
.ne4
.in -5n
It will keep you, Susy, busy,
Make your head with heat grow dizzy;
.in +5n
Tear in eye your dress you'll tear.
So shall I!  Oh, hear my prayer:
.ne4
.in -5n
Pray, console your loving poet,
Make my coat look new, dear, sew it.
.in +5n
Just compare heart, beard and heard,
Dies and diet, lord and word.
.ne4
.in -5n
Sword and sward, retain and Britain,
(Mind the latter, how it's written).
.in +5n
Made has not the sound of bade,
Say \(em said, pay \(em paid, laid, but plaid.
.ne4
.in -5n
Now I surely will not plague you
With such words as vague and ague,
.in +5n
But be careful how you speak:
Say break, steak, but bleak and streak,
.ne4
.in -5n
Previous, precious; fuchsia, via;
Pipe, shipe, recipe and choir;
.in +5n
Cloven, oven; how and low;
Script, receipt; shoe, poem, toe.
.ne4
.in -5n
Hear me say, devoid of trickery;
Daughter, laughter and Terpsichore;
.in +5n
Typhoid, measles, topsails, aisles;
Exiles, similes, reviles;
.ne4
.in -5n
Wholly, holly; signal, signing;
Thames, examining, combining;
.in +5n
Scholar, vicar and cigar,
Solar, mica, war and far.
.ne4
.in -5n
Desire \(em desirable, admirable \(em admire;
Lumber, plumber; bier but brier;
.in +5n
Chatham, brougham; renown but known,
Knowledge; done, but gone and tone,
.ne4
.in -5n
One, anemone; Balmoral,
Kitchen, lichen; laundry, laurel;
.in +5n
Gertrude, German; wind and mind;
Scene, Melpemone, mankind;
.ne4
.in -5n
Tortoise, turquoise, chamois-leather,
Reading, Reading; heathen, heather.
.in +5n
This phonetic labyrinth
Gives:  moss, gross; brook, brooch; ninth, plinth.
.ne4
.in -5n
Billet does not end like ballet;
Bouquet, wallet, mallet, chalet;
.in +5n
Blood and flood are not like food,
Nor is mould like should and would.
.ne4
.in -5n
Banquet is not nearly parquet,
Which is said to rime with darky
.in +5n
Viscous, viscount; load and broad;
Toward, to forward, to reward.
.ne4
.in -5n
And your pronunciation's O.K.
When you say correctly:  croquet;
.in +5n
Rounded, wounded; grieve and sieve;
Friend and fiend, alive and live
.ne4
.in -5n
Liberty, library; heave and heaven;
Rachel, ache, moustache; eleven.
We say hallowed, but allowed;
People, leopard; towed, but vowed.
.in +5n
Mark the difference moreover
Between mover, plover, Dover;
.ne4
.in -5n
Leeches, breeches; wise, precise;
Chalice, but police and lice.
.in +5n
Camel, constable, unstable,
Principle, discipline, label;
.ne4
.in -5n
Petal, penal and canal;
Wait, surmise, plait, promise; pal.
.in +5n
Suit, suite, ruin; circuit, conduit,
Rime with:  "shirk it" and "beyond it";
.ne4
.in -5n
But it is not hard to tell
Why it's pall, mall, but Pall Mall.
.in +5n
Muscle, muscular; goal and iron;
Timber, climber; bullion, lion;
.ne4
.in -5n
Worm and storm; chaise, chaos, chair;
Senator, spectator, mayor.
.in +5n
Ivy, privy; famous, clamour
and enamour rime with "hammer".
.ne4
.in -5n
Pussy, hussy and possess,
Desert, but dessert, address.
.in +5n
Golf, wolf; countenants; lieutenants
Hoist, in lieu of flags, left pennants.
.ne4
.in -5n
River, rival; tomb, bomb, comb;
Doll and roll, and some and home.
.in +5n
Stranger does not rime with anger,
Neither does devour with clangour.
.ne4
.in -5n
Soul, but foul; and gaunt, but aunt;
Font, front, won't; want, grand and grant;
.in +5n
Shoes, goes, does.  Now first say:  finger,
And then; singer, ginger, linger.
.ne4
.in -5n
Real, zeal; mauve, gauze and gauge;
Marriage, foliage, mirage, age.
.in +5n
Query does not rime with very,
Nor does fury sound like bury.
.ne4
.in -5n
Dost, lost, post; and doth, cloth, loth;
Job, Job; blossom, bosom, oath.
.in +5n
Though the difference seems little
We say actual, but victual;
.ne4
.in -5n
Seat, sweat; chaste, caste; Leigh, eight, height;
Put, nut; granite but unite.
.in +5n
Reefer does not rime with deafer,
Feoffer does, and zephyr, heifer.
.ne4
.in -5n
Dull, bull; Geoffrey, George; ate, late;
Hint, pint; senate, but sedate.
.in +5n
Scenic, Arabic, Pacific;
Science, conscience, scientific.
.ne4
.in -5n
Tour, but our, and succour, four;
Gas, alas and Arkansas!
.in +5n
Sea, idea, guinea, area,
Psalm, Maria, but malaria.
.ne4
.in -5n
Youth, south, southern; cleanse and clean;
Doctrine, turpentine, marine.
.in +5n
Compare alien with Italian.
Dandelion with battalion,
.ne4
.in -5n
Sally with ally, Yea, Ye,
Eye, I, ay, aye, whey, key, quay.
Say aver, but ever, fever,
Neither, leisure, skein, receiver.
.in +5n
Never guess \(em it is not safe;
We say calves, valves, half, but Ralf.
.ne4
.in -5n
Heron, granary, canary;
Crevice and device and eyrie;
.in +5n
Face, preface, but efface,
Phlegm, phlegmatic; ass, glass, bass;
.ne4
.in -5n
Large, but target, gin, give, verging;
Ought, out, joust and scour, but scourging;
.in +5n
Ear, but earn; and wear and tear
Do not rime with "here", but "ere".
.ne4
.in -5n
Seven is right, but so is even;
Hyphen, roughen, nephew, Stephen;
.in +5n
Monkey, donkey; clerk and jerk;
Asp, grasp, wasp; and cork and work.
.ne4
.in -5n
Pronunciation \(em think of psyche -
Is a paling, stout and spikey;
.in +5n
Won't it make you lose your wits,
Writing groats and saying "groats"?
.ne4
.in -5n
It's a dark abyss or tunnel,
Strewn with stones, like rowlock, gunwale,
.in +5n
Islington and Isle of Wight,
Housewife, verdict and indict.
.ne4
.in -5n
Don't you think so, reader, rather
Saying lather, bather, father?
.in +5n
Finally:  which rimes with "enough",
Though, through, plough, cough, hough or tough?
.ne4
.in -5n
Hiccough has the sound of "cup",
My advice is ... give it up!
.LE "nnnnnnnnnnnnnnnn"
.br
.ev
.rh "Letter-to-sound rules."
Despite such irregularities, it is surprising how much can be done
with simple letter-to-sound rules.
These specify phonetic equivalents of word fragments and single letters.
The longest stored fragment which matches the current word is translated,
and then the same strategy is adopted on the remainder of the word.
Table 9.5 shows some English fragments and their pronunciations.
.RF
.nr x0 1.5i+\w'pronunciation  '
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta 1.5i
fragment	pronunciation
\l'\n(x0u\(ul'
.sp
-p-	\fIp\fR
-ph-	\fIf\fR
-phe|	\fIf ee\fR
-phe|s	\fIf ee z\fR
-phot-	\fIf uh u t\fR
-place|-	\fIp l e i s\fR
-plac|i-	\fIp l e i s i\fR
-ple|ment-	\fIp l i m e n t\fR
-plie|-	\fIp l aa i y\fR
-post	\fIp uh u s t\fR
-pp-	\fIp\fR
-pp|ly-	\fIp l ee\fR
-preciou-	\fIp r e s uh\fR
-proce|d-	\fIp r uh u s ee d\fR
-prope|r-	\fIp r o p uh r\fR
-prov-	\fIp r uu v\fR
-purpose-	\fIp er p uh s\fR
-push-	\fIp u sh\fR
-put	\fIp u t\fR
-puts	\fIp u t s\fR
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 9.5  Word fragments and their pronunciations"
.pp
It is sometimes important to specify that a rule applies only when
the fragment is matched at the beginning or end of a word.
In the Table "-" means that other fragments can precede or follow this
one.
The "|" sign is used to separate suffixes from a word stem,
as will be explained
shortly.
.pp
An advantage of the longest-string search strategy is that it is easy
to account for exceptions simply by incorporating them into the fragment
table.
If they occur in the input, the complete word will automatically be
matched first, before any fragment of it is translated.
The exception list of complete words can be surprisingly small for
quite respectable performance.
Table 9.6 shows the entire dictionary for an excellent early pronunciation
system written at Bell Laboratories (McIlroy, 1974).
.[
McIlroy 1974
.]
Some of the words are notorious exceptions in English, while others are
included simply because the rules would run amok on them.
Notice that the exceptions are all quite short, with only a few of them
having more than two syllables.
.RF
.nr x1 0.9i+0.9i+0.9i+0.9i+0.9i+0.9i
.nr x1 (\n(.l-\n(x1)/2
.in \n(x1u
.ta 0.9i +0.9i +0.9i +0.9i +0.9i
a	doesn't	guest	meant	reader	those
alkali	doing	has	moreover	refer	to
always	done	have	mr	says	today
any	dr	having	mrs	seven	tomorrow
april	early	heard	nature	shall	tuesday
are	earn	his	none	someone	two
as	eleven	imply	nothing	something	upon
because	enable	into	nowhere	than	very
been	engine	is	nuisance	that	water
being	etc	island	of	the	wednesday
below	evening	john	on	their	were
body	every	july	once	them	who
both	everyone	live	one	there	whom
busy	february	lived	only	thereby	whose
copy	finally	living	over	these	woman
do	friday	many	people	they	women
does	gas	maybe	read	this	yes
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 9.6  Exception table for a simple pronunciation program"
.pp
Special action has to be taken with final "e"'s.
These lengthen and alter the quality
of the preceding vowel, so that "bit" becomes "bite" and so on.
Unfortunately, if the word has a suffix the "e" must be detected even though
it is no longer final, as in "lonely", and it is even dropped sometimes
("biting") \(em otherwise these would be pronounced "lonelly", "bitting".
To make matters worse the suffix may be another word:  we do not
want "kiteflying" to have an extra syllable which rhymes with "deaf"!
Although simple procedures can be developed to take care of common
word endings like "-ly", "-ness", "-d", it is difficult to decompose
compound words like "wisecrack" and "bumblebee" reliably \(em but this must
be done if they are not to be articulated with three syllables instead of two.
Of course, there are exceptions to the final "e" rule.
Many common words ("some", "done", "[live]\dV\u") disobey the rule by not
lengthening the main vowel, while in other, rarer, ones ("anemone",
"catastrophe", "epitome") the final "e" is actually pronounced.
There are also some complete anomalies ("fete").
.pp
McIlroy's (1974) system is a superb example of a robust program which takes
a pragmatic approach to these problems, accepting that they will never be
fully solved, and which is careful to degrade
gracefully when stumped.
.[
McIlroy 1974
.]
The pronunciation of each word is found by a succession of increasingly
desperate trials:
.LB
.NP
replace upper- by lower-case letters, strip punctuation, and try again;
.NP
remove final "-s", replace final "ie" by "y", and try again;
.NP
reject a word without a vowel;
.NP
repeatedly mark any suffixes with "|";
.NP
mark with "|" probable morph divisions in compound words;
.NP
mark potential long vowels indicated by "e|",
and long vowels elsewhere in the word;
.NP
mark voiced medial "s" as in "busy", "usual";
replace final "-s" if stripped;
.NP
scanning the word from left to right, apply letter-to-sound rules
to word fragments;
.NP
when all else fails spell the word, punctuation and all
(burp on letters for which no spelling rule exists).
.LE
.RF
.nr x0 \w'| ment\0\0\0'+\w'replace final ie by y\0\0\0'+\w'except when no vowel would remain in  '
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta \w'| ment\0\0\0'u +\w'replace final ie by y\0\0\0'u
suffix	action	notes and exceptions
\l'\n(x0u\(ul'
.sp
s	strip off final s	except in context us
\&'	strip off final '
ie	replace final ie by y
e	replace final e by E	when it is the only vowel in a word
	(long "e")

| able	place suffix mark as	except when no vowel would remain in
| ably	shown	the rest of the word
e | d
e | n
e | r
e | ry
e | st
e | y
| ful
| ing
| less
| ly
| ment
| ness
| or

| ic	place suffix mark as
| ical	shown and terminate
e |	final e processing
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 9.7  Rules for detecting suffixes for final 'e' processing"
.pp
Table 9.7 shows the suffixes which the program recognizes, with some comments
on their processing.
Multiple suffixes are detected and marked in words like
"force|ful|ly" and "spite|ful|ness".
This allows silent "e"'s to be spotted even when they occur far back in a
word.
Notice that the suffix marks are available to the word-fragment
rules of Table 9.5, and are frequently used by them.
.pp
The program has some
.ul
ad hoc
rules for dealing with compound words like "race|track", "house|boat";
these are applied as well as normal suffix splitting so that multiple
decompositions like "pace|make|r" can be accomplished.
The rules look for short letter sequences which do not
usually appear in monomorphemic words.
It is impossible, however, to detect every morph boundary
by such rules, and the program inevitably makes mistakes.
Examples of boundaries which go undetected are
"edge|ways", "fence|post", "horse|back", "large|mouth", "where|in";
while boundaries are incorrectly inserted into "comple|mentary",
"male|volent", "prole|tariat", "Pame|la".
.pp
We now seem to have presented two opposing points of view on the pronunciation
problem.
Charivarius, the Dutch poet, shows that an enormous number of
exceptional words exist; whereas McIlroy's program makes do with a tiny
exception dictionary.
These views can be reconciled by noting that most of Charivarius' words
are relatively uncommon.
McIlroy tested his program against the 2000 most frequent words in a large
corpus (Kucera and Francis, 1967),
and found that 97% were pronounced correctly if word frequencies were
taken into account.
.[
Kucera Francis 1967
.]
(The notion of "correctness" is of course a rather subjective one.)  However,
he estimated that on the remaining words the success rate was only 88%.
.pp
The system is particularly impressive in that it is prepared to say
anything:  if used, for example, on source programs in a high-level
computer language it will say the keywords and pronouncable
identifiers, spell the other identifiers, and even give the names of special
symbols (like +, <, =) correctly!
.rh "Morphological analysis."
The use of letter-to-sound rules provides a cheap and fast technique
for pronunciation \(em the fragment table and exception dictionary for the
program described above occupy only 11 Kbyte of storage, and can easily
be kept in solid-state read-only memory.
It produces reasonable results if careful attention is paid to rules
for suffix-splitting.
However, it is inherently limited because it is not possible in general
to detect compound words by simple rules which operate on the lexical
structure of the word.
.pp
Compounds can only be found reliably by using a morph dictionary.
This gives the added advantage that syntactic information
can be stored with the morphs to assist with rhythm assignment according
to the Chomsky-Halle theory.
However, it was noted earlier that morphs, unlike the grammatically-determined
morphemes, are not very well defined from a linguistic point of view.
Some morphemic decompositions are obviously not morphic because the
constituents do not in any way resemble the final word;
while others, where the word is simply a concatenation
of its components, are clearly morphic.
Between these extremes lies a hazy region where what one considers
to be a morph depends upon how complex one is prepared to make the
concatenation rules.
The following description draws on techniques used in a project at MIT
in which a morph-based pronunciation system has been implemented
(Lee, 1969; Allen, 1976).
.[
Lee 1969
.]
.[
Allen 1976 Synthesis of speech from unrestricted text
.]
.pp
Estimates of the number of morphs in English vary from 10,000 to 30,000.
Although these seem to be very large numbers, they are considerably less
than the number of words in the language.
For example, Webster's
.ul
New Collegiate Dictionary
(7'th edition) contains about 100,000 entries.
If all forms of the words were included, this number would probably
double.
.pp
There are several classes of morphs, with restrictions on the combinations
that occur.
A general word has prefixes, a root, and suffixes, as shown in Figure 9.3;
only the root is mandatory.
.FC "Figure 9.3"
Suffixes usually perform a grammatical role, affecting the
conjugation of a verb or declension of a noun; or transforming one
part of speech into another
("-al" can make a noun into an adjective, while "-ness" performs the reverse
transformation.)  Other
suffixes, such as "-dom" or "-ship", only apply to certain parts of
speech (nouns, in this case), but do not change the grammatical
role of the word.  Such suffixes, and all prefixes, alter the meaning
of a word.
.pp
Some root morphs cannot combine with other morphs but always stand
alone \(em for instance, "this".
Others, called free morphs, can either occur on their own or combine
with further morphs to form a word.
Thus the root "house" can be joined on either side by another root,
such as "boat",
or by a suffix such as "ing".
A third type of root morph is one which
.ul
must
combine with another morph, like "crimin-", "-ceive".
.pp
Even with a morph dictionary, decomposing a word into a sequence
of morphs is not a trivial operation.
The process of lexical concatenation often results in a
minor change in the constituents.
How big this change is allowed to be governs the morph system being used.
For example, Allen (1976) gives three concatenation rules:  a
final "e" can be omitted, as in
.ta 1.1i
.LB
.NI
give + ing	\(em>  giving;
.LE
the last consonant of the root can be doubled, as in
.LB
.NI
bid + ing	\(em>  bidding;
.LE
or a final "y" can change to an "i", as in
.LB
.NI
handy + cap	\(em>  handicap.
.[
Allen 1976 Synthesis of speech from unrestricted text
.]
.LE
If these are the only rules permitted, the morph dictionary will
have to include multiple versions of some suffixes.
For example, the plural morpheme [-s] needs to be represented both by
"-s" and "-es", to account for
.LB
.NI
pea + s	\(em>  peas
.LE
and
.LB
.NI
baby + es	\(em>  babies  (using the "y" \(em> "i" rule).
.LE
This would not be necessary if a  "y" \(em> "ie"  rule were included too.
Similarly, the morpheme [-ic] will include morphs
"-ic" and "-c"; the latter to cope with
.LB
.NI
specify + c	\(em>  specific    (using the "y" \(em> "i" rule).
.LE
Furthermore, non-morphemic roots such as "galact" need to be included because
the concatenation rules do not capture the transformation
.LB
.NI
galaxy + ic	\(em>  galactic.
.LE
There is clearly a trade-off between the size of the morph dictionary
and the complexity of the concatenation rules.
.pp
Since a text-to-speech system is presented with already-concatenated
morphs, it must be prepared to reverse the effects of the concatenation
rules to deduce the constituents of a word.
When two morphs combine with any of the three rules given above,
the changes in spelling occur only in the lefthand one.
Therefore the word is best scanned in a right-to-left direction to
split off the morphs starting with suffixes, as McIlroy's program does.
If the procedure fails at any point, one of the three rules is
hypothesized, its effect is undone, and splitting continues.
For example, consider the word
.LB
.NI
grasshoppers	<\(em  grass + hop + er + s
.LE
(Lee, 1969).
.[
Lee 1969
.]
The "-s" is detected first, then "-er"; these are both stored in
the dictionary as suffixes.
The remainder, "grasshopp", cannot be decomposed and does not appear
in the dictionary.
So each of the rules above is hypothesized in turn, and the
result investigated.  (The "y" \(em> "i" rule is obviously not
applicable.)  When
the final-consonant-doubling rule is considered, the sequence
"grasshop" is investigated.
"Shop" could be split off this, but then the unknown morph "gras"
would result.
The alternative, to remove "hop", leaves a remainder "grass" which
.ul
is
a free morph, as desired.
Thus a unique and correct decomposition is obtained.
Notice that the procedure would fail if, for example, "grass" had
been inadvertently omitted from the dictionary.
.pp
Sometimes, several seemingly valid decompositions present themselves
(Allen, 1976).
.[
Allen 1976 Synthesis of speech from unrestricted text
.]
For example:
.LB
.NI
scarcity	<\(em  scar + city
.NI
	<\(em  scarce + ity  (using final-"e" deletion)
.NI
	<\(em  scar + cite + y  (using final-"e" deletion)
.NI
resting	<\(em  rest + ing
.NI
	<\(em  re + sting
.NI
biding	<\(em  bide + ing  (using final-"e" deletion)
.NI
	<\(em  bid + ing
.NI
unionized	<\(em  un + ion + ize + d
.NI
	<\(em  union + ize + d
.NI
winding	<\(em  [wind]\dN\u + ing
.NI
	<\(em  [wind]\dV\u + ing.
.LE
The last distinction is important because the pronunciation of "wind"
depends on whether it is a noun or a verb.
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.pp
Several sources of information can be used to resolve these ambiguities.
The word structure of Figure 9.3, together with the division of root
morphs into bound and free ones, may eliminate some possibilities.
Certain letter sequences (such as "rp") do not appear at the beginning
of a word or morph, and others never occur at the end.
Knowledge of these sequences can reject some unacceptable
decompositions \(em or perhaps more importantly, can enable intelligent guesses
to be made in cases where a constituent morph has been omitted from the
dictionary.
The grammatical function of suffixes allows suffix sequences to be
checked for compatibility.
The syntax of the sentence, together with suffix knowledge, can
rule out other combinations.
Semantic knowledge will occasionally be necessary (as in the "unionized"
and "winding" examples above \(em compare a "winding road" with a "winding
blow").
Finally, Allen (1976) suggests that a preference structure on composition
rules can be used to resolve ambiguity.
.[
Allen 1976 Synthesis of speech from unrestricted text
.]
.pp
Once the morphological structure has been determined,
the rest of the pronunciation
process is relatively easy.
A phonetic transcription of each morph may be stored in the morph dictionary,
or else letter-to-sound rules can be used on individual morphs.
These are likely to be quite successful because final-"e" processing can be
now be done with confidence:  there are no hidden final "e"'s in the middle
of morphs.
In either case the resulting phonetic transcriptions of the individual morphs
must be concatenated to give the transcription of the complete word.
Although some contextual modification has to be accounted for,
it is relatively straightforward and easy to predict.
For example, the plural morphs "-s" and "-es" can be realized phonetically
by
.ul
uh\ z,
.ul
s,
or
.ul
z
depending on context.
Similarly the past-tense suffix "-ed" may be rendered as
.ul
uh\ d,
.ul
t,
or
.ul
d.
The suffixes "-ion" and "-ure" sometimes cause modification of the previous
morph:  for example
.LB
.NI
act + ion  \(em>  \c
.ul
a k t\c
  + ion  \(em>  \c
.ul
a k sh uh n.
.LE
.pp
The morph dictionary does not remove the need for a lexicon of exceptional
words.
The irregular final-"e" words mentioned earlier ("done", "anemone", "fete")
need to be treated on an individual basis,
as do words such as "quadruped" which have misleading endings
(it should not be decomposed as "quadrup|ed").
.rh "Pronunciation of languages other than English."
Text-to-speech systems for other languages have been reported in
the literature.
(For example, French, Esperanto,
Italian, Russian, Spanish, and German are covered
by Lesmo
.ul
et al,
1978; O'Shaughnessy
.ul
et al,
1981; Sherwood, 1978;
Mangold and Stall, 1978).
.[
Lesmo 1978
.]
.[
O'Shaughnessy Lennig Mermelstein Divay 1981
.]
.[
Sherwood 1978
.]
.[
Mangold Stall 1978
.]
Generally speaking, these present fewer difficulties than does English.
Esperanto is particularly easy because each letter in its orthography
has only one sound, making the pronunciation problem trivial.
Moreover, stress in polysyllabic words always occurs on the penultimate
syllable.
.pp
It is tempting and often sensible when designing a synthesis system for
English to use an utterance representation somewhere between phonetics and
ordinary spelling.
This may happen in practice even if it is not intended:  a user, finding
that a given word is pronounced incorrectly, will alter the spelling to
make it work.
The Word English Spelling alphabet (Dewey, 1971), amongst others (Haas, 1966),
is a simplified and apparently natural scheme which was developed by the
spelling reform movement.
.[
Dewey 1971
.]
.[
Haas 1966
.]
It maps very simply on to a phonetic representation, just like Esperanto.
However, it can provide little help with the crucial problem of stress
assignment, except perhaps by explicitly indicating reduced vowels.
.sh "9.3  Discussion"
.pp
This chapter has really only touched the tip of a linguistic iceberg.
I have given some examples of representations, rules, algorithms,
and exceptions, to make the concepts more tangible; but a whole mass of
detail has been swept under the carpet.
.pp
There are two important messages that are worth reiterating once more.
The first is that the representation of the input \(em that is,
whether it be a "concept"
in some semantic domain, a syntactic description of an utterance, a
decomposition into morphs, plain text or some contrived re-spelling of it \(em
is crucial to the quality of the output.
Almost any extra information about the utterance can be taken into account
and used to improve the speech.
It is difficult to derive such information if it is not provided explicitly,
for the process of climbing the tree from text to semantic representation is
at least as hard as descending it to a phonetic transcription.
.pp
Secondly, simple algorithms perform remarkably well \(em witness the
punctuation-driven intonation assignment scheme, and word fragment rules
for pronunciation.
However, the combined degradation contributed by several imperfect
processes is likely to impair speech quality very seriously.
And great complexity is introduced when these simple algorithms are
discarded in favour of more sophisticated ones.
There is, for example, a world of difference between a pronunciation
program that copes with 97% of common words and one that deals correctly
with 99% of a random sample from a dictionary.
.pp
Some of the options that face the system designer are recapitulated in
Figure 9.4.
.FC "Figure 9.4"
Starting from text, one can take the simple approach of lexically-based
suffix-splitting, letter-to-sound rules, and prosodics derived
from punctuation, to generate a phonetic transcription.
This will provide a cheap system which is relatively easy to implement
but whose speech quality will probably not be acceptable to any but the
most dedicated listener
(such as a blind person with no other access to reading material).
.pp
The biggest improvement in speech quality from such a system would
almost certainly come from more intelligent prosodic
control \(em particularly of intonation.
This, unfortunately, is also by far the most difficult to make unless
intonation contours, tonic stresses, and tone-group boundaries are hand-coded
into the input.
To generate the appropriate information from text one has to climb to the
upper levels in Figure 9.4 \(em and even when these are reached, the problems
are by no means over.
Still, let us climb the tree.
.pp
For syntax analysis, part-of-speech information is needed; and for this
the grammatical roles of individual words in the text must be ascertained.
A morph dictionary is the most reliable way to do this.
A linguist may prefer to go from morphs to syntax by way of morphemes;
but this is not necessary for the present purpose.
Just the information that
the morph "went" is a verb can be stored in the dictionary, instead
of its decomposition  [went]\ =\ [go]\ +\ [ed].
.pp
Now that we have the morphological structure of the text, stress assignment rules
can be applied to produce more accurate speech rhythms.
The morph decomposition will also allow improvements to be made to the
pronunciation, particularly in the case of silent "e"'s in compound words.
But the ability to assign intonation has hardly been improved at all.
.pp
Let us proceed upwards.
Now the problems become really difficult.
A semantic representation of the text is needed; but what exactly does this
mean?
We certainly must have
.ul
morphemic
knowledge, for now the fact that "went" is a derivative of "go"
(rather than any other verb) becomes crucial.
Very well, let us augment the morph dictionary with morphemic information.
But this does not attack the problem of semantic representation.
We may wish to resolve pronoun references to help assign stress.
Parts of the problem are solved in principle
and reported in the artificial intelligence
literature, but if such an ability is incorporated into the speech
synthesis system it will become enormously complicated.
In addition, we have seen that knowledge of antitheses in the text will greatly
assist intonation assignment, but procedures for extracting this
information constitute a research topic in their own right.
.pp
Now step back and take a top-down approach.
What could we do with this semantic understanding and knowledge of the structure
of the discourse if we had it?
Suppose the input were a "concept" in some as yet undetermined representation.
What are the
.ul
acoustic
manifestations of such high-level features as anaphoric references or
antithetical comparisons,
of parenthetical or satirical remarks,
of emotions:  warmth, sarcasm, sadness and despair?
Can we program the art of elocution?
These are good questions. 
.sh "9.4  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "9.5  Further reading"
.pp
Books on pronunciation give surprisingly little help in designing
a text-to-speech procedure.
The best aid is a good on-line dictionary and flexible software to
search it and record rules, examples, and exceptions.
Here are some papers that describe existing systems.
.LB "nn"
.\"Ainsworth-1974-1
.]-
.ds [A Ainsworth, W.A.
.ds [D 1974
.ds [T A system for converting text into speech
.ds [J IEEE Trans Audio and Electroacoustics
.ds [V AU-21
.ds [P 288-290
.nr [P 1
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
.in-2n
.\"Colby-1978-2
.]-
.ds [A Colby, K.M.
.as [A ", Christinaz, D.
.as [A ", and Graham, S.
.ds [D 1978
.ds [K *
.ds [T A computer-driven, personal, portable, and intelligent speech prosthesis
.ds [J Computers and Biomedical Research
.ds [V 11
.ds [P 337-343
.nr [P 1
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
.in-2n
.\"Elovitz-1976-3
.]-
.ds [A Elovitz, H.S.
.as [A ", Johnson, R.W.
.as [A ", McHugh, A.
.as [A ", and Shore, J.E.
.ds [D 1976
.ds [K *
.ds [T Letter-to-sound rules for automatic translation of English text to phonetics
.ds [J IEEE Trans Acoustics, Speech and Signal Processing
.ds [V ASSP-24
.ds [N 6
.ds [P 446-459
.nr [P 1
.ds [O December
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
.in-2n
.\"Kooi-1978-4
.]-
.ds [A Kooi, R.
.as [A " and Lim, W.C.
.ds [D 1978
.ds [T An on-line minicomputer-based system for reading printed text aloud
.ds [J IEEE Trans Systems, Man and Cybernetics
.ds [V SMC-8
.ds [P 57-62
.nr [P 1
.ds [O January
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
.in-2n
.\"Umeda-1975-5
.]-
.ds [A Umeda, N.
.as [A " and Teranishi, R.
.ds [D 1975
.ds [K *
.ds [T The parsing program for automatic text-to-speech synthesis developed at the Electrotechnical Laboratory in 1968
.ds [J IEEE Trans Acoustics, Speech and Signal Processing
.ds [V ASSP-23
.ds [N 2
.ds [P 183-188
.nr [P 1
.ds [O April
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
.in-2n
.\"Umeda-1976-6
.]-
.ds [A Umeda, N.
.ds [D 1976
.ds [K *
.ds [T Linguistic rules for text-to-speech synthesis
.ds [J Proc IEEE
.ds [V 64
.ds [N 4
.ds [P 443-451
.nr [P 1
.ds [O April
.nr [T 0
.nr [A 1
.nr [O 0
.][ 1 journal-article
.in+2n
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "10  DESIGNING THE MAN-COMPUTER DIALOGUE"
.ds RT "The man-computer dialogue
.ds CX "Principles of computer speech
.pp
Interactive computers are being used more and more by non-specialist people
without much previous computer experience.
As processing costs continue to decline, the overall expense of providing
highly interactive systems
becomes increasingly dominated by terminal and communications equipment.
Taken together, these two factors highlight the need for easy-to-use,
low-bandwidth interactive terminals that make maximum use of the existing
telephone network for remote access.
.pp
Speech output can provide versatile feedback from a computer at very low
cost in distribution and terminal equipment.  It is attractive from several
points of view.
Terminals \(em telephones \(em are invariably in place already.
People without experience of computers are accustomed to their use,
and are not intimidated by them.
The telephone network is cheap to use and extends all over the world.
The touch-tone keypad (or a portable tone generator)
provides a complementary data input device which will do for many
purposes until the technology of speech recognition becomes better developed
and more widespread.
Indeed, many applications \(em especially information retrieval ones \(em need
a much smaller bandwidth from user to computer than in the reverse direction,
and voice output combined with restricted keypad entry provides a good match
to their requirements.
.pp
There are, however, severe problems in implementing natural and useful
interactive systems using speech output.
The eye can absorb information at a far greater rate than can the ear.
You can scan a page of text in a way which has no analogy in auditory terms.
Even so, it is difficult to design a dialogue which allows you to search
computer output visually at high speed.
In practice, scanning a new report is often better done at your desk
with a printed copy than at a computer terminal with a viewing program
(although this is likely to change in the near future).
.pp
With speech, the problem of organizing output becomes even harder.
Most of the information we learn using our ears is presented in a
conversational way, either in face-to-face discussions or over the telephone.
Verbal but non-conversational presentations, as in the
university lecture theatre, are known to be a rather inefficient way
of transmitting information.
The degree of interaction is extremely high even in a telephone conversation,
and communication relies heavily on speech gestures such as hesitations,
grunts, and pauses; on prosodic features such as intonation, pitch range,
tempo, and voice quality; and on conversational gambits such as interruption
and long silence.
I emphasized in the last two chapters the rudimentary state of knowledge
about how to synthesize
prosodic features, and the situation is even worse
for the other, paralinguistic, phenomena.
.pp
There is also a very special problem with voice output, namely, the transient
nature of the speech signal.
If you miss an utterance, it's gone.
With a visual display unit, at least the last few interactions usually remain
available.
Even then, it is not uncommon to look up beyond the top of the screen and
wish that more of the history was still visible!
This obviously places a premium on a voice response system's
ability to repeat utterances.
Moreover, the dialogue designer must do his utmost to ensure that the user
is always aware of the current state of the interaction,
for there is no opportunity to refresh the memory by glancing at earlier
entries and responses.
.pp
There are two separate aspects to the man-computer interface in a voice
response system.
The first is the relationship between the system and the end user,
that is, the "consumer" of the synthesized dialogue.
The second is the relationship between the system and the applications
programmer who creates the dialogue.
These are treated separately in the next two sections.
We will have more to say about the former aspect,
for it is ultimately more important to more people.
But the applications programmer's view is important, too; for without him
no systems would exist!
The technical difficulties in creating synthetic dialogues
for the majority of voice systems probably
explain why speech output technology is still greatly under-used.
Finally we look at techniques for using small keypads such as those on
touch-tone telephones,
for they are an essential part of many voice response systems.
.sh "10.1  Programming principles for natural interaction"
.pp
Special attention must be paid to be details of the man-machine interface
in speech-output systems.
This section summarizes experience of human factors considerations
gained in developing the remote
telephone enquiry service described in Chapter 1 (Witten and Madams, 1977),
which employs an ordinary touch-tone keypad for input in conjunction with
synthetic voice response.
.[
Witten Madams 1977 Telephone Enquiry Service
.]
Most of the principles which emerged were the result of natural evolution
of the system, and were not clear at the outset.
Basically, they stem from the fact that speech is both more intrusive
and more ephemeral than writing, and so they are applicable in general to
speech output information retrieval systems with keyboard or even voice
input.
Be warned, however, that they are based upon casual observation and
speculation rather than empirical research.
There is a desperate need for proper studies of user psychology in speech
systems.
.rh "Echoing."
Most alphanumeric input peripherals echo on a character-by-character basis.
Although one can expect quite a high proportion of mistakes with
unconventional keyboards, especially when entering alphabetic data on a
basically numeric keypad, audio character echoing is distracting and annoying.
If you type "123" and the computer echoes
.LB
.NI
"one ... two ... three"
.LE
after the individual key-presses, it is liable to divert your
attention, for voice output is much more intrusive than a purely visual "echo".
.pp
Instead, an immediate response to a completed input line is preferable.
This response can take the form or a reply to a query, or, if successive
data items are being typed, confirmation of the data entered.
In the latter case, it is helpful if the information can be generated in
the same way that the user himself would be likely to verbalize it.
Thus, for example, when entering numbers:
.LB
.nr x0 \w'COMPUTER:'
.nr x1 \w'USER:'
.NI
USER:\h'\n(x0u-\n(x1u' "123#"	(# is the end-of-line character)
.NI
COMPUTER: "One hundred and twenty-three."
.LE
For a query which requires lengthy processing, the input should be
repeated in a neat, meaningful format to give the user a chance to abort
the request.
.rh "Retracting actions."
Because commands are entered directly without explicit confirmation,
it must always be easy for the user to revoke his actions.
The utility of an "undo" command is now commonly recognized for
any interactive system, and it becomes even more important in speech
systems because it is easier for the user to lose his place in the
dialogue and so make errors.
.rh "Interrupting."
A command which interrupts output and returns to a known state
should be recognized at every level of the system.
It is essential that voice output be terminated immediately,
rather than at the end of the utterance.
We do not want the user to live in fear of the system embarking on
a long, boring monologue that is impossible to interrupt!
Again, the same is true of interactive dialogues which do not use speech,
but becomes particularly important with voice response because it takes
longer to transmit information.
.rh "Forestalling prompts."
Computer-generated prompts must be explicit and frequent enough
to allow new users to understand what they are expected to do.
Experienced users will "type ahead" quite naturally,
and the system should suppress unnecessary prompts under these conditions
by inspecting the input buffer before prompting.
This allows the user to concatenate frequently-used commands into chunks whose
size is entirely at his own discretion.
.pp
With the above-mentioned telephone enquiry service, for example,
it was found that people often took advantage of the prompt-suppression
feature to enter their
user number, password, and required service number as a single keying
sequence.
As you becomes familiar with a service you quickly and easily learn to
forestall expected prompts by typing ahead.
This provides a very natural way for the system to adapt itself automatically
to the experience of the user.
New users will naturally wait to be prompted, and proceed through the dialogue
at a slower and more relaxed pace.
.pp
Suppressing unnecessary prompts is a good idea in any interactive system,
whether or not it uses the medium of speech \(em although it is hardly ever done
in conventional systems.
It is particularly important with speech, however, because an unexpected
or unwanted
prompt is quite distracting, and it is not so easy to ignore it as it is
with a visual display.
Furthermore, speech messages usually take longer to present
than displayed ones, so that the user is distracted for more time.
.rh "Information units."
Lengthy computer voice responses are inappropriate for conveying information,
because attention wanders if one is not actively involved in the conversation.
A sequential exchange of terse messages, each designed to dispense one
small unit of information, forces the user to take a meaningful part in the
dialogue.
It has other advantages, too, allowing a higher degree of input-dependent
branching, and permitting rapid recovery from errors.
.pp
The following example from the "Acidosis program", an audio response system
designed to help physicians to diagnose acidosis, is a good example
of what
.ul
not
to do.
.LB
"(Chime) A VALUE OF SIX-POINT-ZERO-ZERO HAS BEEN ENTERED FOR PH.
THIS VALUE IS IMPOSSIBLE.
TO CONTINUE THE PROGRAM, ENTER A NEW VALUE FOR PH IN THE RANGE
BETWEEN SIX-POINT-SIX AND EIGHT-POINT-ZERO
(beep dah beep-beep)"  (Smith and Goodwin, 1970).
.[
Smith Goodwin 1970
.]
.LE
The use of extraneous noises (for example, a "chime" heralds an error message,
and a "beep dah beep-beep" requests data input in the form
<digit><point><digit><digit>)
was thought necessary in the Acidosis program to keep the user awake
and help him with the format of the interaction.
Rather than a long monologue like this,
it seems much better to design a sequential interchange of terse messages,
so that the caller can be guided into a state where he can rectify his error.
For example,
.LB
.nf
.ne11
.nr x0 \w'COMPUTER:'
.nr x1 \w'CALLER:'
CALLER:\h'\n(x0u-\n(x1u' "6*00#"
COMPUTER: "Entry out of range"
CALLER:\h'\n(x0u-\n(x1u' "6*00#"  (persists)
COMPUTER: "The minimum acceptable pH value is 6.6"
CALLER:\h'\n(x0u-\n(x1u' "9*03#"
COMPUTER: "The maximum acceptable pH value is 8.0"
.fi
.LE
This dialogue allows a rapid exit from the error situation in the likely
event that the entry has simply been mis-typed.
If the error persists, the caller is given just one piece of information
at a time, and forced to continue to play an active role in the interaction.
.rh "Input timeouts."
In general, input timeouts are dangerous, because they introduce apparent
acausality in the system seen by the user.
A case has been reported where a user became "highly agitated and refused
to go near the terminal again after her first timed-out prompt.
She had been quietly thinking what to do and the terminal suddenly
interjecting and making its
own suggestions was just too much for her" (Gaines and Facey, 1975).
.[
Gaines Facey 1975
.]
.pp
However, voice response systems lack the satisfying visual feedback
of end-of-line on termination of an entry.
Hence a timed-out reminder is appropriate if a delay occurs after some
characters have been entered.
This requires the operating system to support a character-by-character mode
of input, rather than the usual line-by-line mode.
.rh "Repeat requests."
Any voice response system must support a universal "repeat last utterance"
command, because old output does not remain visible.
A fairly sophisticated facility is desirable, as repeat requests are
very frequent in practice.
They may be due to a simple inability to understand a response,
to forgetting what was said, or to distraction of attention \(em which is
especially common with office terminals.
.pp
In the telephone enquiry service two distinct commands were employed,
one to repeat the last utterance in case of misrecognition,
and the other to summarize the current state of the interaction
in case of distraction.
For the former, it is essential to avoid simply regenerating an utterance
identical with the last.
Some variation of intonation and rhythm is needed to prevent an annoying,
stereotyped response.
A second consecutive repeat request should trigger a paraphrased reply.
An error recovery sequence could be used which presented the misunderstood
information in a different way with more interaction, but experience
indicates that this is of minor importance, especially if information units
are kept small anyway.
To summarize the current state of the interaction in response to the second
type of repeat command necessitates the system maintaining a model of
the user.
Even a poor model, like a record of his last few transactions and their
results, is well worth having.
.rh "Varied speech."
Synthetic speech is usually rather dreary to listen to.
Successive utterances with identical intonations should be carefully avoided.
Small changes in speaking rate, pitch range, and mean pitch level,
all serve to add variety.
Unfortunately, little is known at present about the role of intonation in
interactive dialogue, although this is an active research area and
new developments can be expected (for a detailed report of a recent
research project relevant to this topic see Brown
.ul
et al,
1980).
.[
Brown Currie Kenworthy 1980
.]
However, even random variations in certain parameters of the pitch contour
are useful to relieve the tedium of repetitive intonation patterns.
.sh "10.2  The applications programming environment"
.pp
The comments in the last section are aimed at the applications programmer
who is designing the dialogue and constructing the interactive system.
But what kind of environment should
.ul
he
be given to assist with this work?
.pp
The best help the applications programmer can have is a speech generation
method which makes it easy for him to enter new utterances and modify
them on-line in cut-and-try attempts to render the man-machine dialogue
as natural as possible.
This is perhaps the most important advantage of synthesizing speech by rule
from a textual representation.
If encoded versions of natural utterances are stored, it becomes quite
difficult to make minor modifications to the dialogue in the light of
experience with it, for a recording session must be set up
to acquire new utterances.
This is especially true if more than one voice is used, or if the
voice belongs to a person who cannot be recalled quickly by the programmer
to augment the utterance library.
Even if it is his own voice there will still be delays, for recording
speech is a real-time job which usually needs a stand-alone processor,
and if data compression is used a substantial amount of computation will
be needed before the utterance is in a useable form.
.pp
The broad phonetic input required by segmental speech synthesis-by-rule
systems is quite suitable for utterance representation.
Utterances can be entered quickly from a standard computer terminal,
and edited as text files.
Programmers must acquire skill in phonetic transcription,
but this is a small inconvenience.
The art is easily learned in an interactive situation where the effect
of modifications to the transcription can be heard immediately.
If allophones must be represented explicitly in the input then the
programmer's task becomes considerably more complicated because of the
combinatorial explosion in trial-and-error modifications.
.pp
Plain text input is also quite suitable.
A significant rate of error is tolerable if immediate audio feedback
of the result is available, so that the operator can adjust his text
to suit the pronunciation idiosyncrasies of the program.
But it is acceptable, and indeed preferable, if prosodic features are
represented explicitly in the input rather than being assigned automatically
by a computer program.
.pp
The application of voice response to interactive computer dialogue is
quite different to the problem of reading aloud from text.
We have seen that a major concern with reading machines is how to glean
information about intonation, rhythm, emphasis, tone of voice, and so on,
from an input of ordinary English text.
The significant problems of semantic processing, utilization of pragmatic
knowledge, and syntactic analysis do not, fortunately, arise in interactive
information retrieval systems.
In these, the end user is communicating with a program which has been
created by a person who knows what he wants it to say.
Thus the major difficulty is in
.ul
describing
the prosodic features rather than
.ul
deriving
them from text.
.pp
Speech synthesis by rule is a subsidiary process to the main interactive
procedure.
It would be unwise to allow
the updating of resonance parameter tracks to be interrupted by
other calls on the system, and so the synthesis process needs to be executed
in real time.
If a stand-alone processor is used for the interactive dialogue, it may
be able to handle the synthesis rules as well.
In this case the speech-by-rule program could be a library procedure,
if the system is implemented in a compiled language.
An interesting alternative with an interpretive-language implementation,
such as Basic, is to alter the language interpreter to add a new
command, "speak", which simply transfers a string representing an utterance
to an asynchronous process which synthesizes it.
However, there must be some way for an intepreted program to abort the
current synthesis in the event of an interrupt signal from the user.
.pp
If the main computer system is time-shared, the synthesis-by-rule
procedure is best executed by an independent processor.
For example, a 16-bit microcomputer controlling a hardware
formant synthesizer has been used to run the
ISP system in real time without too much difficulty (Witten and Abbess, 1979).
.[
Witten Abbess 1979
.]
An important task is to define an interface between the two which
allows the main process to control relevant aspects of the prosody of
the speech in a way which is appropriate to the state of the interaction,
without having to bother about such things as matching the intonation contour
to the utterance and the details of syllable rhythm.
Halliday's notation appears to be quite suitable for this purpose.
.pp
If there is only one synthesizer on the system, there will be no
difficulty in addressing it.
One way of dealing with multiple synthesizers is to treat them as
assignable devices in the same way that non-spooling peripherals
are in many operating systems.
Notice that the data rate to the synthesizer is quite low
if the utterance is represented as text with prosodic markers,
and can easily be handled by a low-speed asynchronous serial line.
.pp
The Votrax ML-I synthesizer which is discussed in the next chapter has an
interface which interposes it between a visual display unit and the serial
port that connects it to the computer.
The VDU terminal can be used quite normally, except that a special sequence
of two control characters will cause Votrax to intercept the following
message up to another control character, and interpret it as speech.
The fact that the characters which specify the spoken message do not appear
on the VDU screen means that the operation is invisible to the user.
However, this transparency can be inhibited by a switch on the synthesizer
to allow visual checking of the sound-segment character sequence.
.pp
Votrax buffers up to 64 sound segments, which is sufficient to generate
isolated spoken messages.
For longer passages, it can be synchronized with the constant-rate
serial output using the modem control lines of the serial interface,
together with appropriate device-driving software.
.pp
This is a particularly convenient interfacing technique in cases when the
synthesizer should always be associated with a certain terminal. 
As an example of how it can be used,
one can arrange files each of whose lines contain a printed message,
together with its Votrax equivalent bracketed by the appropriate
control characters.
When such a file is listed, or examined with an editor program, the lines
appear simultaneously in spoken and typed English.
.pp
If a phonetic representation is used for utterances, with real-time
synthesis using a separate process (or processor), it is easy for
the programmer to fiddle about with the interactive dialogue to get
it feeling right.
For him, each utterance is just a textual string which
can be stored as a string constant within his program just as a VDU prompt
would be.  He can edit it as part of his program, and "print" it to
the speech synthesis device to hear it.
There are no more technical problems to developing an interactive dialogue
with speech output than there are for a conventional interactive program.
Of course, there are more human problems, and the points discussed
in the last section should always be borne in mind.
.sh "10.3  Using the keypad"
.pp
One of the greatest advantages of speech output from computers is the
ubiquity of the telephone network and the possibility of using it without
the need for special equipment at the terminal.
The requirement for input as well as output obviously presents something of a problem
because of the restricted nature of the telephone keypad.
.pp
Figure 10.1 shows the layout of the keypad.
.FC "Figure 10.1"
Signalling is achieved by dual-frequency tones.
For example, if key 7 is pressed, sinusoidal components at 852\ Hz and 1209\ Hz
are transmitted down the line.
During the process of dialling these are received by the telephone exchange
equipment, which assembles the digits that form a number and attempts to route
the call appropriately.
Once a connection is made, either party is free to press keys if desired
and the signals will be transmitted to the other end,
where they can be decoded by simple electronic circuits.
.pp
Dial telephones signal with closely-spaced dial pulses.
One pulse is generated for a "1", two for a "2", and so on.
(Obviously, ten pulses are generated for a "0", rather than none!)  Unfortunately,
once the connection is made it is difficult to signal with dial pulses.
They cannot be decoded reliably at the other end because the telephone
network is not designed to transmit such low frequencies.
However, hand-held tone generators can be purchased for use with dial
telephones.
Although these are undeniably extra equipment, and one purpose of using speech
output is to avoid this, they are very cheap and portable compared with other
computer terminal equipment.
.pp
The small number of keys on the telephone pad makes it rather difficult
to use for communicating with computers.
Provision is made for 16 keys, but only 12 are implemented \(em the others
may be used for some military purposes.
Of course, if a separate tone generator is used then advantage can be taken
of the extra keys, but this will introduce incompatibility with those
who use unmodified touch-tone phones.
More sophisticated terminals are available which extend the keypad \(em such
as the Displayphone of Northern Telecommunications.
However, they are designed as a complete communications terminal and
contain their own visual display as well.
.rh "Keying alphabetic data."
Figure 10.2 shows the near-universal scheme for overlaying alphabetic letters
on to the telephone keypad.
.FC "Figure 10.2"
Since more than one symbol occupies each key, it is obviously necessary
to have multiple keystrokes per character if the input sequence is to be
decodable as a string of letters.
One way of doing this is to depress the appropriate button the number of
times corresponding to the position of the letter on it.
For example, to enter the letter "L" the user would key the "5" button
three times in rapid succession.
Keying rhythm must be used to distinguish the four entries "J\ J\ J",
"J\ K", "K\ J", and "L", unless one of the bottom three buttons is used
as a separator.
A different method is to use "*", "0", and "#" as shift keys to indicate whether
the first, second, or third letter on a key is intended.
Then "#5" would represent "L".
Alternatively, the shift could follow the key instead of preceding it,
so that "5#" represented "L".
.pp
If numeric as well as alphabetic information may be entered, a mode-shift
operation is commonly used to switch between numeric and alphabetic modes.
.pp
The relative merits of these three methods, multiple depressions, shift
key prefix, and shift key suffix, have been investigated
experimentally (Kramer, 1970).
.[
Kramer 1970
.]
The results were rather inconclusive.
The first method seemed to be slightly inferior in terms of user accuracy.
It seemed that preceding rather than following shifts gave higher accuracy,
although this is perhaps rather counter-intuitive and may have been
fortuitous.
The most useful result from the experiments was that users exhibited
significant learning behaviour, and a training period of at least two hours
was recommended.
Operators were found able to key at rates of at least three to four
characters per second, and faster with practice.
.pp
If a greater range of characters must be represented then the coding problem
becomes more complex.
Figure 10.3 shows a keypad which can be used for entry of the full 64-character
standard upper-case ASCII alphabet (Shew, 1975).
.[
Shew 1975
.]
.FC "Figure 10.3"
The system is intended for remote vocabulary updating in a phonetically-based
speech synthesis system.
There are three modes of operation:  numeric, alphabetic, and symbolic.
These are entered by "##", "**", and "*0" respectively.
Two function modes, signalled by "#0" and "#*", allow some
rudimentary line-editing and monitor facilities to be incorporated.
Line-editing commands include character and line delete, and two kinds of
read-back commands \(em one tries to pronounce the words in a line
and the other spells out the characters.
The monitor commands allow the user to repeat the effect of the last input line
as though he had entered it again, to order the system to read back the
last complete output line, and to query time and system status.
.rh "Incomplete keying of alphanumeric data."
It is obviously going to be rather difficult for the operator to key
alphanumeric information unambiguously on a 12-key pad.
In the description of the telephone enquiry service in Chapter 1,
it was mentioned that single-key entry can be useful for alphanumeric data
if the ambiguity can be resolved by the computer.
If a multiple-character entry is known to refer to an item on a given
list, the characters can be keyed directly according to the coding scheme
of Figure 10.2.
.pp
Under most circumstances no ambiguity will arise.
For example, Table 10.1 shows the keystrokes that would be entered for the
first 50 5-letter words in an English dictionary.
Only two clashes occur \(em between " adore" and "afore", and
"agate" and "agave".
.RF
.nr x2 \w'abeam  'u
.nr x3 \w'00000#    'u
.nr x0 \n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\n(x3u+\n(x2u+\w'00000#'u
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta \n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u +\n(x3u +\n(x2u
\l'\n(x0u\(ul'
.sp
aback	22225#	abide	22433#	adage	23243#	adore	23673#	after	23837#
abaft	22238#	abode	22633#	adapt	23278#	adorn	23676#	again	24246#
abase	22273#	abort	22678#	adder	23337#	adult	23858#	agape	24273#
abash	22274#	about	22688#	addle	23353#	adust	23878#	agate	24283#
abate	22283#	above	22683#	adept	23378#	aeger	23437#	agave	24283#
abbey	22239#	abuse	22873#	adieu	23438#	aegis	23447#	agent	24368#
abbot	22268#	abyss	22977#	admit	23648#	aerie	23743#	agile	24453#
abeam	22326#	acorn	22676#	admix	23649#	affix	23349#	aglet	24538#
abele	22353#	acrid	22743#	adobe	23623#	afoot	23668#	agony	24669#
abhor	22467#	actor	22867#	adopt	23678#	afore	23673#	agree	24733#
\l'\n(x0u\(ul'
.in 0
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.FG "Table 10.1  Keying equivalents of some words"
As a more extensive example, in a dictionary of 24,500 words, just under 2,000
ambiguities (8% of words) were discovered.
Such ambiguities would have to be resolved interactively by the system explaining
its dilemma, and asking the user for a choice.
Notice incidentally that although the keyed sequences do not have the same
lexicographic order as the words,
no extra cost will be associated with the table-searching
operation if the dictionary is stored in inverted form, with each legal
number pointing to its English equivalent or equivalents.
.pp
A command language syntax is also a powerful way of disambiguating
keystrokes entered.
Figure 10.4 shows the keypad layout for a telephone voice calculator
(Newhouse and Sibley, 1969).
.[
Newhouse Sibley 1969
.]
.FC "Figure 10.4"
This calculator provides the standard arithmetic operators,
ten numeric registers, a range of pre-defined mathematical functions,
and even the ability for a user to enter his own functions over the
telephone.
The number representation is fixed-point, with user control (through a system
function) over the precision.
Input of numbers is free format.
.pp
Despite the power of the calculator language, the dialogue is defined
so that each keystroke is unique in context and never has to be disambiguated
explicitly by the user.
Table 10.2 summarizes the command language syntax in an informal and rather
heterogeneous notation.
.RF
.nr x0 1.3i+1.7i+\w'some functions do not need the <value> part'u
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta 1.3i +1.7i
\l'\n(x0u\(ul'
construct	definition	explanation
\l'\n(x0u\(ul'
.sp
<calculation>		a sequence of <operation>s followed by a
		call to the system function  \fIE  X  I  T\fR
.sp
<operation>	<add> OR <subtract> OR
	<multiply> OR <divide> OR
	<function> OR <clear> OR
	<erase> OR <answer> OR
	<display-last> OR <display> OR
	<repeat> OR <cancel>
.sp
<add>	+  <value>  #  OR  +  #  <function>
.sp
<subtract>
<multiply>		similar to <add>
<divide>
.sp
<value>	<numeric-value>  OR  \fIregister\fR <single-digit>
.sp
<numeric-value>		a sequence of keystrokes like
		1  .  2  3  4  or  1  2  3  .  4  or  1  2  3  4
.sp
<function>	\fIfunction\fR <name>  #  <value>  #
		some functions do not need the <value> part
.sp
<name>		a sequence of keystrokes like
		\fIS  I  N\fR  or  \fIE  X  I  T\fR  or  \fIM  Y  F  U  N  C\fR
.sp
<clear>	\fIclear register\fR <single-digit>  #
		clears one of the 10 registers
.sp
<erase>	\fIerase\fR  #	undoes the effect of the last operation
.sp
<answer>	\fIanswer register\fR <single-digit>  #
		reads the contents of a register
.sp
<display-last>
<display>		these provide "repeat" facilities
<repeat>
.sp
<cancel>		aborts the current utterance
\l'\n(x0u\(ul'
.in 0
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.FG "Table 10.2  Syntax for a telephone calculator"
A calculation is a sequence of operations followed by an EXIT function call.
There are twelve different operations, one for each button on the keypad.
Actually, two of them \(em
.ul
cancel
and
.ul
function
\(em share the same key so that "#" can be reserved for use as a
separator; but the context ensures that they cannot be confused by the system.
.pp
Six of the operations give control over the dialogue.
There are three different "repeat" commands; a command (called
.ul
erase\c
)
which undoes the effect of the last operation;
one which reads out the value of a register;
and one which aborts the current utterance.
Four more commands provide the basic arithmetic operations of add,
subtract, multiply, and divide.
The operands of these may be keyed literal numbers, or register values,
or function calls.
A further command clears a register.
.pp
It is through functions that the extensibility of the language is achieved.
A function has a name (like SIN, EXIT, MYFUNC) which is keyed with an
appropriate single-key-per-character sequence (namely 746, 3948, 693862
respectively).
One function, DEFINE, allows new ones to be entered.
Another, LOOP, repeats sequences of operations.
TEST incorporates arithmetic testing.
The details of these are not important:  what is interesting is the evident
power of the calculator.
.pp
For example, the keying sequence
.LB
.NI
5  #  1  1  2  3  #  2  1  .  2  #  9  #  6  #  2  1  .  4  #
.LE
would be decoded as
.LB
.NI
.ul
clear\c
  +  123  \-  1.2  \c
.ul
display  erase\c
  \-  1.4.
.LE
One of the difficulties with such a tight syntax is that almost any sequence
will be intepreted as a valid calculation \(em syntax errors are nearly
impossible.
Thus a small mistake by the user can have a catastrophic effect on the
calculation.
Here, however, speech output gives an advantage over conventional
character-by-character echoing
on visual displays.
It is quite adequate to echo syntactic units as they are decoded, instead
of echoing keys as they are entered.
It was suggested earlier in this chapter that confirmation of entry
should be generated in the same way that the user would be likely to
verbalize it himself.
Thus the synthetic voice could respond to the above keying sequence as
shown in the second line, except that the
.ul
display
command would also state the result
(and possibly summarize the calculation so far).
Numbers could be verbalized as "one hundred and twenty-three"
instead of as "one ... two ... three".
(Note, however, that this will make it necessary to await the "#" terminator
after numbers and function names before they can be echoed.)
.sh "10.4  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.sh "10.5  Further reading"
.pp
There are no books which relate techniques of man-computer dialogue
to speech interaction.
The best I can do is to guide you to some of the standard works on
interactive techniques.
.LB "nn"
.\"Gilb-1977-1
.]-
.ds [A Gilb, T.
.as [A " and Weinberg, G.M.
.ds [D 1977
.ds [T Humanized input
.ds [I Winthrop
.ds [C Cambridge, Massachusetts
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
This book is subtitled "techniques for reliable keyed input",
and considers most aspects of the problem of data entry by
professional key operators.
.in-2n
.\"Martin-1973-2
.]-
.ds [A Martin, J.
.ds [D 1973
.ds [T Design of man-computer dialogues
.ds [I Prentice-Hall
.ds [C Englewood Cliffs, New Jersey
.nr [T 0
.nr [A 1
.nr [O 0
.][ 2 book
.in+2n
Martin concerns himself with all aspects of man-computer dialogue,
and the book even contains a short chapter on  the use of
voice response systems.
.in-2n
.\"Smith-1980-3
.]-
.ds [A Smith, H.T.
.as [A " and Green, T.R.G.(Editors)
.ds [D 1980
.ds [T Human interaction with computers
.ds [I Academic Press
.ds [C London
.nr [T 0
.nr [A 0
.nr [O 0
.][ 2 book
.in+2n
A recent collection of contributions on man-computer systems and programming
research.
.in-2n
.LE "nn"
.EQ
delim $$
.EN
.CH "11  COMMERCIAL SPEECH OUTPUT DEVICES"
.ds RT "Commercial speech output devices
.ds CX "Principles of computer speech
.pp
This chapter takes a look at four speech output peripherals that are
available today.
It is risky in a book of this nature to descend so close to the technology
as to discuss particular examples of commercial products,
for such information becomes dated very quickly.
Nevertheless, having covered the principles of various types of speech
synthesizer, and the methods of driving them from widely differing utterance
representations, it seems worthwhile to see how these principles are
embodied in a few products actually on the market.
.pp
Developments in electronic speech devices are moving so fast that it is
hard to keep up with them, and the newest technology today will undoubtedly
be superseded next year.
Hence I have not tried to choose examples from the very latest technology.
Instead, this chapter discusses synthesizers which exemplify rather different
principles and architectures, in order to give an idea of the range of options
which face the system designer.
.pp
Three of the devices are landmarks in the commercial adoption of speech
technology, and have stood the test of time.
Votrax was introduced in the early 1970's, and has been re-implemented
several times since in an attempt to cover different market sectors.
The Computalker appeared in 1976.
It was aimed primarily at the burgeoning computer hobbies market.
One of its most far-reaching effects was to stimulate the interest of
hobbyists, always eager for new low-cost peripherals, in speech synthesis;
and so provide a useful new source of experimentation and expertise
which will undoubtedly help this heretofore rather esoteric discipline to
mature.
Computalker is certainly the longest-lived and probably still the most
popular hobbyist's speech synthesizer.
The Texas Instruments speech synthesis chip brought speech output technology to the
consumer.
It was the first single-chip speech synthesizer, and is still the biggest
seller.
It forms the heart of the "Speak 'n Spell" talking toy which appeared in
toyshops in the summer of 1978.
Although talking calculators had existed several years before, they were
exotic gadgets rather than household toys.
.sh "11.1  Formant synthesizer"
.pp
The Computalker is a straightforward implementation of a serial formant
synthesizer.
A block diagram of it is shown in Figure 11.1.
.FC "Figure 11.1"
In the centre is the main vocal tract path, with three formant filters
whose resonant frequencies can be controlled individually.
A separate nasal branch in parallel with the oral one is provided,
with a nasal formant of fixed frequency.
It is less important to allow for variation of the nasal formant
frequency than it is for the oral ones, because the size and
shape of the nasal tract is relatively fixed.
However, it is essential to control the nasal amplitude, in particular to turn
it off during non-nasal sounds.
Computalker provides independent oral and nasal amplitude parameters.
.pp
Unvoiced excitation can be passed through the main vocal tract
through the aspiration amplitude control AH.
In practice, the voicing amplitudes AV and AN will probably always be zero when AH
is non-zero, for physiological constraints prohibit simultaneous voicing
and aspiration.
A second unvoiced excitation path passes through a fricative formant filter
whose resonant frequency can be varied, and has its amplitude independently
controlled by AF.
.rh "Control parameters."
Table 11.1 summarizes the nine parameters which drive Computalker.
.RF
.nr x0 \w'address0'+\w'fundamental frequency of voicing00'+\w'0 bits0'+\w'logarithmic00'+\w'0000\-00000 Hz'
.nr x1 (\n(.l-\n(x0)/2
.in \n(x1u
.ta \w'000'u \w'address0'u +\w'fundamental frequency of voicing00'u +\w'0 bits0'u +\w'logarithmic00'u
address	meaning	width		\0\0\0range
\l'\n(x0u\(ul'
.sp
\00	AV	amplitude of voicing	8 bits
\01	AN	nasal amplitude	8 bits
\02	AH	amplitude of aspiration	8 bits
\03	AF	amplitude of frication	8 bits
\04	FV	fundamental frequency of voicing	8 bits	logarithmic	\0\075\-\0\0470 Hz
\05	F1	formant 1 resonant frequency	8 bits	logarithmic	\0170\-\01450 Hz
\06	F2	formant 2 resonant frequency	8 bits	logarithmic	\0520\-\04400 Hz
\07	F3	formant 3 resonant frequency	8 bits	logarithmic	1700\-\05500 Hz
\08	FF	fricative resonant frequency	8 bits	logarithmic	1700\-14000 Hz
\09		not used
10		not used
11		not used
12		not used
13		not used
14		not used
15	SW	audio on-off switch	1 bit
\l'\n(x0u\(ul'
.in 0
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.FG "Table 11.1  Computalker control parameters"
Four of them control amplitudes, while the others control frequencies.
In the latter case the parameter value is logarithmically related to
the actual frequency of the excitation (FV) or resonance (F1, F2, F3, FF).
The ranges over which each frequency can be controlled is shown in the Table.
An independent calibration of one particular Computalker has shown that
the logarithmic specifications are met remarkably well.
.pp
Each parameter is specified to Computalker as an 8-bit number.
Parameters are addressed by a 4-bit code, and so a total of 12 bits
is transferred in parallel to Computalker from the computer
for each parameter update.
Parameters 9 to 14 are unassigned ("reserved for future expansion" is
the official phrase), and the last parameter, SW, governs the position of
an audio on-off switch.
.pp
Computalker does not contain a clock that is accessible to the user,
and so the timing of parameter updates is entirely up to the host computer.
Typically, a 10\ msec interval between frames is used,
with interrupts generated by a separate timer.
In fact the frame interval can be anywhere between 2\ msec and 50\ msec,
and can be changed to alter the rate of speaking.
However, it is rather naive to view fast speech as slow
speech speeded up by a linear time compression, for in human
speech production the rhythm changes and elisions occur in a rather
more subtle way.
Thus it is not particularly useful to be able to alter the frame rate.
.pp
At each interrupt, the host computer transfers values for all of the nine
parameters to Computalker, a total of 108 data bits.
In theory, perhaps, it is only necessary to transmit those parameters
whose values have changed; but in practice all of them should be updated
regardless.
This is because the parameters are stored for the duration of the frame
in analogue sample-and-hold devices.  Essentially, the parameter value
is represented as the charge on a capacitor.
In time \(em and it takes only a short time \(em the values drift.
Although the drift over 10\ msec is insignificant, it becomes very
noticeable over longer time periods.
If parameters are not updated at all, the result is a
"whooosh" sound up to maximum amplitude, in a period of a second or two.
Hence it is essential that Computalker be serviced by the computer regularly,
to update all its parameters.
The audio on-off switch is provided so that the computer can turn off
the sound directly if another program, which does not use the device,
is to be run.
.rh "Filter implementation."
It is hard to get definite information on the implementation
of Computalker.
Because it is a commercial device, circuit diagrams are not published.
It is certainly an analogue rather than a digital implementation.
The designer suggests that a configuration like that of Figure 11.2 is used
for the formant filters (Rice, 1976).
.[
Rice 1976 Byte
.]
.FC "Figure 11.2"
Control is obtained over the resonant frequency by varying the resistance
at the bottom in sympathy with the parameter value.
The middle two operational amplifiers can be modelled by a resistance
$-R/k$ in the forward path, where k is the digital control value.
This gives the circuit in Figure 11.3, which can be analysed to obtain
the transfer function
.LB
.EQ
- ~ k over {R~R sub 1 C sub 2 C sub 3} ~ . ~ {R sub 2 C sub 2 ~s ~+~1} over
{ s sup 2 ~+~~
( 1 over {R sub 3 C sub 3} ~+~ {k R sub 2} over {R~R sub 1 C sub 3})~s ~~+~
k over {R~R sub 1 C sub 2 C sub 3}} ~ .
.EN
.LE
.FC "Figure 11.3"
.pp
This expression has a DC gain of \-1, and the denominator is similar to those
of the analogue formant resonators discussed in Chapter 5.
However, unlike them the transfer function has a numerator which creates
a zero at
.LB
.EQ
s~~=~~-~ 1 over {R sub 2 C sub 2} ~ .
.EN
.LE
If  $R sub 2 C sub 2$  is sufficiently small, this zero will have
negligible effect at audio frequencies, and the filter has
the following parameters:
.LB
centre frequency:    $~ mark
1 over {2 pi}~~( k over {R~R sub 1 C sub 2 C sub 3} ~ ) sup 1/2$  Hz
.sp
bandwidth:$lineup
1 over {2 pi}~~( 1 over {R sub 3 C sub 3}~+~
{k R sub 2} over {R~R sub 1 C sub 3} ~ )$  Hz.
.LE
.pp
Note first that the centre frequency is proportional to the square root of
the control value $k$.
Hence a non-linear transformation must be implemented on the control
signal, after D/A conversion, to achieve the required logarithmic relationship
between parameter value and resonant frequency.
The formant bandwidth is not constant, as it should be (see Chapter 5),
but depends upon the control value $k$.
This dependency can be minimized by selecting component values such that
.LB
.EQ
{k R sub 2} over {R~R sub 1 C sub 3}~~<<~~1 over {R sub 3 C sub 3}
.EN
.LE
for the largest value of $k$ which can occur.
Then the bandwidth is solely determined by the time constant  $R sub 3 C sub 3$.
.pp
The existence of the zero can be exploited for the fricative resonance.
This should have zero DC gain, and so the component values for the fricative
filter should make the time-constant  $R sub 2 C sub 2$  large enough to place
the zero sufficiently near the frequency origin.
.rh "Market orientation."
As mentioned above, Computalker is designed for the computer hobbies market.
Figure 11.4 shows a photograph of the device.
.FC "Figure 11.4"
It plugs into the S\-100 bus which has been a
.ul
de facto
standard for hobbyists for several years, and has recently been adopted
as a standard by the Institute of Electrical and Electronic Engineers.
This makes it immediately accessible to many microcomputer systems.
.pp
An inexpensive synthesis-by-rule program, which runs on
the popular 8080 microprocessor, is available to drive Computalker.
The input is coded in a machine-readable version of the standard phonetic
alphabet, similar to that which was introduced in Chapter 2 (Table 2.1).
Stress digits may appear in the transcription, and the program caters for
five levels of stress.
The punctuation mark at the end of an utterance has some effect on pitch.
The program is perhaps remarkable in that it occupies only 6\ Kbyte of storage
(including phoneme tables), and runs on an 8-bit microprocessor
(but not in real time).
It is, however,
.ul
un\c
remarkable in that it produces rather poor speech.
According to a demonstration cassette,
"most people find the speech to be readily intelligible,
especially after a little practice listening to it,"
but this seems extremely optimistic.
It also cunningly insinuates that if you don't understand it, you yourself
may share the blame with the synthesizer \(em after all,
.ul
most
people do!
Nevertheless, Computalker has made synthetic speech accessible to a large
number of home computer users.
.sh "11.2  Sound-segment synthesizer"
.pp
Votrax was the first fully commercial speech synthesizer, and at the time of
writing is still the only off-the-shelf speech output
peripheral (as distinct from reading machine) which is aimed
specifically at synthesis-by-rule rather than storage of parameter tracks
extracted from natural utterances.
Figure 11.5 shows a photograph of the Votrax ML-I.
.FC "Figure 11.5"
.pp
Votrax accepts as input a string of codes representing sound segments,
each with additional bits to control the duration and pitch of the segment.
In the earlier versions (eg model VS-6) there are 63 sound segments, specified
by a 6-bit code, and two further bits accompany each segment to provide a
4-level control over pitch.
Four pitch levels are quite inadequate to generate acceptable intonation
contours for anything but isolated words spoken in citation form.
However, a later model (ML-I) uses an 8-level pitch specification,
as well as a 4-level duration qualifier,
associated with each sound segment.
It provides a vocabulary of 80 sound segments, together with an additional
code which allows local amplitude modifications and extra duration alterations
to following segments.
A further, low-cost model (VS-K) is now available which plugs in to the S\-100
bus, and
is aimed primarily at
computer hobbyists.
It provides no pitch control at all and is therefore
quite unsuited to serious voice response applications.
The device has recently been packaged as an LSI circuit (model SC\-01),
using analogue switched-capacitor filter technology.
.pp
One point where the ML-I scores favourably over other speech synthesis
peripherals is the remarkably convenient engineering of its
computer interface, which was outlined in the previous chapter.
.pp
The internal workings of Votrax are not divulged by the manufacturer.
Figure 11.6 shows a block diagram at the level of detail that they supply.
.FC "Figure 11.6"
It seems to be essentially a formant synthesizer with analogue function
generators and parameter smoothing circuits that provide transitions between
sound segments.
.rh "Sound segments."
The 80 segments of the high-range ML-I model
are summarized in Table 11.2.
.FC "Table 11.2"
They are divided into phoneme classes according to the
classification discussed in Chapter 2.
The segments break down into the following categories.
(Numbers in parentheses are the corresponding figures for VS-6.)
.LB "00 (00) "
.NI "00 (00) "
11 (11) vowel sounds which are representative of the phonological
vowel classes for English
.NI "00 (00) "
\09 \0(7) vowel allophones, with slightly different sound qualities from the
above
.NI "00 (00) "
20 (15) segments whose sound qualities are identical to the segments above, but with
different durations
.NI "00 (00) "
22 (22) consonant sounds which are representative of the phonological
consonant classes for English
.NI "00 (00) "
11 \0(6) consonant allophones
.NI "00 (00) "
\04 \0(0) segments to be used in conjunction with unvoiced plosives to increase
their aspiration
.NI "00 (00) "
\02 \0(2) silent segments, with different pause durations
.NI "00 (00) "
\01 \0(0) very short silent segment (about 5\ msec).
.LE "00 (00) "
Somewhat under half of the 80 elements
can be put into one-to-one correspondence with the phonemes of English;
the rest are either allophonic variations or additional sounds which can
sensibly be combined with certain phonemes in certain contexts.
The Votrax literature, and consequently Votrax users, persists in calling
all elements "phonemes", and this can cause considerable confusion.
I prefer to use the term "sound segment" instead, reserving "phoneme" for its
proper linguistic use.
.pp
The rules which Votrax uses for transitions between sound segments are not
made public by the manufacturer, and are embedded in encapsulated circuits
in the hardware.
They are clearly very crude.
The key to successful encoding of utterances is to use the many
non-phonemic segments in an appropriate way as transitions between the main
segments which represent phonetic classes.  This is a tricky process, and
I have heard of one commercial establishment giving up in despair at the
extreme difficulty of generating the utterances it wanted.
It probably explains the proliferation of letter-to-sound rules for
Votrax which have been developed in research laboratories
(Colby
.ul
et al,
1978; Elovitz
.ul
et al,
1976; McIlroy, 1974; Sherwood, 1978).
.[
Colby Christinaz Graham 1978
.]
.[
Elovitz 1976 IEEE Trans Acoustics Speech and Signal Processing
.]
.[
McIlroy 1974
.]
.[
Sherwood 1978
.]
Nevertheless, with luck, skill, and especially persistence,
excellent results can be
obtained.  The ML-I manual (Votrax, 1976) contains a list of about 625 words and short phrases,
and they are usually clearly recognizable.
.[
Votrax 1976
.]
.rh "Duration and pitch qualifiers."
Each sound segment has a different duration.
Table 11.2 shows the measured duration of the segments, although no
calibration data is given by Votrax.
As mentioned earlier, a 2-bit number accompanies each segment to modify
its duration, and
this was set to 3 (least duration) for the measurements.
The qualifier has a multiplicative effect, shown in Table 11.3.
.RF
.nr x1 (\w'rate qualifier'/2)
.nr x2 (\w'in Table 11.2 by'/2)
.nr x0 \n(x1+2i+\w'00'+\n(x2
.nr x3 (\n(.l-\n(x0)/2
.in \n(x3u
.ta \n(x1u +2i
\l'\n(x0u\(ul'
.sp
.nr x2 (\w'multiply duration'/2)
rate qualifier		\0\0\h'-\n(x2u'multiply duration
.nr x2 (\w'in Table 11.2 by'/2)
		\0\0\h'-\n(x2u'in Table 11.2 by
\l'\n(x0u\(ul'
.sp
	3	1.00
	2	1.11
	1	1.22
	0	1.35
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 11.3  Effect of the 2-bit per-segment rate qualifier"
.pp
As well as the 2-bit rate qualifier, each sound segment is accompanied by
a 3-bit pitch specification.  This provides a linear control over fundamental
frequency, and Table 11.4 shows the measured values.
.RF
.nr x1 (\w'pitch specifier'/2)
.nr x2 (\w'pitch (Hz)'/2)
.nr x0 \n(x1+1.5i+\n(x2
.nr x3 (\n(.l-\n(x0)/2
.in \n(x3u
.ta \n(x1u +1.5i
\l'\n(x0u\(ul'
.sp
pitch specifier	\h'-\n(x2u'pitch (Hz)
\l'\n(x0u\(ul'
.sp
	0	\057.5
	1	\064.1
	2	\069.4
	3	\075.8
	4	\080.6
	5	\087.7
	6	\094.3
	7	100.0
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 11.4  Effect of the 3-bit per-segment pitch specifier"
The quantization interval varies from
one to two semitones.
Votrax interpolates pitch from phoneme to phoneme in a highly satisfactory
manner, and this permits surprisingly sophisticated intonation contours
to be generated considering the crude 8-level quantization.
.pp
The notation in which the Votrax manual defines utterances
gives duration qualifiers and pitch specifications as digits
preceding the sound segment, and separated from it by a slash (/).
Thus, for example,
.LB
14/THV
.LE
defines the sound segment THV with duration qualifier 1 (multiplies the
70\ msec duration of Table 11.2 by 1.22 \(em from Table 11.3 \(em to give 85\ msec)
and pitch specification 4 (81 Hz).
This representation of a segment is transformed into two ASCII characters before transmission
to the synthesizer.
.rh "Converting a phonetic transcription to sound segments."
It would be useful to have a computer procedure to produce a specification for
an utterance in terms of Votrax sound segments from a standard phonetic
transcription.
This could remove much of the tedium from utterance preparation
by incorporating the contextual rules given in the Votrax manual.
Starting with a phonetic transcription, each phoneme should be converted
to its default Votrax representative.
The resulting "wide" Votrax transcription must be
transformed into a "narrow" one by application of contextual rules.
Separate rules are needed for
.LB
.NP
vowel clusters (diphthongs)
.NP
vowel transitions (ie consonant-vowel and vowel-consonant,
where the vowel segment is altered)
.NP
intervocalic consonants
.NP
consonant transitions (ie consonant-vowel and vowel-consonant,
where the consonant segment is altered)
.NP
consonant clusters
.NP
stressed-syllable effects
.NP
utterance-final effects.
.LE
Stressed-syllable effects (which include
extra aspiration for unvoiced stops beginning stressed syllables)
can be applied only if stress markers are included in the phonetic
transcription.
.pp
To specify a rule, it is necessary to give a
.ul
matching part
and a
.ul
context,
which define at what points in an utterance it is applicable, and a
.ul
replacement part
which is used to replace the matching part.
The context can be specified in mathematical set notation using curly brackets.
For example,
.LB
{G SH W K} OO		IU OO
.LE
states that the matching part OO is replaced by IU OO, after a G, SH, W, or K.
In fact, allophonic variations of each sound segment
should also be accepted as valid context, so this rule will also replace OO
after .G, CH, .W, .K, or .X1 (Table 11.2 gives allophones of each segment).
.pp
Table 11.5 gives some rules that have been used for this purpose.
.FC "Table 11.5"
They were derived from careful study of the hints given in the
ML-I manual (Votrax, 1976).
.[
Votrax 1976
.]
Classes such as "voiced" and "stop-consonant" in the context specify sets
of sound segments in the obvious way.
The beginning of a stressed syllable is marked in the input by ".syll".
Parentheses in the replacement part have a significance which is explained in
the next section.
.rh "Handling prosodic features."
We know from Chapter 8 the vital importance of prosodic features
in synthesizing lifelike speech.
To allow them to be assigned to Votrax utterances, an intermediate
output from a prosodic analysis program like ISP can be used.
For example,
.LB
1  \c
.ul
dh i s  i z  /*d zh aa k s  /h aa u s;
.LE
which specifies "this is Jack's house" in a declarative intonation with
emphasis on the "Jack's", can be intercepted in the following form:
.LB
\&.syll
.ul
dh\c
\ 50\ (0\ 110)
.ul
i\c
\ 60
.ul
s\c
\ 90\ (0\ 99)
.ul
i\c
\ 60
.ul
z\c
\ 60\ (50\ 110)
\&.syll
.ul
d\c
\ 50\ (0\ 110)
.ul
zh\c
\ 50
.ul
aa\c
\ 90
.ul
k\c
\ 120\ (10\ 90)
.ul
s\c
\ 90
\&.syll
.ul
h\c
\ 60
.ul
aa\c
\ 140
.ul
u\c
\ 60
.ul
s\c
\ 140
^\ 50\ (40\ 70) .
.LE
Syllable boundaries, pitches, and durations have been assigned by the
procedures given earlier (Chapter 8).
A number always follows each phoneme to specify its duration
(in msec).
Pairs of numbers in parentheses define a pitch specification at some
point during the preceding phoneme:  the first number of the pair defines
the time offset of the specification from the beginning
of the phoneme, while the second gives the pitch itself (in Hz).
This form of utterance specification can then be passed to a Votrax
conversion procedure.
.pp
The phonetic transcription is converted
to Votrax sound segments using the method described above.  The "wide" Votrax
transcription is
.LB
\&.syll THV I S I Z .syll D ZH AE K S .syll H AE OO S PA0 ;
.LE
which is transformed to the following "narrow" one according to the rules
of Table 11.5:
.LB
\&.syll THV I S I Z .syll D J (AE EH3) K S .syll H1 (AH1 .UH2) (O U)
S PA0 .
.LE
The duration and pitch specifications are preserved by the transformation
in their original positions in the string, although they are not shown above.
The next stage uses them to expand the transcription by adjusting
the segments to have durations as close as possible to the specifications, and
computing pitch numbers to be associated with each phoneme.
.pp
Correct duration-expansion can, in general, require a great amount of
computation.
Associated with each sound segment is a set of elements with the same sound quality
but different durations, formed by attaching each of the four duration
qualifiers of Table 11.3 to the segment and any others which are
sound-equivalents to it.  For example, the segment Z has the duration-set
.LB
{3/Z   2/Z   1/Z   0/Z}
.LE
with durations
.LB
{ 70   78   85   95}
.LE
msec respectively, where the initial numerals denote the duration qualifier.
The segment I has the much larger duration-set
.LB
{3/I2   2/I2   1/I2   0/I2   3/I1   2/I1   1/I1   0/I1   3/I   2/I   1/I   0/I}
.LE
with durations
.LB
{ 58   64   71   78   83   92   101   112   118   131   144   159},
.LE
because segments I1 and I2 are sound-equivalents to it.
Duration assignment is a matter of selecting elements from the
duration-set whose total duration is as close as possible to that desired
for the segment.
It happens that Votrax deals sensibly with concatenations of more than one
identical plosive, suppressing the stop burst on all but the last.
Although the general problem of approximating durations in
this way is computationally demanding, a simple recursive exhaustive search
works in a reasonable amount of time because the desired duration is usually
not very much greater than the longest member of the duration-set, and so
the search terminates quite quickly.
.pp
At this point, the role of the parentheses which appear on the right-hand side
of Table 11.5 becomes apparent.  Because durations are only associated with
the input phonemes, which may each be expanded into several Votrax
segments, it is necessary to keep track of the segments which have descended
from a single phoneme.
Target durations are simply spread equally across any parenthesized groups
to which they apply.
.pp
Having expanded durations, mapping pitches on to the sound segments is
a simple matter.  The ISP system for formant synthesizers (Chapters 7 and 8)
uses linear interpolation between pitch specifications, and the frequency which
results for each sound segment needs to be converted to a Votrax specification
using the information in Table 11.4.
.pp
After applying these procedures to the example utterance, it becomes
.LB
14/THV  14/I1  03/S  14/I1  04/Z  04/D  04/J  33/AE  33/EH3  \c
02/K  02/K  02/S  02/H1  01/AH2  01/.UH2  31/O2  31/U1  01/S  \c
10/S  30/PA0  30/PA0  .
.LE
In several places, shorter sound-equivalents have been substituted
(I1 for I, AH2 for AH1, O2 for O, and U1 for U), while doubling-up also occurs
(in the K, S, and PA0 segments).
.pp
The speech which results from the use of these procedures with the
Votrax synthesizer sounds remarkably similar to that generated by the
ISP system which uses
parametrically-controlled synthesizers.  Formal evaluation experiments have
not been undertaken, but it seems clear from careful listening that it would
be rather difficult, and probably pointless, to evaluate the Votrax conversion
algorithm, for the outcome would be completely dominated by the success of the
original pitch and rhythm assignment procedures.
.sh "11.3  Linear predictive synthesizer"
.pp
The first single-chip speech synthesizer was introduced by
Texas Instruments (TI) in the summer of 1978 (Wiggins and Brantingham, 1978).
.[
Wiggins Brantingham 1978
.]
It was a remarkable development, combining recent advances in signal processing
with the very latest in VLSI technology.
Packaged in the Speak 'n Spell toy (Figure 11.7), it was a striking demonstration
of imagination and prowess in integrated electronics.
.FC "Figure 11.7"
It gave TI a long lead over its competitors and surprised many experts
in the speech field.
.EQ
delim @@
.EN
Overnight, it seemed, digital speech technology had descended from
research laboratories with their expensive and specialized equipment into
a $50.00 consumer item.
.EQ
delim $$
.EN
Naturally TI did not sell the chip separately but only as part of their
mass-market product; nor would they make available information on how to
drive it directly.
Only recently when other similar devices appeared on the market did they
unbundle the package and sell the chip.
.rh "The Speak 'n Spell toy."
The TI chip (TMC0280) uses the linear predictive method of synthesis,
primarily because of the ease of the speech analysis procedure and the known
high quality at low data rates.
Speech researchers, incidentally, sometimes scoff at what they perceive to be
the poor quality of the toy's speech; but considering the data rate
used (which averages 1200 bits per second of speech) it is remarkably good.
Anyway, I have never heard a child complain! \(em although it is not uncommon
to misunderstand a word.
Two 128\ Kbit read-only memories are used in the toy to hold data for about
330 words and phrases \(em lasting between 3 and 4 minutes \(em of speech.
At the time (mid-1978) these memories were the largest that were available
in the industry.
The data flow and user dialogue are handled by a microprocessor,
which is the fourth LSI circuit in the photograph of Figure 11.8.
.FC "Figure 11.8"
.pp
A schematic diagram of the toy is given in Figure 11.9.
.FC "Figure 11.9"
It has a small display which shows upper-case letters.
(Some teachers of spelling hold that the lack of lower case destroys
any educational value that the toy may have.)  It
has a full 26-key alphanumeric keyboard with 14 additional control keys.
(This is the toy's Achilles' heel, for the keys fall out after extended use.
More recent toys from TI use an improved keyboard.)  The
keyboard is laid out alphabetically instead of in QWERTY order; possibly
missing an opportunity to teach kids to type as well as spell.
An internal connector permits vocabulary expansion with up to 14 more
read-only memory chips.
Controlling the toy is a 4-bit microprocessor (a modified TMS1000).
However, the synthesizer chip does not receive data from the processor.
During speech, it accesses the memory directly and only returns control
to the processor when an end-of-phrase marker is found in the data stream.
Meanwhile the processor is idle, and cannot even be interrupted from the
keyboard.
Moreover, in one operational mode ("say-it") the toy embarks upon a long
monologue and remains deaf to the keyboard \(em it cannot even be turned off.
Any three-year-old will quickly discover that a sharp slap solves the problem!
A useful feature is that the device switches itself off if unused for more
than a few minutes.
A fascinating account of the development of the toy from the point of view
of product design and market assessment has been published
(Frantz and Wiggins, 1981).
.[
Frantz Wiggins 1981
.]
.rh "Control parameters."
The lattice filtering method of linear predictive synthesis (see Chapter 6)
was selected because of its good stability properties and guaranteed
performance with small word sizes.
The lattice has 10 stages.
All the control parameters are represented as 10-bit fixed-point numbers,
and the lattice operates with an internal precision of 14 bits (including
sign).
.pp
There are twelve parameters for the device:  ten reflection coefficients,
energy, and pitch.
These are updated every 20\ msec.
However, if 10-bit values were stored for each, a data rate of 120 bits
every 20\ msec, or 6\ Kbit/s, would be needed.
This would reduce the capacity of the two read-only memory chips to well
under a minute of speech \(em perhaps 65 words and phrases.
But one of the desirable properties of the reflection coefficients
which drive the lattice filter is that they are amenable to quantization.
A non-linear quantization scheme is used, with the parameter data addressing
an on-chip quantization table to yield a 10-bit coefficient.
.pp
Table 11.6 shows the number of bits devoted to each parameter.
.RF
.in+0.3i
.ta \w'repeat flag00'u +1.3i +0.8i
.nr x0 \w'repeat flag00'+1.3i+\w'00'+(\w'size (10-bit words)'/2)
\l'\n(x0u\(ul'
.nr x1 (\w'bits'/2)
.nr x2 (\w'quantization table'/2)
.nr x3 0.2m
parameter	\0\h'-\n(x1u'bits	\0\0\h'-\n(x2u'quantization table
.nr x2 (\w'size (10-bit words)'/2)
		\0\0\h'-\n(x2u'size (10-bit words)
\l'\n(x0u\(ul'
.sp
energy	\04	\016	\v'\n(x3u'_\v'-\n(x3u'\z4\v'\n(x3u'_\v'-\n(x3u'  energy=0 means 4-bit frame
pitch	\05	\032
repeat flag	\01	\0\(em	\z1\v'\n(x3u'_\v'-\n(x3u'\z0\v'\n(x3u'_\v'-\n(x3u'  repeat flag =1 means 10-bit frame
k1	\05	\032
k2	\05	\032
k3	\04	\016
k4	\04	\016	\z2\v'\n(x3u'_\v'-\n(x3u'\z8\v'\n(x3u'_\v'-\n(x3u'  pitch=0 (unvoiced) means 28-bit frame
k5	\04	\016
k6	\04	\016
k7	\04	\016
k8	\03	\0\08
k9	\03	\0\08
k10	\03	\0\08	\z4\v'\n(x3u'_\v'-\n(x3u'\z9\v'\n(x3u'_\v'-\n(x3u'  otherwise 49-bit frame
	__	___
.sp
	49 bits	216 words
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in-0.3i
.FG "Table 11.6  Bit allocation for Speak 'n Spell chip"
There are 4 bits for energy, and 5 bits for pitch and the first two
reflection coefficients.
Thereafter the number of bits allocated to reflection coefficients decreases
steadily, for higher coefficients are less important for intelligibility
than lower ones.
(Note that using a 10-stage filter is tantamount to allocating
.ul
no
bits to coefficients higher than the tenth.)  With a
1-bit "repeat" flag, whose role is discussed shortly, the frame size
becomes 49 bits.
Updated every 20\ msec, this gives a data rate of just under 2.5\ Kbit/s.
.pp
The parameters are expanded into 10-bit numbers by a separate quantization
table for each one.
For example, the five pitch bits address a 32-word look-up table which
returns a 10-bit value.
The transformation is logarithmic in this case, the lowest pitch being
around 50 Hz and the highest 190 Hz.
As shown in Table 11.6, a total of 216 10-bit words suffices to hold all
twelve quantization tables; and they are implemented on the synthesizer
chip.
To provide further smoothing of the control parameters,
they are interpolated linearly from one frame to the next at eight points
within the frame.
.pp
The raw data rate of 2.5\ Kbit/s is reduced to an average of 1200\ bit/s
by further coding techniques.
Firstly, if the energy parameter is zero the frame is silent,
and no more parameters are transmitted (4-bit frame).
Secondly, if the "repeat" flag is 1 all reflection coefficients are held
over from the previous frame, giving a constant filter but with the ability
to vary amplitude and pitch (10-bit frame).
Finally, if the frame is unvoiced (signalled by the pitch value being zero)
only four reflection coefficients are transmitted, because the ear is
relatively insensitive to spectral detail in unvoiced speech (28-bit frame).
The end of the utterance is signalled by the energy bits all being 1.
.rh "Chip organization."
The configuration of the lattice filter is shown in Figure 11.10.
.FC "Figure 11.10"
The "two-multiplier" structure (Chapter 6) is used, so the 10-stage filter
requires 19 multiplications and 19 additions
per speech sample.
(The last operation in the reverse path at the bottom is not needed.)  Since
a 10\ kHz sample rate is used, just 100\ $mu$sec are available for each
speech sample.
A single 5\ $mu$sec adder and a pipelined multiplier are implemented on
the chip, and multiplexed among the 19 operations.
The latter begins a new multiplication every 5\ $mu$sec, and finishes it
40\ $mu$sec later.
These times are within the capability of p-channel MOS technology,
allowing the chip to be produced at low cost.
The time slot for the 20'th, unnecessary, filter multiplication is used
for an overall gain adjustment.
.pp
The final analogue signal is produced by an 8-bit on-chip D/A converter
which drives a 200 milliwatt speaker through an impedance-matching
transformer.
These constitute the necessary analogue low-pass desampling filter.
.pp
Figure 11.11 summarizes the organization of the synthesis chip.
.FC "Figure 11.11"
Serial data enters directly from the read-only memories, although a control
signal from the processor begins synthesis and another signal is returned
to it upon termination.
The data is decoded into individual parameters, which are used to address
the quantization tables to generate the full 10-bit parameter
values.
These are interpolated from one frame to the next.
The lower part of the Figure shows the speech generation subsystem.
An excitation waveform for voiced speech is stored in read-only
memory and read out repeatedly at a rate determined by the pitch.
The source for unvoiced sounds is hard-limited noise provided by a digital
pseudo-random bit generator.
The sound source that is used depends on whether the pitch value is zero
or not:  notice that this precludes mixed excitation for voiced fricatives
(and the sound is noticeably poor in words like "zee").
A gain multiplication is performed before the signal is passed through the
lattice synthesis filter, described earlier.
.sh "11.4  Programmable signal processors"
.pp
The TI chip has a fixed architecture, and is destined forever
to implement the same vocal tract model \(em a 10'th order lattice filter.
A more recent device, the Programmable Digital Signal Processor
(Caldwell, 1980) from Telesensory Systems allows more flexibility
in the type of model.
.[
Caldwell 1980
.]
It can serve as a digital formant synthesizer or a linear predictive
synthesizer, and the order of model (number of formants, in the former case)
can be changed.
.pp
Before describing the PDSP, it is worth looking at an earlier microprocessor
which was designed for digital signal processing.
Some industry observers have said that this processor, the Intel 2920,
is to the analogue design engineer what the first microprocessor was to
the random logic engineer way back in the mists of time (early 1970's).
.rh "The 'analogue microprocessor'."
The 2920 is a digital microprocessor.
However, it contains an on-chip D/A converter, which can be used in
successive approximation fashion for A/D conversion under program control,
and its architecture is designed to aid digital signal processing calculations.
Although the precision of conversion is 9 bits, internal arithmetic is
done with 25 bits to accomodate the accumulation of round-off errors in
arithmetic operations.
An on-chip programmable read-only memory holds a 192-instruction program,
which is executed in sequence with no program jumps allowed.
This ensures that each pass through the program takes the same time,
so that the analogue waveform is regularly sampled and processed.
.pp
The device is implemented in n-channel MOS technology, which makes it
slightly faster than the pMOS Speak 'n Spell chip.
At its fastest operating speed each instruction takes 400 nsec.
The 192-instruction program therefore executes in 78.6\ $mu$sec, corresponding
to a sampling rate of almost 13\ kHz.
Thus the processor can handle signals with a bandwidth of 6.5\ kHz \(em ample
for high-quality speech.
However, a special EOP (end of program) instruction is provided which
causes an immediate jump back to the beginning.
Hence if the program occupies less than 192 instructions, faster sampling
rates can be used.
For example, a single second-order formant resonance
requires only 14 instructions and so can
be executed at over 150\ kHz.
.pp
Despite this speed, the 2920 is only marginally capable of synthesizing
speech.
Table 11.7 gives approximate numbers of instructions needed to do some
subtasks for speech generation (Hoff and Li, 1980).
.[
Hoff Li 1980 Software makes a big talker
.]
.RF
.nr x0 \w'parameter entry and data distribution0000'+\w'00000'
.nr x1 \w'instructions'
.nr x2 (\n(.l-\n(x0)/2
.in \n(x2u
.ta \w'parameter entry and data distribution0000'u
\l'\n(x0u\(ul'
.sp
task	\0\0\0\0\0\h'-\n(x1u'instructions
\l'\n(x0u\(ul'
.sp
parameter entry and data distribution	35\-40
glottal pulse generation	\0\0\0\08
noise generation	\0\0\011
lattice section	\0\0\020
formant filter	\0\0\014
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.in 0
.FG "Table 11.7  2920 instruction counts for typical speech subsystems"
The parameter entry and data distribution procedure
collects 10 8-bit parameters from a serial input stream, at a frame rate of
100 frames/s.
The parameter data rate is 8\ Kbit/s, and the routine assumes that the
2920 performs each complete cycle in 125\ $mu$sec to generate sampled speech
at 8\ kHz.
Therefore one bit of parameter data is accepted on every cycle.
The glottal pulse program generates an asymmetrical triangular waveform
(Chapter 5), while the noise generator uses a 17-bit pseudo-random feedback
shift register.
About 30% of the 192-instruction program memory is consumed by these
essential tasks.
A two-multiplier lattice section takes 20 instructions,
and so only six sections can fit into the remaining program space.
It may be possible to use two 2920's to implement a complete 10 or 12'th
order lattice, but the results of the first stage must be passed to the
second by transmitting analogue or digital data between each of the
2920's analogue ports \(em not a terribly satisfactory method.
.pp
Since a formant filter occupies only 14 instructions, up to nine of them
would fit in the program space left after the above-mentioned essential
subsystems.
Although other necessary house-keeping tasks may reduce this number
substantially,
it does seem possible to implement a formant synthesizer on a single 2920.
.rh "The Programmable Digital Signal Processor."
Whereas the 2920 is intended for general signal-processing jobs,
Telesensory Systems' PDSP
(Programmable Digital Signal Processor) is aimed specifically at speech
synthesis.
It comprises two separate chips, a control unit and an arithmetic unit.
To build a synthesizer these must be augmented with external memory
and a D/A converter, arranged in a configuration like that of Figure 11.12.
.FC "Figure 11.12"
.pp
The control unit accepts parameter data from a host computer, one byte at a time.
The data is temporarily held in buffer memory before being serialized and passed
to the arithmetic unit.
Notice that for the 2920 we assumed that parameters were presented
to the chip already serialized and precisely timed:  the PDSP control unit
effectively releases the host from this high-speed real-time operation.
But it does more.
It generates both a voiced and an unvoiced excitation source and passes them
to the arithmetic unit, to relieve the latter of the general-purpose
programming required for both these tasks and allow its instruction set
to be highly specialized for digital filtering.
.pp
The arithmetic unit has rather a peculiar structure.
It accomodates only 16 program steps and can execute the full 16-instruction
program at a rate of 10\ kHz.
The internal word-length is 18 bits, but coefficients and the digital output
are only 10 bits.
Each instruction can accomplish quite a lot of work.
Figure 11.13 shows that there are four separate blocks of store in addition
to the program memory.
.FC "Figure 11.13"
One location of each block is automatically associated with each program step.
Thus on instruction 2, for example, two 18-bit scratchpad registers MA(2)
and MB(2), and two 10-bit coefficient registers A1(2) and A2(2), are
accessible.
In addition five general registers, curiously numbered R1, R2, R5, R6, R7,
are available to every program step.
.pp
Each instruction has five fields.
A single instruction loads all the general registers and simultaneously
performs two multiplications and up to three additions.
The fields specify exactly which operands are involved in these operations.
.pp
The instructions of the PDSP arithmetic unit are really very powerful.
For example, a second-order digital formant resonator requires only
two program steps.
A two-multiplier lattice stage needs only one step, and
a complete 12-stage lattice filter can be implemented in the 16 steps available.
An important feature of the architecture is that it
is quite easy to incorporate more than one
arithmetic unit into a system, with a single control unit.
Intermediate data can be transferred digitally between arithmetic units
since the D/A converter is off-chip.
A four-multiplier normalized lattice (Chapter 6) with 12 stages can be implemented
on two arithmetic units, as can a lattice filter which incorporates zeros
as well as poles, and a complex series/parallel formant synthesizer
with a total of 12 resonators whose centre frequencies and bandwidths
can be controlled independently (Klatt, 1980).
.[
Klatt 1980
.]
.pp
How this device will fare in actual commercial products is yet to be seen.
It is certainly much more sophisticated than the TI Speak 'n Spell chip,
and a complete system will necessitate a much higher chip count and consequently
more expense.
Telesensory Systems are committed to producing a text-to-speech
system based upon it
for use both in a reading machine for the blind and as a text-input
speech-output computer peripheral.
.sh "11.5  References"
.LB "nnnn"
.[
$LIST$
.]
.LE "nnnn"
.bp
.ev2
.ta \w'\fIsilence\fR 'u +\w'.EH100'u +\w'(used to change amplitude and duration)00'u +\w'00000000000test word'u
.nr x0 \w'\fIsilence\fR '+\w'.EH100'+\w'(used to change amplitude and duration)00'+\w'00000000000test word'
\l'\n(x0u\(ul'
.sp
.nr x1 (\w'Votrax'/2)
.nr x2 (\w'duration (msec)'/2)
.nr x3 \w'test word'
	\h'-\n(x1u'Votrax		\0\h'-\n(x2u'duration (msec)	\h'-\n(x3u'test word
\l'\n(x0u\(ul'
.sp
.nr x3 \w'hid'
\fIi\fR	I		118	\h'-\n(x3u'hid
	I1	(sound equivalent of I)	\083
	I2	(sound equivalent of I)	\058
	I3	(allophone of I)	\058
	.I3	(sound equivalent of I3)	\083
	AY	(allophone of I)	\065
.nr x3 \w'head'
\fIe\fR	EH		118	\h'-\n(x3u'head
	EH1	(sound equivalent of EH)	\070
	EH2	(sound equivalent of EH)	\060
	EH3	(allophone of EH)	\060
	.EH2	(sound equivalent of EH3)	\070
	A1	(allophone of EH)	100
	A2	(sound equivalent of A1)	\095
.nr x3 \w'had'
\fIaa\fR	AE		100	\h'-\n(x3u'had
	AE1	(sound equivalent of AE)	100
.nr x3 \w'hod'
\fIo\fR	AW		235	\h'-\n(x3u'hod
	AW2	(sound equivalent of AW)	\090
	AW1	(allophone of AW)	143
.nr x3 \w'hood'
\fIu\fR	OO		178	\h'-\n(x3u'hood
	OO1	(sound equivalent of OO)	103
	IU	(allophone of OO)	\063
.nr x3 \w'hud'
\fIa\fR	UH		103	\h'-\n(x3u'hud
	UH1	(sound equivalent of UH)	\095
	UH2	(sound equivalent of UH)	\050
	UH3	(allophone of UH)	\070
	.UH3	(sound equivalent of UH3)	103
	.UH2	(allophone of UH)	\060
.nr x3 \w'hard'
\fIar\fR	AH1		143	\h'-\n(x3u'hard
	AH2	(sound equivalent of AH1)	\070
.nr x3 \w'hawed'
\fIaw\fR	O		178	\h'-\n(x3u'hawed
	O1	(sound equivalent of O)	118
	O2	(sound equivalent of O)	\083
	.O	(allophone of O)	178
	.O1	(sound equivalent of .O)	123
	.O2	(sound equivalent of .O)	\090
.nr x3 \w'who d'
\fIuu\fR	U		178	\h'-\n(x3u'who'd
	U1	(sound equivalent of U)	\090
.nr x3 \w'heard'
\fIer\fR	ER		143	\h'-\n(x3u'heard
.nr x3 \w'heed'
\fIee\fR	E		178	\h'-\n(x3u'heed
	E1	(sound equivalent of E)	118
\fIr\fR	R		\090
	.R	(allophone of R)	\050
\fIw\fR	W		\083
	.W	(allophone of W)	\083
\l'\n(x0u\(ul'
.sp3
.ce
Table 11.2  Votrax sound segments and their durations
.bp
\l'\n(x0u\(ul'
.sp
.nr x1 (\w'Votrax'/2)
.nr x2 (\w'duration (msec)'/2)
.nr x3 \w'test word'
	\h'-\n(1u'Votrax		\0\h'-\n(x2u'duration (msec)	\h'-\n(x3u'test word
\l'\n(x0u\(ul'
.sp
\fIl\fR	L		105
	L1	(allophone of L)	105
\fIy\fR	Y		103
	Y1	(allophone of Y)	\083
\fIm\fR	M		105
\fIb\fR	B		\070
\fIp\fR	P		100
	.PH	(aspiration burst for use with P)	\088
\fIn\fR	N		\083
\fId\fR	D		\050
	.D	(allophone of D)	\053
\fIt\fR	T		\090
	DT	(allophone of T)	\050
	.S	(aspiration burst for use with T)	\070
\fIng\fR	NG		120
\fIg\fR	G		\075
	.G	(allophone of G)	\075
\fIk\fR	K		\075
	.K	(allophone of K)	\080
	.X1	(aspiration burst for use with K)	\068
\fIs\fR	S		\090
\fIz\fR	Z		\070
\fIsh\fR	SH		118
	CH	(allophone of SH)	\055
\fIzh\fR	ZH		\090
	J	(allophone of ZH)	\050
\fIf\fR	F		100
\fIv\fR	V		\070
\fIth\fR	TH		\070
\fIdh\fR	THV		\070
\fIh\fR	H		\070
	H1	(allophone of H)	\070
	.H1	(allophone of H)	\048
\fIsilence\fR	PA0		\045
	PA1		175
	.PA1		\0\05

	.PA2 (used to change amplitude and duration)	\0\0\-
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.sp3
.ce
Table 11.2  (continued)
.bp
.ta 0.8i +2.6i +\w'(AH1 .UH2)  (O U)000'u
.nr x0 0.8i+2.6i+\w'(AH1 .UH2)  (O U)000'+\w'; i uh  \- here'
\l'\n(x0u\(ul'
.sp
vowel clusters
	EH I	A1 AY	; e i  \- hey
	UH OO	O U	; uh u  \- ho
	AE I	(AH1 EH3) I	; aa i  \- hi
	AE OO	(AH1 .UH2) (O U)	; aa u  \- how
	AW I	(O UH) E	; o i  \- hoi
	I UH	E I	; i uh  \- here
	EH UH	(EH A1) EH	; e uh  \- hair
	OO UH	OO UH	; u uh  \- poor
	Y U	Y1 (IU U)
.sp
vowel transitions
	{F M B P} O	(.O1 O)
	{L R} EH	(EH3 EH)
	{B K T D R} UH	(UH3 UH)
	{T D} A1	(EH3 A1)
	{T D} AW	(AH1 AW)
	{W} I	(I3 I)
	{G SH W K} OO	(IU OO)
	AY {K G T D}	(AY Y)
	E {M T}	(E Y)
	I {M T}	(I Y)
	E {L}	(I3 UH)
	EH {R N S D T}	(EH EH3)
	I {R T}	(I I3)
	AE {S N}	(AE EH)
	AE {K}	(AE EH3)
	A1 {R}	(A1 EH1)
	AH1 {R P K}	(AH1 UH)
	AH1 {ZH}	(AH1 EH3)
.sp
intervocalics
	{voiced} T {voiced}	DT
.sp
consonant transitions
	L {EH}	L1
	H {U OO IU}	H1
\l'\n(x0u\(ul'
.sp3
.ce
Table 11.5  Contextual rules for Votrax sound segments
.bp
\l'\n(x0u\(ul'
.sp
consonant clusters
	B {stop-consonant}	(B PA0)
	P {stop-consonant}	(P PA0)
	D {stop-consonant}	(D PA0)
	T {stop-consonant}	(T PA0)
	DT {stop-consonant}	(T PA0)
	G {stop-consonant}	(G PA0)
	K {stop-consonant}	(K PA0)
	{D T} R	(.X1 R)
	K R	.K (.X1 R)
	{consonant} R	.R
	{consonant} L	L1
	K W	.K .W
	D ZH	D J
	T SH	T CH
.sp
initial effects
	{.syll} P {vowel}	(P .PH)
	{.syll} K {vowel}	(K .H1)
	{.syll} T {vowel}	(T .S)
	{.syll} L	L1
	{.syll} H {U OO O AW AH1}	H1
.sp
terminal effects
	E {PA0}	(E Y)
\l'\n(x0u\(ul'
.ta 0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i +0.8i
.sp3
.ce
Table 11.5  (continued)
.ev