当前位置：文档库 › Automatic Detection of Phone-Level Mispronunciation for Language

Automatic Detection of Phone-Level Mispronunciation for Language

AUTOMATIC DETECTION OF PHONE-LEVEL MISPRONUNCIATION

FOR LANGUAGE LEARNING

Horacio Franco, Leonardo Neumeyer, María Ramos, and Harry Bratt

SRI International

Speech Technology and Research Laboratory

Menlo Park, CA, 94025, USA

ABSTRACT

We are interested in automatically detecting speci?c phone seg-ments that have been mispronounced by a nonnative student of a foreign language. The phone-level information allows a lan-guage instruction system to provide the student with feedback about speci?c pronunciation mistakes. Two approaches were evaluated; in the ?rst approach, log-posterior probability-based scores [1] are computed for each phone segment. These proba-bilities are based on acoustic models of native speech. The sec-ond approach uses a phonetically labeled nonnative speech database to train two different acoustic models for each phone: one model is trained with the acceptable, or correct native-like pronunciations, while the other model is trained with the incor-rect, strongly nonnative pronunciations. For each phone seg-ment, a log-likelihood ratio score is computed using the incorrect and correct pronunciation models. Either type of score is compared with a phone dependent threshold to detect a mispronunciation. Performance of both approaches was evalu-ated in a phonetically transcribed database of 130,000 phones uttered in continuous speech sentences by 206 nonnative speak-ers.

1. INTRODUCTION

Computer-based language instruction systems potentially can offer some advantages over traditional methods, especially in areas such as pronunciation training, which often require full attention of the teacher to a single student. If the computer could provide the type of feedback that a pronunciation teacher provides, it would be a much cheaper alternative, accessible at any time and at any place, and certainly tireless.

Recent advances in research on automatic pronunciation scor-ing [1],[2] allow us to obtain pronunciation quality ratings for sentences or groups of sentences, with arbitrary text, with grad-ing consistency similar to that of an expert teacher. While pro-nunciation scoring is essential in systems designed for automatic evaluation, a score or number represents only part of the desired feedback for language instruction. In the classroom, a human teacher can point to speci?c problems in producing the new sounds, and can give speci?c directions to lessen the most salient pronunciation problems. Our current efforts focus on the introduction of detailed feedback on speci?c pronunciation problems to help correct or improve pronunciation.

Native and nonnative pronunciations differ in many dimensions. For example, at the phone-segment level, there are differences in phonetic features, which lie on a continuum of possible val-ues between L1 and L2 [3]. There are also prosodic elements, such as stress, duration, timing, pauses, and intonation, which are crucial to native-like pronunciation [4], although in this work we are focusing on segmental pronunciation aspects.

To provide useful feedback at the phone-segment level we need to reliably detect whether a phone is native-like or nonnative, and, ideally, to evaluate how close it is to the native phone pro-duction along its different phonetic features. Recently, the use of posterior scores was extended to evaluate the pronunciation quality of speci?c phone segments [5] as well as to detect phone mispronunciations [6],[7]. An alternative approach [8] used hidden Markov models (HMMs) with two alternative pro-nunciations per phone—one trained on native speech, the other on strongly nonnative speech. Mispronunciations were detected from the phone backtrace when the nonnative phone alternative was chosen.

The recent availability of a large nonnative speech database [9] with detailed phone-level transcriptions allowed us to accu-rately extend and evaluate our phone mispronunciation detec-tion strategies. We investigated two different methods for the detection of phone-level mispronunciations (rather than scor-ing the pronunciation quality of phone segments). In the ?rst method, posterior scores [1] are computed for each phone seg-ment. These probabilities are based on acoustic models of native speech. The second method uses the phonetically tran-scribed nonnative speech database to train two different acous-tic models for each phone: one model is trained with the acceptable, or correct native-like pronunciations, while the other model is trained with the incorrect, strongly nonnative pronunciations. For each phone segment, a log-likelihood ratio score is computed using the incorrect and correct pronuncia-tion models. Either type of score is compared with a phone dependent threshold to detect a mispronunciation. Both meth-ods were evaluated over a phonetically transcribed database of 130,000 phones uttered by 206 nonnative speakers of Latin American Spanish.

2. DATABASE DESCRIPTION

The collection of phone-level pronunciation data is one of the most challenging tasks necessary to build and evaluate a sys-tem that can give detailed feedback on speci?c phone-level pronunciation problems. For this study we had collected a Latin-American Spanish speech database [9] that included recordings from native and nonnative speakers. A group of expert phoneticians transcribed part of the nonnative speech data to be used for development of the mispronunciation detec-tion algorithms. This effort involved obtaining detailed phone-level information for approximately 130,000 phones uttered in continuous speech sentences. These sentences were produced by 206 nonnative speakers whose native language was Ameri-can English. Their levels of pro?ciency were varied, and an attempt was made to balance the number of speakers by level

of pro?ciency as well as by gender. For this study, the detailed phone-level transcriptions were collapsed into two categories: native-like and nonnative pronunciations. In this way we con-veyed the judgment of the nativeness of each phone occur-rence.

Four native Spanish-speaking phoneticians provided the detailed phonetic transcriptions for 2550 sentences totalling 130,000 phones that were randomly divided among transcrib-ers. An additional 160 sentences, the common pool, were tran-scribed by all four phoneticians to assess the consistency with which the human could achieve this task. In [9] it was con-cluded that not all the phone classes could be transcribed con-sistently. The most reliable to transcribe were the approximants /β/, /δ/, and /γ/; surprisingly, some of the phones which were expected to be good predictors of nonnativeness, such as voice-less stops, most vowels, and /l/ and /r/, did not have good con-sistency across all the transcribers.

3. DETECTION OF MISPRONUNCIATION The mispronunciation detection is planned to be integrated into our existing pronunciation scoring paradigm [2],[1], which uses an HMM speech recognizer to generate phonetic segmentations of the student’s speech. The recognizer models can be trained using a database of native speakers or a combination of native and nonnative speakers. From the alignments, different pronun-ciation scores can be generated for each phonetic segment, using different types of models. The scores of the different pho-netic segments are combined and calibrated to obtain the closest match to the judgment of an expert human rater. We typically assume that the interaction between the student and the com-puter has been designed to be error-free. In this case the pho-netic segmentation can be obtained by computing a forced alignment using the known prompted text and the pronuncia-tion dictionary.

The goal of the phone-level mispronunciation detection is to add a judgment of nativeness for each phonetic segment de?ned in the forced alignment.

3.1 De?nition of the Mispronunciation Labels

To evaluate, as well as to train, the models used in the mispro-nunciation detection algorithms, we need to de?ne for each phone segment whether or not it was pronounced in a native-like manner. To this end, we de?ne the canonical transcription as the sequence of “expected” phones; this phone string is obtained from the recognizer forced alignment.

To assess what was actually uttered, we associated the canoni-cal transcription with the transcriber’s phone string by applying a dynamic programing (DP) alignment of the two strings. The distance between phone labels was based on the actual acoustic distance between phone classes. This type of distance allowed us to disambiguate phone insertions and deletions in the map-ping of the strings. Then, each phone in the canonical tran-scription was assigned a label “correct” or “mispronounced”, depending on whether or not the transcriptions from the phone-ticians were the same as the canonical transcription. Phone segments labeled as “correct” correspond to a native-like phone. Phone segments labeled as “mispronounced” may correspond to a nonnative version of the same phone, a different phone, or a deletion of the phone. By our de?nition, phone insertions induce a “mispronounced” label for the canonical phone to which they are mapped.

Informal analysis has shown that the recognizer forced align-ments are robust to the variability of the nonnative pronuncia-tions. On the other hand, phone insertions and deletions may induce some minor alignment errors. This problem could be alleviated by adding more alternative pronunciations to the pronunciation dictionary in order to model the most common deletions and insertions.

3.2 Human Detection of Mispronunciations

To assess an overall measure of consistency across transcribers, we aligned the transcription from each phonetician with the canonical transcription as described above. From these align-ments we derived the sequence of “correct” and “mispro-nounced” labels for the phones of each sentence. We then compared the sequence of labels between every pair of raters by counting the percent of the time that they disagree. The average across all the pairs of raters is an estimate of the mean disagreement between human raters. The resulting value of 19.8% can be considered as a lower bound to the average detection error that an automatic detection system may achieve.

3.3 Mispronunciation Detection Methods

Two approaches were evaluated and compared. Both assume that a phonetic segmentation of the utterance has been obtained in a ?rst step by using the Viterbi algorithm and the known transcription.

In the ?rst approach, previously developed log-posterior proba-bility-based scores [1] are computed for each phone segment with canonic label. For each frame the posterior probability of the phone given the observation vector is

(1) The sum over runs over a set of context-independent models for all phone classes. is the prior probability of the phone class. The posterior score for the phone seg-ment is de?ned as

(2)

where is the frame duration of the phone and is the start-ing frame index of the phone segment. The class conditional phone distributions used to compute the posterior probabilities are Gaussian mixture models that have been trained with a large database of native speech. For a mispro-nunciation to be detected, the phone posterior score

must be below a threshold predetermined for each phone class. The second approach uses the phonetically labeled nonnative database to train two different Gaussian mixture models for each phone class: one model is trained with the “correct”, native-like pronunciations of a phone, while the other model is trained with the “mispronounced” or nonnative pronunciations of the same phone. A four-way jackkni?ng procedure was used to train and evaluate this approach on the same phonetically transcribed nonnative database. There were no common speak-ers across the evaluation and training sets. The number of

q i

P q i y t

()q i y t

P q i y t

()

p y t q i

()P q i

()

p y t q j

()P q j

()

∑-----------------------------------------------------

P q i

()

q iρq i

()

ρq i

()

--P q i y t

()

log

t t0

t0d1

–

∑

d t0

p y t q i

()

ρq i

()

Gaussians per model was proportional to the amount of training data for each model, ranging from 200 to 1.

In the evaluation phase, for each phone segment , a length-normalized log-likelihood ratio score is computed for the phone segment by using the “mispronounced” and “correct”pronunciation models and , respectively.(3)

The normalization by allows de?nition of unique thresholds for the LLR for each phone class, independent of the lengths of the segments. A mispronunciation is detected when the LLR is above a predetermined threshold, speci?c to each phone.For both detection approaches, and for each phone class, differ-ent types of performance measures were computed for a wide range of thresholds, receiver operating characteristic (ROC)curves were obtained, and optimal thresholds were determined for the points of equal error rate (EER).

4. EXPERIMENTS

The acoustic models used to generate the phonetic alignments and produce the posterior scores were gender independent,Genonic Gaussian mixture models, as introduced in [10]. They were trained using a gender-balanced database of 142 native Latin American Spanish speakers, totaling about 32,000 sen-tences. Given the alignments, the detection of mispronunciation is reduced to a binary decision problem as the phone class is given by the alignments. Consequently, the mispronunciation detection performance can be studied for each phone class inde-pendently. Reasons to evaluate the performance for each phone class are that (1) the distributions of machine scores corre-sponding to different phone classes have different statistics, so independent thresholds must be used for each phone class, (2)the percent of “correct” and “mispronounced” reference labels is different for each phone class, and (3) the complexity of the acoustic model may be different for each phone class.

The performance of the mispronunciation detection algorithms was evaluated as a function of the threshold, for each phone class. For each threshold we obtained the machine-produced labels “correct” (C) or “mispronounced” (M) for each phone utterance. Then, we compared the machine labels with the labels obtained from the human phoneticians. Two types of per-formance measures were computed for each threshold: error measures and correlation measures.

The error measures were the total error , which is the percent of cases where the machine label and the human label differ; the probability of false detection, estimated as the percent of cases where a phone utterance is labeled by the machine as incorrect when it was in fact correct; and the probability of missing a tar-get, that is, the probability that the machine labeled a phone utterance as correct when it was in fact incorrect.

To compute the correlation measures we ?rst converted the C and M labels to numeric values 0 and 1, respectively. Then the 0-1 strings from machine and human judgments for each phone were correlated using the standard correlation coef?cient, as well as the cross correlation measure (or cosine distance) used in [7].

q i LLR q i ()λM λC LLR q i ()1

--p y t q i λM ,()log p y t q i λC ,()log –[]t t 0

=t 0d 1

–+∑

=d One important issue we found is that for many phone classes the number of phone utterances that have been labeled “mis-pronounced” by the phoneticians was much less than the num-ber labeled “correct”. The probability of error for those phone classes was then very biased by the priors. This bias, combined with the fact that the distributions of machine scores for “cor-rect” and for “mispronounced” phones had signi?cant overlap for some phone classes, determined that in some cases the probability of error could be relatively low, when we would just be classifying every phone utterance as correct, regardless of its machine score. For that reason the minimum error point was not a good indicator of how well we can actually detect a mispronunciation. In addition, in comparing detection perfor-mance across phone classes, the measure should not be affected by the priors of the labels. Consequently, we evaluated the mispronunciation performance by computing the ROC curve, and ?nding the points of EER, where the probability of false detection is equal to the probability of missing a target.This error measure is independent of the actual priors for the C or M labels, but results in a higher total error rate when the pri-ors are skewed.

To some degree, a similar but complementary effect was observed for the cross correlation measure; that is, for the phone classes with relatively high detection error rate, rela-tively high cross correlation values could be obtained by just labeling every phone utterance as mispronounced.

In Table 1 we show the EER, the correlation coef?cient, and the cross correlation measure for each phone class and for both detection methods. Weighted averages overall all the phones are also shown. The phones whose nativeness or mispronuncia-tion were detected most reliably were the approximants /β/, /δ/, and /γ/, the voiced stops /b/ and /d/, the fricative /x/, and the semivowel /w/. These phone classes have good agreement with those found to be the most consistent across different transcrib-ers [9]. The LLR method performed better than the posterior-based method for almost all phone classes. The lower perfor-mance for the nasals /m/ and /?/ could be explained because they had very few training examples for the mispronounced phones. The reduction in error rate was not uniform across phone classes. The advantage of the LLR method over the pos-terior method is more signi?cant if we look only at the phone classes with the lowest detection error. On average, the EER had a relative reduction of 33% for the seven most reliably detected phone classes referred above, when going from poste-rior-based to LLR-based detection. Acceptable levels of the correlation coef?cients were also found for that set of phones.The overall weighted average of the phone mispronunciation detection EER was 35.5% when log-posterior scores were used while 32.3% EER was obtained when the LLR method was used. If instead of the EER, we obtain the minimum total detection error for each phone class, and compute the average error weighted by the number of examples in each class, the resulting minimum average error is 21.3% for the posterior-based method and 19.4% for the LLR-based method. This min-imum average error can be compared with the transcribers’percent of pairwise disagreement reported in section 3.2, as both take into account the actual priors of the evaluation data.The closeness of the human-machine and the human-human average errors suggests that the accuracy of the LLR-based detection method is bounded by the consistency of the human transcriptions.

5. DISCUSSION AND CONCLUSIONS

We studied two mispronunciation detection algorithms. One algorithm is based on posterior probability scores computed using models of the native speech, and the other is based on models trained on actual nonnative speech, including both “cor-rect” and “mispronounced” phone utterances.

An important advantage of the posterior-based method is that the native models can be applied to detect mispronunciation errors for any type of nonnative accent. The LLR-based method needs to be trained with speci?c examples of the target nonna-tive user population. Experimental results show that the LLR-based system has better overall performance than the posterior-based method. The improvement is particularly signi?cant for the phone classes with the highest consistency across transcrib-ers.

The results also suggest that the reported performance of the system might have been limited by the accuracy and consis-tency of the transcriptions. This is suggested by (1) the agree-ment between the most consistent phone classes for humans and the best recognized phone classes by the machine, (2) the similar level of average error rate between pairs of humans on one hand and between machine and humans on the other hand,

cross correlation at the phone level for the two detection meth-ods studied. Weighted averages of the various scoring measures are shown at the bottom.

and (3) the fact that the level of improvement, when using the LLR method, is more signi?cant on the most consistent phone classes.

Results showed the set of phones where mispronunciation can be detected reliably. They mostly coincide with those phones that the phoneticians were able to transcribe more con-sistently. The overall error rate of the best system was 19.4%,which was similar to an estimate of pairwise human disagree-ment on the same task.

6. ACKNOWLEDGMENTS

Special thanks to Mitchel Weintraub and Fran?oise Beaufays for valuable help with the model training software. We grate-fully acknowledge support from the U. S. Government under the Technology Reinvestment Program (TRP). The views expressed in this material do not necessarily re?ect those of the Government.

REFERENCES

[1]H. Franco, L. Neumeyer, and Y. Kim (1997), Automatic

Pronunciation Scoring for Language Instruction,Proc. of ICASSP 97, pp. 1471-1474, Munich.[2]L. Neumeyer, H. Franco, M. Weintraub, and P. Price

(1996), Automatic Text-Independent Pronunciation Scor-ing of Foreign Language Student Speech,Proc. of ICSLP 96, pp. 1457-1460, Philadelphia, Pennsylvania.[3]J. Flege (1980), Phonetic Approximation in Second Lan-guage Acquisition,Language Learning , V ol. 30, No. 1, pp.117-134.[4]M. Eskenazi (1996), Detection of Foreign Speakers’ Pro-nunciation Errors for Second Language Training - Prelimi-nary Results,Proc. of ICSLP 96, pp. 1465-1468.[5]Y. Kim, H. Franco, and L. Neumeyer (1997), Automatic

Pronunciation Scoring of Specific Phone Segments for Language Instruction,Proc. of EUROSPEECH 97, pp.649-652, Rhodes.[6]S. Witt and S. Young (1997), Language Learning Based on

Non-Native Speech Recognition, in Proc. of EURO-SPEECH 97, pp. 633-636, Rhodes.[7]S. Witt and S. Young (1998), Performance Measures for

Phone-Level Pronunciation Teaching in Call,Proc. of the Workshop on Speech Technology in Language Learning ,pp. 99-102, Marholmen, Sweden.[8]O. Ronen, L. Neumeyer, and H. Franco (1997), Automatic

Detection of Mispronunciation for Language Instruction,Proc. of EUROSPEECH 97, pp. 645-648, Rhodes.[9]H. Bratt, L. Neumeyer, E. Shriberg, and H. Franco (1998),

Collection and Detailed Transcription of a Speech database for Development of Language Learning Technologies,Proc. of ICSLP 98, pp. 1539-1542, Sydney, Australia.[10]V . Digalakis and H. Murveit (1994), GENONES: Optimiz-ing the Degree of Mixture Tying in a Large Vocabulary Hidden Markov Model Based Speech Recognizer,Proc. of ICASSP 94,pp. I537-I540.