Audio and Speech Processing for Data Mining

Zheng-Hua Tan

Aalborg University, Denmark

INTRODUCTION

The explosive increase in computing power, network bandwidth and storage capacity has largely facilitated the production, transmission and storage of multimedia data. Compared to alpha-numeric database, non-text media such as audio, image and video are different in that they are unstructured by nature, and although containing rich information, they are not quite as expres-sive from the viewpoint of a contemporary computer.

As a consequence, an overwhelming amount of data is created and then left unstructured and inaccessible, boosting the desire for efficient content management of these data. This has become a driving force of mul-timedia research and development, and has lead to a new field termed multimedia data mining. While text mining is relatively mature, mining information from non-text media is still in its infancy, but holds much promise for the future.

In general, data mining the process of applying analytical approaches to large data sets to discover implicit, previously unknown, and potentially useful information. This process often involves three steps:

data preprocessing, data mining and postprocessing (Tan, Steinbach, & Kumar, 2005). The first step is to transform the raw data into a more suitable format for subsequent data mining. The second step conducts the actual mining while the last one is implemented to validate and interpret the mining results.

Data preprocessing is a broad area and is the part in data mining where essential techniques are highly dependent on data types. Different from textual data, which is typically based on a written language, image, video and some audio are inherently non-linguistic.

Speech as a spoken language lies in between and of-ten provides valuable information about the subjects, topics and concepts of multimedia content (Lee &

Chen, 2005). The language nature of speech makes information extraction from speech less complicated yet more precise and accurate than from image and video. This fact motivates content based speech analysis for multimedia data mining and retrieval where audio

and speech processing is a key, enabling technology (Ohtsuki, Bessho, Matsuo, Matsunaga, & Kayashi, 2006). Progress in this area can impact numerous busi-ness and government applications (Gilbert, Moore, &

Zweig, 2005). Examples are discovering patterns and generating alarms for intelligence organizations as well as for call centers, analyzing customer preferences, and searching through vast audio warehouses.

BACKGROUND

With the enormous, ever-increasing amount of audio data (including speech), the challenge now and in the future becomes the exploration of new methods for accessing and mining these data. Due to the non-struc-tured nature of audio, audio files must be annotated with structured metadata to facilitate the practice of data mining. Although manually labeled metadata to some extent assist in such activities as categorizing audio files, they are insufficient on their own when it comes to more sophisticated applications like data mining. Manual transcription is also expensive and in many cases outright impossible. Consequently, automatic metadata generation relying on advanced processing technologies is required so that more thor-ough annotation and transcription can be provided.

Technologies for this purpose include audio diarization and automatic speech recognition. Audio diarization aims at annotating audio data through segmentation, classification and clustering while speech recognition is deployed to transcribe speech. In addition to these is event detection, such as, for example, applause detec-tion in sports recordings. After audio is transformed into various symbolic streams, data mining techniques can be applied to the streams to find patterns and as-sociations, and information retrieval techniques can be applied for the purposes of indexing, search and retrieval. The procedure is analogous to video data mining and retrieval (Zhu, Wu, Elmagarmid, Feng, &

Wu, 2005; Oh, Lee, & Hwang, 2005).

Audio and Speech Processing for Data Mining

Diarization is the necessary, first stage in recognizing

A

speech mingled with other audios and is an important field in its own right. The state-of-the-art system has achieved a speaker diarization error of less than 7% for broadcast news shows (Tranter & Reynolds, 2006).

A recent, notable research project on speech tran-scription is the Effective Affordable Reusable Speech-To-Text (EARS) program (Chen, Kingsbury, Mangu, Povey, Saon, Soltau, & Zweig, 2006). The EARS program focuses on automatically transcribing natural, unconstrained human-human speech from broadcasts and telephone conversations in multiple languages.

The primary goal is to generate rich and accurate transcription both to enable computers to better detect, extract, summarize, and translate important informa-tion embedded in the speech and to enable humans to understand the speech content by reading transcripts instead of listening to audio signals. To date, accura-cies for broadcast news and conversational telephone speech are approximately 90% and 85%, respectively.

For reading or dictated speech, recognition accuracy is much higher, and depending on several configurations, it can reach as high as 99% for large vocabulary tasks.

Progress in audio classification and categorization is also appealing. In a task of classifying 198 sounds into 16 classes, (Lin, Chen, Truong, & Chang, 2005) achieved an accuracy of 97% and the performance was 100% when considering Top 2 matches. The 16 sound classes are alto-trombone, animals, bells, cello-bowed, crowds, female, laughter, machines, male, oboe, percussion, telephone, tubular-bells, violin-bowed, violin-pizz and water.

The technologies at this level are highly attractive for many speech data mining applications. The ques-tion we ask here is what is speech data mining? The fact is that we have areas close to or even overlapping with it, such as spoken document retrieval for search and retrieval (Hansen, Huang, Zhou, Seadle, Deller, Gurijala, Kurimo, & Angkititrakul, 2005). At this early stage of research, the community does not show a clear intention to segregate them, though. The same has happened with text data mining (Hearst, 1999).

In this chapter we define speech data mining as the nontrivial extraction of hidden and useful information from masses of speech data. The same applies to audio data mining. Interesting information includes trends, anomalies and associations with the purpose being primarily for decision making. An example is mining spoken dialog to generate alerts.

MAIN FOCUS

In this section we discuss some key topics within or related to speech data mining. We cover audio diariza-tion, robust speech recognidiariza-tion, speech data mining and spoken document retrieval. Spoken document retrieval is accounted for since the subject is so closely related to speech data mining, and the two draw on each other by sharing many common preprocessing techniques.

Audio Diarization

Audio diarization aims to automatically segment an audio recording into homogeneous regions. Diariza-tion first segments and categorizes audio as speech and non-speech. Non-speech is a general category covering silence, music, background noise, channel conditions and so on. Speech segments are further annotated through speaker diarization which is the current focus in audio diarization. Speaker diarization, also known as “Who Spoke When” or speaker segmentation and clustering, partitions speech stream into uniform seg-ments according to speaker identity.

A typical diarization system comprises such com-ponents as speech activity detection, change detection, gender and bandwidth identification, speaker segmenta-tion, speaker clustering, and iterative re-segmentation or boundary refinement (Tranter & Reynolds, 2006).

Two notable techniques applied in this area are Gauss-ian mixture model (GMM) and BayesGauss-ian information criterion (BIC), both of which are deployed through the process of diarization. The performance of speaker diarization is often measured by diarization error rate which is the sum of speaker error, missed speaker and false alarm speaker rates.

Diarization is an important step for further process-ing such as audio classification (Lu, Zhang, & Li, 2003), audio clustering (Sundaram & Narayanan, 2007), and speech recognition.

Robust Speech Recognition

Speech recognition is the process of converting a speech signal to a word sequence. Modern speech recognition systems are firmly based on the principles of statistical pattern recognition, in particular the use of hidden Markov models (HMMs). The objective is to find the most likely sequence of words Wˆ , given the observation data Y which are feature vectors extracted

Audio and Speech Processing for Data Mining

from an utterance. It is achieved through the following Bayesian decision rule:

where P(W) is the a priori probability of observing some specified word sequence W and is given by a language model, for example tri-grams, and P(Y|W) is the probability of observing speech data Y given word sequence W and is determined by an acoustic model, often being HMMs.

HMM models are trained on a collection of acous-tic data to characterize the distributions of selected speech units. The distributions estimated on training data, however, may not represent those in test data.

Variations such as background noise will introduce mismatches between training and test conditions, lead-ing to severe performance degradation (Gong, 1995).

Robustness strategies are therefore demanded to reduce the mismatches. This is a significant challenge placed by various recording conditions, speaker variations and dialect divergences. The challenge is even more significant in the context of speech data mining, where speech is often recorded under less control and has more unpredictable variations. Here we put an emphasis on robustness against noise.

Noise robustness can be improved through feature-based or model-feature-based compensation or the combination of the two. Feature compensation is achieved through three means: feature enhancement, distribution nor-malization and noise robust feature extraction. Feature enhancement attempts to clean noise-corrupted features, as in spectral subtraction. Distribution normalization reduces the distribution mismatches between training and test speech; cepstral mean subtraction and vari-ance normalization are good examples. Noise robust feature extraction includes improved mel-frequency cepstral coefficients and completely new features. Two classes of model domain methods are model adaptation and multi-condition training (Xu, Tan, Dalsgaard, &

Lindberg, 2006).

Speech enhancement unavoidably brings in uncer-tainties and these unceruncer-tainties can be exploited in the HMM decoding process to improve its performance.

Uncertain decoding is such an approach in which the uncertainty of features introduced by the background noise is incorporated in the decoding process by using a modified Bayesian decision rule (Liao & Gales, 2005).

This is an elegant compromise between feature-based and model-based compensation and is considered an interesting addition to the category of joint feature and model domain compensation which contains well-known techniques such as missing data and weighted Viterbi decoding.

Another recent research focus is on robustness against transmission errors and packet losses for speech recognition over communication networks (Tan, Dalsgaard, & Lindberg, 2007). This becomes important when there are more and more speech traffic through networks.

Speech Data Mining

Speech data mining relies on audio diarization, speech recognition and event detection for generating data de-scription and then applies machine learning techniques to find patterns, trends, and associations.

The simplest way is to use text mining tools on speech transcription. Different from written text, however, textual transcription of speech is inevitably erroneous and lacks formatting such as punctuation marks. Speech, in particular spontaneous speech, furthermore contains hesitations, repairs, repetitions, and partial words. On the other hand, speech is an information-rich media including such information as language, text, mean-ing, speaker identity and emotion. This characteristic lends a high potential for data mining to speech, and techniques for extracting various types of informa-tion embedded in speech have undergone substantial development in recent years.

Data mining can be applied to various aspects of speech. As an example, large-scale spoken dialog sys-tems receive millions of calls every year and generate terabytes of log and audio data. Dialog mining has been successfully applied to these data to generate alerts (Douglas, Agarwal, Alonso, Bell, Gilbert, Swayne, &

Volinsky, 2005). This is done by labeling calls based on subsequent outcomes, extracting features from dialog and speech, and then finding patterns. Other interesting work includes semantic data mining of speech utterances and data mining for recognition error detection.

Whether speech summarization is also considered under this umbrella is a matter of debate, but it is nev-ertheless worthwhile to refer to it here. Speech sum-marization is the generation of short text summaries of speech (Koumpis & Renals, 2005). An intuitive approach is to apply text-based methods to speech

0 Audio and Speech Processing for Data Mining

transcription while more sophisticated approaches

A

combine prosodic, acoustic and language information with textual transcription.

Spoken Document Retrieval

Spoken document retrieval is turned into a text informa-tion retrieval task by using a large-vocabulary continu-ous speech recognition (LVCSR) system to generate a textual transcription. This approach has shown good performance for high-quality, close-domain corpora, for example broadcast news, where a moderate word error rate can be achieved. When word error rate is below one quarter, spoken document retrieval systems are able to get retrieval accuracy similar to using human refer-ence transcriptions because query terms are often long words that are easy to recognize and are often repeated several times in the spoken documents. However, the LVCSR approach presents two inherent deficiencies:

vocabulary limitations and recognition errors. First, this type of system can only process words within the predefined vocabulary of the recognizer. If any out-of-vocabulary words appear in the spoken documents or in the queries, the system cannot deal with them.

Secondly, speech recognition is error-prone and has an error rate of typically ranging from five percent for clean, close-domain speech to as much as 50 percent for spontaneous, noisy speech, and any errors made in the recognition phase cannot be recovered later on as speech recognition is an irreversible process. To resolve the problem of high word error rates, recognition lattices can be used as an alternative for indexing and search (Chelba, Silva, & Acero, 2007). The key issue in this field is indexing, or more generally, the combination of automatic speech recognition and information retrieval technologies for optimum overall performance.

Search engines have recently experienced a great success in searching and retrieving text documents.

Nevertheless the World Wide Web is very silent with only primitive audio search mostly relying on surround-ing text and editorial metadata. Content based search for audio material, more specifically spoken archives, is still an uncommon practice and the gap between audio search and text search performance is significant.

Spoken document retrieval is identified as one of the key elements of next-generation search engines.

FUTURE TRENDS

The success of speech data mining highly depends on the progress of audio and speech processing. While the techniques have already shown good potentials for data mining applications, further advances are called for.

To be applied to gigantic data collected under diverse conditions, faster and more robust speech recognition systems are required.

At present hidden Markov model is the dominating approach for automatic speech recognition. A recent trend worth noting is the revisit of template-based speech recognition, which is an aged, almost obsolete paradigm. It has now been put into a different perspec-tive and framework and is gathering momentum. In addition to this are knowledge based approaches and cognitive science oriented approaches.

In connection with diarization, it is foreseen that there should be a wider scope of studies to cover larger and broader databases, and to gather more diversity of information including emotion, speaker identities and characteristics.

Data mining is often used for surveillance applica-tions, where real-time or near real-time operation is man-datory while at present many systems and approaches work in a batch mode or far from real-time.

Another topic under active research is music data mining for detecting melodic or harmonic patterns from music data. Main focuses are on feature extrac-tion, similarity measures, categorization and clustering, pattern extraction and knowledge representation.

Multimedia data in general contains several differ-ent types of media. Information fusion from multiple media has been less investigated and should be given more attention in the future.

CONCLUSION

This chapter reviews audio and speech processing technologies for data mining with an emphasis on speech data mining. The status and the challenges of various related techniques are discussed. The under-pinning techniques for mining audio and speech are audio diarization, robust speech recognition, and audio classification and categorization. These techniques have reached a level where highly attractive data

Audio and Speech Processing for Data Mining

mining applications can be deployed for the purposes of prediction and knowledge discovery. The field of audio and speech data mining is still in its infancy, but it has received attention in security, commercial and academic fields.

REFERENCES

Chelba, C., Silva, J., & Acero, A. (2007). Soft index-ing of speech content for search in spoken documents.

Computer Speech and Language, 21(3), 458-478.

Chen, S.F., Kingsbury, B., Mangu, L., Povey, D., Saon, G., Soltau, H., & Zweig, G. (2006). Advances in speech transcription at IBM under the DARPA EARS program. IEEE Transactions on Audio, Speech and Language Processing, 14(5), 1596- 1608.

Douglas, S., Agarwal, D., Alonso, T., Bell, R., Gilbert, M., Swayne, D.F., & Volinsky, C. (2005). Mining cus-tomer care dialogs for “daily news”. IEEE Transactions on Speech and Audio Processing, 13(5), 652-660.

Gilbert, M., Moore, R.K., & Zweig, G. (2005). Edito-rial – introduction to the special issue on data mining of speech, audio, and dialog. IEEE Transactions on Speech and Audio Processing, 13(5), 633-634.

Gong, Y. (1995). Speech recognition in noisy envi-ronments: a survey. Speech Communication, 16(3), 261-291.

Hansen, J.H.L., Huang, R., Zhou, B., Seadle, M., Del-ler, J.R., Gurijala, A.R., Kurimo, M., & Angkititrakul, P. (2005). SpeechFind: advances in spoken document retrieval for a national gallery of the spoken word.

IEEE Transactions on Speech and Audio Processing, 13(5), 712-730.

Hearst, M.A. (1999). Untangling text data mining. In Proceedings of the 37th annual meeting of the Associa-tion for ComputaAssocia-tional Linguistics on ComputaAssocia-tional Linguistics, 3– 10.

Koumpis, K., & Renals, S., (2005). Automatic sum-marization of voicemail messages using lexical and prosodic features. ACM Transactions on Speech and Language Processing, 2(1), 1-24.

Lee, L.-S., & Chen, B. (2005). Spoken document un-derstanding and organization. IEEE Signal Processing Magazine, 22(5), 42-60.

Liao, H., & Gales, M.J.F. (2005). Joint uncertainty decoding for noise robust speech recognition. In Pro-ceedings of INTERSPEECH 2005, 3129–3132.

Lin, C.-C., Chen, S.-H., Truong, T.-K., & Chang, Y. (2005). Audio classification and categorization based on wavelets and support vector machine. IEEE Transactions on Speech and Audio Processing, 13(5), 644-651.

Lu, L., Zhang, H.-J., & Li, S. (2003). Content-based audio classification and segmentation by using support vector machines. ACM Multimedia Systems Journal 8(6), 482-492.

Oh, J.H., Lee, J., & Hwang, S. (2005). Video data min-ing. In J. Wang (Ed.), Encyclopedia of Data Warehous-ing and MinWarehous-ing. Idea Group Inc. and IRM Press.

Ohtsuki, K., Bessho, K., Matsuo, Y., Matsunaga, S., &

Kayashi, Y. (2006). Automatic multimedia indexing.

IEEE Signal Processing Magazine, 23(2), 69-78.

Sundaram, S., & Narayanan, S. (2007). Analysis of audio clustering using word descriptions. In Proceed-ings of the 32nd International Conference on Acoustics, Speech, and Signal Processing.

Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Intro-duction to data mining. Addison-Wesley.

Tan, Z.-H., Dalsgaard, P., & Lindberg, B. (2007). Ex-ploiting temporal correlation of speech for error-robust and bandwidth-flexible distributed speech recognition.

IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1391-1403.

Tranter, S., & Reynolds, D. (2006). An overview of automatic speaker diarization systems. IEEE Trans-actions on Audio, Speech and Language Processing, 14(5), 1557-1565.

Xu, H., Tan, Z.-H., Dalsgaard, P., & Lindberg, B.

Dans le document Encyclopedia of Data Warehousing and Mining (Page 178-184)