Automation of the DTW Process - John Fulcher, University of Wollongong, Australia

The aim of this study is to replace the one-on-one human facilitator in the feedback loop (Figure14 — adaptive learning) with an automated system — one based on a neural network pattern classifier. Further, our basic aim is no longer to produce text output from speech input, but rather images. Cast in terms of a pattern recognition problem, suddenly artificial neural networks become relevant again, whereas for conventional speech recognition they have been largely superseded by hidden Markov models.

A 2002 pilot study centered around the Macintosh-based proof-of-concept system shown in Figure 15. It comprised: (a) a voice input pre-processor (microphone, sound card, and noise filter), (b) a fast Fourier transform package (which converted sampled words to frequencies), and (c) an ANN pattern classifier (the output from which was the 1-of-n “best match” from the reference word look-up table). We hasten to add that this reference vocabulary was kept very small in this first instance.

Figure 15. Original Macintosh-based system (2002)

‘Talk-Write’

Figure 16. Apple Macintosh G4^™ screen dumps of ^Talk-Write software (top: user manual; bottom: input)

By the end of this 12-month inaugural study, whilst some success was forthcoming with each of these three sub-sections, the overall system performance was somewhat lacking.

A second system was developed the following year. A screen dump from the Apple Mac G4 screen is shown in Figure 16, from which we see integration of IBM ^ViaVoice^®, as well as support for additional input devices — namely scanner, graphics tablet and mouse. These latter devices are needed in order to augment speech input. More specifically, users are able to input their own drawings (either pre-prepared or new, via the tablet or mouse), in order to complement their oral stories.

As a first approximation to speech recognition for literacy, images could simply be linked on a one-to-one basis with words in the inbuilt vocabulary look-up table — whether that be as part of the Macintosh OS/X™ inbuilt speech library, or third-party software packages such as Dragon NaturallySpeaking™ or IBM ^ViaVoice® (the latter is shown in Figure 16). Ultimately however, we are aiming to do this the other way around

— in other words, to produce image output from speech input, then link the former on a one-to-one basis to text. Over time the user begins to associate (internalize) these words and images as part of the DTW process.

Other system features critical to producing an automated DTW “engine” are:

1. storage of speech input in a form easily indexed and retrieved as needed, and 2. synchronised playback of keywords/phrases in the speaker’s own voice rather

than in the unrealistic styles used in commercial speech synthesis packages.

Up to the present time, an unrealistically small reference vocabulary has been used;

obviously this would need to be expanded significantly before a production version is released into the marketplace. More to the point, we have yet to determine just what constitutes a “minimum yet sufficient”-sized vocabulary to enable users to tell their stories (and no doubt this will vary considerably from user to user).

CONCLUSION

This work-in-progress has thrown up numerous exciting possibilities for future investigation. Apart from the system issues outlined above, there is much experimenta-tion that could be performed to determine optimum pattern recogniexperimenta-tion configuraexperimenta-tions (to date, only simple, naïve multi-layer perceptron/back-propagation neural networks have been used). Likewise, we have yet to benchmark ANNs against alternative pattern classifier approaches.

The future possibilities and applications of draw-talk-write are limited only by our fears and lack of perceived safety. For example, “literary dramaturgy” has recently enabled people to consider and experiment public writing processes with literacy-inefficient people. DTW provides rich potential for minorities to voice, witness, and be heard by audiences who demand text and belittle those that have not mastered it.

What we need to assist us in our endeavours is technology that can record voice into text, synchronise it with playback in the voice of the narrator and the production of images, in other words, an intelligent system which incorporates word recognition, but which is configured in a manner that enables computer illiterate people to utilise the

system. Thus the computer system needs to respond to the user, rather than constrain people because they cannot meet the demands or limitations of the machine.

Lastly, successful automation of DTW on a computer platform would have far-reaching consequences beyond the specific (text-illiterate) section of the population of interest in the present study. Indeed, any community possessing a strong oral (story-telling) tradition could stand to benefit from this technology. Moreover, since the system output is images rather than text, it would have universal appeal.

ACKNOWLEDGMENTS

Financial support for this project from the Apple University Development Fund — AUDF — is gratefully acknowledged. The contributions of the following colleagues are likewise much appreciated: Kim Draisma, Ernie Blackmore, Marion Worthy, David Welsh, Frances Laneyrie, “the Dancer” and “the Prisoner” (fellow DTW travellers); Brian Pinch, Sunil Hargunani, Rishit Lalseta, Benjamin Nicholson, Riaz Hasan, Brij Rathore, and Phillip Dawson (software developers); and Professor Michael Wagner (proofreading and comments).

REFERENCES

Ainsworth, W. A. (1997). Some approaches to automatic speech recognition. In W. J.

Hardcastle & J. Laver (Eds.), The handbook of phonetic sciences (Chap. 24). Oxford, UK: Blackwell.

Atal, B. S., & Hanauer, S. L. (1971). Speech analysis and synthesis by linear prediction of the speech wave. Journal of the Acoustic Society of America, 50(2), 637-655.

Bourlard, H., & Morgan, M. (1994). Connectionist speech recognition. Boston: Kluwer Academic Press.

Bourlard, H., & Wellekens, C. (1990). Links between Markov models and multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(12), 1167-1178.

Brunet, P. T., Pandya, A. S., & Pinera, C. V. (1994, June 27-July 2). Artificial neural networks for phoneme recognition. Proceedings of the IEEE World Congress on Computational Intelligence, Orlando, FL (pp. 4473-4478).

Cambourne, B., & Turbill, J. (1987). Coping with chaos. Rozelle, NSW: Primary English Teaching Association.

Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437-1462.

Cosatto, E., Ostermann, J., Graf, H. P., & Schroeter, J. (2003). Lifelike talking faces for interactive services. Proceedings of the IEEE, 91(9), 1406-1429.

Deng, L., & Huang, X. (2004). Challenges in adopting speech recognition. Communica-tions of the ACM, 47(1), 69-75.

Deng, L., & O’Shaughnessy, D. (2003). Speech processing — A dynamic and optimisation-oriented approach. New York: Marcel Dekker.

Draisma, K., Gluck, R., Hancock, J., Kanitz, R., Price, W., Knell, J., Sharman, G., & Squires, J. (1994, October). Look what you get when the system is a bit more flexible and

generous. Journal of the Australian & New Zealand Student Services Association News, 4.

Ferguson, J. (1980). Hidden Markov models for speech. Princeton, NJ: IDA.

Fritsch, J., Hild, H., Meier, U., & Waibel, A. (2000). Time-delay neural networks for NN/

HMM hybrids: A family of connectionist continuous-speech recognition systems.

In S. Katagiri (Ed.), Handbook of neural networks for speech processing (Chap.

7). Norwood, MA: Artech House.

Fulcher, J. (1997). Image processing. In E. Fiesler & R. Beale (Eds.), Handbook of neural computation (Chap. F1.6). Oxford, UK: Oxford University Press.

Fulcher, J. (2005). Secure, remote, ubiquitous access of electronic health records. In D.

Bangert & R. Doktor (Eds.), Human and organizational dynamics in e-health: A global perspective. Oxford, UK: Radcliffe Medical Publishers.

Fulcher, J., Gluck, R., Worthy, M., Draisma, K., & Vialle, W. (2003). A neural network, speech-based approach to literacy. Proceedings of the International Symposium on Information & Communications Technology, Dublin, Ireland, September 27-29 (pp. 27-32).

Gallimore, R., & Tharp, R. (1990). Teaching mind in society: Teaching schooling and literate discourse. In L. C. Moll (Ed.), Vygotsky and education. Cambridge, UK:

Cambridge University Press.

Ganapathiraju, A., & Picone, J. (2000). Hybrid SVM/HMM architecture for speech recognition. Proceedings of the 6^th International Conference on Spoken Lan-guage Processing, Beijing, China, October 16-20 (pp. 504-507).

Gardner, H. (1983). Frames of mind: The theory of multiple intelligences. New York: Basic Books.

Gauvain, J. -L., & Lamel, L. (2000). Large-vocabulary continuous speech recognition:

Advances and applications. Proceedings of the IEEE, 88(8), 1181-1200.

Giangola, J. P., Balogh, J., & Cohen, M. H. (2004). Voice user interface design. Boston:

Addison Wesley.

Gluck, R., & Draisma, K. (1997). The Pipeline Project: An Investigation of Local Infra-structures to Educate Aboriginal Professionals for Community, Professional and Industrial Organisations in the Illawara and Surrounding Regions. Evaluations and Investigations Program, Higher Education Division, Department of Employment Education, Training and Youth Affairs, Canberra, Australia.

Gluck, R., Draisma, K., Fulcher, J., & Worthy, M. (2004). Draw-talk-write. Proceedings of the 6^th Biannual Language and Academic Skills in Higher Education Conference, Flinders University, Adelaide, SA (pp. 109-116).

Gluck, R., Vialle, W., Lysaght, P., & Larkin, J. (1998). What have I learned? In R. Corderoy (Ed.), Flexibility: The next wave? Proceedings of the 15^th Annual Conference of the Australasian Society for Computers in Learning in Tertiary Education - ASCILITE, Wollongong, NSW (pp. 297-308).

Gluck, R., Vialle, W., & Lysaght, P. (1999a). Reorienting mainstream learning through aboriginal experience. Proceedings of The Challenge of Change in Education Conference — Change and Education Research Group (CERG), University of Technology, Sydney, February.

Gluck, R., Vialle, W., & Lysaght, P. (1999b). Are you doing time or getting an education?

Proceedings of the 34^th World Congress of the International Institute of Sociol-ogy: Multiple Modernities in an Era of Globalization, Tel Aviv, July 11-15.

Haton, J. -P. (2000). Probability-oriented neural networks and hybrid connectionist/

stochastic networks. In S. Katagiri (Ed.), Handbook of neural networks for speech processing (Chap. 8). Norwood, MA: Artech House.

Hornik, K. (1991). Approximation capability of multilayer feedforward networks. Neural Networks, 4, 2151-2157.

Huang, X. -D., Acero, A., & Hon, H. -W. (2001). Spoken language processing — A guide to theory, algorithms, and system development. Upper Saddle River, NY: Prentice Hall.

Huang, X. -D., Akiri, Y., & Jack, M. (1990). Hidden Markov models for speech recogni-tion. Edinburgh, UK: Edinburgh University Press.

Itakura, F., & Saito, S. (1970). A statistical method for estimation of speech spectral density and formant frequencies. Electronics & Communications in Japan, 53-A, 36-43.

Jordan, M. I., & Jacobs, R. A. (1992). Hierarchies of adaptive experts. Advances in Neural Information Processing, 4, 985-993.

Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181-214.

Juang, B. H., & Furui, S. (2000). Automatic recognition and understanding of spoken language — ZA first step toward natural human-machine communication. Pro-ceedings of the IEEE, 88(8), 1142-1165.

Katagiri, S. (2000). Speech recognition. In S. Katagiri (Ed.), Handbook of neural networks for speech processing (Chap. 3). Norwood, MA: Artech House,

Keller, E. (1994). Fundamentals of speech synthesis and speech recognition: Basic concepts, state-of-the-art and future challenges. Chichester, England: John Wiley.

Kohonen, T. (1997). Self-organizing maps (2^nd ed.). Berlin: Springer-Verlag.

Kotelly, B. (2003). The art and business of speech recognition. Berkeley, CA: Peachpit Press.

Lang, K., & Hinton, G. (1988). The development of the time-delay neural network architecture for speech recognition (Tech. Rep. No. CMU-CS-88-152). Pittsburgh, PA: Carnegie-Mellon University.

Lang, K., Waibel, A., & Hinton, G. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1), 23-43.

Leshno, M., Lin, V., & Schoken, A. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Net-works, 6, 861-867.

Lippmann, R. (1989). Review of neural networks for speech recognition. Neural Compu-tation, 1(1), 1-38.

Lysaght, P. (2001). Intelligent profiles: A model for change in women’s lives. Doctor of education thesis, University of Wollongong, Australia.

Man, K. F., Tang, K. S., & Kwong, S. (1999). Genetic algorithms: Concepts and designs.

London: Springer Verlag (Chap. 8 — GA’s in Speech Recognition Systems).

Marsic, I., Medl, A., & Flanagan, J. (2000). Natural communication with information systems. Proceedings of the IEEE, 88(8), 1354-1366.

Masaki, S. (2000). The speech signal and its production model. In S. Katagiri (Ed.), Handbook of neural networks for speech processing (Chap. 2). Norwood, MA:

Artech House.

McDermott, E. (2000). Discriminative prototype-based methods for speech recognition.

In S. Katagiri (Ed.), Handbook of neural networks for speech processing (Chap.

5.6 LVQ). Norwood, MA: Artech House.

McDermott, E., & Katagiri, S. (1991). LVQ-based shift-tolerant phoneme recognition.

IEEE Transactions on Speech Processing, 39(6), 1398-1411.

Morgan, D. P., & Scofield, C. L. (1991). Neural networks and speech processing.

Amsterdam: Kluwer.

Morgan, N., & Bourlard, H. A. (1995). Neural networks for statistical recognition of continuous speech. Proceedings of the IEEE, 83(5), 742-770.

Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech, & Signal Processing, ASSP-32(2), 263-271.

O’Shaughnessy, D. (2003). Interacting with computers by voice: Automatic speech recognition and synthesis. Proceedings of the IEEE, 91(9), 1272-1305.

Oviatt, S. (2002). Breaking the robustness barrier: Recent progress on the design of robust multimodal systems. In M. Zelkowitz (Ed.), Advances in computers (pp. 305-341). San Diego, CA: Academic Press.

Potamianos, G., Neti, C., Gravier, G., Garg, A., & Senior, A. W. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306-1326.

Rabiner, L., & Juang, B. -H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall.

Ramachandran, R. P., & Mammone, R. (Eds.). (1995). Modern methods of speech processing. Norwell, MA: Kluwer Academic Publishers.

Rieber, R. W., & Carton, A. S. (Eds.). (1987). The collected works of L. S. Vygotsky. New York: Plenum.

Robinson, T., Hochberg, M., & Renals, S. (1996). The use of recurrent neural networks in continuous speech recognition. In C. –H. Lee, F. K. Soong, & K. K. Paliwal (Eds.), Automatic speech and speaker recognition: Advanced topics (pp. 233-258).

Boston: Kluwer Academic Publishers.

Rumi (c1250) A great wagon. Translated by C. Barks, J. Moyne, A. Arberry, & R.

Nicholson (1995). The essential Rumi. New York: Harper Collins.

Sakoe, H., & Chiba, S. (1978). A dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, & Signal Processing, ASSP-26(1), 43-49.

Schuster, M. (2000). Recurrent neural networks for speech recognition. In S. Katagiri (Ed.), Handbook of neural networks for speech processing (Chap. 6). Norwood, MA: Artech House.

Sridharan, S., Leis, J., & Paliwal, K. K. (2000). Speech coding. In S. Katagiri (Ed.), Handbook of neural networks for speech processing (Chap. 4). Norwood, MA:

Artech House.

Stadermann, J., & Righoll, G. (2003). Flexible feature extraction and HMM design for a hybrid speech recognition system in noisy environments. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 3), April 6-10 (pp. 332-335).

Thompson, R. (2003). Stepping out with gargantua — Learning new research practices in the educational theatre of absurdity. Doctor of education thesis, University of Canberra, Australia.

Trentin, E., & Gori, M. (2003). Robust combination of neural networks and hidden Markov models for speech recognition. IEEE Transactions on Neural Networks, 14(6), 1519-1531.

Vaughan-Nichols, S. -J. (2004). Voice authentication speaks to the marketplace. IEEE Computer, 37(3), 13-15.

Van der Veer, R., & Valsiner, J. (1994). The Vygotsky reader. Oxford, UK: Blackwell.

Vygotsky, L. S. (1934/1987). Thinking and speech. In R. W. Rieber & A. S. Carton (Eds.), The collected works of L.S. Vygotsky, Volume 1: Problems of general psychology.

(Translated by N. Minick). New York: Plenum.

Vygotsky, L. S. (1962). Thought and language. Cambridge, MA: MIT Press.

Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes. In M. Cole, V. J. Steiner, S. Scrivener, & E. Suberman (Eds.). Cambridge, MA: Harvard University Press.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recogni-tion using time-delay neural networks. IEEE Transacrecogni-tions on Acoustics, Speech,

& Signal Processing, ASSP-37(3), 328-339.

Waibel, A., & Lee, K. -F. (1990). Readings in speech recognition. San Mateo, CA:

Morgan Kaufmann.

Walker, W., Lamere, P., Kwok, P., Raj, P., Singh, R., Gouvea, E., Wolf, P., & Woepel, J.

(2004). Sphinx-4: A flexible open source framework for speech recognition (Tech.

Rep. No. TR-2004-139). SUN Microsystems.

Wang, J. -F., Furui, S., & Juang, B. -H. (Eds.). (2004). Real world speech processing.

Boston: Kluwer Academic Publishers.

Wells, G. (1999). Dialogical inquiry: Towards a sociocultural practice and theory of education. New York: Cambridge University Press.

Williams, R. (1997). Problems in materialism and culture: Selected essays. London:

Verso.

Wu, J., & Chan, C. (1993). Isolated word recognition by neural network models with cross-correlation coefficients for speech dynamics. IEEE Transactions on Pattern Analysis & Machine Intelligence, 15(11), 1174-1186.

Young, S. J., Wood, P. C., & Byrne, W. J. (1996). Spontaneous speech recognition for the credit card corpus using the HTK toolkit. IEEE Transactions on Speech & Audio Processing, 2(4), 615-621.

Zhang, M., Xu, S., & Fulcher, J. (2002). Neuron-adaptive higher order neural network models for automated financial data modelling. IEEE Transactions on Neural Networks, 13(1), 188-204.

Zwicker, E. (1961). Subdivision of the auditory frequency range into critical bands. J.

Acoustic Society of America, 33, 248.

ENDNOTES

1 Two well known examples of ANNs learning unexpected input-output associations are (a) sunny versus cloudy days instead of images of forests with and without Army tanks, and (b) photographs of ‘males’ versus ‘females’ being classified not on the basis of gender, but rather on the amount of white space between the tops of their heads and the top of the photograph [ref. The Dream Machine, Episode#4, BBC 1991].

2 Technical And Further Education — technical/vocational post-secondary school colleges.

APPENDIX: AUTOMATIC SPEECH

Dans le document John Fulcher, University of Wollongong, Australia (Page 124-132)