• Aucun résultat trouvé

Acoustic features of impaired articulation due to amyotrophic lateral sclerosis

N/A
N/A
Protected

Academic year: 2021

Partager "Acoustic features of impaired articulation due to amyotrophic lateral sclerosis"

Copied!
227
0
0

Texte intégral

(1)

Amyotrophic Lateral Sclerosis

by

Rachelle L. Horwitz-Martin

B.S., Biomedical Engineering and Electrical & Computer Engineering, Worcester Polytechnic Institute, 2008

S.M., Electrical Engineering and Computer Science, MIT, 2014

Submitted to the Division of Health Sciences and Technology in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Health Sciences and Technology at the Massachusetts Institute of Technology

September 2017

@

2017 Rachelle L. Horwitz-Martin. All Rights Reserved.

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium

now known or hereafter created.

Signature of Author:

Signature redacted

Harvard-MIT Division of Health Sciences and Technology August 28, 2017

Certified by:

-Signature

redacted

Thomas F. Quatieri, Sc.D. Faculty Affiliate of the Division of Health Sciences and Technology Senior Member of Technical Staff, MIT Lincoln Laboratory Thesis Co-Supervisor

Certified by:

Signature redacted

-Jordan R. Green, Ph.D., CCC-SLP Professor, Communication Sciences & Disorders Massa setts General Hospital Institute of Health Professions /hesis Co-Supervisor Accepted by:

DieiHarvard-MIT Division of

Professor of Computational Neuroscience and

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

SEP 2

8

2017

LIBRARIES

ARQHJYE$

Emery Brown, M.D., Ph.D. Health Sciences and Technology Health Sciences and Technology

(2)
(3)

Sclerosis

by Rachelle Horwitz-Martin

Submitted to the Department of Health Sciences and Technology in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Abstract

Progressive bulbar motor deterioration resulting from amyotrophic lateral sclerosis (ALS) leads to speech impairment. Despite the devastating consequences of speech impairment to life quality, few options are available to objectively assess speech motor involvement. The overarching goal of this research was to derive objective measures of speech acoustics that can be used to support clinical decision making. To achieve this goal, we obtained 121 speech samples from 33 patients with ALS who repeated the phrase "Buy Bobby a puppy" five times in succession. In total, 342 acoustic features were semi-automatically extracted from each speech recording. Pearson correlations were computed between each feature and three metrics of overall speech severity: sen-tence intelligibility, speaking rate, and communication efficiency. The findings were grounded within a physiologic framework where acoustic features were grouped into one of three domains that when combined, were hypothesized to broadly characterize articulatory performance: articulatory specification, articulatory coupling, and articu-latory consistency. To obtain the most accurate prediction of ALS with the features we extracted, we compared two machine learning algorithms: linear regression and ran-dom forest. In shuffle-split cross-validation, the strongest mean Pearson correlations we obtained between actual and predicted intelligibility, speaking rate, and communi-cation efficiency were 0.67, 0.74, and 0.77, respectively (SD=0.077, 0.050, and 0.059, respectively). Of the three domains, the specificity features were the most strongly associated with intelligibility impairments (mean r=0.68), and coupling was the most strongly associated with slower speaking rate (mean r=0.73). Specificity and coupling yielded similar performances in communication efficiency prediction.

Other contributions of this thesis are that it is the first to implement a framework of dysarthric speech in terms of three domains: specification, coupling, and consistency; the first to validate automated formant tracking in dysarthric speech; and the first to perform an in-depth investigation into physiologically-inspired acoustic features that describe articulatory impairments of patients with ALS. Novel findings include the presence of abnormal formant coupling patterns, which may suggest greater tongue-jaw coupling, in patients with more severe dysarthria due to ALS. Areas of future research involve further feature discovery, improved analysis methods, and a deeper understanding of relations to articulatory kinematics.

Thesis Co-Supervisor: Thomas F. Quatieri

(4)

4

Title: Faculty Affiliate of the Division of Health Sciences and Technology Senior Member of Technical Staff, Lincoln Laboratory

Massachusetts Institute of Technology

Thesis Co-Supervisor: Jordan R. Green

Title: Professor, Communication Sciences and Disorders

(5)

Acknowledgments

First and foremost, I would like to thank my advisors, Tom Quatieri and Jordan Green. Both of you have been patient with me throughout this process, and I greatly appre-ciate the amount of time you have spent advising me. Tom, your attention to detail is astounding. The comments you have made on thesis and paper drafts have been tremendously helpful. I'd also like to thank you for your ability to keep me on the main path of my thesis, especially when I would wander off to explore a minute detail. Jordan, the vast amount of knowledge you have about speech, ALS, and statistics, are truly extraordinary. I greatly appreciate your guidance throughout these past 2.5 years, particularly on the interpretations of our results, as well as your mentorship in general. The positive reinforcement was extremely helpful!

Jim Glass and Satra Ghosh, thank you for your guidance throughout this process. I do not want to know what would have happened if the two of you had not convinced me that my initial thesis topic was unfeasible, given the current state of speech synthesis and warping capabilities. In addition your guidance when selecting a thesis topic, I wanted to thank you for sharing your expertise in machine learning, signal processing, and speech science.

I would particularly like to thank the following individuals from Lincoln Labs for sharing their technical and professional expertise: Jim Williamson, Adam Lammert, Daryush Mehta, and Greg Ciccarelli. Jim, thank you for your guidance in applying many of the features you developed, and in attempting to understand the meaning behind some of the coordination features. Adam, I enjoyed our numerous discussions about articulatory gestures and the relationship between articulator movement and formant trajectories. Daryush, thank you for providing KARMA support and for your guidance and mentorship in general. Greg, thank you for your advice, and cheerleading particularly over the past 3 months.

Other individuals at Lincoln Labs I would also like to thank for their technical ex-pertise are Brian Helfer, Bea Yu, Elizabeth Godoy, Rajmonda Caceres, Vivek Varshney, Nick Malyska, Bob Dunn, Doug Sturim, Chris Smalt, Shakti Davis, and Mike Brand-stein. I would also like to thank Ed Wack and Jeff Palmer, the Group 48 group leaders, for allowing me to complete my thesis at Lincoln Labs; and Darrell Ricke, for mentoring me, as well as acting as the token Midwesterner; and Joey Perricone and Ray Trebicka, for their comments about the clarity of various sections of this thesis.

(6)

6

I am grateful to the following individuals at Mass. General Institute of Health Professions: Brian Richburg, Panying Rong, Kristen Allison, Claire Cordella, and Meg Simione. Brian, Panying, Kristen, and Meg, thank you for answering all of my questions about the database and other random questions as they arose. Claire, thank you for collecting the data that became a chapter in my thesis. Good luck finishing your thesis; I know you can do it! You have a wonderful advisor to provide guidance when you run into obstacles.

Alex Craciun and Tom Bickstr6m, thank you for discussing the estimated spectral amplitude shift features with me. I have enjoyed working with you, and I look forward to continued discussions.

I would also like to thank Julie Greenberg, for whom I TA'ed 6.555J. TAing was a valuable experience for me, and was often far more difficult than I had expected! Julie, I'd also like to thank you for your patience and kindness during a difficult time in my life.

Other individuals I would like to thank are my all of my professors from SHBT and EECS, as well as Megan Fuller, Samiya Alkhairy, and my classmates: Nate Zuk, Jordan Whitlock, Koeun Lim, and Sonam Dilwali. I have learned a tremendous amount from all of you.

Mom, Dad, Josh, Therese, Bill, Phill, Tyler, and Justin: thank you for all of your love, support, and understanding, especially when I miss birthday dinners, hikes, home renovation projects, and other family events. Mom, it was absolutely thrilling to be able to converse with you about what I was learning. You're the only person I have known pre-MIT with whom I can use terms such as "formant," "dysarthria," and "spirantization" without needing to provide definitions.

Rob, in case you don't already know, you're the best in the world, and I love you more than anything. I couldn't have asked for a more compassionate, supportive partner throughout this process, and I can't imagine finishing without you and Piper. Thank you for forcing me to take breaks every so often, even if it was just to watch Piper begging for chicken with her disordered voice. I can't wait to start the next chapter of our lives together.

Lastly, I would like to thank all of the patients and their families who participated in the data collection. I believe that one of the kindest acts you can perform for humanity is to participate as a subject in research studies, especially if you will not be able to reap the rewards from the knowledge gained by your participation. To those who have passed, may your memory be a blessing.

This research was funded by NIDCD grant T32 DC000038, Air Force Contract FA8721-05-C-0002, and NIH Grants R01 DC009890 and R01 DC0135470.

(7)
(8)
(9)

Contents

Abstract Acknowledgments List of Figures Glossary/Notational Conventions 1 Introduction 1.1 Problem Statement . . . . 1.1.1 Perceptual Cues to Assess Dysarthria Severity . . . . 1.1.2 Intelligibility, Speaking Rate, and Communication Efficiency As-1.2

1.3 1.4 1.5

sessments of Dysarthria Severity . . . . Approach . . . . Objectives . . . . Summary of Contributions . . . . Thesis Outline . . . . 2 Databases and Metrics

2.1 Comprehensive ALS Database (Database-C) . . . . 2.1.1 Sources of Variability . . . . 2.1.2 Speech Material . . . . 2.1.3 M etrics . . . . 2.2 Database-121 . .. . . ... . . . .. . . . . 2.2.1 Inclusion Criteria and Demographic Statistics . 2.2.2 Speech Material . . . . 2.2.3 Speech Impairment Metrics . . . . 2.3 Conclusions . . . . 3 Formant Trajectory Extraction and Validation

3.1 Semi-Automatic Extraction of Formant and Voicing Activity Detection Trajectories . . . . 3.1.1 Pre-Processing of Speech Samples . . . .

3 5 13 21 25 26 27 28 28 30 30 31 33 . . . . 33 . . . . 33 . . . . 34 . . . . 34 . . . . 35 . . . . 35 . . . . 36 . . . . 37 . . . . 41 43 44 45 9

(10)

CONTENTS 3.1.2

3.1.3 3.1.4

Segmentation . . . . Formant Extraction and SAD Calculation . . . . . Post-Processing of Formant Trajectories and SAD 3.2 Formant Trajectory Validation . . . .

3.2.1 Selecting Sessions for Manual Annotation Procedure . . . . Limitations . . . . 3.2.2 Manual Annotation Methodology . . . . . 3.2.3 Results . . . . B y Sex . . . . By Speech Impairment . . . . Comparison to Deng et al .. . . . . Comparison of Basic Features . . . . 3.3 Conclusions and Future Work . . . .

4 Kinematic/Physiological Basis for Candidate Features: Acoustic & Articulatory Analysis in a Healthy Control

4.1 Background . . . . 4.2 A ssum ptions . . . . 4.3 Methods . . . .. . .. .. . .... .... ... . .. . . . . 4.3.1 Subject and Speech Material . . . . 4.3.2 Data Collection . . . . 4.3.3 Processing of Articulatory Kinematics Data . . . . 4.3.4 Processing of Audio Data . . . . 4.4 R esults . . . . 4.5 Conclusions and Future Work . . . . 5 Identifying Features for the Prediction of Speech Degradation

5.1 Articulatory Specification Features . . . . 5.1.1

nth

Order Derivatives of Formant Trajectories . . . . Description of Features . . . . R esults . . . . Summary of the nth Order Derivative Features . . . . 5.1.2 Vowel Space . . . . Description of Features . . . . R esults . . . . 5.1.3 Bandwidths . . . . Description of Features . . . . R esults . . . . 5.1.4 Estimated Spectral Amplitude Shift . . . . Description of Features . . . . R esults . . . . 5.1.5 Conclusions and Future Work . . . .

45 46 47 47 48 48 50 50 50 51 55 57 58 60 63 63 64 68 68 68 68 68 69 73 75 77 78 78 81 89 90 90 91 91 92 95 98 100 102 104 10

(11)

5.2 Articulatory Coupling Features . . . . 105

5.2.1 Zero-Lag Correlation & Covariance Features . . . . 106

Description of Features . . . . 108

R esults . . . . 110

5.2.2 Unbiased Cross-Correlation and Cross-Covariance Features . . . 112

Description of Features . . . . 112

R esults . . . . 115

5.2.3 Instantaneous Cross-Correlation and Cross-Covariance Features . 121 Description of Features . . . . 121

R esults . . . . 121

5.2.4 Frequency-Domain Coupling Features . . . . 127

Description of Features . . . . 127

R esults . . . . 128

5.2.5 Conclusions, Limitations, and Future Work . . . . 132

5.3 Articulatory Consistency Features . . . . 133

5.3.1 Description of Features . . . . 133

5.3.2 R esults . . . . 134

5.3.3 Conclusions, Limitations, and Future Work . . . . 136

5.4 Conclusions and Future Work . . . . 136

6 Predictions 139 6.1 Relations Among Features . . . . 139

6.2 Framework . . . . 141

6.2.1 Shuffle-Split Cross-Validation . . . . 142

6.2.2 Suitability of Linear Regression and Random Forest . . . . 143

6.2.3 LR Framework . . . . 145

6.2.4 RF Framework . . . . 146

6.2.5 Post-Processing . . . . 147

6.2.6 Assessing Prediction Accuracy . . . . 147

6.3 Prediction of Intelligibility, Speaking Rate, and Communication Effi-ciency Using All Features . . . . 147

6.4 Relative Feature Contributions to Intelligibility, Speaking Rate, and Com-munication Efficiency Prediction . . . . 149

6.4.1 Tl(I) Prediction . . . 153

6.4.2 SR Prediction . . . . 153

6.4.3 Tl(I)SR Prediction . . . . 154

6.5 Importance of Specification, Coupling, and Consistency in ALS Prediction154 6.5.1 M ethod . . . .. . . . 154

6.5.2 R esults . . . 155

6.6 Conclusions and Future Work . . . . 158

7 Conclusions and Future Work 161 7.1 Summary and Contributions . . . 161

(12)

CONTENTS

7.2 Lim itations . . . 164

7.2.1 Database and Speech Impairment Metrics . . . 164

Speech Impairment Metrics . . . 164

Effect of Fatigue on Subjects . . . 164

Choice of Phrase . . . . 165

7.2.2 Signal Pre-Processing and Segmentation . . . . 165

7.2.3 Formant Trajectory Estimation and Speech Activity Detection . 166 7.2.4 Formant Trajectory Validation . . . . 166

7.2.5 Exploring the Physiological Basis for the Articulatory Features in Controls . . . . 167

7.2.6 Feature Extraction . . . . 167

7.2.7 Analysis of the Relations Between Each Feature and Intelligibility, Speaking Rate, and Communication Efficiency . . . . 168

7.2.8 Prediction of Intelligibility, Speaking Rate, and Communication Efficiency . . . . 168

7.2.9 Feature Importance . . . . 168

7.3 Extensions . . . 169

7.4 Future Im pact . . . . 170

A Derivation of LPC Spectral Shift Features 173 B Pearson Correlation and Statistical Covariance 175 C Relations Between Each Frequency-Domain Feature and both Tl(I) and Tl(I)SR 177 D Relations Between the Magnitudes of the Eigenvalues of the Unbiased Cross-Correlation Matrix and Both Tl(I) and SR 181 E Pearson Correlations Between Each Consistency Feature and Each of the three SIMs 185 F Random Forest 193 F.1 Binary Regression Trees . . . . 193

F.2 Bootstrap Aggregation of Binary Regression Trees . . . . 194

F.3 Random Forest Algorithm . . . . 195

F.4 Feature Importance Computed by Random Forest . . . . 196

G Predicting Speech Impairment Using Sex and Age 197

H Feature Importances: Means and Standard Deviations 199

Bibliography 12

(13)

List of Figures

1.1 The four speech subsystems: respiratory, phonatory, resonatory, and ar-ticulatory. Figure from [41]. . . . . 26 1.2 The hypothesized relationship between phonetic specification and

pho-netic variability with respect to speech clarity, defined for dysarthric speech as intelligibility, speaking rate, and communication efficiency. From Mefferd and Green (2010) [77]. . . . . 29 2.1 Histogram of number of sessions attended by each subject, by gender.

P:

mean; a-: standard deviation; y: skewness. . . . . 36 2.2 Histograms of the number of sessions in various levels of IP, Tl(I), SR,

ISR, and Tl(I)SR for three sex groups: both males and females (M & F), males only (M), and females only (F). Skewness (7) is also provided. . . 38 2.3 IP versus Tl(I). . . . . 39 3.1 2D histogram of MA versus KARMA-estimated formant values,

com-puted from both males and females. Color indicates number of frames within a bin. Top row: Fl; bottom row: F2. Horizontal axes: manually formants; vertical axes: KARMA-estimated formants. Each column con-tains a different SAD that masks the formant estimates. R2 and MAE values are also provided above each panel. . . . . 52 3.2 Same as Figure 3.1, but only includes data from males . . . . 53 3.3 Same as Figure 3.1, and 3.2, but only includes data from females. . . . . 53 3.4 2D histograms of MA versus KARMA-estimated formant values for BBP

repetitions in sessions during which the subject's Tl(I)SR was high. . . . 56 3.5 2D histograms of MA versus KARMA-estimated formant values for BBP

repetitions in sessions during which the subject's Tl(I)SR was low. . . . 56

(14)

3.6 Flowchart depicting the method to compare the features extracted from the MA and KARMA-estimated formant trajectories. F1, F2, and SAD represent the MA trajectories and SAD outputs; and F1, F2, and SAD represent the KARMA estimates of F1, F2, and the SAD outputs. X represents the six basic features computed from F1, F2, and the SAD outputs, while X represents the same but computed from F1, F2, and SAD... ... 59 3.7 Pearson correlations between each of the 12 "basic features" computed

from both the MA (yellow) and KARMA-estimated (blue) formants. The height of each bar represents the Pearson correlation between a feature and a SIM. An asterisk indicates a Bonferroni-corrected statistically sig-nificant correlation (a = 0.05; a/(12 features x 2 types of formants x 3 SIMs) = 6.94 x 10-4) between a feature and a SIM. . . . . 60 4.1 Illustration of the three parameters in the three-parameter model. The

glottis is at the left end of the illustration; the mouth opening is on the right [112] . . . . 65 4.2 Contours of vowel articulation. Each contour represents the range of

parameter values that produce the values indicated within each region. Each dot represents the mean value for a particular vowel. Modified from Stevens and House (1955) [112]. . . . . 66 4.3 Values of the first three formants as a function of the three parameters.

Modified from Stevens and House (1955) [112]. The horizontal axes rep-resent the point of constriction from the glottis, do, in cm; the vertical axes represent the formant frequencies in Hz; each panel represents a different value of ro; and the three families of curves within each panel represent the three formants [1121. . . . . 67 4.4 Acoustic and articulatory plots for the 3rd repetition of the female control

uttering BBP. Top: wideband spectrogram overlaid with formants trajec-tories as estimated by KARMA. Cyan: speech as indicated by SAD. Red: non-speech as indicated by SAD. Bottom: articulatory data collected si-multaneously with acoustic data. Arpabet transcription [98] provided between the two plots. . . . . 70 4.5 Vowel quadrilateral for Standard American English. "Front," and "back"

refer to the tongue's anterior/posterior position in the oral cavity, while "high" and "low" refer to the superior/inferior position of the tongue.

Modified from [3] .... . . . . 71

4.6 F1 (top) and F2 (bottom) versus articulator position across all five BBP repetitions. . . . . 72 4.7 Frequency response of formant trajectories & articulatory movements. . 72

(15)

5.1 Block diagram depicting each class of features within the articulatory specification domain. The numbers within the parentheses represent the number of features within each block. . . . . 77 5.2 Flowchart depicting the process used to calculate the MSCAVD features.

k E 1,2 ... ... . .... ... ... . .... ... .. . 80 5.3 F2 trajectory, velocity of F2 trajectory, and acceleration of F2 trajectory

collected from a male PALS with normal intelligibility and speaking rate (left), and degraded intelligibility and SR (right). For each of the dis-placement, velocity, and ac trajectories, both before and after speech im-pairment, the MAV, Std, and mCV of the trajectories are provided. Top row: spectrogram overlaid with F2 trajectory extracted via KARMA. Cyan: F2 trajectory classified as "speech" by the SAD; red: F2 classi-fied as "non-speech" by the SAD. Second row: F2 velocity; third row: F2 acceleration. Red letters above first row: phonemes time-aligned to spectrogram. ... ... .... . . . .. . . . .. 82 5.4 The 18 MSCAVD features vs. Tl(I). Pearson correlations are provided

above each plot. Top 3 rows: correlations between each F1 MSCAVD fea-ture and Tl(I). Bottom 3 rows: correlations between each F2 MSCAVD feature and Tl(I). . . . . 84 5.5 The 18 MSCAVD features vs. SR . . . . 85 5.6 The 18 MSCAVD features versus Tl(I)SR. . . . . 86 5.7 Bar plots of Pearson correlations between each feature and each of the

three SIMs . . . . ... 87 5.8 Scatter plots and Pearson correlations between the 9 5th percentile of

formant trajectory speeds and each of the three metrics. Top row: scatter plots and correlations between 9 5th percentile of F1 speed and each of

the three metrics. Bottom row: scatter plots and correlations between

9 5th percentile of F2 speed and each of the three metrics. . . . . 89

5.9 Scatter plots and Pearson correlations between the the two vowel space features and each of the three metrics. Top row: scatter plots and cor-relations between areaFlF2 and each of the three metrics; bottom row: scatter plots and correlations between Std(F2minusFl) and each of the three m etrics. . . . . 91 5.10 Distribution of F1 and F2 mean BWs across all sessions. The vertical

axis is is ratio of the number of BW estimates within a bin to the total number of BW estimates. . . . . 92 5.11 Distribution of F1 and F2 mean BWs across sessions from male speakers

with Tl(I)SR values greater than 125. The vertical axis is the ratio of the number of BW estimates within a bin to the total number of BW estim ates. . . . . 94

(16)

LIST OF FIGURES 5.12 F1 and F2 trajectories and bandwidths superimposed on spectrogram,

generated from repetitions in which the estimated mean F2 BW was unreasonably large. Each plot was generated using a different subject, session, and repetition. Cyan: formant trajectories at epochs classified as speech by the SAD; red: formant trajectories at times classified as non-speech by the SAD; transparent magenta: bandwidth. The BWs around non-speech (red) regions are ignored in the calculation of the mean BW. 96 5.13 F1 and F2 trajectories and bandwidths superimposed on spectrogram,

generated by repetitions in which estimated the mean F2 BW was consid-ered to be a reasonable value. Each plot was generated using a different subject, session, and repetition. The BWs around non-speech (red)

re-gions are ignored in the calculation of the mean BW. . . . . 97

5.14 Top: Mean Bis vs. Tl(I), SR, and Tl(I)SR. Bottom: Mean B2s vs. Tl(I), SR, and Tl(I)SR. . . . . 98

5.15 Depiction of spectral amplitude shift within a small range of frequencies ki [19]. The red curve represents X(ki, to), the spectral envelope of a formant over a small frequency range ki at time to. The black curve represents X(ki, to +

At),

which is the spectral envelope of the formant over the same range of frequencies, but at time to +

At.

X(ko, to) is a point along the curve X(ki, to) at frequency ko; X(ko + Ak, to + At) is a point along the X(ki, to +

At)

curve at frequency ko

+

Ak.The diagonal lines depict the assumption that over a small range of frequencies, a and b are assumed to be constant. Retrieved from Craciun et al. (2017) [19]. 99 5.16 LPC spectrogram (top row) and amplitude shift spectrogram (a; bottom row) for Subject 0149, before (left column) and after (right column) the precipitous declines in intelligibility, SR, and communication efficiency. 102 5.17 Scatter plots of the two amplitude shift features versus each of the three SIM s. . . . 103

5.18 Block diagram depicting each class of features within the coupling do-main. The numbers within the parentheses represent the number of features within each block. . . . . 106

5.19 F1 vs. F2 for all five BBP repetitions spoken by the healthy female subject in Chapter 4. ... . .. . . . . .. .. . .. .. . . 107

5.20 Comparison of different sliding window lengths for subject 0149 before severe speech degradation (left column) and after (right column). The rF1F2[0](t) and SF1F2[0](t) are plotted in the second and third rows, respectively, and values are obtained only when the SAD declares a region to contain speech. Different colored lines represent the different window lengths... ... 109

5.21 Relations between each of the zero-lag features and each Tl(I). . . . 110

5.22 Relations between each of the zero-lag features and SR. . . . 111

5.23 Relations between each of the zero-lag features and Tl(I)SR . . . 111

NOPPRIMP111 FIN P 19PRI IN"1101 11,01, IN FIRM

(17)

5.24 Method of extracting the unbiased correlation matrix. Panel A: F1 and F2 trajectories (top), z-scored F1 and F2 trajectories (bottom). Panel B: unbiased correlations between the z-scored F1 and F2 trajectories. In each subplot the stems are 10 ms apart. Panel C: Ru,F1F1 zoomed in between -500 ms and 500 ms, as depicted by the black box in the Ru,F1F1

subplot of Panel B. Red stems: samples (spaced 30 ms apart). Panel D: correlation matrix. Warmer colors represent greater values, and each "pixel" in the matrix corresponds to a single sample in Panel C. Panel E: structure of each of the blocks of the unbiased correlation matrix. The arrows from the subplots in the Panel C to the unbiased correlation matrix in Panel D indicate the rows of the unbiased correlation matrix that are occupied the samples. . . . 114 5.25 Mean z-scored UXRM-EM (top) and UXSM-EM (bottom). . . . 116 5.26 Relations between the magnitude of the jth eigenvalue, denoted by Aj,

of the unbiased cross-correlation matrix, versus Tl(I)SR. Each horizontal axis is the magnitude of Aj of the unbiased cross-correlation matrix. . . 119 5.27 Relations between the the magnitude of the jth eigenvalue, denoted by Aj,

of the unbiased cross-covariance matrix, versus Tl(I)SR. Each horizontal axis is the magnitude of A1 of the unbiased cross-covariance matrix. . . 120

5.28 Mean of the z-scored singular values (SVs) of the instantaneous correla-tion matrix (top) and instantaneous covariance matrix (bottom). . . . . 122 5.29 Scatter plots of the SVs of the instantaneous cross-correlation matrices,

versus Tl(I)SR. Each horizontal axis is the SV of instantaneous cross-correlation m atrices. . . . . 125 5.30 Scatter plots of the SVs of the instantaneous cross-covariance matrices,

versus Tl(I)SR. Each horizontal axis is the SV of instantaneous cross-covariance m atrices. . . . . 126 5.31 Relations between each of the average Pow and XPow features with

Tl(I)SR. A different frequency band (denoted by fi) is displayed in each column, and mappings between fi and the frequency bands are provided in Tables 5.13 and 5.14. Top row: average power in F1 features. Bottom row: average power in F2 features. . . . . 130 5.32 Relations between each of the XPow features with Tl(I)SR. . . . . 131 5.33 Relations between the coherence features and Tl(I)SR. . . . . 131 5.34 Block diagram depicting each class of features within the consistency

domain. The numbers within the parentheses represent the number of features within each block . . . . 134

(18)

5.35 Bar plots of the Pearson correlations between each (in)consistency feature and each of the three SIMs. For each plot, the horizontal axis is feature index. Recall that the CV is a measurement of variability; a negative correlation between a "consistency" feature and a SIM indicates that variability decreases as a SIM increases (i.e., consistency decreases as speech degrades). The yellow lines denote the Pearson correlations at which a p-value is statistically significant: 0.36. . . . 135 6.1 Heat map of correlations between each of the 29,241 pairs of features.

The color bar to the right of the figure maps each Pearson correlation value to a color. Each tick mark denotes the beginning of a class of features. "Spec. Other" includes the vowel space, bandwidth, and es-timated spectral amplitude shift features; "zlr" and "zls" are the zero-lag correlation and zero-zero-lag covariance features, respectively; "Ri,F1F2" and "Si,F1F2" are the instantaneous cross-correlation and cross-covariance features, respectively; "Ru,F1F2" and "Su,F1F2" are the unbiased cross-correlation and cross-covariance features, respectively; "Pow" represents the average power features, "XPow" represents the average cross-power features, and H represents the coherence features. . . . . 140 6.2 Two folds, i, and

j,

of shuffle-split, depicting the process of randomly

selecting training data and test data. X is the 121x342 matrix of the features for each session, and y is the 121 x 1 vector that contains one of the three speech impairment metrics: Tl(I), SR, or Tl(I)SR. Inside the X matrices for folds i, and

j,

P denotes the subject, S denotes the session, and M denotes the feature. For example, the fourth row, third column of X represents the value of the third feature for the second subject's first session. Yellow boxes represent the data that is randomly selected to be test data for a particular fold; green represents training. The total number of folds is 100. . . . 142 6.3 Top: percentage of variance explained by each principal component.

Bot-tom: cumulative percentage of variance explained. . . . 144 6.4 Flowchart depicting the framework within a cross-validation loop for LR. 146 6.5 Flowchart depicting the framework within a cross-validation loop for RF. 146 6.6 Top: Prediction accuracy of LR. The color bars map to the number of

test sessions within each bin over all 200 rounds of randomly sampling ytest and 't from the 33 subjects. Bottom: histogram of RMSEs over all 100 folds of shuffle-split. The mean and standard deviation (Std) values reported were computed as described in Section 6.2. . . . . 148

I I -1 111- __-- -- __ _ M11 M lip 11111 11qllml 0 owli"MN"FIF"IMIINIIIRp"p".,I-,.Wm-lr- wllmm ""

(19)

6.7 Prediction accuracy of RF. The color bars map to the number of test sessions within each bin over all 200 rounds of randomly sampling Ytest and yt'est from the 33 subjects. Bottom: histogram of RMSEs over all 100 folds of shuffle-split. The mean and standard deviation (Std) values reported were computed as described in Section 6.2. . . . . 149 6.8 Feature importances for Tl(I). Feature indices 1-25 are specification

fea-tures; feature indices 26-171 are coupling feafea-tures; feature indices 172-342 are consistency features. For a complete mapping between each feature index and each feature, see Appendix H. "Zero-lag" refers to the zero-lag coupling and consistency features. "Inst. R" and "Inst S": in-stantaneous cross-correlation and cross-covariance features, respectively. "Unb. R" and "Unb. S": unbiased cross-correlation and cross-covariance features, respectively. "Freq-Domain Coupl.": frequency-domain cou-pling features, which includes features that describe the average power of F1 and F2, average cross-power between F1 and F2, and coherence, each within four frequency bands: 0.25-1 Hz, 1-2 Hz, 2-4 Hz, and 4-8 Hz. The height of each blue bar represents the mean feature impor-tance for a feature; the red bars represent the standard deviation of the im portance. . . . 150 6.9 Feature importances for SR. . . . 151 6.10 Feature importances for Tl(I)SR. . . . 152 6.11 Prediction results obtained by computing Tl(I) (left plots), SR (middle

plots), and Tl(I)SR (right plots), from only the 25 specification features. Top: two-dimensional histograms displaying the relations between the actual versus predicted values of Tl(I), SR, and Tl(I)SR, when we iso-lated the 25 specification features. The color bars are mapped to the number of sessions within a bin in the two-dimensional histogram. The mean Pearson correlation and mean root mean square error (RMSE) are displayed for each prediction. Bottom: histograms of the RMSEs com-puted between the actual and predicted values of Tl(I), SR, and Tl(I)SR. 155 6.12 Same as Figure 6.11 but with the 146 coupling features instead of the 25

specification features . . . 156 6.13 Same as Figures 6.11 and 6.12 but with the 171 consistency features. . . 156 B. 1 Depiction of Pearson correlation rxy versus statistical covariance sy.

Top: F1 and F2 trajectories (blue: Fl; red: F2). Second plot: Pearson correlation coefficient over the trajectories using a sliding 90-ms window. Third plot: standard deviations of F1 and F2. Fourth plot: the product of the standard deviations of F1 and F2, when the standard deviations are computed over a 90-ms window. Bottom plot: statistical covariance of the trajectories with the 90-ms window. . . . . 176

(20)

LIST OF FIGURES C. 1 Pearson correlations between each of the PSD features with Tl(I). A

different frequency band (denoted by

fi)

is displayed in each column. Top row: F1 PSD features. Bottom row: F2 PSD features . . . 177 C.2 Pearson correlations between each of the PSD features with SR. A

dif-ferent frequency band (denoted by

fi)

is displayed in each column. Top row: F1 PSD features. Bottom row: F2 PSD features . . . 178 C.5 Pearson correlations between the coherence features and Tl(I). . . . 178 C.3 Pearson correlations between each of the XPSD features with Tl(I). A

different frequency band (denoted by

fi)

is displayed in each column. Features from the same XPSD matrix within frequency band i, A(Mi), are plotted in the same column. Top row: first eigenvalue of Mi, denoted by

j

in Aj(Mi). Bottom row: second eigenvalue of Mi . . . 179 C.6 Pearson correlations between the coherence features and SR. . . . 179 C.4 Pearson correlations between each of the XPSD features with SR. A

different frequency band (denoted by

fi)

is displayed in each column. Features from the same XPSD matrix within frequency band i, A(Mi), are plotted in the same column. Top row: first eigenvalue of Mi, denoted by

j

in Aj(Mi). Bottom row: second eigenvalue of Mi . . . . 180 D.1 Pearson correlations between the z-scored eigenvalue magnitudes of the

unbiased cross-correlation matrix and Tl(I) . . . . 182 D.2 Pearson correlations between the z-scored eigenvalue magnitudes of the

unbiased cross-correlation matrix and SR . . . 183 F.1 A binary decision tree for regression. . . . 194 F.2 Random forest algorithm, from Hastie et al. (2008) [45] . . . 195 20

(21)

Glossary/Notational Conventions

Abbreviation Definition A Eigenvalue Mean Standard deviation Accel Acceleration

ALS Amyotrophic Lateral Sclerosis

ALSFRS-R ALS Functional Rating Scale - Revised [17] areaFlF2 Area of F1-F2 space; used to measure vowel space

Bk Bandwidth of Formant k

BBP "Buy Bobby a puppy."

BW Bandwidth

C() Consistency of a quantity (i.e., for this thesis, coefficient of variation of a quantity)

Corr Correlation

Cov Covariance

CV Coefficient of variation:

CV = U/P

Database-36 The set of acoustic BBP speech data and SIT scores from the 15 subjects (36 sessions) included in the comparison of the manually-annotated formant and SAD trajectories versus the KARMA-estimated formant and SAD trajec-tories.

Database-121 The set of the acoustic BBP speech data and SIT scores from 33 subjects (121 sessions), extracted from Database-C. KARMA is used to automatically extract formant and SAD trajectories from the BBP repetitions in this database.

Database-C Comprehensive database that includes acoustic, kine-matic, nasometric, and/or phonatory data from all 123

longitudinal

subjects with ALS.

(22)

Abbreviation Definition

DME Direct magnitude estimation [30,124]

E[.] Expected value (i.e., mean)

EM Magnitude of an eigenvalue F1 First formant F2 Second formant F3 Third formant iT Transformed intelligibility: i = 1 0 0IP/100

IP Intelligibility percentage (as measured by the SIT) [134] ISR Intelligible speaking rate, which is a measure of

commu-nication efficiency: ISR = IP x SR/100

KARMA Kalman-based autoregressive moving average [81]

LMN Lower motor neuron

LPC Linear Predictive Coding

LR Linear regression

MA Manually annotated

MAE Mean Absolute Error

mCV Modified coefficient of variation:

mCV = a/MAV

MAV Mean of the absolute value:

MAV = E[-I]

MSCAVD MAV, Standard deviation, and modified Coefficient of variation of the Acceleration, Velocity, and Displacement of a formant trajectory

PALS Patients with ALS

PCA Principal Component Analysis

PD Parkinson's disease

PowFk (fb) Average power in the Fk trajectory (k c 1, 2), computed

over frequency band b

(23)

Ri,F1F2 Instantaneous correlation between F1 and F2

Ru,F1F2 Unbiased correlation between F1 and F2

RF Random forest

RMSE Root mean square error

RSS Residual sum of squares

RWN Repetition with noise

Si,F1F2 Instantaneous covariance between F1 and F2

Su,F1F2 Unbiased covariance between F1 and F2

SAD Speech Activity Detector

SIM Speech impairment metric (i.e., Tl(I), SR, or Tl(I)SR) SIT Sentence Intelligibility Test [134]

SLP Speech-language pathologist

Speed95Fk 95th percentile of Fk speed; (k E 1, 2)

SR Speaking rate (as measured by the SIT) [134]

Std Standard deviation

Std(F2minusFl) Standard deviation of F2 minus F1

SV Singular value

Tl(I) Log-transformed intelligibility:

Tl(I) = 50 [- logo (101 - IP) + 2]

Tl(I)SR Log-transformed intelligible speaking rate; a measure of communication efficiency:

Tl(I)SR = Tl(I) x SR/100

TBI Traumatic Brain Injury

UMN Upper motor neuron

UPDRS Unified Parkinson's Disease Rating Scale [87]

UXRM Unbiased cross-correlation matrix

UXSM Unbiased cross-covariance matrix

Vel Velocity

XPow(fb) Average cross-power between the F1 and F2 trajectories over the frequency band indexed by b

(24)
(25)

Chapter 1

Introduction

ALS is a fatal neurodegenerative disease that primarily affects the nerves responsible for movement, called motor neurons. Objective, accurate assessments are needed for early ALS diagnosis, disease progression monitoring, and improvements to the efficacy of drug trials. Early disease detection is critical because it may allow patients and clinicians to take measures that slow the rate of progression. Moreover, in the future, if the disease is caught in the early stage, disease-modifying treatment could potentially reverse the disease processes. The monitoring of ALS severity is important because it may affect clinical decisions, and it may enable the patient and clinicians to better plan for various interventions.

To minimize patient discomfort and risk of complications, non-invasive markers of disease are desirable. Speech is an excellent candidate that can be used as a basis for noninvasive markers for some neurological and psychiatric disorders because it requires complex interactions in many areas of the brain. With the brain driving the process, physiological production of speech involves air being pushed from the lungs, through vo-cal folds vibrating at the larynx, and into the vovo-cal tract, which acts as a resonator. The acoustic output ultimately radiates from the lips. Thus, embedded in the speech sig-nal is information about control and functioning of underlying production mechanisms. Many patients with neurological disorders such as Parkinson's disease, stroke, Trau-matic Brain Injury (TBI), and Amyotrophic Lateral Sclerosis (ALS) exhibit changes in their speech.

When a patient with impaired neuromuscular function encounters difficulty in exe-cuting speech, he/she suffers from a motor speech disorder called dysarthria. Dysarthria may affect intelligibility, quality, and/or rate of speech to varying degrees. It may result from damage to upper motor neurons (UMNs), lower motor neurons (LMNs), or both. UMNs originate in either the motor cortex of the brain or in the brainstem and carry information to LMNs, while LMNs originate in the spinal cord or cranial nerve nuclei in the brainstem, and directly innervate muscle.

Dysarthria can affect at least one of the four speech subsystems shown in Figure 1.1, as well as prosody [41, 72, 73,97]. The respiratory subsystem refers to the lungs and

(26)

piratory musculature, the phonatory subsystem consists of the larynx and supporting musculature, the resonatory subsystem consists of the nasal cavity and velopharyngeal apparatus, and the articulatory subsystem consists of the lips, tongue, and jaw. Ar-ticulatory information, which refers to the movement of the tongue, lips, and jaw, can be found within consonants, as well as vocal tract resonances, which are also known as formants. Prosodic information, or information relating to the melody or rhythm of speech, may also be considered a separate type of information affected by dysarthria

[22,113].

Figure 1.1: The four speech subsystems: respiratory, phonatory, resonatory, and artic-ulatory. Figure from [41].

Although ALS is the primary cause of dysarthria in this thesis, the framework de-veloped in this thesis is generalizable to dysarthria resulting from other neurological conditions. The dysarthria observed in patients with ALS (PALS) is often character-ized by the auditory percept of weakness and/or spasticity [27]. Although the rate of progression and location of symptom onset vary among patients, most patients exhibit declines in intelligibility and rate of their speech [138, 141], as well as communication efficiency [135], which is defined as the rate of intelligible words spoken within a given time. The decline in the disease can be rapid; the mean survival time is 3 to 5 years after onset [2]. The rapid disease progression in ALS patients allows for longitudinal data to be collected and analyzed, and for intra-subject comparisons to be made.

* 1.1 Problem Statement

One of the problems associated with dysarthria assessment is obtaining objective, ac-curate measurements of the severity of dysarthria. Currently, dysarthria severity is estimated using two methods: perceptual ratings and/or administering an intelligibil-ity test.

(27)

Table 1.1: Representations of each number in the speech assessment portion of the ALSFRS-R.

4 Normal speech processes 3 Detectable speech disturbances 2 Intelligible with repeating

1 Speech combined with nonvocal communication 0 Loss of useful speech

U 1.1.1 Perceptual Cues to Assess Dysarthria Severity

There is no established set of cues that can be used to reliably assess the severity of dysarthria [41]. For ALS and Parkinson's disease (PD), the severity of the diseases as a whole are assessed by the ALS Functional Rating Scale Revised (ALSFRS-R) [17] and the Unified Parkinson's Disease Rating Scale (UPDRS) [87], respectively. Both rating scales are multi-items where speech is only one of the many items; in the ALSFRS-R, there are 12 items total, and each item is rated on a scale from 0 to 4, where 4 represents normal abilities. (Other items in the ALSFRS-R are salivation, swallowing, handwrit-ing, cutting food/handling utensils, dressing and hygiene, turning in bed and adjusting bed clothes, walking, climbing stairs, difficulty breathing during physical exertion, dif-ficulty breathing, difdif-ficulty sleeping due to shortness of breath, and use of mechanical ventilation support). For speech, the representations of each number in the 0-4 rating scale are provided in Table 1.1. The subjectivity, coarseness, and lack of sensitivity to the beginning stages of ALS highlight the need for an improved severity assessment.

Although there is no standard set of characteristics used to assess the severity of dysarthria [41], more detailed auditory and visual cues have been used to diagnose or classify the type of dysarthria [20, 21, 27]. In the seminal work of Darley, Aronson, and Brown (1969) [20, 21], judges used a 7-point scale to rate 38 characteristics of the speech of dysarthric patients with bulbar palsy (i.e., LMN lesions that affect speech), pseu-dobulbar palsy (i.e., UMN lesions that affect speech), cerebellar lesions, parkinsonism, dystonia (e.g., spasmodic dysphonia), chorea (e.g., Huntington's disease), and ALS. The 38 characteristics that appeared to capture the speech abnormalities were divided into seven domains: pitch, loudness, voice and resonance, respiration, prosody, and ar-ticulation, plus two global measures: intelligibility and "bizarreness." Since the seminal work of Darley, Aronson, and Brown (1969) [20,21], there have been modifications to the characteristics and rating scales [27], but many issues remain if the characteris-tics are to be used in dysarthria severity assessment: (1) there is disagreement on the definitions of the perceptual characteristics, (2) there is a lack of consensus regarding which perceptual characteristics should be used in the evaluation, and (3) even when perceptual features are operationally defined, clinicians may not agree on the severity rating [55, 142]. With these problems, it is difficult to obtain a meaningful metric that represents the severity of a patient's speech deficits, which increases the difficulty in

(28)

CHAPTER 1. INTRODUCTION monitoring the progression of the disease.

M 1.1.2 Intelligibility, Speaking Rate, and Communication Efficiency Assessments of Dysarthria Severity

In research environments, less subjective measurements of dysarthria severity are used. Both are measured during administration of the Sentence Intelligibility Test (SIT) [134], in which a patient reads 11 sentences of increasing length while a listener transcribes the words he/she hears and computes the speaking rate in words per minute. The SIT soft-ware package computes the intelligibility percentage, which is defined as the percentage of words the listener transcribed correctly. Speaking rate is computed by dividing the number of words in each sentence by the time to speak each sentence, yielding a value for speaking rate in terms of words per minute. The values for intelligibility percentage and speaking rate can be combined to provide communication efficiency, defined as the rate at which intelligible words can be spoken [135].

Problems associated with the SIT are that the intelligibility score is still subjective, as it is based on an individual's understanding of a subjects speech, and it is labor-intensive, requiring that a listener transcribe each of the 11 sentences from each session and record the length of each sentence. Further, the SIT does not provide insight into which components of the speech production system are not functioning properly. With the shortcomings of the SIT combined with the issues with the perceptual ratings, we conclude that there is a need for more accurate, quantitative assessments by type and severity of dysarthria, informed by contributions of speech features to intelligibility and speaking rate.

* 1.2 Approach

To provide more objective measurements of intelligibility and speaking rate and to provide insight regarding which components of the speech production system are not functioning properly, researchers extract acoustic, aerodynamic, nasometric, and/or kinematic features [41,140. Of these types of features, acoustic features can be ex-tracted most easily; only a microphone and a recording device are needed. Thus, for a speech assessment system that is fast, noninvasive, and easy to use, acoustic data lends itself most readily to this application.

Of the speech subsystems, we are focusing on the articulatory subsystem for two reasons: (1) although acoustic data is useful in the representation of the articula-tory, phonaarticula-tory, and prosodic subsystems, it is difficult to represent the respiratory and resonatory subsystems using only acoustic data, and (2) among the subsystems, the articulatory subsystem has the strongest association with intelligibility loss due to ALS [22,95,97]. Common descriptors used by speech-language pathologists (SLPs) to characterize articulatory impairments due to neuromotor impairments include articu-latory imprecision [14,20,21], and inconsistency [66,77,100,125]. Quantitative met-28

(29)

rics of articulatory impairments that correlate with intelligibility and/or ALS

sever-ity have also been extracted from speech acoustics [15, 22, 40,47,54, 58, 60, 76, 77, 83,

86, 95, 97,114,119,122,124,138,141] and/or directly from the articulator movements [59, 65, 76, 77, 79, 95, 97, 119, 125, 138, 139, 141].

Although there has been a vast amount of research associating a decrease in in-telligibility in ALS speech with formant features, particularly with reduced formant 2

(F2) slope [54, 60,83,122, 124,141], and reduced vowel space [102,114, 117, 124], there

are three main gaps in current knowledge. The first issue is the time required for

the process of manually annotating or hand-correcting automatically tracked formants. Further, there is a lack of validation that the methods employed to extract the formants from dysarthric speech are accurate.

The second gap in current knowledge is that the number of acoustic features explored has been small in number. More recently developed speech features have not been applied for the purpose of better understanding the various speech deficits associated

with ALS.

The third gap in current knowledge is the lack of a framework relating auditory per-cepts commonly used by clinicians to refer to dysarthria and intelligibility and speaking rate. Clinicians have reported that dysarthric speech is perceptually characterized by poorly specified [14, 20, 21] and inconsistent [20, 21, 77] sound production patterns. Mef-ferd et al. (2010) [77] have hypothesized that as sound production patterns become more poorly specified and less consistent due to dysarthria, intelligibility and speaking rate decrease, as shown in Figure 1.2.

High

High Speech Clarity C

Ideal

Typilcal

Dysanhdc

Low LowSp Cbft

LOW Phonetic Variability High

Figure 1.2: The hypothesized relationship between phonetic specification and phonetic variability with respect to speech clarity, defined for dysarthric speech as intelligibility,

speaking rate, and communication efficiency. From Mefferd and Green (2010) [77]. Many studies have focused on acoustic formant features or kinematic articulatory

features because they are strongly associated with intelligibility [22, 95, 97]. In the acoustic domain, articulatory specification has been defined by the acoustic distance

(30)

CHAPTER 1. INTRODUCTION between vowels, and inconsistent articulation has been defined by a metric called the spatiotemporal index. Although Mefferd et al. (2010) [77] defined specified articulation by acoustic distance between vowels, and inconsistent articulation has been defined by a metric called the spatiotemporal index [108], we hypothesize that other features will predict intelligibility loss, speaking rate, and communication efficiency decline, and it is unknown how these other features interact with acoustic distances and spatiotem-poral indices to produce the percept of poorly specified and inconsistent articulation. Additionally, the framework is lacking a variable for acoustic coupling (e.g., patterns of the movement of the formant trajectories relative to one another) between the first two formant trajectories, which may also be affected by ALS. Together, we will refer to articulatory specificity, coupling, and consistency as the three domains of articulatory degradation due to ALS.

E

1.3 Objectives

Motivated by the current state of dysarthria assessment, our objectives are as follows: 1. Validate that automatically estimating formant trajectories is sufficiently similar

to manually estimating formant trajectories.

2. Identify and extract acoustic features, inspired by hypothesized or previously stud-ied impairments in the articulators, that may contribute to decline in sentence intelligibility, speaking rate, and/or communication efficiency due to ALS.

3. Obtain the best possible prediction of sentence intelligibility, speaking rate, and communication efficiency.

4. Identify which domain(s) contribute the most to sentence intelligibility, speaking rate, and communication efficiency prediction.

E

1.4 Summary of Contributions

The main contribution of this thesis is that it provides a framework for semi-automatically assessing the severity of dysarthria, based on acoustic features that describe the articula-tory deficits associated with dysarthria secondary to ALS. The features in the framework comprise three domains that describe the articulatory deficits: lack of specification, ab-normal coupling, and inconsistency. Within each domain, we developed features based on prior literature, as well as hypotheses motivated by observations of relations between the kinematics of the articulators and the acoustics of a healthy control. We used the framework to predict the intelligibility, speaking rate, and communication efficiency of subjects with varying degrees of ALS, and to identify the features and domains that were most important to the predictions. Although we developed the framework specif-ically for ALS, the framework may be applied toward dysarthria secondary to other diseases that cause dysarthria, such as PD.

(31)

This thesis also revealed a novel finding: in one subject who exhibited speech degra-dation across sessions, the local Pearson correlation coefficients between the first two formant trajectories, which are moderately negative in healthy speech, became more in phase in disordered speech. This suggests that there may be increased coupling between the tongue and lips/jaw as speech degrades.

Another contribution of this thesis is that it appears to be the first to apply auto-mated formant trajectory estimation to dysarthric speech and to measure the accuracy of the formant estimates. It is important for the formant trajectories to be as accurate as possible because the conclusions drawn from any formant trajectory-based analysis with inaccurate formant trajectories are likely to be invalid. The effects may become amplified when various functions are applied to the formant trajectories, such as veloc-ity or acceleration. Further, if the features do not accurately reflect the true movements of the articulators, then our inferences about the articulators and their movements from the correlation and prediction experiments may not be accurate. Using the Kalman-based autoregressive moving average (KARMA) algorithm [81] to estimate the formant trajectories, we concluded that the formant tracking algorithm we used was sufficiently accurate to be used throughout the thesis.

* 1.5 Thesis Outline

Including the introduction, this thesis is divided into seven chapters. Chapter 2 de-scribes the database from which we obtained the speech data, as well as our inclusion criteria. It also discusses the methods used to calculate intelligibility, speaking rate, and communication efficiency.

Chapter 3 discusses KARMA [81], which we used to automatically extract the for-mants. It also discusses how we compared the KARMA-based formant estimates to manually labeled formant estimates, and reports the results.

Chapter 4 reports on the simultaneous collection of acoustic speech and articulatory kinematic data, from a healthy subject. The subject was speaking the same phrase as the PALS, which allowed us to hypothesize differences in the formant trajectories that we would expect from the PALS, and also provided us with insight into how the formant trajectories were related to the articulatory kinematics.

Chapter 5 details all of the features we extracted from the acoustic speech data of the PALS. It discusses our hypotheses of what we expected from the Pearson correlations between each feature and the speech impairment metrics, and provides results. It also relates the results to prior work and/or speculations about the kinematics of the articulators in ALS speech.

Chapter 6 discusses the methods and results for the prediction experiments. It reports on the best prediction of intelligibility, speaking rate, and communication ef-ficiency that we were able to obtain, and quantifies the quality of the predictions. It also quantifies the relative importance of each feature, and compares the accuracy of the intelligibility, speaking rate, and communication efficiency predictions of each of the

(32)

32 CHAPTER 1. INTRODUCTION three articulatory domain individually.

Finally, Chapter 7 discusses our contributions, the limitations of this thesis, and future work.

(33)

Chapter 2

Databases and Metrics

Longitudinal data from 123 subjects with ALS was obtained from a dataset collected by several NIH-funded research projects on bulbar motor decline (NIH-NIDCD grants R01DC009890 and R01DC0135470). Of the 123 subjects and the hundreds of sessions attended by the subjects collectively, we selected a subset of 33 subjects (121 sessions) to be included in the feature analysis and prediction experiments. This chapter details the comprehensive database (Database-C) of 123 subjects, the metrics used to evaluate the subjects dysarthria severity, and the rationale and composition of the subset of Database-C, named Database-121, which we use in this thesis.

* 2.1 Comprehensive ALS Database (Database-C)

The purpose of the comprehensive database was to provide data to assess bulbar im-pairment due to ALS. During each session, acoustic, aerodynamic, nasometric, and/or kinematic data were collected to assess at least one of the following subsystems: res-piratory, phonatory, resonatory, and phonatory. In all or nearly all cases, while an acoustic measurement was being recorded, another type of measurement (i.e., aerody-namic, nasometric, or kinematic) was simultaneously made, which often involved the subject wearing a face mask, nasal clip, tongue/lip sensors and cables, or markers on the lips and face [140]. All subjects included in this study met the following criteria: (1) were diagnosed with possible, probable, or definite ALS according to the revised El Escorial criteria [10]; (2) did not have a history of any other congenital or acquired neurological disorder; (3) possessed hearing and vision capabilities sufficient to read the stimuli; and (4) possessed literacy skills sufficient to read the stimuli.

N 2.1.1 Sources of Variability

For each subject, the total number of sessions and time between each session were variable, as well as the tasks performed and content of the speech collected during each session. Attempts were made to collect data from the subjects at least every three to

Références

Documents relatifs

Here we used SuperSonic shear wave imaging to generate two dimension (2D) maps of shear elasticity during the coagulation process and show its use in the study of the

Le comportement électrochimique du complexe 3 dans lequel le ligand salen a été greffé de façon covalente sur la face primaire d’une  -cyclodextrine montre un

This configuration enables extensions such as (i) “cross-model” uncertainty analysis with Monte Carlo Anal- ysis: time synchronisation allows EUTRO and TOXI simu- lations to be

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

La composition et la taille des aérosols varient en fonction de leurs sources d’émission et des transformations qu’ils subissent dans l’atmosphère. La plupart des

Moreover, empirical studies on the impact of media coverage and freedom don’t investigate the effect of mass media on electoral participation in developing

Nonlinear Control for Urban Vehicles Platooning, Relying upon a Unique Kinematic GPS.. Jonathan Bom, Benoît Thuilot, François Marmoiton,