The digital symbol digit test : screening for Alzheimer's and Parkinson's

(1)

The Digital Symbol Digit Test: Screening for

Alzheimer’s and Parkinson’s

by

Lauren Huang

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2017

c

Massachusetts Institute of Technology 2017. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

June 9, 2017

Certified by . . . .

Randall Davis

Professor

Thesis Supervisor

Certified by . . . .

Dana L. Penney

Director of Neuropsychology, Lahey Hospital and Medical Center

Thesis Reader

Accepted by . . . .

Christopher J. Terman

Chairman, Masters of Engineering Thesis Committee

(2)

(3)

The Digital Symbol Digit Test: Screening for Alzheimer’s

and Parkinson’s

by

Lauren Huang

Submitted to the Department of Electrical Engineering and Computer Science on June 9, 2017, in partial fulfillment of the

requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

Abstract

Neurodegenerative diseases affect the cognition of millions of people worldwide, de-grading their quality of life and placing a burden on their families. Early identification can be extremely beneficial in treating or slowing down the onset of these diseases. One technique used to identify early warning signs is the use of cognitive tests. Unfor-tunately, grading these tests is subjective. In this study, we quantitatively evaluated the digital Symbol Digit Test (dSDT), in which patients translate symbols into digits based on a given mapping. In collaboration with Dr. Penney of Lahey Clinic, we administered the dSDT to over 170 patients using a digitizing pen that measures its position on the page and the pressure applied. We developed support vector machine and logistic regression classifiers that indicate Alzheimer’s Disease and Parkinson’s Disease with an area under the curve of 0.957 and 0.963, respectively.

Thesis Supervisor: Randall Davis Title: Professor

Thesis Reader: Dana L. Penney

(4)

(5)

Acknowledgments

I would like to thank my supervisor, Prof. Randall Davis, whose wise advice made me into a better researcher and was always a fount of interesting research questions. I would also like to thank Dr. Dana Penney for her help in testing this work in the field and for her invaluable diagnostic insight.

I give special thanks to Jason Tong, whose hard work developing the code behind the Digital Symbol Digit Test system made this research possible today.

This work was supported in part by National Science Foundation Award IIS-1404494 and by the Robert E. Wise Research and Education Institution.

(6)

(7)

List of Figures

2-1 An example key for the top of the translation task. . . 20 2-2 An example portion of the translation task presented to the patient. . 20 2-3 An example key for the top of the copy task. . . 20 2-4 An example portion of the copy task presented to the patient. . . 20

3-1 The centroid of this stroke, marked with a red X, is calculated by averaging the points in the stroke. (Test CIN0123365438, Stroke 57) . 28 3-2 The bounding box, in red, contains the entire stroke. The bounding

box center is marked with a red X. (Test CIN0123365438, Stroke 57) 28

4-1 Solid dots are points recorded by the pen. The dotted lines are straight segments between the points. The points shown in this stroke are compressed, making the time interval between points uneven. (Test CIN0123365438, Stroke 57) . . . 35 4-2 Points have been interpolated between compressed points to make the

time interval between consecutive points 13 ms. (Test CIN0123365438, Stroke 57) . . . 35 4-3 Solid dots are points recorded by the pen. The points in this stroke are

connected with straight lines, whereas natural handwriting consists of smooth curves. (CIN0123365438, Stroke 57) . . . 35 4-4 The dotted line represents the pen path recreated with the linear

in-terpolation method. The solid line represents the path recreated by applying a cubic spline to this stroke. The curves of the cubic spline are more similar to natural human writing. (CIN0123365438, Stroke 57) 35

(10)

4-5 An example of a cubic spline interpolation that is a poor approximation of the original stroke. The noisy, jittery points at the top of the 1 caused the spline to loop back and forth. (SNF1305604878, Stroke 77) 36 4-6 A close-up view of a poor cubic spline interpolation of this 1. The

resolution of the pen is visible here; the points fall along a grid. The cubic spline, in blue, loops back and forth excessively, causing the length of the spline to be 10% longer than the linear interpolation. (SNF1305604878, Stroke 77) . . . 36 4-7 A linear interpolation better fits the densely clustered points in this

stroke. (SNF1305604878, Stroke 77) . . . 36 4-8 A close-up view of a linear interpolation of this 1. Instead of looping

back and forth, the linear interpolation simply connects the points. (SNF1305604878, Stroke 77) . . . 36

5-1 An example of a hooklet between the digits of an 11. The hooklet is drawn in red. (CIN0123365438, Stroke 13) . . . 44 5-2 A hooklet between the digits of an 11 with a projected line fit to it,

drawn in red. The perpendicular distance between the projection and the next stroke is shown in cyan. (CIN0123365438, Stroke 13) . . . . 44 5-3 An example of an extracell stroke, drawn in blue. An in-cell stroke is

shown in red. The cell boundaries are drawn in black. (CIN1937073217, Stroke 83) . . . 49 5-4 An example of a thinking point is drawn in blue. An example of a

normal stroke is drawn in red. (CIN2045945427, Stroke 12) . . . 49 5-5 A pre-stroke rest and a post-stroke rest on a drawing of a 1. The drawn

1 is shown in the left plot and the speed profile of the 1 is shown in the right plot. The pre-stroke rest is plotted in cyan; the post-stroke rest is plotted in magenta. (CIN0173214264, Stroke 25) . . . 54

(11)

5-6 An example of a speed profile of one of the 1s of an 11. The drawn 1 is shown in the left plot and the speed profile of the 1 is shown in the right plot. Marked in red is the part of the speed profile corresponding to a hooklet. (CIN0123365438, Stroke 13) . . . 56 5-7 An example of a direction profile of one of the 1s of an 11. The drawn

1 is shown in the left plot and the direction profile of the 1 is shown in the right plot. Marked in red is the part of the direction profile corresponding to a hooklet. (CIN0123365438, Stroke 13) . . . 57 5-8 An example of a speed profile in which the patient went back and forth

over their digits repeatedly, creating a distinctive spiked or peaked speed profile. (CIN0006751695, Stroke 11) . . . 57 5-9 A comparison of the speed profiles of two tens digits 1s: the top from

the translation task and the bottom from the copy task. The copy task speed profiles are less densely populated and are less likely to have hooklets. (CIN0123365438, Strokes 20 and 152) . . . 59

(12)

(13)

Chapter 1 Introduction and Motivation

Neurodegenerative diseases affect millions of people worldwide, degrading their qual-ity of life and placing a burden on their families. The class of neurodegenerative diseases includes Alzheimer’s Disease (AD), various forms of dementia, Parkinson’s Disease (PD), Huntington’s Disease, amyotrophic lateral sclerosis, and other related diseases. The National Institute of Environmental Health Sciences estimates that 5 million Americans are living with AD and at least 500,000 Americans are living with PD [9].

Early diagnosis can be extremely beneficial in treating or slowing down the onset of these diseases. Researchers have developed many techniques over the years to identify warning signs of neurodegenerative diseases. One such technique involves giving patients cognitive tests. These tests may include tasks like drawing figures and remembering digits. Cognitive tests are widely used and have been a fairly successful method of diagnosis, because they can capture subtle signs of neurodegeneration in very simple tasks. A study of one cognitive test, the clock-drawing test, found that the test was able to detect dementia with a true positive rate of 76% and true negative rate of 81%, even among mild dementias [4]. Although these tests are easy and quick to administer, they suffer from the qualitative nature of the tasks and their evaluation. In one form of the clock-drawing test, for example, patients are asked to draw a picture of a clock in a pre-drawn circle and set the time at 10 after 11. The test evaluator must identify “minor visuospatial errors” such as “mildly impaired

(14)

spacing of times,” which is difficult for many different evaluators to judge consistently and without bias [14].

This work builds upon the THink system, which makes the diagnosis process more objective by digitizing cognitive tests and analyzing test results using machine learning. We use a digitizing pen to accurately capture a patient’s test responses, which allows us to replay the test, compute precise features from the test, and rapidly analyze the test multiple times. Our computed features are more powerful than manual grading of tests because they can take into account time data and capture subtle trends not discernable to the human eye.

In this work, we describe a new digitized cognitive test, the digital Symbol Digit Test (dSDT), and evaluate its use in identifying two classes of neurodegenerative diseases: memory conditions (AD and mild cognitive impairment) and PD. After administering the dSDT to 172 patients, we trained a machine learning classifier on the test data. We report an F1-score of 0.868 and an area under the curve (AUC)

of 0.957 for distinguishing memory conditions from all other conditions (including healthy) in the data set. We also report an F1-score of 0.908 and an AUC of 0.963

for similarly distinguishing PD from all other other conditions, making the dSDT an accurate, automated, and robust process for scoring cognitive tests.

(15)

Chapter 2 Background

2.1 Diagnosis with cognitive tests

Several tests that assess cognitive impairment are the clock-drawing test, the Mini-Mental State Examination (MMSE), and the Montreal Cognitive Assessment (MoCA). These tests are used as cognitive screening tests to detect cognitive impairment, a symptom associated with AD and PD [17]. The MMSE and MoCA, along with other tests, are used by clinicians as one source of guidance in diagnosing their patients.

2.1.1 The Clock-Drawing Test

One popular cognitive test is the clock-drawing test [15], in use for more than fifty years. The test occurs in two phases: the “command” phase and the “copy” phase. In the command phase, the patient is asked to draw on a blank page a clock showing 10 minutes after 11. In the copy phase, the patient is asked to copy a pre-drawn clock showing 10 minutes after 11. The command phase is meant to test language (the patient must determine the meaning of “10 minutes after 11”) and memory (the patient must remember what a clock looks like). The copy phase focuses on spatial planning and executive function, since the patient has a clock to copy [2].

(16)

2.1.2 The Mini-Mental State Examination (MMSE)

The MMSE is meant to be a quick evaluation of cognitive state in older patients. The most frequently used exam [15], the MMSE test comes in two parts. The first part involves responding to oral questions intended to gauge orientation, memory, and attention. The second part tests the patient’s ability to name objects, follow verbal and written commands, write a sentence spontaneously, and draw a complex shape. The test is not timed but requires only 5-10 minutes to administer. It focuses only on cognitive aspects of mental functions [3].

2.1.3 The Montreal Cognitive Assessment (MoCA)

The MoCA covers several cognitive domains: visuospatial/executive function, nam-ing, memory, attention, language, abstraction, delayed recall, and orientation. The visuospatial/executive function portion of the test includes a clock-drawing task. The MoCA takes approximately 10 minutes to administer [8]. The test was developed to better characterize mild cognitive impairment (MCI) than tests like the MMSE, which suffers from a ceiling effect with respect to distinguishing individuals with pre-dementia from normal individuals [18]. Where the MMSE has a sensitivity of 18% to detect MCI, the MoCA achieves a sensitivity of 78% [8].

2.2 Diagnosis with handwriting analysis

Handwriting analysis has also been a promising route for early diagnosis of neurode-generative disease. Patients with PD often exhibit micrographia (small lettering) and other impairments in their handwriting, making it a favorable signal for early diagnosis [19]. In addition to decreased letter size, impairments can also occur in the force, velocity, and fluency of writing [20]. These factors are not so easily detectable by the human eye. Machine learning on handwriting data may be helpful for dis-cerning these subtler patterns. Brewer et al. successfully classified 85% of patients with and without PD using a support vector machine (SVM) trained on a carefully

(17)

crafted feature set of force data [1]. Pereira et al. [11] reported accuracies ranging from 80-85% using convolutional neural networks on a dataset including data from a smart pen similar in many aspects to the Anoto pen described below.

2.3 THink

In this study, we work with THink, a system in which machine learning is applied to the subtle signals captured by cognitive tests to produce a robust, quantitative, and easy-to-use system to screen for neurodegenerative diseases. THink is the subject of U.S. Patent 8,740,819. The THink project uses an off-the-shelf digitizing pen from Anoto, Inc. to administer cognitive tests. The pen records its position on the page once every 13 ms with a resolution of ±0.005 cm. The test data is then transferred to a computer for analysis [16]. The pen provides data two orders of magnitude more precise than previously available, which opens up a new wealth of currently inaccessible information [2].

THink provides several advantages over a traditional pen-and-paper test. Where a paper test’s evaluator usually looks only at the patient’s drawing process and response after the test is completed, THink captures the pen’s position at every point in the test. That means it can analyze important timing information inaccessible to manual evaluators, such as the order in which pen strokes were made and how long the patient took to complete a part of the test: every stroke and pause is recorded and timestamped. The spatial resolution of the data from the pen allows THink to detect warning signs (e.g. the shakiness of a patient’s handwriting or the off-centeredness of the test responses) at a much more precise level than the human eye can. Because the data from a patient’s test is stored digitally, future analyses are easily run; the test does not need to be administered again in order to run a new metric on it. Additionally, the machine learning-driven analysis of THink also allows for quick, automated analysis, as opposed to the often tedious grading that comes with a paper exam [16].

(18)

2.4 The Digital Clock Drawing Test

Work done previously by Prof. Randall Davis, Dr. Dana Penney, William Souillard-Mandar, and others in a collaboration between the MIT Computer Science and Ar-tifical Intelligence Laboratory (CSAIL) and Lahey Clinic has had great success in developing the THink system for use with the clock-drawing test. The resulting test is known as the digital clock-drawing test (dCDT). Using a variety of machine learning methods, the dCDT achieved an AUC of up to 0.93 in distinguishing memory impair-ment disorders from healthy controls, up to 0.88 in distinguishing vascular cognitive disorders from healthy controls, and up to 0.91 in distinguishing PD from healthy con-trols. dCDT performance was better than the performance of existing clock-drawing test scoring algorithms. The work with the dCDT also produced classifiers based on more intuitive, understandable models that physicians could interpret, instead of simply a machine learning black box. This was important for getting physicians comfortable with using the dCDT [16].

We built on this previous work by applying the THink system to another test: the Symbol Digit Test (SDT), a cognitive test targeted at identifying a patient’s ability to learn. This test (the SDT combined with THink) is known as the digital Symbol Digit Test (dSDT).

2.5 The Symbol Digit Modalities Test (SDMT)

The Symbol Digit Modalities Test (SDMT) is the basis for the SDT used in this study. The SDMT presents nine different symbols corresponding to the numbers 1 through 9. The test-taker first practices writing the correct number under its corresponding symbol, and then fills out the blank spaces under a set of symbols with their corresponding numbers. The SDMT generally takes around 90 seconds, which is advantageous for situations where a brief cognitive test is useful. The patient is given a score, calculated by tallying the number of correct responses. This score can then be compared against SDMT norms to help reach a diagnosis [13].

(19)

The SDMT assesses key substitution task functions, including attention, visual scanning, and motor speed [13]. It has been widely used to measure cognitive im-pairment in neurodegenerative diseases like multiple sclerosis, thanks to its short length and unambiguous grading [7, 12]. The SDMT achieves a high sensitivity of 0.91 for identifying cognitive impairment in multiple sclerosis, while also providing a specificity of 0.60 [12].

2.6 The Digital Symbol Digit Test (dSDT)

2.6.1 dSDT tasks

The Digital Symbol Digit Test (dSDT), developed by Dr. Dana Penney and Prof. Randall Davis, is a variation of the SDMT. The dSDT involves three tasks. The first is a translation task, in which the patient is asked to translate a series of symbols into corresponding numbers, as depicted in Figure 2-2. A key with pairs of symbols and numbers is provided at the top of the test (see Figure 2-1). The second task is a copy task, in which the patient is asked to copy a series of numbers, as depicted in Figure 2-4. An example key with copied numbers is provided at the top of the copy task (see Figure 2-3). The third task is a delayed recall task, in which the patient is presented with a permutation of the original six symbols and asked to translate those into numbers. The translation key is no longer visible at this portion of the test, so the patient must rely on their memory to translate the symbols. Both the translation and copy tasks are prefaced by a short practice section, in which the patient can learn how to do the task correctly. The practice section is not included in the later analysis of the test.

The translation task tests the patient’s higher-order thinking; they are required to recognize shapes from the legend and translate them into numerals. It also tests attention, visual scanning, and motor speed in the same manner as the SDMT.

The purpose of the copy task in the dSDT is to provide a baseline for each patient’s writing speed, allowing the effect of writing speed to be extracted from the time spent

(20)

Figure 2-1: An example key for the top of the translation task.

Figure 2-2: An example portion of the translation task presented to the patient.

Figure 2-3: An example key for the top of the copy task.

(21)

in the translation task. This isolates the time the patient spends on doing the actual translation, which is what we are interested in analyzing.

The delayed recall task tests whether the patient has internalized the mappings from symbol to digit: in effect, whether the patient has learned the mappings instead of just reading the translations off the key each time.

The delayed recall task tests the patient’s learning and memory; if the patient learned the test’s symbol-digit mappings in the translation task, they will get more mappings correct in the delayed recall section. While the patient is encouraged to complete the translation and copy tasks in order and as quickly as possible, they are free to complete the delayed recall section in any order and are given no time limit.

2.6.2 dSDT design

The dSDT is designed to capture evidence of learning throughout the course of the test, with the goal of screening for memory impairments. All six symbols appear equally often in the translation task, and no symbol appears twice in a row, and the test is designed so that the motor load is identical in both the translation and the copy tasks.

The entire dSDT can be administered in 4-5 minutes; completion times for the dSDT among our patients ranged from 2.4 minutes to 15.2 minutes. The test takes longer than the SDMT primarily because of the added delayed recall segment, which captures learning and memory information that the SDMT does not.

(22)

(23)

Chapter 3 Prior Work

Prior to this work, Tong and Davis integrated the THink machine learning classifier with the dSDT, training the classifier on a total of 87 dSDTs. This chapter describes the previous work they did with the dSDT, which we will refer to as the Tong model and which this study builds on.

3.1 Patient classes

The dSDT was administered to patients from four different subject classes: Healthy Control (Healthy), AD+MCI, PD, and Miscellaneous (Misc). The Healthy class consisted of patients who were not diagnosed with any neurodegenerative diseases, the AD+MCI class consisted of patients diagnosed with Alzheimer’s Disease or an amnestic Mild Cognitive Impairment, the PD class included patients diagnosed with Parkinson’s Disease, and the Misc class included subjects with a variety of conditions. Miscellaneous conditions included multiple sclerosis, Huntington’s Disease, epilepsy, depression, and other conditions that could potentially impair cognitive ability.

3.2 Test administration

The dSDT was administered to 67 patients using an off-the-shelf Anoto pen. Patients taking the test were instructed to complete the task as quickly and accurately as

(24)

possible, beginning from the first cell and continuing to the next one until they reached a stopping point in the test. Some of the dSDTs in the input set were given to the same patients, either as back-to-back tests or as tests at a follow-up examination. Back-to-back tests were given within a few minutes of each other.

3.3 Machine learning

3.3.1 Models

Each dSDT was put into one of the four patient classes noted above, based on the consensus diagnosis provided by clinicians. The Tong model used two models for classification: a linear support vector machine (SVM) and a logistic regression (LR). Because of the small number of input tests, simple models were preferred. The SVM and LR were also advantageous because they were transparent to the end user: a clinician using the dSDT as a diagnosis tool would like to understand the signals that suggested one diagnosis or another. In these models, a user can easily view the features that contributed to a diagnosis, unlike a more opaque method like deep learning.

To help prevent overfitting on the small amount of input data, the Tong model also trained the classifier with three-fold stratified cross-validation: input data was partitioned into three subsets, each chosen to be representative of the whole data. One subset of the data was used as the validation set and the other two were used as training sets. The validation set was then rotated through the other two subsets, adjusting the training sets accordingly.

3.3.2 Experiments

The Tong model ran a classification experiment between every class of patients, as depicted in Table 3.1: Healthy vs. AD+MCI, Healthy vs. PD, Healthy vs. Misc, AD+MCI vs. PD, AD+MCI vs. Misc, and PD vs. Misc. In addition, the Tong model also defined the Others class of patients, which included all patients not in

(25)

AD+MCI PD Misc Others Healthy ∗ ∗ ∗ ∗

AD+MCI ∗ ∗ ∗

PD ∗ ∗

Misc ∗

Table 3.1: The classification experiments performed in this study are marked with ∗.

the patient class of interest. For example, a Healthy vs. Others experiment would try to distinguish Healthy patients from all the patients not classified as Healthy. Experiments were run between each patient class and Others: Healthy vs. Others, AD+MCI vs. Others, PD vs. Others, and Misc vs. Others.

Of the various experiments, the most critical were AD+MCI vs. Others and PD vs. Others, since they provide a measure of sensitivity and are the models that a clinician could use while screening for AD and PD.

3.3.3 Features

The Tong model extracted features from raw Anoto pen data, which provides times-tamped data points with position and pressure information. Features fell into two categories: direct features and derived features. Direct features were computed di-rectly from pen stroke data. These were further subdivided into six subcategories: time, stroke attribute, stroke type, geometric, correctness, and delayed recall fea-tures. These are described in the following sections. Derived features were calculated by aggregating direct feature values from multiple cells.

Time features

Time-based features are a powerful way to estimate a patient’s cognitive load. These features include precell delay, inktime, writetime, and thinktime.

The precell delay of a cell is the time elapsed between the last stroke of the previous cell and the first stroke of the current cell. During the precell delay time, the test-taker may be physically moving the pen from the previous cell to the current one and simultaneously moving their visual focus from the previous cell to the current cell’s

(26)

symbol. In the translation task, the test-taker may also be checking the translation key, scanning for the corresponding symbol, and finally returning to the cell to be completed. However, if the subject begins to learn the translation key, the time to visually check and scan the translation key can be replaced with the time to recall the appropriate match.

In the copy task, the precell delay is primarily the time taken to physically move from one cell to the next. If the cell is the first in its row, it is not assigned a precell delay, since the time taken to move from the end of the prior row to the beginning of the current row would confound the usual precell delay factors.

Cell inktime measures the time the pen spends in contact with the paper in a particular cell. Writetime is the time taken to complete a cell, including the pauses in between consecutive strokes in that cell. Only the pauses in between consecutive strokes are taken into account, to avoid counting the time elapsed if the test-taker moves on to another cell and returns to make a correction in the current cell. Think-time is the sum of the precell delay and the Think-time between consecutive strokes in a cell.

Stroke attribute features

Stroke attribute features include features measuring stroke inklength, pressure, ve-locity, and compressed segments.

The inklength of a stroke is the sum of the lengths of all the segments making up the stroke, calculated from the pen position data at each datapoint. Inklength is calculated on both the raw segments of the stroke and on the spline fit to the stroke, forming the raw length and spline length features. In addition, the ratio of the two is calculated to create the spline over raw feature.

From the inklength, the Tong model calculated the velocity of the pen along the stroke segments. The model calculated the average and maximum ink velocity in each cell, as well as the variance in velocity in the cell. Similarly, the average, maximum, and variance in pen pressure was calculated in each cell.

(27)

dat-apoints fall along a line, only the endpoints of the line are recorded. The number of compressed segments is therefore a measure of the linearity of a stroke. This phe-nomenon is further discussed in Section 4.1. Compressed segments are used for two features: the number of compressed segments per unit of inklength indicates how linear subsections of the stroke are, while the ratio of the longest compressed segment to the inklength indicates how linear the stroke is overall.

Stroke type features

Stroke type features are counts of a particular type of stroke in each cell. The Tong model counted the total number of strokes per cell and total number of correction strokes per cell.

If the patient writes a response in one cell, moves on, and then returns later to the original cell to change their response, this is called a correction stroke. Although this method of detection captures only those corrections made after moving to another cell, it is an easy and robust way to capture at least some corrections. The number of corrections in each cell can be counted to see how often a patient had to revise an answer, usually because it was incorrect. This provides an additional measure of how accurately the patient is performing: the correction is not reflected in their grade since only the patient’s final response is considered when grading their test.

Geometric features

Geometric features are calculated from geometric aspects of the strokes in each cell, and measure response size, response off-centeredness, and stroke slope.

The width and height features of a cell are the difference between the minimum and maximum x value and the difference between the minimum and maximum y value in the cell, respectively. These were further combined into the area of the bounding box containing the points in the cell. This is a measure of the size of the test-taker’s handwriting.

Two measures of how well a response is centered within the cell are centroid-cellcenter distance and boxcenter-centroid-cellcenter distance. The centroid is calculated by

(28)

Figure 3-1: The centroid of this stroke, marked with a red X, is calculated by averaging the points in the stroke. (Test CIN0123365438, Stroke 57)

Figure 3-2: The bounding box, in red, contains the entire stroke. The bound-ing box center is marked with a red X. (Test CIN0123365438, Stroke 57)

averaging the coordinates of the points in the cell (see Figure 3-1). The boxcenter is the center of the bounding box containing the response (see Figure 3-2).

In order to investigate handwriting variation between the translation and copy task, the Tong model added a slope of least squares fit line feature. This feature is targeted at comparing how the digit 1 is written (which appears in the responses 1, 11, 13, and 15). A least squares line is fit to either the tens place or ones place depending on the cell response and use the slope of the line in radians as a feature. If the response does not have a 1 digit, the feature returns -1.

Correctness features

Correctness features are derived from the test-taker’s response in each cell. The Tong model considered four classes of correctness: correct, incorrect, invalid, or omitted. A cell is correct if its response matches the answer key; a cell is incorrect if its response does not match the answer key but is in the set of valid answers (1, 2, 3, 11, 13, 15); a cell is invalid if its response is not in the set of valid answers; a cell is omitted if the test-taker did not write a response in it. Each patient’s responses were hand-graded to ensure that the correct intent of the patient was captured, since many tests had cross-outs and poor handwriting that a typical digit classifier would struggle with. In addition, some patients wrote unexpected invalid responses—for example, one patient

(29)

copied a few shapes instead of drawing them, a behavior difficult for a digit classifier to characterize.

Delayed recall features

Delayed recall features are based on responses in the delayed recall task. While the test-takers were instructed to complete the rest of the test in order, they were not given any ordering requirements in the delayed recall section. The ordering of delayed recall responses therefore provides some insight into which responses the test-taker was most confident about, since they will likely complete the cells they are sure of first. The Tong model included the ordering of both the first stroke in each cell and the last stroke in each cell, since people often returned to a cell more than once to revise their answer.

Some patients did not write the original set of numerals (1, 2, 3, 11, 13, 15) into the delayed recall section. One frequent set of responses was the numbers 1-6, suggesting that the patient gave up on remembering the actual ordering. To see whether this tendency was diagnostic, the Tong model included a feature indicating whether the test-taker wrote 1-6 in their delayed recall answers. The Tong model also included a feature indicating whether the test-taker wrote 0-5 in their delayed recall answers, since that appeared in one test.

Derived features

From these direct features, the Tong model also computed derived features—aggregate features and difference features. Features are aggregated at several levels of resolution to capture more global trends in the dSDT. Continuous features (e.g. inktime or inklength) are aggregated as an average. Discrete features (e.g. cell correctness or number of compressed segments) are aggregated as a count. Delayed recall features are not aggregated, since they are already calculated across the delayed recall task.

Features are aggregated at the row, symbol, task, and test level. For example, a row-aggregated feature might be the average variance in velocity in row 4. An example of a symbol-aggregated feature is the average inktime for all stars. A task-aggregated

(30)

feature could be the number of invalid responses in the translation task.

In addition, the Tong model calculated difference features to capture differences between the translation and copy task. For each translation cell with a corresponding copy cell, that copy cell’s cell-level features are subtracted from the translation cell’s features. Difference features are computed only for continuous features. An example of a difference feature might be the difference in thinktime for cell 15.

3.3.4 Performance

The models were evaluated on two metrics: the AUC of the receiver operating char-acteristic (ROC) curve and the F1-score. Because of the three folds used for

cross-validation, each experiment produced three ROC curves. The Tong model averaged the three curves and computed the area under the averaged ROC curve for the AUC metric. The AUCs are summarized in Table 3.2. In the SVM model, the AUC of the Healthy vs. Others experiment was 0.79, for AD+MCI vs. Others it was 0.74, and for PD vs. Others it was 0.72. In the LR model, the AUCs of the three experiments were 0.52, 0.75, and 0.72, respectively.

The F1-score is defined as:

F1 = 2 ∗

precision ∗ recall precision + recall

Because it is based on precision and recall, an F1-score exists at every point on

the ROC curve. The Tong model chose the (precision, recall) point that maximizes the Youden Index, defined as:

J = sensitivity + specificity − 1

The F1-scores of the models are summarized in Table 3.3. In the SVM model,

the F1-scores of the Healthy vs. Others, AD+MCI vs. Others, and PD vs. Others

experiments were 0.44, 0.73, and 0.59, respectively. In the LR model, the F1-scores

(31)

AD+MCI PD Misc Others Classifier SVM LR SVM LR SVM LR SVM LR

Healthy 0.986 0.845 0.919 0.781 0.760 0.648 0.793 0.523 AD+MCI - - 0.663 0.696 0.712 0.763 0.737 0.745 PD - - - - 0.836 0.896 0.715 0.715 Table 3.2: A comparison of AUCs of ROCs generated for different experiments using SVM and LR classifiers.

AD+MCI PD Misc Others Classifier SVM LR SVM LR SVM LR SVM LR

Healthy 0.952 0.640 0.857 0.711 0.758 0.667 0.438 0.357 AD+MCI - - 0.749 0.788 0.790 0.825 0.728 0.732 PD - - - - 0.882 0.859 0.591 0.615 Table 3.3: A comparison of F1-scores at the maximum value of Youden’s Index on

(32)

(33)

Chapter 4 System robustness and

performance

Our work on the dSDT falls in two general categories. The first is system robustness and performance. The second category of work includes new features and in-depth feature analysis. Both the methods used to improve system robustness and the engi-neering of new features helped boost the classifier’s diagnostic performance.

4.1 Geometric changes

Some of the features used in the classifier were geometric features, calculated on geometric properties of each stroke, like bounding box size. To make these features as informative and accurate as possible, we were interested in making sure our digitized pen data accurately matched the test-taker’s actual writing.

The Anoto pen captures timestamped data points every 13 ms. This resulted in a sampling of the test-taker’s ink strokes every 13 ms, which created a jerky picture of the test-taker’s actual handwriting. To make matters more complicated, the Anoto pen automatically compressed linear segments: if multiple consecutive data points fell on a line, the Anoto pen recorded only the endpoints of the line and discarded all points in between (in order to save storage space on the pen). We post-processed data to interpolate compressed data points and make the interval between all data

(34)

points a regular 13 ms. A comparison of the compressed points and post-processed points are shown in Figures 4-1 and 4-2.

However, the 13 ms intervals between sampling points resulted in poor resolution at turns in the stroke. The na¨ıve approach to improving the resolution of the stroke was to connect consecutive data points using linear segments. This linear interpola-tion method resulted in a picture like Figure 4-3—not an accurate representainterpola-tion of human handwriting, which is smoother.

To make our data more realistically represent human handwriting, we fit a cubic spline to the points in each stroke. This resulted in the smoother segments seen in Figure 4-4, representative of the natural curves in human handwriting. However, the cubic spline does not always perform well. In a stroke with noisy, jittery points—as shown in Figures 4-5 and 4-6—the cubic spline may loop back and forth excessively in an attempt to include every single data point. This was a poor approximation of the pen’s movement and could skew geometric features like stroke ink length. These poor spline approximations only happen in noisy strokes. In the stroke in Figure 4-5, the patient left the pen nearly-stationary for a long time, resulting in lots of noise at the beginning of the stroke.

We wanted to use cubic splines where possible to simulate natural human hand-writing and fall back upon linearization when necessary—that is, when a cubic spline was an inaccurate representation of the handwriting data. We detected these spline-fitting anomalies by comparing the length of the cubic spline interpolation to the length of the original linear interpolation. If the spline was 10% longer than the lin-ear interpolation, we flagged the spline as a poor approximation. In order to prevent poorly approximated strokes from skewing geometric features, we locally replaced parts of the cubic spline with linear interpolations. We call the parts of the stroke between two sampled data points segments. For each segment with a cubic spline in-terpolation of length greater than 2x the length of the segment’s linear inin-terpolation, we replaced the spline with the linear interpolation. If no single segment is causing the length difference, we linearize the whole stroke, as shown in Figures 4-7 and 4-8. The updated model that combined cubic splines and linear interpolations resulted

(35)

Figure 4-1: Solid dots are points recorded by the pen. The dotted lines are straight segments between the points. The points shown in this stroke are compressed, making the time interval between points uneven. (Test CIN0123365438, Stroke 57)

Figure 4-2: Points have been inter-polated between compressed points to make the time interval between consecutive points 13 ms. (Test CIN0123365438, Stroke 57)

Figure 4-3: Solid dots are points recorded by the pen. The points in this stroke are connected with straight lines, whereas natural hand-writing consists of smooth curves. (CIN0123365438, Stroke 57)

Figure 4-4: The dotted line represents the pen path recreated with the lin-ear interpolation method. The solid line represents the path recreated by applying a cubic spline to this stroke. The curves of the cubic spline are more similar to natural human writ-ing. (CIN0123365438, Stroke 57)

(36)

Figure 4-5: An example of a cu-bic spline interpolation that is a poor approximation of the original stroke. The noisy, jittery points at the top of the 1 caused the spline to loop back and forth. (SNF1305604878, Stroke 77)

Figure 4-6: A close-up view of a poor cubic spline interpolation of this 1. The resolution of the pen is visible here; the points fall along a grid. The cubic spline, in blue, loops back and forth excessively, causing the length of the spline to be 10% longer than the linear interpolation. (SNF1305604878, Stroke 77)

Figure 4-7: A linear interpolation bet-ter fits the densely clusbet-tered points in this stroke. (SNF1305604878, Stroke 77)

Figure 4-8: A close-up view of a linear interpolation of this 1. Instead of loop-ing back and forth, the linear inter-polation simply connects the points. (SNF1305604878, Stroke 77)

(37)

Ink Length Per Stroke (mm)

Interpolation Method Mean Median Standard Deviation Linear 8.989 7.544 5.279

Cubic Spline 9.075 7.629 5.315 Hybrid Cubic Spline-Linear 9.073 7.627 5.317

Table 4.1: A comparison of ink length per stroke calculated using the linear, cubic spline, and hybrid cubic spline-linear interpolation methods on raw pen data.

in ink lengths similar to those of the pure cubic spline model. This was expected: the poor cubic spline approximations were infrequent, so the adjustments in ink length were infrequent. A comparison of the ink lengths using the different interpolation methods is summarized in Table 4.1. As we expected, the hybrid cubic spline-linear interpolation resulted in a longer mean ink length than the linear interpolation and a slightly shorter mean ink length than the cubic spline interpolation.

4.2 Overfitting

With only a few hundred tests, overfitting was a major concern for the dSDT classifier. To make our classifier more robust, we analyzed and normalized our features and curated our feature set.

4.2.1 Analyzing feature skew

Overfitting can sometimes happen when a test has an outlier value for a particular feature. The classifier may heavily weight the outlier feature so the test is correctly classified based primarily on that outlier feature, which is undesirable if we want our model to capture the general patterns for each diagnosis class.

We calculated the skew across the values in each feature to see which features might encourage overfitting. Skew was calculated as follows:

n

(n − 1)(n − 2)

Xxj− ¯x

s

3

(38)

element in the set, ¯x was the mean of the set, and s was the standard deviation of the set.

Some of the heavily skewed features were the number of incorrect, omitted, and invalid responses, the average maximum pressure, and the average distance from the centroid of a stroke to the center of its containing cell (average centroid-cell center distance). The fact that the number of incorrect, omitted, and invalid responses were skewed was unsurprising, since a few patients performed very poorly on the dSDT compared to the other patients. The average maximum pressure was limited by the Anoto pen’s maximum pressure level, 126. Most patients saturated the pressure reading, so the few average maximum pressure readings less than 126 caused the feature to be skewed. However, the skewness of the average centroid-box center distance was surprising, since the distance was roughly constrained by the size of the cell. The reason for the skew turned out to be a single outlier test with an extremely low average centroid-box center distance, due to the fact that much of the digital ink in the test was missing, likely a result of the Anoto pen malfunctioning. The feature skew analysis proved to be a useful way to identify bad tests by highlighting outlier tests.

We also examined skewness in the top features for each training iteration of the classifier. By top features in a classifier we mean the features with the largest magni-tude coefficients in the trained classifier. If the top features were skewed in a training iteration of our classifier, it was a sign that the classifier was overfitting to the data. Top feature analysis is further discussed in section 5.1.1.

4.2.2 Normalizing features

We wanted to deal with the large difference in the range of features. As one example, the average variance in pressure ranges from 0.0008 to 0.1387, while average cell inktime ranges from 192 to 2080. This is a problem for SVMs, which assume that input data is within a standard range. In addition, the difference in feature ranges became a confounding factor when comparing feature coefficients. To make our model easier to interpret, we normalized all features to have a mean of 0 and a variance of

(39)

1 before beginning training.

Normalization significantly improved performance. After normalization was im-plemented, our model AUCs jumped an average of 26%. This also allowed us to easily compare the significance of features by directly looking at feature coefficients.

4.2.3 General features

Prior to this new work, our THink model contained 4041 features, primarily cell features. (For more details on our prior feature set, refer to Section 3.3.3.) However, intuitively, there was too much noise in these cell features to capture the diagnostic patterns of each patient class. In fact, the overabundance of features could have encouraged overfitting to specific tests based on cell-level anomalies.

We removed the cell features and the cell difference features and instead created a set of general features. The general features initially included features aggregated over each row (e.g. average pen velocity across the strokes in row 1), features aggregated over each symbol (e.g. average pen velocity across all responses to the star shape), features aggregated over each task (e.g. average pen velocity across the translation task), and features aggregated over the entire test. We also computed general dif-ference features, calculated on a row-by-row or task-by-task basis. For example, one general row difference feature might be the difference in average pen velocity across the first row of the translation task and across the first row of the copy task. One general task difference feature might be the difference in average pen velocity across the strokes in the translation task and across the strokes in the copy task.

With the general feature set, we have a total of 1019 features. (After removing features, we also added many additional feature classes, as well as row and task dif-ference features. The new features are discussed in Chapter 5.) Though our model performance initially dropped after removing the cell features and cell difference fea-tures, our top features became less skewed, a sign that our model was overfitting less.

(40)

4.2.4 Redundant features

Some groups of features were redundant, meaning that they represented the same information multiple times—for example, the number of correct responses and the number of incorrect responses were redundant. Having the same information encoded in multiple features unfairly gave additional weight to that information, so we removed redundant features from our feature set. At the time of removal, removing the number of correct responses feature did not significantly change our classifier’s AUC.

4.2.5 Adding new tests

Prior to this work, the THink system was trained on 87 dSDTs, which is a very small sample size. These 87 tests were administered to 67 patients (some patients took the test multiple times). Nine of these patient dSDTs were in the Healthy patient class, 35 were in the AD+MCI class, 25 were in the PD class, and 18 were in the Miscellaneous class. The small number of tests overall and the small number of tests in each class was a serious problem for our classifier.

Throughout the course of this study, new tests become available; in response we retrained our classifiers. Each time new tests arrived, our classifier performance dropped, suggesting that more data continued to be helpful in eliminating overfitting. At the time of this writing, we have a total of 317 dSDTs from 172 patients, listed in Table A.1. 144 of these dSDTs are in the Healthy patient class, 99 are in the AD+MCI class, 47 are in the PD class, and 27 are in the Miscellaneous class. The unbalanced input classes are a concern, but the due to our limited number of input tests, we decided to keep all of the tests we had available instead of balancing the number of tests in each class. All the statistics presented in this paper are calculated using this set of 317 dSDTs.

(41)

Chapter 5 New features and feature analysis

5.1 Model-level feature analysis

5.1.1 Top features

To better understand our model, we were interested in seeing which features were the most important. We could use the important features to gain some additional intuition into the diagnosis classes. We used feature coefficients to determine fea-ture importance. Once normalization had been implemented (see Section 4.2.2), the feature coefficients with the largest magnitudes influenced the model the most. We created utilities for reporting the top features from each experiment and used these for model performance vs. complexity analysis, discussed in Section 5.1.2.

We also used feature coefficients to find and remove feature classes that were not useful, i.e. features that consistently had no variation across all tests. Using this method, we removed the memory feature indicating whether the test-taker wrote 0-5 in the delayed recall section, since we saw it occur in only one test out of our over 300.

5.1.2 Performance vs. complexity analysis

Performance vs. complexity analysis was targeted at answering the question of how many features were needed to create a good model for classifying dSDTs. If the

(42)

dSDT were to be used as a diagnostic aid, it would be desirable for the model to be as simple and lightweight as possible while still retaining its diagnostic power. To this aim, we re-trained our classifiers with only the top n features and recorded the model performance, where n increased by 10s, starting from 10.

All of the experiments tested quickly saturated: Healthy vs. AD+MCI, Healthy vs. PD, and AD+MCI vs. Others all reached close to their peak performance with 100 features or so. After finishing new feature engineering, it may be worthwhile to remove some of the lower-performing classes of features from the model to push its size down, since a few powerful features are what really drive the model performance.

5.2 Deep-dive feature analysis

Throughout the course of this work, we investigated some features in depth and added several new features. This deep-dive feature analysis was in order to better understand interesting patterns we noticed while examining features.

5.2.1 Off-centeredness features

We noticed that off-centeredness was an unexpectedly strong feature for the AD+MCI classification experiments, in the form of centroid-cellcenter and box-center-cellcenter distance features. This was something we had actually expected in the PD class, rather in than the AD+MCI class. As a baseline, the responses of healthy patients tended to be more centered than responses from other patient classes. We conjectured that PD patients would “drift” to the left of cells because of difficulty moving the pen from cell to cell. Though off-centeredness features do appear among top features for PD experiments, they also appear among the top AD+MCI features.

We visually inspected several AD+MCI tests and found that responses from some patients did indeed drift, though the drift direction varied from patient-to-patient. However, within the responses from one patient, they tended to be consistently off-centered, which makes sense: if the patient was writing slightly off from the center of a cell and moved their hand a distance approximately calibrated to a cell width, they

(43)

would still be off-centered by the same amount. It is possible that the off-centeredness observed in AD+MCI patients is because of visual inattention, which would result in more drift than in healthy patients [6].

5.2.2 Hooklet features

Identifying hooklets

A hooklet is a sharp upturn at the bottom of a stroke leading to the start of the following stroke. An example hooklet is shown in Figure 5-1. The presence of a hooklet in a number is hypothesized to be an indication of anticipation: the subject is thinking about the next stroke and begins moving in that direction before finishing the current stroke [5, 10]. The ability to think ahead is a sign of cognitive health, since impaired cognition can obstruct multitasking, leaving only enough resources to attend to the current moment. Therefore, the absence of hooklets may be a hallmark of possibly early cognitive decline, as in pre-clinical Alzheimer’s, and is thus of great interest in this study.

While some hooklets are visible to the eye, many hooklets are very short, on average less than 2 mm. Fortunately, hooklets are detected by the THink system programmatically. When the direction of a stroke changes sharply in a short time, it detects a potential hooklet corner. The rest of the stroke following a hooklet corner is a potential hooklet, or proto-hooklet. In order to extract the anticipation aspect of a hooklet, a line is fit to the potential hooklet points, as depicted in red in Figure 5-2. The projected line is extended from the current stroke to the next stroke, and the perpendicular distance between the projected line and the next stroke is calculated— this is a measure of how “accurate” the hooklet was. If the distance is within 3 mm, the potential hooklet is labeled a definite hooklet. If the distance is within 5 mm, it is labeled as a possible hooklet.

Hooklets can occur between the two digits of a single number (e.g. an 11) or between numbers in different cells. The former is known as an intracell hooklet and the latter is known as a crosscell hooklet. This gives us five categories of hooklets for

(44)

Figure 5-1: An example of a hooklet between the digits of an 11. The hook-let is drawn in red. (CIN0123365438, Stroke 13)

Figure 5-2: A hooklet between the dig-its of an 11 with a projected line fit to it, drawn in red. The perpendic-ular distance between the projection and the next stroke is shown in cyan. (CIN0123365438, Stroke 13)

each stroke: no hooklet, possible intracell hooklet, definite intracell hooklet, possible crosscell hooklet, and definite crosscell hooklet. We use the count of hooklets of each type as a cell feature.

We created utilities for identifying, labeling, and plotting hooklets, making it easy to review them by eye.

Extended hooklet features

In addition to counting the number of hooklets, we can calculate some other hooklet-related statistics. For each proto-hooklet, we calculate its length, the ratio of its length to the length of its containing stroke (relative size), the perpendicular distance between its projection and the next stroke (distance to destination), the distance from its corner to the next stroke (distance from base to destination), and its speed. Each of these statistics is used as a cell feature.

We calculate the length-related features because we are interested in better un-derstanding the cause behind long or short hooklets. A long hooklet could potentially be a result of fast movement: if the patient was writing rapidly and confidently, their strokes may blend together in the form of a long hooklet. However, a hooklet could intuitively also be a sign of the patient lingering on each digit and not

(45)

progress-ing quickly through the test. We wanted to understand the actual relation between hooklet length and cognitive load.

We calculate the distance-related features to investigate the accuracy of hooklets. We expect more accurate hooklets in healthy patients, who do not exhibit memory problems or tremor.

Table 5.1 summarizes the average values of extended hooklet features among all patient classes. Healthy patients have the longest hooklets on average, significantly longer than those of AD+MCI, PD, and Misc patients (p = 0.041, p = 0.016, p = 0.0001 < 0.05, respectively with a 1-tailed unequal variance t-test). Interestingly, PD patients have the most accurate hooklets, with distances to destination significantly smaller than those of the Healthy and AD+MCI classes (p = 0.024, p = 0.032 < 0.05, respectively). This may be related to the tendency for micrographia often displayed in PD patients—hooklets from PD patients are shorter than those from Healthy and AD+MCI patients. It may be easier to be accurate with shorter hooklets.

It is also interesting how different the Misc class appears compared to the other patient classes. Generally, the hooklet lengths and distances to next stroke are shorter. Upon further inspection, this appears to be primarily due to the multiple sclerosis (MS) patient subclass within the Misc class. Table 5.2 breaks down the Misc class statistics into the most common subclasses of conditions: MS (7 tests), epilepsy (7 tests), and depression (5 tests). The hooklet lengths and distances to next stroke are markedly shorter in the MS subclass. This is partially because one MS test contains no hooklets at all, an unique occurrence among all current dSDTs. However, even excluding that one test, the MS subclass stands out. Upon visually inspecting the tests from MS patients, there is no clear indication of why this is the case. It may be partially due to a small sample size problem: there are only 7 MS tests. To make matters worse, those 7 tests are from a total of 3 MS patients. So it is difficult to draw a reasonable conclusion without more data.

(46)

Average Extended Hooklet Features (mm)

Length Relative Size Dist to Dest Dist from Base to Dest Healthy 0.192 0.024 3.688 5.926

AD+MCI 0.152 0.019 3.684 5.919 PD 0.138 0.018 3.204 5.742 Misc 0.090 0.012 2.694 4.248

Table 5.1: Extended hooklet statistics calculated for all hooklets and proto-hooklets. Average Extended Hooklet Features Among Misc Conditions (mm) Length Relative Size Dist to Dest Dist from Base to Dest MS 0.039 0.004 1.142 1.827

Epilepsy 0.100 0.015 3.961 6.221 Depression 0.102 0.013 3.632 5.466 All 0.090 0.012 2.694 4.248

Table 5.2: Extended hooklet statistics calculated for all hooklets and proto-hooklets, broken down by common conditions in the Misc patient class.

Hooklet distributions

To further investigate when and where hooklets occur, we counted hooklet at row, task, and test granularity for each patient class. These results are summarized in Table 5.3.

We find that intracell hooklets occur more frequently than crosscell hooklets. This is expected, because crosscell hooklets go between two cells and involve two different numbers, hence we expect patients to have more difficulty anticipating the next response. We also see more definite hooklets than possible hooklets, suggesting that patients were fairly accurate with their hooklets.

Taking a look at aggregated hooklet counts, we see, on average, the number of hooklets is greatest in the Healthy patient class and lower in the AD+MCI and PD patient classes. In addition, many more hooklets appear in the translation task than in the copy task, as shown in Table 5.4. The greater presence of hooklets in the translation task suggests that cognitive load encourages more hooklets. Since patients tend to write faster in the copy task, slower handwriting speed may also be a factor causing more hooklets.

(47)

Avg Num Hooklet Type Per Test Possible Definite Intracell Crosscell All Healthy 7.43 22.58 22.57 7.44 30.01 AD+MCI 7.16 22.22 22.14 7.24 29.38 PD 5.60 22.19 19.72 8.06 27.79 Misc 5.74 17.78 17.15 6.37 23.52 Table 5.3: Average counts of hooklet types per test across patient classes.

Avg Num Hooklet Type Per Test Possible Definite Intracell Crosscell All Translation 3.98 12.88 12.25 4.61 16.86 Copy 2.37 7.51 7.35 2.53 9.87 Delayed Recall 0.40 1.13 1.29 0.24 1.53 Row 1 0.68 2.14 2.14 0.68 2.82 Row 2 0.94 2.97 2.70 1.21 3.91 Row 3 1.07 3.37 3.31 1.13 4.44 Row 4 0.87 2.94 2.80 1.01 3.81 Row 5 0.50 1.53 1.59 0.44 2.03 Row 6 0.55 1.87 1.82 0.59 2.42 Row 7 0.62 2.10 1.98 0.74 2.72 Row 8 0.33 0.87 0.87 0.33 1.20

Table 5.4: Average counts of hooklet types per test across the various tasks and rows of the dSDT. Counts are lower in rows 1, 4, 5, and 8 because they have fewer cells.

Avg Hooklet Length (mm) Avg Hooklet Speed (mm/s)

Healthy 0.192 6.66

AD+MCI 0.152 5.55

PD 0.138 5.12

Misc 0.090 3.94

Table 5.5: Average hooklet lengths and speeds across patient classes. Hooklet speed is measured along the length of the hooklet tail.

(48)

slow hooklets a good sign? Table 5.5 lists length and speed distributions for hooklets across the various patient classes. It seems like Healthy and PD patients have longer hooklets, and the Healthy class has the fastest hooklets. AD+MCI hooklets are notably slower than the Healthy or PD classes, making hooklet speed a promising feature. So perhaps fast hooklets are a sign of cognitive health; it is less clear whether long or short hooklets have any link.

5.2.3 Stray marks

The investigation into stray marks on the paper stemmed from the idea that there was lots of information encoded in the dSDT in strokes that were not directly digits. Examples of stray marks include strokes outside the cell (extracell strokes) and small points where the patient leaves the pen on the paper as they are thinking or checking the translation key (thinking points). We added a few more stroke type features with these new stroke types.

Extracell strokes

We label a stroke as an extracell stroke if its centroid is not within a cell’s boundaries. Although an extracell stroke may not be in a cell, it is associated with a cell (the cell it is closest to). In most circumstances, the distance between the extracell stroke and its associated cell is very short—the patient barely missed the cell while drawing the stroke. We added this as a new feature—the number of extracell strokes associated with each cell—and created utilities for plotting extracell strokes. An example of a plotted extracell stroke is shown in Figure 5-3.

A summary of the extracell strokes in each task is shown in Table 5.6. We generally see more extracell strokes in the copy and delayed recall sections compared to the translation task. The increased frequency of extracell strokes in the delayed recall section is striking—it could be a sign that patients are less careful about writing within the cell when distracted with the additional task of recalling a symbol-digit mapping.

(49)

Figure 5-3: An example of an ex-tracell stroke, drawn in blue. An in-cell stroke is shown in red. The cell boundaries are drawn in black. (CIN1937073217, Stroke 83)

Figure 5-4: An example of a thinking point is drawn in blue. An example of a normal stroke is drawn in red. (CIN2045945427, Stroke 12)

Avg Num Extracell Strokes Per Cell Translation Copy Delayed Recall All Healthy 0.000 0.002 0.019 0.002 AD+MCI 0.005 0.018 0.025 0.012 PD 0.004 0.027 0.046 0.017 Misc 0.006 0.002 0.043 0.004

Table 5.6: Average counts of extracell strokes per cell across tasks and across patient classes.

Avg Num Thinking Points Per Cell Translation Copy Delayed Recall All Healthy 0.034 0.014 0.065 0.026 AD+MCI 0.060 0.025 0.108 0.044 PD 0.058 0.033 0.145 0.049 Misc 0.022 0.011 0.099 0.019

Table 5.7: Average counts of thinking points per cell across tasks and across patient classes.

(50)

Thinking points

A stroke is categorized as a thinking point if the area of the bounding box containing the stroke is less than 0.54 mm2_{. This value was empirically determined by observing}

many strokes classified by eye as thinking points. An example of a thinking point is shown in Figure 5-4. We believe thinking points are a sign of cognitive load: patients may put their pen down and leave it in one spot for a period of time if they are trying to recall a symbol-digit mapping or if they are looking up at the translation key to find a symbol. In the delayed recall section, we also see patients put their pen down in a cell as they attempt to recall the correct digit, then move on to another cell before returning later when they have recalled the previous symbol (or at least have a better guess at what the answer might be). Thinking points in the delayed recall section therefore often appear in order from left to right, since the patient naturally starts attempting the task in order but may end up completing it in the order they are most confident in.

Our analysis indicated there are indeed more thinking points in the translation task than in the copy task, across all patient classes (see Table 5.7). Even more striking is the frequency of thinking points in the delayed recall task, largely due to the phenomenon of patients putting the pen down in several delayed recall cells before filling one out.

It is worth noting that the AD+MCI class most frequently has thinking points in the translation task, as we intuitively expect, but the PD class has more thinking points in the copy and delayed recall tasks. The greater frequency of thinking points in the copy task may be explained by our method of identifying thinking points: it is possible that strokes we are assigning as thinking points in the copy section are instead noise strokes from motor impairments, a frequently found symptom of PD. There is no clear-cut way at this time for us to distinguish these two types of strokes, unfortunately. The high frequency of PD thinking points in the delayed recall task is more surprising and warrants further study as to whether this is a symptom of the same noise strokes issue.

(51)

Thinking points are very similar to pre-stroke rests, which are discussed further in Section 5.3.1.

5.3 Learning and cognitive load analysis

One of the goals of the dSDT’s design is the desire to detect learning. As patients progress through the test, they are exposed to the same six symbols over and over. Learning from this repeated exposure can manifest itself in two main ways: perfor-mance in the translation task and in the delayed recall task. In the delayed recall task, we see many patients manage to recall symbol-digit mappings—not necessar-ily perfectly, but much better than by chance. Clearly, patients are learning the symbol-digit mappings.

We analyze translation task precell delay in detail, a feature that may be affected by learning over time. We also look at performance from the same patient on multiple administrations of the dSDT.

Finally, we examine the effect of cognitive load on handwriting. We focus on the cognitive load from the task of translation. Whether we can use learning and cognitive load as diagnostic signals is the topic of this section.

5.3.1 Precell delay

The precell delay of a cell in the translation task is the time the patient spends between the last stroke of the previous cell and the first stroke of the current cell. In addition to physically moving the pen, the patient is deciding what the next number to write is, either by recalling the symbol-digit mapping mentally or by looking at the translation key. Assuming that recalling a mapping takes a differing amount of time than checking the translation key, we should see a shift in the precell delay throughout the translation task if the patient begins to learn the mappings. However, there is an additional complication: the time to check the translation key is not constant. Each time the patient moves down another row, they increase the distance they have to traverse with their eyes to check the key.

(52)

Average Precell Delay (ms) Healthy Tests AD Tests PD Tests Row 1 1019 1676 1231 Row 2 1074 1800 1395 Row 3 1032 1835 1281 Row 4 1009 1772 1235

Table 5.8: Average precell delay across each row of the translation task. Average Precell Delay with Rests (ms)

Healthy Tests AD Tests PD Tests Row 1 1160 1836 1404 Row 2 1273 1982 1658 Row 3 1261 2064 1612 Row 4 1259 1982 1525

Table 5.9: Average precell delay, with pre- and post-stroke rests, across each row of the translation task.

The precell delay in each row of the translation task for each class of subjects is shown in Table 5.8. Here, we exclude the cell at the beginning of each row to avoid introducing the row-to-row movement time, since that movement is not present in row 1. We also do not include the practice cells in row 1.

There are several interesting observations. In the Healthy and PD tests, the shortest precell delay is found in the first row, but after a spike up in the second row, the precell delay steadily drops from row-to-row. However, in the AD tests, the precell delay continues increasing until the third row and then drops off at row 4. The fact that the precell delay begins low, increases, then decreases in both cases suggests that both the factors of learning and changing translation key check time are taking effect. At first, the translation key is nearby; it is easy for the test-taker to glance up and find the symbol they are looking for. When they reach the second row, they need to reorient and travel double the distance with their eyes to reach the key. By the time Healthy and PD patients reach the third row, they are perhaps beginning to remember mappings and can fill them in more quickly; the AD patients take until the fourth row before their recall allows them to speed up.

It is also possible that patients are looking at their previous responses, perhaps in the row above the current row, instead of checking the translation key each time they

The digital symbol digit test : screening for Alzheimer's and Parkinson's

The Digital Symbol Digit Test: Screening for

Alzheimer’s and Parkinson’s

by

Lauren Huang

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2017

c

Massachusetts Institute of Technology 2017. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

June 9, 2017

Certified by . . . .

Randall Davis

Professor

Thesis Supervisor

Certified by . . . .

Dana L. Penney

Director of Neuropsychology, Lahey Hospital and Medical Center

Thesis Reader

Accepted by . . . .

Christopher J. Terman

Chairman, Masters of Engineering Thesis Committee

The Digital Symbol Digit Test: Screening for Alzheimer’s

and Parkinson’s

by

Lauren Huang

Abstract

Acknowledgments

Contents

List of Figures

Chapter 1

Introduction and Motivation

Chapter 2

Background

2.1

Diagnosis with cognitive tests

2.1.1

The Clock-Drawing Test

2.1.2

The Mini-Mental State Examination (MMSE)

2.1.3

The Montreal Cognitive Assessment (MoCA)

2.2

Diagnosis with handwriting analysis

2.3

THink

2.4

The Digital Clock Drawing Test

2.5

The Symbol Digit Modalities Test (SDMT)

2.6

The Digital Symbol Digit Test (dSDT)

2.6.1

dSDT tasks

2.6.2

dSDT design

Chapter 3

Prior Work

3.1

Patient classes

3.2

Test administration

3.3

Machine learning

3.3.1

Models

3.3.2

Experiments

3.3.3

Features

3.3.4

Performance

Chapter 4

System robustness and