Autonomic activity from human videos

(1)

Autonomic Activity from Human Videos

by

Weixuan Chen

B.S., Tsinghua University (2012)

M.S.E, University of Pennsylvania (2014)

Submitted to the Program in Media Arts and Sciences, School of

Architecture and Planning

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Media Arts and Sciences

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2020

c

○ Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . .

Program in Media Arts and Sciences, School of Architecture and

Planning

August 7, 2020

Certified by . . . .

Rosalind W. Picard

Professor of Media Arts and Sciences

Thesis Supervisor

Accepted by . . . .

Tod Machover

Academic Head, Program in Media Arts and Sciences

(2)

(3)

Autonomic Activity from Human Videos

by

Weixuan Chen

Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning

on August 7, 2020, in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Media Arts and Sciences

Abstract

The autonomic nervous system (ANS) is a part of the nervous system that is respon-sible for regulation and integration of internal organs’ functioning. Traditionally, for the assessment of the ANS function, autonomic activity is measured in various tests by medical devices with contact sensors. Most of these tests require wearing cumber-some equipment on the human body, so they are commonly conducted in clinics and only sporadic data can be collected.

A potential solution to more convenient analysis of autonomic activity is via camera-based human sensing. Recent research has shown that it can be combined with computer vision algorithms to realize visualization of ANS activity in human videos and non-contact estimation of ANS parameters such as heart rate, respiration rate, and heart rate variability (HRV). However, there are still many hurdles that pre-vent the solution from reaching the accuracy and covering the scope of clinical tests: 1) The robustness of the existing methods are still unsatisfactory in ambulatory sit-uations, especially when illumination changes and body motions are significant. 2) Previous visualization algorithms distort non-sinusoidal components of autonomic ac-tivity in motion magnification. 3) Potential ethics and privacy issues might impede the deployment of the new techniques.

To address these problems, this dissertation proposes an end-to-end convolutional attention network using both gradient descent and gradient ascent to enable robust measurement and visualization under major motions, proposes a near-infrared-based carotid pulse tracker that can work under too dynamic or absent illumination, pro-poses a motion magnification algorithm that can magnify non-sinusoidal autonomic activity faithfully, discusses potential privacy issues in video-based autonomic activity monitoring, and as a solution proposes a framework for eliminating autonomic activ-ity from facial videos without affecting their visual appearance. Through combining these proposed approaches, the final goal of the dissertation is to realize unobtrusive analysis of autonomic activity from human video that can work in the field.

Thesis Supervisor: Rosalind W. Picard Title: Professor of Media Arts and Sciences

(4)

Autonomic Activity from Human Videos

by

Weixuan Chen

This doctoral thesis has been reviewed and approved by the following

committee members:

Professor Rosalind W. Picard . . . . Professor of Media Arts and Sciences Director of the Affective Computing Group Massachusetts Institute of Technology

Professor William T. Freeman . . . . Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Massachusetts Institute of Technology

Professor Ramesh Raskar . . . . Associate Professor of Media Arts and Sciences

NEC Career Development Professor Massachusetts Institute of Technology

(5)

Acknowledgments

I would like to express my earnest thanks to all the great people who have helped me complete this degree. I cannot imagine how I could get through these PhD years without them. The following acknowledgments are by no means exhaustive, for which I apologize.

First, I want to express my deepest gratitude to my adviser Professor Rosalind Picard for being the best advisor I could have wished for. It is my honor to have enjoyed this wonderful journey with you at the Media Lab. You have been and will eternally be a bright, shining star in my mind guiding my way through the darkness. You are always there as my role model, showing me how to become a better researcher, a better explorer and a better human being. I also want to thank my thesis committee: Professor Bill Freeman and Professor Ramesh Raskar, for all their time spent listening to my ideas, attending meetings and reviewing thesis drafts. Their advice and contributions to this thesis have been enormous and invaluable.

I am grateful to Javier Hernandez Rivera, Daniel McDuff, and Akane Sano who are not only great collaborators but more like my big brothers and sisters. I have always enjoyed the depth of their characters and their extraordinary work ethics, and have got countless encouragements and inspirations from them.

I feel lucky to have so many fantastic office mates in the past few years: Asma Ghandeharioun, Natasha Jaques, Grace Leslie, Fengjiao Peng, Judy Shen, Matt Groh. Our random chats day after day have become my golden memories. I would also like to thank the rest of the Affective Computing group: Szymon Fedor, Sara Tay-lor, Kristy Johnson, Ognjen Rudovic, Daniel Lopez Martinez, Craig Ferguson, Noah Jones, Oliver Wilder-Smith, Yue Zhang Weninger, Agata Lapedriza Garcia, and Sable Aragon. I have found a lot of fun and comfort as a member of this big family.

Prof. Joseph Paradiso is the first person at the Media Lab I talked to and the one who introduced me to the Affective Computing group. I would like to thank him for his kindness to reply to a random applicant’s email and also for all the ingenious insights about remote human sensing I got from him afterwards. Prof. Hiroshi Ishii’s

(6)

office was right next to mine. His passion towards research has deeply infected me through lighting in his office late at night and through smiles on his face during our daily greetings.

The findings in this thesis are only possible due to the hundreds of individuals who took part in our experiments. I would like to thank them for their time and patience. I also want to thank the Tsinghua Alumni Association at Greater Boston for connecting me to a number of young students and scholars who enthusiastically shared their stories and dreams with me.

I want to express my most sincere appreciation for the people who loved me and accompanied me even during the most difficult times - my selfless parents (Defu and Guoqin), and my considerate girlfriend (Shuwen).

Finally, I want to thank all the healthcare workers on the frontlines fighting against the coronavirus pandemic. Thank you for taking care of the world with such great compassion.

(7)

List of Figures

1-1 Common modalities used in autonomic activity estimation from human videos. . . 28 2-1 Core steps of remote physiological measurement methods. . . 39 2-2 Carotid pulse visualization. (a) The first frame of a NIR video with a

red scan line on the neck. (b) The scan line plotted over time, which shows a subtle crenation along both edges corresponding to the carotid pulse. . . 45 2-3 (a) Cross section of the human neck showing the location of the carotid

arteries (b) Geometric relationships among the light source 𝑆, the cam-era 𝐶, and all the potential motion sources of the neck. . . . 45 2-4 Visualization of a real (a) and schematic view (b) of the Leap Motion

controller [99]. Note that the optical filter of the device was removed in (a) only to reveal its inside. It was kept intact in our experiments. 49 2-5 Template image of the neck area used in template matching. . . 50

2-6 A NIR frame, and how its AMAD map was derived from the MAD

map in neck detection. . . 50

2-7 The Hidden Markov Model designed for HR smoothing. The gray

nodes were observed PSD estimates, and the white nodes were the heart rates we wished to infer. . . 54 2-8 PSD estimates 𝑝𝑖(𝑓 ) and marginal distributions within the prospective

HR frequency band for four consecutive windows. The red triangles denote the maxima of each spectrum. . . 54

(12)

2-9 Flow chart for BR measurement. (a) An exemplary NIR frame with the detected neck area marked in yellow and the respiration ROI marked in green. (b) The spatial average of all the pixel intensities within the ROI, fluctuating with time. (c) The breathing movement estimated by band-pass filtering the time series in (b). . . 56 2-10 Experimental setup (BVP: blood volume pulse). . . 58 2-11 Infrared frames captured by one of the Leap Motion cameras in a bright

room (a) and a fully dark room (b). . . 58 2-12 Flow chart for HR measurement. (a) An exemplary NIR frame with the

detected 81 by 19 pixels neck area marked in yellow, and the enlarged version of the area after being downsampled to 41 pixels x 10 pixels. (b) Time series showing the intensity changes of the pixels on the neck area. (c) Component candidates: 𝑐0the common average of all the time

series, 𝑐1 the second principal component, and 𝑐2 the third principal

component. (d) PSD estimates of the three component candidates within [0 Hz, 5 Hz], and their normalized band power (NBP) (e) The same PSD estimates within the prospective HR frequency band [0.75 Hz, 2.5 Hz], their kurtosis (K) and pulse significance (PS). . . 59 2-13 An exemplary 30-second window in HR measurement. (a) Blood

vol-ume pulse waveform measured by the finger probe. (b) The optimal component 𝑐 extracted from the neck corresponding to the carotid pulse. (c) Lomb-Scargle periodogram of the BVP signal within the prospective HR frequency band. (d) Lomb-Scargle periodogram of 𝑐 within the prospective HR frequency band. . . 61 2-14 Bland-Altman plots for overall 744 pairs of (a) HR measurements and

(b) BR measurements, with two halves of these pairs respectively mea-sured in bright and dark conditions (HR: heart rate, BVP: blood vol-ume pulse, BR: breathing rate, BW: breathing waveform, NIR: near-infrared, bpm: beats/breaths per minute). . . 61

(13)

2-15 An exemplary 30-second window in BR measurement. (a) Breath-ing waveform measured by the respiration belt. (b) BreathBreath-ing move-ment estimated from the NIR video. (c) Lomb-Scargle periodogram of the breathing waveform within the prospective BR frequency band. (d) Lomb-Scargle periodogram of the estimated breathing movement within the prospective BR frequency band. . . 62 2-16 An example of a participant wearing a collared shirt. . . 65 2-17 We present DeepPhys a novel approach to video-based physiological

measurement using convolutional attention networks that significantly outperforms the state-of-the-art and allows spatial-temporal visualiza-tion of physiological informavisualiza-tion in video. DeepPhys is an end-to-end network that can accurately recover heart rate and breathing rate from RGB or infrared videos. . . 67 2-18 The architecture of our end-to-end convolutional attention network.

The current video frame at time t and the normalized difference be-tween frames at t+1 and t are given as inputs to the appearance and motion models respectively. The network learns spatial masks, that are shared between the models, and features important for recovering the BVP and respiration signals. . . 70 2-19 Example frames from the four datasets: (a) RGB Video I, (b) RGB

Video II, (c) MAHNOB-HCI, (d) Infrared Video. The yellow bounding boxes indicate the areas cropped as the input of our models. (e) shows exemplary attention weights of the left frame in (a) for HR measure-ment, and of the right frame in (a) for BR measurement. . . 75

3-1 (a) 0.5 Hz sawtooth wave. (b) Its power spectral density (PSD). (c) Simulated time series with a mixture of the sawtooth-wave signal and random white Gaussian noise (left), and the signals recovered by prin-cipal component analysis (PCA), a 0.45-0.55 Hz band-pass filter, and a 0.45-1.55 Hz band-pass filter (right). . . 86

(14)

3-2 Flow charts of motion component magnification. The red patch in the power spectral density (PSD) figures indicates the frequency band of interest: 0.8258 Hz - 0.9133 Hz. All the temporal signals have been normalized to the same amplitude for better visualization. . . 88

3-3 The first frames of videos with red scan lines plotted over time: (a) the input video with sawtooth color changes, 𝛼 = 1. (b) the output video of linear EMM. (c) the output video of the new MCM. (d) the ground truth video, 𝛼 = 5. (Video: sim1) . . . . 92

3-4 A plot of the errors as functions of magnification factors for each method. 92

3-5 The first frames of videos with red scan lines plotted over time: (a) the input video with subtle motions in a square wave, 𝛽 = 1. (b) the output video of phase-based EMM. (c) the output video of the new MCM. (d) the ground truth video, 𝛽 = 4. (Video: sim2) . . . 94

3-6 A plot of the errors as functions of magnification factors for each method. 94

3-7 The first frames of the noise videos: (a) 12 dbW. (b) 18 dbW. (c) 24 dbW. (d) 30 dbW. (e) 36 dbW. (Video: noise) . . . 95

3-8 A plot of the errors as functions of noise powers for each method. . . 96

3-9 Video example making the pulse waveform visible from the face. In every subfigure, the left image is the first frame of the video with a yellow scan line, and the right image is the scan line plotted over time. (a) Original video. (b) Video magnified by Wu et al. [102] imposes a pure sinusoid. (c) Video magnified by the MCM method uncovers a varying shape more faithful to the underlying motion. In (b) and (c), the average green-channel intensities of the scan line at each time point are also superimposed. (Video: face1) . . . 97

(15)

3-10 Video example showing three vibrating strings of a guitar. In every subfigure, the left image is the first frame of the video, with a yellow scan line over three strings and a green rectangle containing a non-motion area, while the right image is the scan line plotted over time. (a) Original video. (b) Video magnified by Wu et al. [102]. (b) Video magnified by Wadhwa et al. [93]. (b) Video magnified by MCM. (Video: guitar) . . . 98 3-11 Video example of wrist pulse. In every subfigure, the left image is

the first frame of the video, with a yellow scan line over the wrist and a green rectangle containing a non-motion area, while the right image is the scan line plotted over time. (a) Original video. (b) Video magnified by Wu et al. [102]. (c) Video magnified by Wadhwa et al. [93]. (d) Video magnified by MCM: note also the reduced noise in the non-motion area. (Video: wrist) . . . 99 3-12 (a) The first frame of a video showing two men annotated by scan

lines on their faces. (b) The scan lines plotted over time in the original video. (c) The scan lines plotted over time in the magnified video. (Video: duo) . . . 100 3-13 The first frames of the guitar video with yellow scan lines, and the scan

line plotted over time. (a) Original video. (b) Processed video with the vibration of two strings magnified. (Video: guitar) . . . 100 3-14 A video example showing a burning candle in a dark room. In every

subfigure, the left image is the first frame of the video with a yellow scan line, and the right image is the scan line plotted over time. (a) Original video. (b) Processed video with the dominant motion (background brightness variance) magnified. (Video: candle) . . . 101 3-15 The first frames of the face video with yellow scan lines on the shoulder,

and the scan line plotted over time. (a) Original video. (b) Processed video with the dominant motion (respiration) magnified. (Video: face1) 102

(16)

3-16 Average normalized band power and accumulated band power for dif-ferent numbers of analyzed components in (a) face, (b) guitar, and (c) wrist. . . 102

3-17 Normalized band power 𝑄 of each 8 x 8 block among all pyramid levels and color channels of the guitar example. (a) Red channel. (b) Green channel. (c) Blue channel. The two lines with high 𝑄 in all the color channels correspond to the vibration of the low E string and its reflection.104

3-18 Normalized band power 𝑄 of each 8 x 8 block among all pyramid levels and color channels of the wrist example. (a) Red channel. (b) Green channel. (c) Blue channel. The small dots with high 𝑄 in the 3rd, 4th and 5th pyramid levels correspond to the location of the ulnar artery on the wrist. . . 105

3-19 Normalized band power 𝑄 of each 8 x 8 block among all pyramid levels and color channels of the face example. (a) Red channel. (b) Green channel. (c) Blue channel. The vertical lines with high 𝑄 in the first three pyramid levels indicate the location of the carotid pulse, while the large high-𝑄 areas in the last three levels of the green channel correspond to the global color changes of the face. . . 105

3-20 The first frames of the face video with yellow scan lines, and the scan line plotted over time. (a) Processed video with only the facial color changes magnified. (b) Processed video with only the carotid pulse magnified. (Video: face1) . . . 106

(17)

3-21 We present a novel end-to-end deep neural framework for video magni-fication (DeepMag). Our method allows measurement, magnimagni-fication and synthesis of subtle color and motion changes from a specific source even in the presence of large motions. We demonstrate this via pulse and respiration manipulation in 2D videos. Our approach produces magnified videos with substantially fewer artifacts when compared to previous methods, such as Eulerian Video Magnification [102] shown here. Our method magnifies the red color changes more clearly and otherwise the video frames more faithfully reflect the input video (i.e., show fewer artifacts). . . 108

3-22 The architecture of DeepMag. The CNN model predicts the motion signal of interest based on a motion representation computed from consecutive video frames. Magnification of the motion signal in video can be achieved by amplifying the L2 norm of its first-order derivative and then propagating the changes back to the motion representation using gradient ascent. . . 108

3-23 We used two exemplar tasks to illustrate the benefits of DeepMag. a) Color change (Blood flow) magnification. b) Motion (respiration) magnification. These two tasks require different input motion represen-tations (frame differences) and CNN architectures due to the nature of the motion signals. The color change magnification input is a normal-ized frame difference computed from the downsampled RGB frames. The motion magnification input is the phase variations in a complex steerable pyramid, we computed a pyramid with octave bandwidth and four orientations (𝜃 = 0∘, 45∘, 90∘, 135∘). . . 112

3-24 Exemplary frames from the four tasks of our video dataset. Note the different backgrounds and head rotation speeds. . . 118

(18)

3-25 Scan line comparisons of color change magnification methods for a Task D video: a) original video, b) Eulerian video magnification [102], c) video acceleration magnification [105], d) our method. The yellow line shows the source of the scan line in the frames. The section of video shown was 15 seconds in duration. Our method produces clearer magnification of the color change due to blood flow and significantly fewer artifacts. . . 120 3-26 Scan line comparisons of motion magnification methods for a Task

B video: a) original video, b) phase-based Eulerian video magnifica-tion [93], c) video acceleramagnifica-tion magnificamagnifica-tion [105], d) learning-based motion magnification [63], e) our method. The yellow line shows the source of the scan line in the frames. The section of video shown was 15 seconds in duration. Our method produces comparable magnification of the respiration motion and significantly fewer artifacts and blurring. 121 3-27 Original and magnified traces of a pixel (the yellow dot) in three color

channels of a Task B video (a) red channel (b) green channel (c) blue channel. Magnified traces using different step sizes 𝛾 are shown in different colors. The notches (large variations in image intensity) in the traces correspond to when the participant rotated her head to the far left/right and the pixel was no longer on the skin. Our method amplified the subtle color changes of the pixel only when it was on the skin, and kept the relative magnitudes of the pulse in three color channels with the green channel one being the strongest. . . 122 3-28 Original and magnified traces of a pixel (the red dot) in the phase

representation 𝜑(𝑟0, 𝜃, 𝑡) of a Task C video along four orientations (a)

𝜃 = 0∘ (b) 𝜃 = 45∘ _{(c) 𝜃 = 90}∘ _{(d) 𝜃 = 135}∘_{. Magnified traces using}

different step sizes 𝛾 are shown in different colors on the plots and on different scales. The pixel exhibits a respiration movement mainly in the vertical direction, so its magnified phase traces have the highest amplitude along the 𝜃 = 90∘ _{orientation. . . .} ₁₂₃

(19)

3-29 Learning curves: (a) The change of the CNN loss with different num-bers of iterations 𝑁 and different step sizes 𝛾. (b) The change of the CNN loss with different products of 𝑁 and 𝛾. . . . 129 3-30 (a) Time series and histograms of the L1 norms of the input motion

rep-resentation 𝑋1 for a 30-second video. (b) Time series and histograms

of the L1 norms of the motion gradient ∇‖𝑦(𝑋1|𝜃)‖2 for the same video.131

3-31 Pixel-wise correlation coefficients between the input and magnified mo-tion representamo-tions in the respiramo-tion magnificamo-tion task, without the sign correction mechanism (b) and with the sign correction mechanism (c). A masked version of (c) is shown with black pixels in the back-ground of the original input video colored black. Note: Due to the rotation of the head in the video the region is larger than the face in the example frame, some of the pixels were black from only some of the frames. . . 132 3-32 Scan lines for motion (respiration) magnification method applied to

the “head" video and color change (pulse) magnification applied to the “baby2" video from [102]. In these cases the step size 𝛾 was set to 1 × 10−2 for motion and 1 × 10−3 for color change magnification. The yellow line shows the source of the scan line in the frames. (a) Original input video of head. (b) Motion magnified video. (c) Input video video of baby. (d) Color change video magnified video. Note: Our method does not eliminate all artifacts in the magnified videos. For example, in (b) the edges of the shoulders become blurred and some definition is lost. . . 134

4-1 Flow charts of our methodology. The red patch in the power spectral density (PSD) figures indicates the frequency band of interest: 0.8258 Hz - 0.9133 Hz. All the temporal signals have been normalized to the same amplitude for better visualization. . . 146

(20)

4-2 (a) Mean absolute error of heart rate measurement, (b) normalized pulse power 1, and (c) normalized pulse power 2, averaged among all input videos and all output videos using different methods and different elimination factors. An asterisk indicates a significant difference from the input at a 0.05 significance level. . . 153 4-3 Exemplary frames and power spectra of the estimated pulse signal from

an input video (a) and output videos processed by different elimination factors: (b) 𝐴 = 0, (c) 𝐴 = −0.5, (d) 𝐴 = −1.0, and (e) 𝐴 = −1.5. The yellow bounding box indicates the region of interest we applied our algorithm to, and the red vertical lines mark the ground truth heart rate in each power spectrum. . . 154 4-4 Scatter plots show the relationships between the physiological

elimi-nation indicators (MAE = mean absolute error of heart rate measure-ment, NPP1 = normalized pulse power 1, NPP2 = normalized pulse power 2) and the video appearance distortion indicators (RMSE = root mean square error, SSIM = structural similarity index). MAE, NPP1 and NPP2 were converted into multiples by dividing them by their corresponding input values. Least-squares lines were also fit and superimposed in each subplot. . . 156 5-1 Separation of two images with different polarizing filters into specular,

shading, melanin, and hemoglobin components [86]. . . 161 5-2 (Left y-axis) Absorption coefficients of melanin, oxygenated hemoglobin,

and water in the human skin. (Right y-axis) Scattering coefficient of skin. 𝐴𝑑𝑖𝑓 𝑓 is the absorption difference between the red channel and near-infrared. . . 162

(21)

List of Tables

2.1 Core steps of video-based autonomic activity estimation methods . . 40

2.2 Heart rate measurement . . . 62

2.3 Breathing rate measurement . . . 63

2.4 Statistics with respect to each video𝑎 _{. . . .} ₆₄

2.5 Performance of heart rate and breathing rate measurement for RGB Video I. Participant dependent (p.-dep.) and participant independent results are shown, as are task independent results for the six tasks with varying levels of head rotation . . . 81

2.6 RGB Video II, MAHNOB-HCI and Infrared Video dataset results.

(22)

3.1 Video quality measured via Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) for the magnified videos. The baselines for color change magnification were EVM [102] and VAM [105], and for motion magnification were phase-EVM [93], VAM and learning-based motion magnification [63]. The table shows the average metrics among all videos within each task, while the bar charts also show the standard deviations as error bars. Our models (both participant-dependent and participant independent) produce videos with higher PSNR and SSIM compared to the baselines for all tasks. The benefit of our model is particularly strong for videos with greater levels of head rotation. We observed that while the magnification causes changes to the video that artifacts dominated and thus PSNR and SSIM still provide a reasonable quantitative measure of overall magnified video quality. . . 125

3.2 Video quality measured via Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) for Task C videos magnified to different levels. . . 129

4.1 Mean absolute error of heart rate measurement (MAE, in the unit of beats per minute), normalized pulse power 1 (NPP1), and normalized pulse power 2 (NPP2) for each video. The parameter A represents the elimination factor. All the output videos were processed by our algorithm using the fine estimate of the heart rate frequency band. The last two rows show the t-test results between each output metric and its corresponding input metric. . . 152

4.2 Root mean square error (RMSE) and structural similarity (SSIM) be-tween each output video and its corresponding input video within the regions of interest. The parameter A represents the elimination factor. All the output videos were processed by our algorithm using the fine estimate of the heart rate frequency band. . . 155

(23)

A.1 Table of the Parameter Values Used to Produce the Videos in the autonomic activity visualization chapter. . . 166 A.2 Mean absolute error of heart rate measurement (MAE, in the unit of

beats per minute), normalized pulse power 1 (NPP1), and normalized pulse power 2 (NPP2) for each video. The parameter A represents the elimination factor. All the output videos were processed by our algorithm using the coarse estimate of the heart rate frequency band. The last two rows show the t-test results between each output metric and its corresponding input metric. . . 167 A.3 Mean absolute error of heart rate measurement (MAE, in the unit of

beats per minute), normalized pulse power 1 (NPP1), and normalized pulse power 2 (NPP2) for each video. The parameter A represents the elimination factor. All the output videos were processed by the Eulerian motion magnification algorithm. The last two rows show the t-test results between each output metric and its corresponding input metric. . . 168 A.4 Root mean square error (RMSE) and structural similarity (SSIM)

be-tween each output video and its corresponding input video within the regions of interest. The parameter A represents the elimination factor. All the output videos were processed by our algorithm using the coarse estimate of the heart rate frequency band. . . 169

(24)

(25)

Chapter 1 Introduction

1.1 Motivation

Autonomic activity is short for the activity of the autonomic nervous system (ANS). The ANS is the part of the nervous system that is responsible for regulation and integration of internal organs’ functioning. It acts largely unconsciously and regu-lates multiple physiological processes, such as blood pressure, heart rate, body tem-perature, digestion, metabolism, fluid and electrolyte balance, sweating, urination, defecation and sexual response. Disorders of autonomic regulation are described in multiple and diverse diseases, both those that directly afflict the nervous system as well as those afflicting other organs, where they trigger or enhance pathological symptoms [48, 7, 91, 61, 82, 20, 37, 51]. In the past decades, the significance of dis-turbances of autonomic regulation in circulatory system diseases has been especially emphasized [77, 50, 57, 40, 106, 72, 62, 52, 36]. Clinical symptoms of these distur-bances (e.g. migraine, anxiety, distress, insomnia, chronic fatigue, dry eyes, vertigo, tinnitus, palpitation, acid reflux, etc.) are frequently non-characteristic and shared across many diseases. Therefore, in order to identify them, it is essential to assess the ANS function by measuring autonomic activity.

Traditionally, for the assessment of the ANS function, autonomic activity is mea-sured by medical devices with contact sensors. Common tests include:

(26)

∙ Tests of autonomic cardiovascular reflexes [51, 27, 33]: Measuring blood pressure responses and ECG responses (mainly RR interval changes) to deep breathing test, isometric handgrip test, cold pressor test (immersion of hands or feet for about 60-90 s in cold water), mental arithmetic test (usually 100 minus 7 or 1000 minus 13), and/or Valsalva maneuver (voluntary forced expiration against a resistance).

∙ Analysis of heart rate variability (HRV) [49, 67]: Time-domain or frequency-domain analysis of full 24-h ECG recordings measured from a Holter monitor, or 5-min recordings in response to various factors.

∙ Microneurography [84]: Invasive measurement of sympathetic nerve activity by means of tungsten microelectrodes inserted selectively into muscle or skin fascicles.

∙ Testing of sudomotor function [51]: Measuring electrodermal activity (EDA) re-sponses to physiological stimulation (loud noise, flash, touch, inspiratory gasps) or electrical stimulation (of median, tibial, peroneal or supraorbital nerve); Sweat secretion measured by color-changing powders in response to raised body temperature in a sweat chamber; Quantitating sweat volume using a sudorom-eter following direct activation of the sudomotor nerve fibres (acetylcholine ion-tophoresis).

Most of these tests require wearing cumbersome equipment on the human body. As a result, even though many provocative maneuvers (such as deep breathing and mental arithmetic) are simple and can be performed on one’s own, most of the tests must be conducted in clinics. Similarly, even though long-term ambulatory measurement of autonomic activity is promising for the diagnosis and tracking of ANS disorders, it is largely prohibited by the non-portability of the medical devices. One of the few ex-ceptions is ambulatory HRV analysis, which enables estimation of a para-sympathetic component of the ANS responses but usually relies on uncomfortable chest sensors and unsustainable adhesive electrodes. Moreover, given the complexity of the autonomic

(27)

system, there is no single test that precisely reflects function of a specific branch of the system. Therefore, it is not uncommon to order numerous tests based on diverse reflexes, which would involve wearing and taking off multiple different devices.

A potential solution to long-term measurement of autonomic activity is via camera-based human sensing. We are living in a world where we are surrounded by so many intelligent video-capturing devices. These devices capture data about how we live and what we do. Recent research has shown that they can be combined with com-puter vision algorithms to realize non-contact estimation of human physiology such as heart rate [69], respiration rate [69], blood oxygen [30] and HRV [70]. Compared with traditional contact sensors, the cameras can be more physically unobtrusive with less burden on users. Video frames captured by cameras usually have at least thousands of pixels on the human body, which provides a huge potential for noise cancellation under ambulatory environments in comparison to one or only a handful of signal channels in contact devices. In addition, camera-based methods enable si-multaneous collection and fusing of multiple modalities (e.g. cardiovascular activity and sweating) from a single device.

However, the performance of all the existing methods are still unsatisfactory in realistic ambulatory situations. Their measurement accuracy is vulnerable to many common factors: specular reflection on the skin, lighting intensity and color changes, and rigid (e.g. head rotation) and non-rigid (e.g. facial expressions and talking) body motions. Specular reflection in video exhibits the spectral distribution of the lighting source instead of the human skin underneath. For autonomic activity manifested as spectral distribution changes (e.g. rPPG), no physiological information could be inferred from the specular reflection areas. As a result, if human subjects stay in a very bright environment, specular reflection might cover the majority of their skin surfaces, making usable regions of interest much smaller. Since specular reflection usually has high pixel values and its distribution is sensitive to the changes of sur-face normals of any object, noise caused by rigid and non-rigid motions will also be exacerbated in video with scattered specular reflection. Rigid body motions can be compensated by face tracking or attention mechanism, but it is difficult to take the

(28)

same measures for non-rigid motions. When the intensity of skin deformation is low, non-rigid motions can be attenuated via spatial averaging. However, when the inten-sity is high, the shading component changes will dominate the video and bury any target signal. Different from lighting intensity changes, lighting color changes, though rare, are extremely hard to be detected and removed. This is because the decoupling of the incoming lighting and the skin absorption is under-constrained without prior knowledge about the lighting source spectrum or the skin color. A common situation involving lighting color changes is when a human subject is close to a computer mon-itor or a phone screen showing color-changing content. The main theme of this thesis will be about how to overcome these challenges in various tasks so as to get closer to unobtrusive analysis of autonomic activity from human video that can work in the field.

1.2 Areas of Work

(a) Remote Photoplethy-mography (rPPG)

(b) Imaging

Ballisto-cardiogram (iBCG) (c) Carotid Pulse (d) Respiratorymovement

(e) Blinking rate and pupil diameter

(f) Eccrine sweat gland activity

Figure 1-1: Common modalities used in autonomic activity estimation from human videos.

The autonomic nervous system (ANS) is regulated by integrated reflexes through the brainstem to the spinal cord and organs. Autonomic functions are reflected as activity of autonomic effectors. Though autonomic effectors include the majority of human organs, not every one of them has activity that can be easily observed on the human body surface.

The most common modalities of autonomic activity that can be observed in human video are shown in Fig. 1-1 and summarized as follows:

(29)

(a) Subtle color changes of the skin caused by blood circulation (a.k.a., remote plethysmography, or rPPG) [90, 69, 70, 16, 83, 96]

(b) Subtle body motions associated with the blood ejection into the vessels (a.k.a., imaging ballistocardiography, or iBCG) [2]

(c) Subtle skin deformations of the neck caused by palpable pulsatile changes in the carotid arterial diameters

(d) Chest volume changes during breathing [81, 32]

(e) Eye-blink rates and pupil diameter changes associated with arousal responses (f) Active sweat gland counts measured in thermal imaging [39]

To better understand these modilities, progress made in the research area has been mainly around two tasks: autonomic activity estimation and autonomic activity visualization. Below I summarize partial solutions previous researchers have provided and problems they face:

Autonomic Activity Estimation Over the past decade video-based autonomic

activity estimation using RGB cameras has developed significantly [54]. For instance, physiological parameters such as heart rate (HR) and breathing rate (BR) have been accurately extracted from stationary facial videos. Early work on remote plethysmog-raphy identified that spatial averaging of skin pixel values from an imager could be used to recover the blood volume pulse [80]. The strongest pulse signal was observed in the green channel [90], but a combination of color channels provides improved re-sults [69, 55]. Combining these insights with face tracking and signal decomposition enables a fully automated recovery of the pulse wave and heart rate [69].

In the presence of dynamic lighting and motion, advancements were needed to successfully recover the pulse signal. Leveraging models grounded in the optical properties of the skin has improved performance. The CHROM [16] method uses a linear combination of the chrominance signals. It makes the assumption of a stan-dardized skin color profile to white-balance the video frames. The Pulse Blood Vector

(30)

(PBV) method [17] relies on characteristic blood volume changes in different regions of the frequency spectrum to weight the color channels. Adapting the facial ROI can improve the performance of iPPG measurements as blood perfusion varies in intensity across the body [87]

However, none of the techniques above have achieved robust accuracy under het-erogeneous lighting and major motions, which makes it still infeasible to estimate autonomic activity from human video in an unobtrusive setting. In addition, most of the recent approaches are complex multi-stage methods that are hard to tune and implement. Almost all of them require face tracking and registration, skin segmen-tation, color space transformation, signal decomposition and filtering steps. There is currently no end-to-end solution to autonomic activity estimation.

Another problem with all the previous studies is that they leverage pervasive RGB cameras such as webcams, which require a bright environment to work well [101]. This requirement may not be possible in many real-life scenarios such as sleep monitoring, dark office spaces, and night surveillance. To help address this problem, researchers have explored the use of other sensing modalities. For instance, microwave Doppler radar and thermal imaging have been used for non-contact HR and BR measurements [29, 26, 23, 13], and laser Doppler vibrometry has been used to remotely track the carotid pulse [73]. However, measuring these modalities usually requires expensive and custom hardware which is not readily available.

Autonomic Activity Visualization Subtle variations on the human body

re-lated to autonomic activity are usually difficult to observe with the unaided eye, yet visualizing them can be very informative. Eulerian video magnification (EVM) proposed by Wu et al. [102] combines spatial decomposition with temporal filtering to reveal time varying signals such as pulse and respiration without estimating mo-tion trajectories. However, it uses linear magnificamo-tion that only allows for relatively small magnifications at high spatial frequencies and cannot handle spatially variant magnification. To counter the limitation, Wadhwa et al. [93] proposed a non-linear phase-based approach, magnifying phase variations of a complex steerable pyramid

(31)

over time. Replacing the complex steerable pyramid [93] with a Riesz pyramid [94] produces faster results. In general, the linear EVM technique is better at magni-fying small color changes (e.g. rPPG), while the phase-based pipeline is better at magnifying subtle motions (e.g. iBCG and respiratory movement) [102]. Both the EVM and the phase-EVM techniques rely on hand-crafted motion representations. To optimize the representation construction process, a learning-based method [63] was proposed, which uses convolutional neural networks as both frame encoders and decoders. With the learned motion representation, fewer ringing artifacts and better noise characteristics have been achieved.

These previous methods have achieved success in amplifying global color vari-ations or local motions of objects, but they tend to be inaccurate for visualizing non-sinusoidal motions due to the usage of band-pass filters. This is particularly a problem for visualization of autonomic activity, as most of the related motions are not sinusoidal.

Another problem with all the methods above is that they are limited to stationary objects, whereas many realistic applications would involve small motions of interest in the presence of large ones (e.g. pulse and respiration under large torso or head mo-tions). After motion magnification, these large motions would result in large artifacts such as haloes or ripples, and overwhelm any small temporal variation.

In the course of improving autonomic activity estimation and visualization meth-ods, there emerge two new research domains that have not been explored before. I summarize them as follows:

Ethics and Privacy Issues As the theme of this thesis involves video-capturing

technology, it will inevitably raise issues about ethics and privacy. Now due to the development of autonomic activity estimation and visualization algorithms, whenever you are in front of a camera, people can not only recognize your identity based on your appearance, but also monitor some aspects of your health and affective state. This

(32)

information might be misused for manipulation in marketing, negotiation, and other situations. Moreover, there has been no way to eliminate this channel of information from facial videos without affecting the visual appearance.

Though there are many algorithms measuring physiological signals from facial videos, it is not trivial to develop a new method for eliminating these signals. First, most of the measurement algorithms are irreversible, which means the attenuation of their output signals can not be properly propagated to their input videos. For exam-ple, many methods take the average of multiple pixels in a region of interest to form raw physiological traces, but not every pixel within the region includes the needed physiological information. Therefore, to uniformly remove the estimated traces from the whole region would result in artifacts on the non-relevant pixels. Second, the target of most of the measurement methods is to synthesize a single reading such as a heart rate from a video epoch instead of recovering the temporal shape of the physiological signal faithfully. Without the ability to estimate a signal faithfully, it will be also hard to eliminate it cleanly.

Difficulty in collecting and labeling data For any research about non-contact

estimation of autonomic activity, video data with corresponding ground truth is im-portant. On one hand, the performance evaluation of algorithms relies on comparison with the ground truth on datasets. On the other hand, deep learning techniques have demonstrated superiority in various computer vision tasks, and most of the remark-able successes are owing to supervised learning algorithms using some form of ground truth as training labels.

However, collecting data for video-based autonomic activity measurement is es-pecially difficult. In contrast to object recognition, facial expression recognition, and other common computer vision tasks, most autonomic modalities are “unlabelable" in video, because they are usually subtle variations imperceivable by the human eyes. As a result, to obtain ground truth about autonomic activity, physiological signals such as ECG and skin conductance have to be collected from the bodies of human subjects during experiments using medical devices, which has a much higher cost than

(33)

labeling existing images/videos. Consequently, only a limited number of datasets are available in the research community. Though there are tons of human videos shared on the Internet, they cannot be directly used.

An optimal dataset of autonomic activity in video not only needs accurate ground truth labels but also needs to contain as many conditions as possible. Since the measurement accuracy is affected by various factors concerning lighting, movement and skin color, an algorithm that works well under one condition does not guarantee a good performance under another. Unfortunately, nearly all the existing datasets were collected in lab settings with up to one condition simulated. Without enough data samplings under each possible condition, adopting domain adaptation would also be difficult.

1.3 Thesis Aims

The specific research aims of this thesis are the following:

∙ To propose a video-based autonomic activity estimation method that can work under dynamic or absent illumination. In the method, subtle motions associated with the carotid pulse and breathing movement are captured from the neck using near-infrared (NIR) video imaging.

∙ To propose a video-based autonomic activity estimation method that can work under major body motions. The method features an end-to-end learning system using convolutional attention networks to overcome the influence of motion-related noise.

∙ To propose a video-based visualization method that can magnify autonomic activity faithfully. The method replaces linear temporal filters in traditional motion magnification algorithms with principal component analysis so that the morphology of non-sinusoidal motions can be better kept.

∙ To propose a video-based autonomic activity visualization method that can work under major body motions. By performing gradient ascent in a deep neural

(34)

framework, the method allows magnification and synthesis of subtle color and motion changes from a specific source even in the presence of large motions. ∙ To propose a method for eliminating autonomic activity from facial videos based

on motion component magnification. With any facial video as input, the method outputs a video visually the same but from which physiological signals such as rPPG can no longer be measured accurately.

1.4 Thesis Outline

The remainder of the thesis will cover the following materials:

Chapter 2 defines the problem of autonomic activity estimation with a general

model, and introduces a notation system associated with it. Using the notation system, previous works in the area are summarized as several core steps. Then the chapter focuses on the solutions to two of the most common obstacles: how to estimate autonomic activity under extreme lighting conditions, and how to estimate autonomic activity under major body motions.

Chapter 3 provides a background on using motion magnification methods for

vi-sualizing autonomic activity in human video. Since most of autonomic activities are represented by non-sinusoidal signals, a method designed for faithfully magnifying non-sinusoidal motions is then introduced. In realistic situations, autonomic activity nearly always co-exists with other body motions, so the remaining content of the chapter presents a source-specific motion magnification algorithm that can visualize autonomic activity specifically.

Chapter 4 establishes a new concept called autonomic activity elimination. After

reviewing some related works, the chapter describes a novel framework for video-processing that can attenuate autonomic activity information in human videos with-out changing their visual appearance.

(35)

Chapter 5 summarizes the conclusions and contributions of this thesis and provides a discussion of future work.

(36)

(37)

Chapter 2 Autonomic Activity Estimation

2.1 Problem Statement and Notation

All video-based autonomic activity estimation methods involve capturing color changes or subtle motions of the human skin using a camera. For modeling lighting, imagers and physiology, previous works used the Lambert-Beer law (LBL) [41, 103] or Shafer’s dichromatic reflection model (DRM) [96]. We state the problem and build our models on top of the DRM as it provides a better framework for separating specular reflection and diffuse reflection. Assume the light source has a constant spectral composition but varying intensity, and 𝐼(𝑡) is the luminance intensity level, which changes with the light source as well as the distance between the light source, skin tissue and camera. 𝐼(𝑡) is then modulated by two components in the DRM: specular reflection

𝑣𝑣𝑣𝑠(𝑡), mirror-like light reflection from the skin surface, and diffuse reflection 𝑣𝑣𝑣𝑑(𝑡), the absorption and scattering of light in skin-tissues. Let 𝑣𝑣𝑣𝑛(𝑡) denote the quantization noise of the camera sensor, and we can finally define the RGB values of the 𝑘-th skin pixel 𝑐𝑐𝑐_{𝑘(𝑡) ∈ R}3 _{in an image sequence by a time-varying function:}

𝑐𝑐𝑐𝑘(𝑡) = 𝐼(𝑡) · (𝑣𝑣𝑣𝑠(𝑡) + 𝑣𝑣𝑣𝑑(𝑡)) + 𝑣𝑣𝑣𝑛(𝑡) (2.1)

In Eq. 2.33, 𝐼(𝑡), 𝑣𝑣𝑣𝑠(𝑡) and 𝑣𝑣𝑣𝑑(𝑡) can all be further decomposed into a stationary and a time-dependent part through a linear transformation [96]. Let 𝑢𝑢𝑢𝑑 denote the unit

(38)

color vector of the skin-tissue, 𝑑0 denote the stationary reflection strength, 𝑢𝑢𝑢𝑝 denote the relative pulsatile strengths caused by hemoglobin and melanin absorption, and

𝑝(𝑡) denote the BVP. We will have:

𝑣 𝑣

𝑣𝑑(𝑡) = 𝑢𝑢𝑢𝑑· 𝑑0+ 𝑢𝑢𝑢𝑝· 𝑝(𝑡) (2.2)

Let 𝑢𝑢𝑢𝑠 denote the unit color vector of the light source spectrum, and 𝑠0 and 𝑠(𝑡)

denote the stationary and varying parts of specular reflections. We will have:

𝑣𝑣𝑣𝑠(𝑡) = 𝑢𝑢𝑢𝑠· (𝑠0+ 𝑠(𝑡)) (2.3)

Let 𝐼0 be the stationary part of the luminance intensity, and 𝐼0· 𝑖(𝑡) be the intensity

variation observed by the camera. We will have:

𝐼(𝑡) = 𝐼0· (1 + 𝑖(𝑡)) (2.4)

The stationary components from the specular and diffuse reflections can be combined into a single component representing the stationary skin reflection:

𝑢

𝑢𝑢𝑐· 𝑐0 = 𝑢𝑢𝑢𝑠· 𝑠0+ 𝑢𝑢𝑢𝑑· 𝑑0 (2.5)

where 𝑢𝑢𝑢𝑐 denotes the unit color vector of the skin reflection and 𝑐0 denotes the

reflec-tion strength. Substituting (2.34), (2.35), (2.36) and (2.37) into (2.33), produces:

𝑐𝑐𝑐𝑘(𝑡) = 𝐼0 · (1 + 𝑖(𝑡)) · (𝑢𝑢𝑢𝑐 · 𝑐0 + 𝑢𝑢𝑢𝑠 · 𝑠(𝑡) + 𝑢𝑢𝑢𝑝 · 𝑝(𝑡)) + 𝑣𝑣𝑣𝑛(𝑡) (2.6)

As the time-varying components are much smaller (i.e., orders of magnitude) than the stationary components in (2.38), we can neglect any product between varying terms and approximate 𝑐𝑐𝑐𝑘(𝑡) as:

(39)

To minimize the camera quantization error 𝑣𝑣𝑣𝑛(𝑡), spatial averaging is then conducted in nearly all methods. The whole skin area is divided into 𝐿 skin patches in rectangles (via spatial downsampling) or triangles (via facial landmark triangulation), and the RGB values of all pixels within each patch 𝑙 ∈ (1, · · · , 𝐿) are averaged to form the final definition of the problem

𝑐𝑐𝑐𝑙(𝑡) ≈ 𝑢𝑢𝑢𝑐· 𝐼0· 𝑐0· (1 + 𝑖(𝑡)) + 𝑢𝑢𝑢𝑠· 𝐼0· 𝑠(𝑡) + 𝑢𝑢𝑢𝑝 · 𝐼0· 𝑝(𝑡) (2.8)

In the simplest case, all the skin pixels are averaged together to generate a single 𝑐𝑐𝑐𝑙(𝑡) (i.e. 𝐿 = 1).

In brief, every 𝑐𝑐𝑐𝑙(𝑡) is a linear mixture of three source-signals 𝑖(𝑡), 𝑠(𝑡) and 𝑝(𝑡). It can be also regarded as a 3D space spanned by three vectors 𝑢𝑢𝑢𝑐, 𝑢𝑢𝑢𝑠 and 𝑢𝑢𝑢𝑝, in which

𝑢𝑢𝑢𝑠 depends on the light source color, 𝑢𝑢𝑢𝑐 depends on the skin reflection color (decided by the light source color and the intrinsic skin color, i.e. the optical absorption of skin melanin), and 𝑢𝑢𝑢𝑝 depends on the pulse-induced color variation (decided by the light source color and the optical absorption of hemoglobin). For any of the autonomic activity estimation methods, the goal is to extract 𝑝(𝑡) from 𝑐𝑐𝑐𝑙(𝑡).

Figure 2-1: Core steps of remote physiological measurement methods.

2.2 Previous Work

Minute variations in light reflected from the skin can be used to extract human phys-iological signals (e.g., heart rate (HR) [80] and breathing rate (BR) [70]). A digital single reflex camera (DSLR) is sufficient to measure the subtle blood volume pulse signal (BVP) [80, 90]. The simplest method involves spatially averaging the image

(40)

color values for each frame within a time window; however, it is highly susceptible to noise from motion, lighting and sensor artifacts. Recent advancements have led to significant improvements in measurement under increasingly challenging conditions. Table 2.1 shows a common pipeline for video-based autonomic activity estimation that involves region-of-interest (ROI) selection, a color domain transform, and a spatial domain transform.

ROI selection

Face detection Face tracking Skin segmentation Color domain transform

Green channel [90, 44]

Principal component analysis [42]

Independent component analysis [69, 70, 55] Plane-Orthogonal-to-Skin [96]

Chrominance-based [16] Pulse-blood-vector [17] Spatial domain transform

Spatial averaging [16, 17, 96] Principal component analysis [97] Independent component analysis [41] Self-Adaptive Matrix Completion [87]

Table 2.1: Core steps of video-based autonomic activity estimation methods

2.2.1 Color Domain Transform

A color domain transform is a projection vector 𝑝𝑝_{𝑝 ∈ R}3 _{designed and multiplied with}

the three color channels (RGB) in 𝑐𝑐𝑐𝑙(𝑡) as below

𝑑𝑙(𝑡) = 𝑝𝑝𝑝𝑇 · 𝑐𝑐𝑐𝑙(𝑡) ≈ 𝑝𝑝𝑝𝑇 · 𝑢𝑢𝑢𝑐· 𝐼0· 𝑐0· (1 + 𝑖(𝑡))+

𝑝

𝑝𝑝𝑇 · 𝑢𝑢𝑢𝑠· 𝐼0· 𝑠(𝑡) + 𝑝𝑝𝑝𝑇 · 𝑢𝑢𝑢𝑝· 𝐼0· 𝑝(𝑡) (2.9)

The goal is to find a 𝑝𝑝𝑝 that can minimize 𝑝𝑝𝑝𝑇 _{· 𝑢}_𝑢_𝑢

𝑐 and 𝑝𝑝𝑝𝑇 · 𝑢𝑢𝑢𝑠 and maximize 𝑝𝑝𝑝𝑇 · 𝑢𝑢𝑢𝑝, so that the transformed signal 𝑑𝑙(𝑡) is dominated by 𝑝(𝑡).

Green: The optical properties of the skin under ambient illumination mean that

the green color channel tends to give the strongest PPG signal. Motivated by this, a simplest transform 𝑝𝑝𝑝 = [0 1 0]𝑇 _{was used in initial rPPG works [90, 44].}

(41)

without further assumptions it is impossible to solve the optimization problem and find the best weights. Thus different assumptions about the source signals 𝑖(𝑡), 𝑠(𝑡) and 𝑝(𝑡) or the basis vectors 𝑢𝑢𝑢𝑐, 𝑢𝑢𝑢𝑠 and 𝑢𝑢𝑢𝑝 have been proposed in previous works:

PCA/ICA: Blind Source Separation (BSS) techniques such as principal

compo-nent analysis (PCA) [42] and independent compocompo-nent analysis (ICA) [69, 70] assume the source signals 𝑖(𝑡), 𝑠(𝑡) and 𝑝(𝑡) to be uncorrelated or independent so that a de-mixing matrix 𝑃𝑃𝑃 = [𝑝𝑝𝑝1 𝑝𝑝𝑝2 𝑝𝑝𝑝3] can be estimated. Applying 𝑃𝑃𝑃 to 𝑐𝑐𝑐𝑙(𝑡) generates three transformed signals 𝑝𝑝𝑝𝑇

1 · 𝑐𝑐𝑐𝑙(𝑡), 𝑝𝑝𝑝𝑇2 · 𝑐𝑐𝑐𝑙(𝑡) and 𝑝𝑝𝑝𝑇3 · 𝑐𝑐𝑐𝑙(𝑡), which are uncorrelated or independent to each other. Then further assumptions need be made about the frequency characteristics of the pulse signal to select among them the best estimate of 𝑝(𝑡). For example, the best estimate is assumed to have the highest power in the common heart-rate range, or to have the biggest gap between the highest peak and the second highest peak in the power spectrum. The main problem with the usage of these BSS techniques is that the basic assumption about uncorrelation or inde-pendence does not hold strictly. Due to the existence of BCG, 𝑖(𝑡) and 𝑠(𝑡) can be both correlated with and dependent on 𝑝(𝑡). Also, PCA uses the covariance of 𝑐𝑐𝑐𝑙(𝑡) to estimate 𝑃𝑃𝑃 , which requires the variation in the amplitude of pulse and noise to be

sufficiently different to determine the eigenvector directions, while ICA requires 𝑐𝑐𝑐𝑙(𝑡) to be long enough to enable a statistical measurement of independence.

PBV: The assumptions can be also based on prior knowledge about the unique

properties of skin reflection. For example, the Pulse Blood Vector (PBV) method [17] assumes the direction of the pulse-induced color variations normalized by the skin reflection color is constant, i.e.

𝑢

𝑢𝑢𝑝𝑏𝑣 = (𝑢𝑢𝑢𝑝· 𝐼0)./(𝑢𝑢𝑢𝑐· 𝐼0· 𝑐0) = [0.33, 0.77, 0.53]𝑇 (2.10)

in which ./ indicates element-wise division. Combining it with the assumption of un-correlation, 𝑝𝑝𝑝 can then be estimated by a least square regression. The main limitation

with this method is that it relies on accurate knowledge about 𝑢𝑢𝑢𝑝𝑏𝑣. As the numbers given in (2.10) are specific to a recording setup (a halogen lighting source and an

(42)

UI-2220SE-C camera), the performance will be worse under a different condition. Also, solving the least square regression requires 𝑖(𝑡) ̸= 0 and (𝑡) ̸= 0; thus, the method will not work in near-perfect conditions with very low noise.

CHROM: The chrominance-based method (CHROM) [16] estimates 𝑝𝑝𝑝 by

assum-ing a standardized skin color profile to white-balance the video frames:

𝑢

𝑢𝑢𝑠𝑘𝑖𝑛= (𝑢𝑢𝑢𝑐· 𝐼0· 𝑐0)./(𝑢𝑢𝑢𝑠· 𝐼0) = [0.77, 0.51, 0.38]𝑇 (2.11)

Based on the assumption, two projection vectors are manually designed as 𝑝𝑝𝑝1 =

[3, − 2, 0]𝑇 and 𝑝𝑝𝑝2 = [1.5, 1, − 1.5]𝑇 to ensure 𝑖(𝑡) and 𝑝(𝑡) are in-phase or

anti-phase in 𝑝𝑝𝑝𝑇

1 · 𝑐𝑐𝑐𝑙(𝑡) and 𝑝𝑝𝑝𝑇2 · 𝑐𝑐𝑐𝑙(𝑡). Then 𝑝𝑝𝑝 can be found as a linear combination of 𝑝𝑝𝑝1

and 𝑝𝑝𝑝2 by "alpha-tuning" [16]. The main problem with the method is the assumption

of 𝑢𝑢𝑢𝑠𝑘𝑖𝑛. Though the numbers in (2.11) were estimated from a large-scale experiment, any skin tone deviated from it will cause estimation error.

POS: To soften the knowledge assumed in PBV and CHROM, the

Plane-Orthogonal-to-Skin (POS) method [96] proposes to project 𝑐𝑐𝑐𝑙(𝑡) onto a plane orthogonal to the skin refection color using two hand-crafted vectors 𝑝𝑝𝑝1 = [0, 1, − 1]𝑇 and 𝑝𝑝𝑝2 = [−2, 1, 1]𝑇.

In this way, 𝑠(𝑡) and 𝑝(𝑡) will be in-phase or anti-phase in 𝑝𝑝𝑝𝑇₁ · 𝑐𝑐𝑐𝑙(𝑡) and 𝑝𝑝𝑝𝑇2 · 𝑐𝑐𝑐𝑙(𝑡), and 𝑝𝑝𝑝 can be estimated by "alpha-tuning" again. Though not directly based on 𝑢𝑢𝑢𝑠𝑘𝑖𝑛 and 𝑢𝑢𝑢𝑝𝑏𝑣, the method still assumes the relative directions of them and the magnitude order of 𝑢𝑢𝑢𝑠𝑘𝑖𝑛 (green>blue>red). In addition, its performance becomes sub-optimal when the pulsatile strength (𝑝(𝑡)) and specular strength (𝑠(𝑡)) are close to each other.

2.2.2 Spatial Domain Transform

A spatial domain transform is a projection vector 𝑚𝑚_{𝑚 ∈ R}𝐿 _{designed and multiplied} with the color-transformed signals from all the skin patches 𝑑𝑙(𝑡), 𝑙 ∈ (1, · · · , 𝐿):

(43)

Sometimes a spatial transform can be also applied before a color transform like 𝑝𝑝𝑝𝑇 _·

𝑚𝑚𝑚𝑇 _{· 𝑐𝑐𝑐}

𝑙(𝑡), though not common. Similar to the color transform, the goal is to find an

𝑚𝑚𝑚 that can maximize the strength of 𝑝(𝑡) in the transformed signal 𝑦(𝑡).

Average: The simplest and most widely used form of the spatial transform is an

unweighted average of all patches, i.e. 𝑚𝑚𝑚 = [1, 1, · · · , 1]𝑇 _{[69, 70, 55]. However, the} distribution of physiological signals is not uniform on the human body, so assigning higher weights to skin areas with stronger and less noisy signals should improve the measurement accuracy.

PCA/ICA: PCA [97] and ICA [41] have also been used in the estimation of

𝑚𝑚𝑚. Similarly, an assumption in addition to uncorrelation / independence needs to

be made to select the best component representing 𝑝(𝑡). Several metrics have been proposed to sort the components, e.g. the ratio of the amplitude at the highest peak to the amplitude at the second highest peak [41], and the ratio between the maximum power and total power of the signal spectrum in the pulse-frequency band [97]. The weaknesses of using these BSS techniques are the same as those described for color transforms.

SAMC: [87] finds a low-rank matrix that best approximates [𝑑1(𝑡), 𝑑2(𝑡), · · · , 𝑑𝐿(𝑡)]𝑇

using self-adaptive matrix completion (SAMC), and takes the dominant eigenvector of the matrix as an estimate of 𝑦(𝑡). The estimation assumes that 𝑦(𝑡) is within the common heart rate frequency range, and the variation of 𝑑𝑙(𝑡) with small local stan-dard deviation is only caused by the heart beats instead of the other motion sources. The second assumption will not hold, if SAMC is applied independently, because slight head movements and facial expressions can also cause low-amplitude variation. Therefore, the method has to rely on CHROM [16] as a preceding color transform to attenuate these noises beforehand.

2.2.3 Machine Learning Approaches

Few approaches have made use of supervised learning for video-based physiological measurement. Formulating the problem is not trivial. Template matching and Sup-port Vector approaches [65] have obtained modest results. Linear regression and

(44)

Nearest Neighbor (NN) techniques have been combined with signal decomposition methods [60] to solve the problem of selecting the appropriate source signal. How-ever, these are still limited by the performance of the color and/or spatial domain transforms (e.g, ICA or PCA).

2.3 Illumination-robust Autonomic Activity

Esti-mation in Near-infrared

Video-based autonomic activity estimation make it possible to monitor human vital signs like heart rate (HR) and respiration rate (RR) comfortably and unobtrusively. It is now popular to do this using a regular camera based on remote photoplethys-mography (PPG) or ballistocardiography (BCG). However, these systems can only work in brightly lit environments. To overcome this problem, we introduce a novel methodology to measure carotid pulse and respiration movement from near-infrared (NIR) video of the neck. This approach is more robust to different lighting condi-tions, easier to set up, and captures less private information compared with methods using visible light. It also has lower cost, more widespread availability, and higher automation levels than alternative methods using non-visible light.

2.3.1 Theoretical Model

Fig. 2-3 (a) shows a cross section of the neck and the location of the carotid arteries which can be simplified to a cylindrical model (Fig. 2-3 (b)). With each cardiac cycle, the heart pumps blood to the periphery, and causes palpable pulsatile changes in the carotid arterial diameter, reflected as subtle skin deformations Δ𝑟(𝑡) along the two sides of the neck. To model these deformations in video and its relationship with lighting, we adopted the Blinn-Phong reflection model. Compared with other models commonly used for modeling skin reflection such as the Lambert-Beer law and Shafer’s dichromatic reflection model, the Blinn-Phong reflection model is better at illustrating geometric relationships.

(45)

(a) (b)

x time 16 s x

y

Figure 2-2: Carotid pulse visualization. (a) The first frame of a NIR video with a red scan line on the neck. (b) The scan line plotted over time, which shows a subtle crenation along both edges corresponding to the carotid pulse.

(a) (b) Carotid arteries C S l → v → C S l → v → nyz → x y z x x z y x x p y 0 y nzx → z zp v → l → 0 p y nzx → r r kr

Figure 2-3: (a) Cross section of the human neck showing the location of the carotid arteries (b) Geometric relationships among the light source 𝑆, the camera 𝐶, and all the potential motion sources of the neck.

Assuming a surface point 𝑝 on the neck illuminated by a single NIR light source

𝑆, its illumination captured by a NIR camera 𝐶 can be expressed as

𝐼𝑝(𝑡) = 𝐼𝑎𝑘𝑎+ 𝐼𝑖(𝑡)[𝑘𝑑(⃗𝑙 · ⃗𝑛(𝑡)) + 𝑘𝑠(⃗𝑛(𝑡) · ⃗ℎ)𝛼] + 𝐼𝑛(𝑡) (2.13)

It consists of four terms:

1. 𝐼𝑎𝑘𝑎 is the ambient light intensity, which is the result of multiple reflections from walls and objects. 𝐼𝑎 is the ambient component of the light source, and 𝑘𝑎 is the ambient reflection coefficient. The whole term is usually considered to be

(46)

constant for a particular object and uniform at every point.

2. 𝐼𝑖(𝑡)𝑘𝑑(⃗𝑙 · ⃗𝑛(𝑡)) is the diffuse reflection intensity, associated with the absorption and scattering of the light in skin-tissues. 𝐼𝑖(𝑡) is the intensity of the light source after distance attenuation. 𝑘𝑑is the diffuse reflection coefficient, which depends on the nature of the material (skin) and the wavelength of the incident light. In theory, there is a pulsatile component in 𝑘𝑑similar to photoplethysmography due to the variations of hemoglobin absorption. However, in NIR, hemoglobin absorption of the skin is one order of magnitude smaller than dermal scattering and two orders of magnitude smaller than melanin absorption [58], causing the component to be much weaker than in visible light [88]. Thus we assume 𝑘𝑑 to be a constant. ⃗𝑙 is the direction vector from point 𝑝 toward the light source. ⃗𝑛

is the surface normal at 𝑝.

3. 𝐼𝑖(𝑡)𝑘𝑠(⃗𝑛(𝑡) · ⃗ℎ)𝛼 is the specular reflection intensity, which is a mirror-like light reflection from the skin surface. 𝑘𝑠 is the specular reflection coefficient, usually taken to be a material-dependent constant. ⃗ℎ is the halfway vector defined as

⃗ℎ = (⃗𝑙+⃗𝑣)/2, in which ⃗𝑣 is the direction vector from point 𝑝 toward the camera.

As we assume 𝑆 is a distant light source at the same location as 𝐶, ⃗ℎ = ⃗𝑙 = ⃗𝑣. 𝛼 is the shininess coefficient controlling the strength of specular highlights.

4. 𝐼𝑛(𝑡) is the quantization noise of the camera sensor.

The movement of the neck involves rigid motions and non-rigid motions. Assuming a Cartesian coordinate system with its 𝑧-axis parallel to ⃗𝑙 and its 𝑦-axis perpendicular to the neck cross section, the rigid motions will include translations along three axes Δ𝑥(𝑡), Δ𝑦(𝑡) and Δ𝑧(𝑡), and rotations around the axes Δ𝜃𝑥(𝑡), Δ𝜃𝑦(𝑡) and Δ𝜃𝑧(𝑡). Due to the assumption of a distant light source, Δ𝑥(𝑡), Δ𝑦(𝑡) and Δ𝜃𝑧(𝑡), which are orthogonal to ⃗𝑙, will have no influence on 𝐼𝑝(𝑡). With interferences like talking, eating and swallowing avoided, the skin deformation Δ𝑟(𝑡) caused by the carotid pulse will be the only non-rigid motion.

Next, we elaborate on how the motions Δ𝑧(𝑡), Δ𝜃𝑥(𝑡), Δ𝜃𝑦(𝑡) and Δ𝑟(𝑡) influence the illumination of the neck point 𝐼𝑝(𝑡). First, the relationship between the light

Autonomic activity from human videos

Autonomic Activity from Human Videos

by

Weixuan Chen

B.S., Tsinghua University (2012)

M.S.E, University of Pennsylvania (2014)

Submitted to the Program in Media Arts and Sciences, School of

Architecture and Planning

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Media Arts and Sciences

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

September 2020

c

○ Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . .

Program in Media Arts and Sciences, School of Architecture and

Planning

August 7, 2020

Certified by . . . .

Rosalind W. Picard

Professor of Media Arts and Sciences

Thesis Supervisor

Accepted by . . . .

Tod Machover

Academic Head, Program in Media Arts and Sciences

Autonomic Activity from Human Videos

by

Weixuan Chen

Abstract

Autonomic Activity from Human Videos

by

Weixuan Chen

This doctoral thesis has been reviewed and approved by the following

committee members:

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Areas of Work

1.3

Thesis Aims

1.4

Thesis Outline

Chapter 2

Autonomic Activity Estimation

2.1

Problem Statement and Notation

2.2

Previous Work

2.2.1

Color Domain Transform

2.2.2

Spatial Domain Transform

2.2.3

Machine Learning Approaches

2.3

Illumination-robust Autonomic Activity

Esti-mation in Near-infrared

2.3.1

Theoretical Model