The application of deep learning to nucleus images for early cancer diagnostics

(1)

The Application of Deep Learning to Nucleus

Images for Early Cancer Diagnostics

by

Ali Can Soylemezoglu

B.S., MASSACHUSETTS INSTITUTE OF TECHNOLOGY,

CAMBRIDGE (2017)

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Bachelor of Science in Computer Science and Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2018

c

Massachusetts Institute of Technology 2018. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

May 22, 2018

Certified by . . . .

Caroline Uhler

Assistant Professor

Thesis Supervisor

Accepted by . . . .

Katrina LaCurts

Chairman, Masters of Engineering Thesis Committee

(2)

(3)

The Application of Deep Learning to Nucleus Images for

Early Cancer Diagnostics

by

Ali Can Soylemezoglu

Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2018, in partial fulfillment of the

requirements for the degree of

Bachelor of Science in Computer Science and Engineering

Abstract

Cancer remains a major concern for patients and early diagnosis can go a long way in treating patients. Current cancer diagnosis usually involves a pathologist look-ing at tissue slices of patients for specific features associated with cancer prognosis such as nuclear morphometric measures. However, early diagnosis remains a major challenge. Recent studies have shown that changes in fibroblast nuclei play a critical role in the early development of cancer. In addition, it is crucial that computational models are capable of justfying themselves when used in critical decisions such as diagnosing a patient with cancer. In this thesis, we use machine learning techniques on two dimensional nuclei images to show that computational models are capable of presenting human interpretable features as a means of justifying themselves. In addition, we use machine learning techniques on volumetric images of nuclei of cells in a co-culture model that represents the cancer tissue microenvironment to study changes the fibroblasts undergo. These studies pave the way for various approaches to early disease diagnosis.

Thesis Supervisor: Caroline Uhler Title: Assistant Professor

(4)

(5)

Acknowledgments

I would like to thank my advisor, Professor Caroline Uhler, for her guidance and support on this project. For the past two years, she has given me valuable advice along the way. Working with her has taught me a lot in terms of research. Her approachable and friendly attitude has made this experience as smooth as possible.

I would also like to thank Professor G.V. Shivashankar of the Mechanobiology Insti-tute in Singapore. Throughout this project, he has given valuable guidance in under-standing the biological processes that take place in the cancer tissue microenivron-ment as well as providing valuable biological interpretation of the data and results.

Furthermore, I would also like to thank Adit Radha and Charles Durham for their contributions to the PatchNet chapter presented in the paper and Karren Yang for her contributions to the machine learning approaches for the Co-Culture chapters. I would also like to thank Karthik Dormadan and Saradha Paty from the Mechanobi-ology Institute in Singapore for imaging the nuclei and undertaking the segmentation of the images.

I would also like to thank my friends who have provided me with moral support throughout my time at MIT. Their friendship has given me incredible joy in my 5 years here and have helped me always stay optimistic.

Finally, I would like to thank my mother, father and brother along with the rest of my extended family. They have been my number one supporters in all of my undertakings. They have constantly pushed me to be a better person and helped me get through even the hardest of tasks. Their support for this project gave me tremendous motivation to complete this project.

(6)

(7)

List of Figures

2-1 A schematic of a single computational neuron. Each input xi is

mul-tiplied by a weight wi and summed together along with a bias term b.

This sum is passed through an activation function f and the output of the function is passed along to other neurons in the network . . . 28 2-2 A 2 × 2 MaxPool layer with a slide of 2. One can think of the

repre-sentation on the left side of the arrow as a single slice of the internal representation the neural model has for the input image. Once the pooling layer is applied, the representation is downsampled by a factor of 2 in all directions. The window starts in the top right corner and then takes slides of step size 2. . . 31

3-1 The PatchNet architecture. The left schematic displays the local network S. S acts on a small patch of the image. The patch is fed into six 3 × 3 convolutional layers of 64 filters each with each layer followed by a ReLU unit. 64 linear models are applied as a dot product for filters of the last layer to produce a 64-element vector. The vector goes through a hyperbolic tangent activation unit and is then condensed to a vector that is the length of the number of classes. Finally, a softmax layer generates a probability distribution over the classes for the specific patch. The schematic on the right shows the global network G which applies S on possibly overlapping patches followed by averaging the probability distributions generated by S across all the patches to arrive at a classification result. The image is from [37] . . . 39

(10)

4-1 The setup of the co-culture experiment. Cancer (MCF7) cells (marked as red in the schematic) are placed as a clump in the center of the collagen matrix. Fibroblasts (NIH3T3) cells (marked as green in the schematic) are scattered around the clump of MCF7 cells. The exper-iment is sustained for 5 days. On each day, the cells are stained and imaged. Image credit goes to Karren Yang . . . 46

4-2 Four consecutive z-slices of a raw DAPI-stained image of the co-culture are shown above. The total number of slices in this image was 51. These four slices are roughly around the middle of the stack. As one goes from one slice to another, it is possible to see that certain nuclei become more prominent while others fade. . . 47

4-3 In (a), a slice of the co-culture experiment imaged on the first day is shown. The clump in the middle corresponds to the cancerous cells and the non-cancerous fibroblasts are scattered around the clump. (b) shows a slice of the co-culture experiment imaged on day 5. At this point the cells have grown and divided along with having become acti-vated. The fibroblasts that have become activated have now oriented themselves such that they point towards a cancerous clump. Note that both slices come from roughly the middle of their respective stacks. . 48

4-4 The triplet of images are of the same slice of the stack from day 5 of the co-culture. (a) is the EPCAM stain of the slice, (b) is the vimentin stain of the slice and (c) is the DAPI stain of the slice. In the EPCAM-stain image, pixels are most intense where the cancer cells are (i.e. the clumps of nuclei in the DAPI stain). In the vimentin-stained image, the pixel intensities are higher in the region of fibroblasts which corresponds to the long tail of nuclei trailing the larger of the clumps. 49

(11)

4-5 A comparison of EPCAM and vimentin stained images where the top row images are slices from the first day of the co-culture while the bottom row images are from the fifth day of the co-culture. The inten-sities of the stains for both EPCAM and vimentin are higher in day 5 as expected . . . 50

5-1 Overall workflow for the analysis of the images. The nuclei in the raw DAPI-stained images are segmented and then either are input into VoxNet, a convolutional neural network, or have features extracted from them which are input into a logistic regression model. These models are trained to accomplish various classification tasks . . . 51

5-2 Example individual nuclei. Consecutive slices of two nuclei are shown above. These nuclei were extracted using the segmentation pro-cess described in Section 5.2.1. (a) shows an example MCF7 nucleus while (b) displays an example NIH3T3 nucleus. . . 54

5-3 Example patches. The same slice from two different patches gener-ated from the same nucleus are shown above. . . 54

A-1 The segmentation workflow is displayed. After a raw DAPI image is thresholded, watershed is run to retrieve an initial segmentation. A size filter is used to remove any objects that are too large to be a single nucleus. Then, a convex hull is found for each of the remaining objects. Thus, extracting individual nuclei. Image credit goes to Saradha Pathy. 67

(12)

A-2 VoxNet Architecture. The flowchart above displays the VoxNet architecture used in the classification tasks. The input image is of size 32×32×32. The notation l×Conv3D-k (3×3×3) means that there are l 3D convolutional layers (one feeds into the other) each with k filters of size 3 × 3 × 3. MaxPool3D(2 × 2 × 2) indicates a 3D max pooling layer with pooling size 2 × 2 × 2. FC-k indicates a fully connected layer with k neurons. Note that the PReLU activation function is used in every convolutional layer while ReLU activation functions are used in the fully connected layers. Finally, batch normalization is followed by every convolutional layer. . . 68

(13)

List of Tables

3.1 A comparison of various models’ performance on the task of classify-ing between two different cell lines. PatchNet is run with 3 different patch sizes (11, 17 and 31). It is possible to see the trade-off be-tween patch size and generalization error as larger patch size results in smaller generalization error. PatchNet is compared to CAM, another CNN model that offers a means of visualizing features, and Grad-CAM. Grad-CAM’s VGG architecture performs the best, but PatchNet com-petes with CAM and is capable of beating it out. In addition, the trade-off between patch size and time it takes to train per epoch is shown as smaller patches result in longer training times. This table can also be found in [37]. . . 40 3.2 A comparison of various models’ feature visualizations on the task of

classifying between two different cell lines. PatchNet is run with 3 different patch sizes (11, 17 and 31). Smaller patches generate sharper feature maps while the CAM and Grad-CAM models are less sharp. PatchNet provides relevant features for all patch sizes while CAM and Grad-CAM fail to do so. This table can also be found in [37]. . . 41 3.3 Comparison of PatchNet-11, CAM and Grad-CAM visualizations with

that of CENP-A markers which are associated with heterochromatin regions that are a feature of the early onset of cancer. PatchNet is capable of picking out features correlated with CENP-A markers but Grad-CAM and CAM pick out biologically irrelevant features. This table can also be found in [37]. . . 43

(14)

5.1 Results of the logistic regression model on the task of classifying NIH3T3 patches from the co-culture and the control groups on the same day. Validation accuracy increases as the day increases as hypothesized . . 56 5.2 Results of the VoxNet neural network on the task of classifying NIH3T3

patches from the co-culture and the control groups on the same day. Validation accuracy increases as the day increases as hypothesized . . 58 5.3 Results of the VoxNet neural network on the task of classifying NIH3T3

patches from the co-culture group on different days. Validation accu-racy increases as the temporal distance between the days increases . . 59 5.4 Results of the VoxNet neural network on the task of classifying NIH3T3

patches from the control group on different days. Validation accuracy increases as the temporal distance between the days increases . . . . 60

(15)

Chapter 1 Introduction

1.1 Problem Statement

The focus of this thesis is applying deep learning on 2D and 3D nuclei images of cancerous and non-cancerous cells. First, the thesis concerns itself with developing a model that is capable of visualizing features it uses to make predictions on two dimensional nuclei images. This study validates that neural models used in critical healthcare decisions are capable of presenting human interpretable features as a means of justifying of their classifications while maintaining high accuracy. Second, given images from a co-culture of cancerous and non-cancerous cells that mimic that of a cancer tumor in the human body, the thesis concerns itself with developing a pipeline that allows for single cell resolution classification of the nuclei. Thus, validating that deep learning methods are capable of detecting that cells in a co-culture change more than their counterparts in a control group over the same time period. The results of this thesis validate the use of deep learning methods for cancer identification and also pave the way for manifold learning of images.

1.2 Motivation

The American Cancer Society estimates that there will be 1.7 million new cancer cases and 609,640 cancer deaths in the United States of America [2]. Early cancer

(16)

diagnostics continues to be a crucial factor in the treatment of cancer [1]. Current cancer diagnostics usually involve a pathologist examining tissue slices to determine whether a patient has cancer or not. However, this approach can be subjective due to human involvement as well as inefficient as it might take a pathologist a while to examine a single patient [11]. It would be desirable instead to be able to diagnose a patient automatically with as little human intervention as possible.

One source of information on whether a cell is cancerous or non-cancerous is the cell’s nucleus. Nuclear morphometric information plays an important role in diagnosing pa-tients with cancer [9, 27, 29, 31, 32]. The role of the nucleus is crucial in homeostasis as it acts as an integrator of various mechanical and chemical signals that regulate homeostasis [5, 39, 44]. Diseases are known to occur when these mechanical and chemical signals are altered [13, 22, 45]. Hence, changes to nuclear structure can lead to various diseases including cancer [12, 15, 19, 26, 28, 48]. Thus, studying nuclei im-ages of cells provide valuable information for determining whether a cell is cancerous or not. The question then is if deep learning methods can be used to leverage the information found in these images.

In 2012, a deep convolutional neural network (CNN), AlexNet, won the ImageNet Large Scale Visual Recognition Competition for image classification [25]. Since then, machine learning techniques involving CNNs have become the state of the art for image classification. The use of CNNs have become prevalent in tasks such as fa-cial recognition as well as scene segmentation for self-driving cars. CNNs are also being used in biological applications including cancer diagnostics [10]. While these deep learning techniques are useful for late stage cancer diagnostics, early cancer diagnostics remains difficult. Nevertheless, some studies have shown that deep learn-ing techniques are capable of picklearn-ing up on very slight abnormalities in cell nuclei [36], thus strengthening the belief that CNNs can help with early cancer diagnostics. Furthermore, current deep learning approaches use two dimensional images. More information can be extracted by using three dimensional images. Hence, using deep

(17)

learning on 3D images of cell nuclei imaged under a scenario which mimics that of cancer progression can provide us with more insight on the capabilities of deep learn-ing for the use of early cancer diagnostics.

In addition, if computational models are ever to see widespread use in the diagnosis of patients, it is crucial that such models are capable of justifying their decisions with human interpretable features. Otherwise, doctors will not know whether the model is accurate in its classifications or not. To this end, it is of importance to develop models that can offer human interpretable features while not suffering much in terms of accuracy.

The thesis first describes a computational model called PatchNet [37] that makes a global classification of a two-dimensional nucleus by aggregating classification deci-sions from small patches of the image. Then, it describes a co-culture experiment that is setup to mimic the interactions between a cancer tumor and non-cancerous cells in the body. Then, the co-culture along with control groups for the two cell lines in the co-culture have their nuclei stained and imaged. These 3D images are then used as inputs into various deep learning methods to validate the hypothesis that cells in the co-culture change more as they become more activated than their control counterparts.

1.3 Related Works

This thesis has its foundations in two major areas. The first of the two areas is machine learning. The thesis uses deep learning techniques to classify images. Hence, it draws upon various work already presented in the field of machine learning. The second of the two areas is cancer diagnostics. The thesis focuses on the application of deep learning techniques to cancer diagnostics. Specifically since the images are nuclei images, this paper draws from previous work done on studying how nuclear morphometric information plays a role in cancer diagnostics. We now proceed to

(18)

discuss previous work in these two fields.

1.3.1 Machine Learning

AlexNet has shown that convolutional neural networks are capable of reaching ex-tremely high levels of accuracy in image classification tasks [25]. Since then, several different state of the art models have been used. Currently, two of the most com-monly used CNN models are ResNet [17] and VGG [40]. While both of these models can be used for the problem at hand, this thesis uses a 3D extension of the VGG model. The reasoning for this stems from the documented success of the VGG model in cancer detection. Radhakrishnan et al use the VGG-19 model on various classifi-cation tasks [36]. The most basic of these tasks is binary classificlassifi-cation of cancerous and non-cancerous cells in which it achieves 88.2% validation accuracy. This thesis will look to expand on the results presented in this paper by working with 3D images of nuclei.

While work in 3D image classification is not as widespread as 2D image classifica-tion, Korolev et al apply 3D extensions of ResNet and VGG to 3D brain images [24]. Their models, which are named VoxResNet (3D extension of ResNet) and VoxCNN (3D extension of VGG), are tested on data acquired from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The images are one of four labels: Alzheimer’s Disease (AD), Late Mild Cognitive Impairment (LMCI), Early Mild Cognitive Im-pairment (EMCI), Normal Cohort (NC). They report results on six different binary classification tasks between various combinations of the these four labels. While the two models perform roughly the same, VoxCNN tends to beat out VoxResNet by the slightest of margins on most of these tasks. Coupled with the success of the VGG model reported in [36], this thesis chooses to implement the VoxCNN model for the classification tasks.

One portion of this thesis concerns itself with developing neural models capable of of-fering feature visualizations alongside its classification results. The model presented

(19)

in this paper is most similar to the Class Activation Map (CAM) model [47] and Grad-CAM [38], an augmented version of the CAM model. CAM differentiates from conventional feed-forward neural models in that the values obtained in the last layer of filters are averaged across the filters before being reduced to the number of classes through a fully-connected layer. Then, the final layer is fed into a softmax layer to obtain classification results. As for feature visualization, CAM takes the weight of a given class label from the final layer of the model and uses that to compute a weighted sum of the filters of the last convolutional layer. The resulting weighted feature map is upscaled to match the input image size, thus producing a heatmap of that image for a given class label. Grad-CAM augments CAM’s techniques to provide for more sharper visualizations. However, both models suffer from some drawbacks. For instance, both models tend to blur areas of interest as the resulting heatmaps are obtained from upscaling smaller filters to larger images. Furthermore, both models still make their classifications based off the entire image, hence all the models need is a specific region in the image to base their classification off of. However, in medical diagnosis it is important that the feature visualizations include all features related to a class that the model considered when making the decision. Such visualizations would aid doctors and pathologists in patient diagnosis.

We compare the results of the deep learning models in the co-culture experiment to that of a logistic regression model [8]. The features extracted for the logistic regression can be separated into two broad categories: shape and texture features. The features used for the logistic regression are a subset of the features that Radhakrishnan et al use for their linear models in the 2D case. The feature ablation study conducted by [36] demonstrates that not all features impact classification results and extending some features into 3D become more computationally expensive than others. Hence, Zernike moments which tend to not impact the classification results were not extended to 3D. Local binary patterns (LBP) are extended to 3D using LBP-TOP [46] which is more efficient than computing it over the entire volume.

(20)

1.3.2 Cancer Diagnostics

The work in this paper builds on what is known regarding the information that the nucleus provides about whether a cell is cancerous or not. Zink et al state that can-cerous cells’ nuclei are structurally different than non-cancan-cerous cell nuclei [48]. In fact, certain tumor types and cancer stages can be associated with certain alterations in the nucleus. Specifically, changes in the nuclear matrix, a network of fibers that plays a key role in the structural integrity of the nucleus, can impact the regulation of various processes such as RNA splicing and DNA replication [48]. Hence, analyzing nuclei of cells can lead to diagnosing a patient with cancer. In fact, Radhakkirshnan et al use deep learning to classify nuclei of the same cell line in which one set of nuclei have been slightly perturbed in ways that mimic that of a cancer tissue microenviron-ment [36]. For instance, one set of cells were perturbed using tumor necrosis factor α (TNF-α), which is known to play a role in cancer progression. Radhakrishnan et al report that their deep learning model was able to reach 88.9% validation accuracy when classifying between a control group and a group exposed to TNF-α of the same cell line [36]. The work in this paper extends the findings in [36] by working with three dimensional nuclei images on similar classification tasks.

1.4 Organization of this Thesis

The thesis is organized as follows. Chapter 2 familiarizes the reader with the necessary background associated with the models used throughout this paper (i.e. convolutional neural networks, logistic regression) as well as the biological background. Chapter 3 discusses the PatchNet model and its results. Chapter 4 lays out the co-culture experiment set up and the data that is extracted from it. Chapter 5 proceeds to explain the models used followed by an analysis of the results. Chapter 6 concludes the thesis and suggests new avenues to expand upon the work presented here.

(21)

1.5 Contributions

The work in this thesis contributes to the ongoing studies on deep learning applied to cancer diagnostics. Specifically, the thesis first presents a model that is capable of offering feature visualizations alongside its classification results of the two-dimensional nuclei images. Then, the thesis presents a model that operates on 3D nuclei images of cells in a co-culture environment that models the cancer tissue microenvironment. The thesis studies the application of deep learning methods on such images with the goal of validating that cells in the co-culture change more than their control counterparts in the same time period. These findings support the fact that fibroblasts in a cancer tissue microenvironment become polarized. Furthermore, it provides more evidence for the predictive power of deep learning methods in early cancer prognosis. It also opens up avenues for studies on manifold learning of cells using temporal and spatial information which can enable predicting the trajectory of cells in a tissue.

(22)

(23)

Chapter 2 Background

Before diving into the machine learning models and results of this thesis, it is neces-sary to give background knowledge on the machine learning and biological aspects of the work presented in this thesis. We proceed to explain what convolutional neural networks are followed by an explanation of the fundamentals of cancer, which involves a discussion on the cancer tissue microenvironment and a more detailed look at the role of the nucleus.

2.1 Convolutional Neural Networks

2.1.1 Motivation

Neural networks, as the name suggests, are inspired by the human nervous system. As with the nervous system, one primary use case for neural networks is image clas-sification. With the increase in computational power and larger training sets, neural networks have seen an incredible increase in use as they easily beat out other image classification methods. Specifically, since a convolutional neural network (CNN) won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012, CNNs have been considered the state of the art when it comes to image classification. Thus, it is crucial to understand how a CNNs before diving into how they are used to solve image classification problems.

(24)

2.1.2 Preliminary Definitions

Before going in depth, we define some preliminary terms:

• Training set, St: The training set is a set of pairs (x, y) such that x is an input

datapoint into the model and y is the associated class label (or ground truth) used to train a model. In the case of 2D grayscale image classification x ∈ Rn×m

and y ∈ {0, 1, 2, . . . , k − 1} where n × m are the dimensions of the image and k is the number of classes. In this work, y is a binary variable meaning that any given model is trying to distinguish between two different classes. Furthermore, the images used are 3D images hence throughout this thesis x ∈ R3×32×32 _(in

general for 3D grayscale images, x ∈ R3×n×n for various n ∈ Z+).

• Validation set, Sv: The validation set is a set of pairs (x, y) similar to the

training set. The only difference is that the training set is used to train a model while Sv is used to validate the model. Thus, it is important that Sv∩ St= ∅

2.1.3 Evaluating Models: Introduction to Loss functions

One can think of the CNN as a function f that maps an input x to a vector of probability scores ˆy where ˆy_i indicates the probability of x being labelled class i. Specifically,

f : Rn×m _{→ R}k

Then, what is meant by training a model on a training set St is learning a set of

parameters θ for the function f . We can denote a function parameterized by θ as fθ.

Then, we need a way to measure how well a certain set of parameters θ are. To this end, we introduce loss functions.

As mentioned above, a loss function L measures how well a set of parameters θ for the function f fit the data. While there are several different loss functions one can

(25)

choose from, one of the most commonly used ones is the cross-entropy loss function. In the case that there are only two classes, the loss function is specialized to the binary cross-entropy loss function. The cross entropy loss for a datapoint (xi, yi)

with ˆy_i = fθ(xi) is as follows:

L(ˆy_i, y_i) = −Pk−1

j=0yi,jln ˆyi,j

Since a given image can be only one of the k different classes y_i,j is 0 for all j except for one. Hence, only one term in the loss is non-zero. Furthermore, for the binary case the loss can be simplified as follows:

L(ˆy_i, y_i) = −(y_i,0ln ˆy_i,0+ (1 − y_i,0) ln ˆy_i,0)

The total loss over an entire training set is the average of the individual cross-entropy losses:

L(St; θ) = |S1t|

P|St|

i=1L(yi, fθ(xi))

2.1.4 Finding Optimal Parameters: Adam Optimization

Now that we know how to determine how well a set of parameters θ fit a training set St, we would like to be able to find an optimal θ. Hence, we would like to find

a θ that minimizes the loss. The search space of the parameters θ is too large for us to randomly try different θ’s hoping to find a good θ. This is simply infeasible. However, we can start with a random initialization of θ and work to improve it. In fact, this is the core idea behind how we optimize neural networks.

Given a random initialization for θ, we can compute L(St; θ). Then, we have to

improve θ to θ0 such that L(St; θ0) < L(St; θ). This can be done by taking a step

proportional to the negative of the gradient of the loss function at the current point. The gradient is a multi-variable generalization of the derivative. In other words, the gradient tells us how the function changes at a given point as we change the variables. In this case θ is the shorthand notation for a set of parameters meaning that L is actually a function over multiple variables. We can compute the gradient of

(26)

L, denoted as ∇L using a method such as backpropagation, and then do the following update:

θ ← θ − η · ∇L

where η is the learning rate. This method is known as gradient descent. Gradient descent computes the gradient after having computed the loss over the entire train-ing set. However, one can compute the loss over a smaller batch of the traintrain-ing set, update the parameters and then move on to the next batch. An extreme case of this is when the batch is set to 1. In other words, the parameters are updated after every training set input. This is known as stochastic gradient descent (SGD).

In this thesis, the optimization method used is an extension of SGD and is known as Adam [23]. Adam gets its name from adaptive moment estimation. The motivation behind Adam is that in SGD the learning rate η is fixed. Adam instead computes adaptive learning rates for the parameters which improve upon SGD. Adam keeps track of moving averages of the gradient (the first moment) and the square of the gradient (the second moment). Specifically, the optimization includes calculations of ˆmt (the bias-corrected first moment estimate) and ˆvt (the bias-corrected second

moment estimate) for iteration t. Then, the update for θ at iteration t is as follows:

θt ← θt−1− η · √m_ˆ_vˆt

t+

for some small .

2.1.5 The building block: a single neuron

We start the discussion on neural network architecture by first introducing its building block: the neuron. As mentioned previously, neural networks are biologically-inspired models. The building block, a neuron, is modeled off of the neurons that make up the nervous system. While the biological neuron has several parts, the parts that have influenced the computational neuron are as follows:

(27)

• Dendrites: dendrites are small branch-like structures from which the neuron receives inputs

• Axon: The axon is a single thin terminal from which the neuron fires signals down towards the axon terminals. The axon terminals are small branch-like structures that form synapses with other neurons. Hence, the signal that one neuron outputs through its axon terminals are sent to other neurons through their dendrites

For a signal to travel through the axon, the signal has to overcome the activation potential.

The computational neuron is quite similar to the biological neuron in this respect. A single neuron acts as follows (the same process is shown in a schematic in Figure 2-1):

1. Receives inputs x1, x2, . . . , xn (think of the neuron receiving each input on a

dendrite)

2. All of the inputs are multiplied by their respective weights w1, w2, . . . , wn(think

of this as the synaptic strength through which the input reaches the neuron).

3. The neuron sums them together usually along with a bias b and passes the sum through an activation function (think of this as the activation potential threshold the biological neuron has)

4. The output of the activation function is then passed on to other neurons (think of this as the biological neuron passing the signal through its axon terminals to other neurons)

(28)

Figure 2-1: A schematic of a single computational neuron. Each input xi is multiplied

by a weight wi and summed together along with a bias term b. This sum is passed

through an activation function f and the output of the function is passed along to other neurons in the network

Then, the goal of fitting a model is to learn a set of weights wi for each neuron in the

network such that the loss is minimized. This set of weights wi is what is meant by

the parameter variable θ in the previous sections.

2.1.6 Convolutional Layers

The convolutional layer is at the core of CNNs. The convolutional layer applies a set of filters to its input volume. First, we cover the two dimensional case and then discuss the extension to three dimensional images.

Consider a grayscale 2D image of size n × n. Then, the first convolutional layer will apply l filters each of dimension m × k with a step size of s. The output of this layer will be an n × n × l volume where each slice j will be an n × n feature map generated by a single filter. A given feature map is computed by placing the m × k filter on top of the input image, computing the sum of the product of the weight in the filter and the pixel value of the image that overlaps with that entry in the filter (think of this as a dot product) followed by sliding the filter by s along the height and width of the image. In most cases, the resulting feature maps are passed through an activation function before being input into the next layer.

In convolutional layers further down the network, the inputs are now volumes of some size n×n×l rather than a 2D image. The general operation of the convolutional layer

(29)

stays the same. However, now a given filter of size m × k is applied to an m × k area in the input volume but along the entirety of the depth l. This is different than a 3D con-volution where each value along the depth would be convolved with a different weight.

At its core, the convolutional layer computes a dot product which is essentially what a single neuron does. Hence, one can think of a single neuron as a filter with adjustable weights. The inputs to the neuron correspond to the overlapping region of the input volume with the filter. Then, since this filter is applied over and over again to the input space by sliding over the filter along the width and height of the image, we can copy over the same neuron enough times such that it mimics the sliding of the filter. It is important that each of these neurons share the same parameters for a given fil-ter otherwise the number of paramefil-ters will make learning a model computationally infeasible.

Some of the images that are studied in this thesis are 3D grayscale images. Hence, we work with 3D convolutions. The idea behind 3D convolutions remain the same. The only difference now is that the first layer acts on a volumetric image rather than a 2D image and does a 3D convolution instead. Thus, now the filters are cubes that slide over the width, height and depth of the image. The convolution is not only computed across the width and height but also across the depth of the image. As a result, the output is a 4-dimensional volume with each slice consisting of a feature map the same dimension as the input image. As a result, the inputs to the convolutional layers further down the network are 4-dimensional. These convolutional layers still compute 3D convolutions, thus they compute the convolutions by sliding over the width, height and depth but will extend through the entire fourth dimension.

2.1.7 Activation Functions

Activation functions are used to introduce non-linearities into the neural network. There are several different activation functions that are used in practice. One of the most widely used one is the Rectified Linear Unit (ReLU) activation function [30].

(30)

The ReLU activation function f is defined as follows: f (x) = max(0, x)

There are a couple of advantages as to why one would use ReLU. The first advantage is that it is computationally efficient to compute. ReLU thresholds an input x with 0, which can be implemented efficiently. Other activation functions are computationally much more expensive as they involve operations such as exponentials. A second reason is that training a model takes less iterations when ReLU is used (i.e. the optimizer converges faster) [25].

Another activation function that is commonly used is the parametric ReLU (PReLU) [16]. The PReLU is as follows:

f (x) = max(0, x) + a · min(0, a)

for a variable a. Thus, for x ≥ 0, it is the same as ReLU but for x < 0, it outputs a · x instead of 0. Since a is a variable and can be changed, it becomes part of the parameters of the neural model. Thus, through training a neural model an optimal a value is learned. This is shown to improve performance [16, 36]. Thus, given the advantages of the ReLU coupled with the increase in performance using PReLU, this thesis uses these two activation functions.

2.1.8 Pooling Layers

The pooling layer is inserted in order to downsize the internal representation of the input image which helps reduce the number of parameters necessary. One can think of the pooling layer as using a sliding window of size n × n with a slide of k across a single slice of the current representation of the input image to apply a function on each n × n grid of the slice. Generally, the function applied is the max function, in which case the pooling layer is referred to as a MaxPool layer. Other functions are also seen such as computing the average of the numbers in the n × n grid on which the sliding window falls on top of (in which case the pooling layer would be an average pooling layer).

(31)

Figure 2-2: A 2×2 MaxPool layer with a slide of 2. One can think of the representation on the left side of the arrow as a single slice of the internal representation the neural model has for the input image. Once the pooling layer is applied, the representation is downsampled by a factor of 2 in all directions. The window starts in the top right corner and then takes slides of step size 2.

The most common of the MaxPool layer is a 2 × 2 window with a slide of 2. This results in the representation being downsized by a factor of 2 in all dimensions. Figure 2-2 shows such a pooling layer on a toy example. The window starts on the top right corner and slides 2 over in one direction before computing the maximum again. One can imagine that an average pool would instead average numbers in the 2 × 2 window. For instance, the pooling layer would compute the average of 1, 2, 5, 6 and place 7 in place of 6 in the output of the layer.

2.1.9 Fully Connected Layers

A fully connected layer is simply a layer in which each neuron is connected to all the outputs of the neurons in the previous layer. Hence, one can think of each neuron in this layer doing a matrix multiplication of the output of the previous layer with the weights and adding an offset. Up until the fully connected layer, the image is processed through local convolutions. The fully connected layer, on the other hand, does a global computation on the representation of the image. Hence, this usually comes at the end of an architecture as a means of bringing it all together.

(32)

2.1.10 Softmax Layer

The softmax layer typically has as many neurons as there are class labels. One can think of each neuron as associated with a label. The purpose of the softmax layer is to compute the softmax function [4]. Assuming that there are k classes, the softmax function, σ, is defined as follows:

σ : Rk→ (0, 1)k σ(x)i =

exi

Pk

j=1exj

In other words, σ takes a real-valued vector of size k and converts it into a vector where each entry is between 0 and 1. Furthermore, the sum of the vector entries of the output is 1. Hence, one can think of this function as a probability distribution over the k classes.

In the case where there are only 2 classes (i.e. a binary classification task), one can construct a softmax layer with a single neuron which computes the sigmoid function. The sigmoid function is as follows:

σ(x) = e

x

ex_{+ 1}

Hence, the sigmoid function takes in a single real value and mapping it into (0, 1). One can treat the output of the sigmoid function as the probability of the class with label 1. Then, 1 − σ(x) will be the probability of the class with label 0.

Given a probability distribution over the different class labels for a given input image, one can choose to assign the label with the highest probability afterwards.

(33)

2.2 Fundamentals of Cancer

2.2.1 The Role of the Nucleus

Having stressed the importance of the nuclei in cancer diagnostics and the fact that the images we are working with are nuclei images, we now give a more detailed look into the role of the nucleus in cancer diagnostics.

First, we define homeostasis. Homeostasis refers to an organism’s stable state as well as the regulation of the stable state [18]. Cells in a tissue microenvironment maintain homeostasis through a combination of mechanical and chemical signals. Cells in a tissue form the extracellular matrix (ECM) which in turn provides supports to the cells in the matrix [20]. By sending compressive force signals, the ECM helps the cells regulate cellular processes such as gene expression [20]. These signals make their way to the nucleus through various means such as using dynamic cytoskeletal-nuclear links found within the cell [39]. Thus, one can consider the nucleus as an integrator of such signals.

The nucleus itself is a very crowded due to the fact that it houses the DNA of the cell. The spatial organization of the DNA within the nucleus determines gene expres-sion. As a result, different packaging of the DNA can lead to different genes being expressed. An important part of gene expression is how chromosomes are packaged as contact between different chromosomes facilitate gene expression [39]. It is believed that cytoskeletal-nuclear links help facilitate the positioning of chromosomes with the nucleus [39]. Thus, one can see that the structure of the nucleus plays a crucial part in maintaining homeostasis and alterations to this structure can lead to various diseases.

In fact, it has been studied that cancer cells have altered nuclear structure. Zink et al [48] discuss variance structural differences between non-cancerous and cancerous nuclei. Some of the differences are listed below:

(34)

nuclei have an irregular shape. It is usually the case that alterations in shape correlate with alterations in heterochromatin structure which can cause alter-ations in gene expression

• Nuclear matrix: The nuclear matrix is a matrix consisting of fibers and protein that help give structural integrity to the nucleus. The matrix plays a crucial role in processes such as DNA replication. Thus, it is believed that changes in the composition of the nuclear matrix in cancer cells can lead to abnormalities in such processes

• Nucleoli size: Nucleoli are structures within the cell nucleus that play a crucial role in the creation of ribosomes. Cancer cells see an increase in nucleoli size which is thought to be correlated with ribosome production rate which in turn is associated with the cell proliferation rate.

To summarize, the ECM that makes up a big part of the tissue microenvironment sends various mechanical and chemical signals to the cells. These signals make their way to the nuclei. The nuclei integrate these signals and use them to regulate various processes such as gene expression. However, if a nucleus’ structure is abnormal, the regulation of such processes is disrupted. Such disruptions are believed to be linked with cancer. Thus, studying the nuclear structure can validate existing beliefs on the link between the cell nucleus and cancer.

2.2.2 Cancer Tissue Microenvironment

The images of the cells’ nuclei are obtained from a co-culture model which is supposed to mimic the cancer tissue microenvironment. Thus, we now study how the cancer tissue microenvironment influences the cells in the microenvironment.

The cancer tissue microenvironment or tumor microenvironment refers to the mi-croenvironment consisting of the tumor, various non-cancerous cells and the ECM. The co-culture experiment presented in this thesis mimics this environment by

(35)

dis-persing non-cancerous fibroblasts around a clump of cancerous cells (meant to resem-ble a tumor) while the collagen matrix serves as the ECM.

In the cancer tissue microenvironment, the tumor can secrete the transcription growth factor (TGF)-β which causes resident fibroblasts to convert to cancer-associated fi-broblasts (CAFs) [43]. CAFs in return help sustain the tumor through the secretion of various signals that promote cell proliferation and the degradation of the ECM [7, 43]. CAFs are referred to as polarized cells which indicate the different activation states that they can find themselves in [3]. There are various biomarkers that are associated with CAFs. Hence as cells become more activated, they display higher levels of such biomarkers.

In summary, in the cancer tissue microenvironment the cancer tumor secretes signals that cause fibroblasts to change into cancer-associated fibroblasts. The CAFs are polarized and those that become activated help sustain the cancer tumor through various signals. In this thesis, we study how the nuclear structure of these CAFs differ from regular fibroblasts. Specifically, it is hypothesized that since CAFs see a change in the secretion of signals, the nuclear structure of these cells should change more than the structure of the nuclei of cells not in the presence of cancer tumors.

(36)

(37)

Chapter 3 PatchNet

3.1 Motivation

Disease diagnosis is a critical decision process that impacts the patient deeply. Hence, if computational models such as neural networks are ever to be used in such settings, it is of utmost importance that they be able to provide a visualization of features they pick up on for each possible output class. This way, pathologists and doctors can examine these feature visualizations in order to understand why the model made the decision it made and either agree with it or disagree with it. In other words, such feature visualizations allow the computational model to justify themselves to human experts in the field.

It is important that such justifications don’t come at the cost of classification accu-racy or generalization error. Thus, this trade-off between generalization error and sharpness of feature visualizations is at the crux of such computational models. To this end, we propose PatchNet [37] which aggregates classification results based on local context (i.e. small patches) to generate a global classification label on the image. The traoff between generalization error and sharpness of feature visualizations de-pends on the patch size chosen for the model to operate on. Larger patch sizes result in lower generalization error but less sharper visualizations.

(38)

3.2 The Model

PatchNet is a rather straightforward model. There are two networks of interest: the global neural network G and the local network S. The parameters of the local net-work are global in the sense that they are shared across all patches that S acts on rather than learning a specific local network for each patch.

The global network G takes in a 2D image and first patches the image into k × k patches for some parameter k. These patches are then fed into the local network S which consists of six convolutional layers, each applying a 3 × 3 two-dimensional convolutional layer. Each convolutional layer consists of 64 filters and is followed by a ReLU activation unit. Afterwards, 64 linear models are applied to the 64 filters of the last convolutional layer as a dot product to obtain 64 values. These values are input into a hyperbolic-tangent activation unit. This is followed by a fully-connected layer that reduces the length 64 vector a vector that is the length of the number of classes (note that if there are only two classes, we reduce it to a single value using a dot product which represents an unnormalized probability of the class label being 1). The vector is then normalized to produce a probability distribution over the class labels for the given patch using a softmax layer. G averages the probabilities for each class label over the patches to produce a classification result for the global image. This model can be seen in Figure 3-1.

Then, in order to construct the feature visualization heatmaps for a given class label, one can generate the intensity value of a pixel in the heatmap as follows: run S on a patch centered on that specific pixel and use the output of the softmax layer for the given class label as the intensity value.

Since we are interested in binary classification tasks, the model uses the binary cross entropy loss and the softmax layers are sigmoid layers instead. In addition, since the local network S operates on small patches, we forgo the use of pooling layers as it

(39)

Figure 3-1: The PatchNet architecture. The left schematic displays the local network S. S acts on a small patch of the image. The patch is fed into six 3 × 3 convolutional layers of 64 filters each with each layer followed by a ReLU unit. 64 linear models are applied as a dot product for filters of the last layer to produce a 64-element vector. The vector goes through a hyperbolic tangent activation unit and is then condensed to a vector that is the length of the number of classes. Finally, a softmax layer generates a probability distribution over the classes for the specific patch. The schematic on the right shows the global network G which applies S on possibly overlapping patches followed by averaging the probability distributions generated by S across all the patches to arrive at a classification result. The image is from [37]

does not make sense to downsize such images.

3.3 Results

PatchNet is used to classify between two dimensional nuclei images of two different cell lines. The two cell lines used in the classification task are BJ, which are nor-mal/benign cells, and MCF10A, which are fibrocystic cells. Fibrocystic cells are cells that have undergone changes that represent either benign or pre-malignant cancer states [31]. The dataset consists of 1267 BJ cell nuclei and 1282 MCF10A cell nuclei images with each image being of size 128 × 128. The training and validation set split is as follows: 190 of each class are used in the validation set while the remaining 2169 images are used as the training set.

(40)

of classification performance as well as the sharpness of feature visualizations. We are also interested in examining the trade-off between patch size and generalization error for PatchNet. To this end, we first compare the classification results that Patch-Net attains on three different patch sizes (11, 17 and 31) to the classification results attained by CAM and Grad-CAM. Table 3.1 displays the comparison of these clas-sification results. Grad-CAM uses the state-of-the-art VGG-11 [40] architecture in its entirety and hence comparing PatchNet to Grad-CAM is equivalent to comparing its performance to the VGG architecture. The first trend to notice is that as the patch size increases, PatchNet’s performance improves. In addition, a larger patch size means that it takes less time to train the model. Grad-CAM achieves the best performance as expected due to its use of the VGG architecture. PatchNet is capable of competing with CAM and in fact beats it out when using a patch size of 31. One possible reason for this behavior is that PatchNet effectively trains on several patches for the same image while CAM only uses the entire image.

Comparison of Various Models Model Training Loss Training Acc. Validation Loss Validation Acc. Training Time per Epoch (seconds) PatchNet-11 0.439 92.7% 0.553 80.8% 212.81 PatchNet-17 0.381 92.6% 0.504 81.8% 131.63 PatchNet-31 0.226 95.4% 0.416 83.2% 79.22 CAM 0.325 85.9% 0.456 82.1% 13.66 Grad-CAM 0.110 95.8% 0.242 90.5% 41.94 Table 3.1: A comparison of various models’ performance on the task of classifying between two different cell lines. PatchNet is run with 3 different patch sizes (11, 17 and 31). It is possible to see the trade-off between patch size and generalization error as larger patch size results in smaller generalization error. PatchNet is compared to CAM, another CNN model that offers a means of visualizing features, and Grad-CAM. Grad-CAM’s VGG architecture performs the best, but PatchNet competes with CAM and is capable of beating it out. In addition, the trade-off between patch size and time it takes to train per epoch is shown as smaller patches result in longer training times. This table can also be found in [37].

We also compare the feature visualizations of PatchNet, CAM and Grad-CAM in Table 3.2. When one examines Tables 3.1 and 3.2 together, one can see that while

(41)

smaller patch sizes causes higher generalization error, the resulting feature maps are much sharper and more localized. It is also possible to see that PatchNet generates features that are biologically relevant. Specifically, it is known that heterochromatin (i.e. condensed DNA) regions are indicative of early onset cancer and fibrocystic state [31]. PatchNet picks up on such condensed DNA regions in nuclei as indicative of MCF10A which is expected. However, CAM and Grad-CAM do not pick up on such relevant features and use the background as indicative of the BJ cell line.

Feature Visualization Comparison of Various Models

Model Normal (BJ) Nuclei Abnormal (MCF10A) Nuclei

original PatchNet-11 PatchNet-17 PatchNet-31 CAM Grad-CAM

Table 3.2: A comparison of various models’ feature visualizations on the task of classifying between two different cell lines. PatchNet is run with 3 different patch sizes (11, 17 and 31). Smaller patches generate sharper feature maps while the CAM and Grad-CAM models are less sharp. PatchNet provides relevant features for all patch sizes while CAM and Grad-CAM fail to do so. This table can also be found in [37].

Finally, we are also interested in studying whether features picked out by these models are of any biological relevance. Increase in DNA condensation is known to be

(42)

asso-ciated with the onset of cancer and the fibrocystic state of cells [31]. A fluorescent marker for CENP-A, a histone that is recruited by heterocrhomatin regions, can be used as an indicator for fibrocystic state in cells. If we dye BJ cells with the marker for CENP-A, we expect to see only some small regions to light up. Furthermore, we expect that PatchNet to pick out those regions as indicative of the fibrocystic state and the MCF10A class. To see whether PatchNet picks up on the same regions, we generate heatmaps (by rounding using a threshold value of 0.5) of 354 BJ nuclei stained with the CENP-A marker. Then, we overlay the actual CENP-A marker regions as a mask on top of it and compute the following statistics:

• Average Recall : average across all images of the number of CENP-A dye pixels identified as features for MCF10A divided by the total number of CENP-A dye pixels (i.e. the recall). Since the CENP-A markers is a means of measuring DNA condensation which is linked with fibrocystic state, recall is used to measure the fraction of DNA condensation markers picked up by the various models.

• Average Exact Match: an average across all images of the number of pixels that match exactly between the CENP-A dye mask and the heatmap for a given nucleus. Since it is possible to obtain a high average recall by marking the entire nucleus as indicative of the fibrocystic state, average exact match is used to check the overall percentage that the nucleus was tagged correctly in the heatmap.

• Average AUROC : average across all images of the area under the receiver op-erating characteristic curve to compare how different thresholds impact classi-fication accuracies

It is of note that we do note use average precision as a statistic as high precision is only achieved if DNA condensation is the sole factor in determining if a cell is cancerous or not which is not the case. Table 3.3 compares how PatchNet-11, CAM and Grad-CAM perform on these statistics. Class 1 is the MCF10A class while class 0 is the BJ class. Hence, pixel values of 1 in the heatmap mean that that pixel

(43)

was indicative of class 1. PatchNet-11 achieves above 50% on all of the statistics which suggests that the features PatchNet-11 picks up are indeed related to DNA condensation. While CAM and Grad-CAM achieve higher recall, their average exact numbers are less than 50% which coupled with their sample heatmaps suggest that these models tend to mark the entire nucleus as indicative of MCF10A.

Biological Relevance of Features Picked Out By Models Model Examples Average

Recall Average Exact Match Average AUROC original - - -CENP-A Feature Masks - - - PatchNet-11 50.4% 52.4% 0.515 CAM 89.1% 28.4% 0.452 Grad-CAM 78.2% 32.2% 0.473

Table 3.3: Comparison of PatchNet-11, CAM and Grad-CAM visualizations with that of CENP-A markers which are associated with heterochromatin regions that are a feature of the early onset of cancer. PatchNet is capable of picking out features correlated with CENP-A markers but Grad-CAM and CAM pick out biologically irrelevant features. This table can also be found in [37].

3.4 Conclusions

PatchNet provides us with valuable results. First, PatchNet further validates that neural models are capable of picking up on biologically relevant features when diag-nosing patients. It also further validates that the nucleus is able to help identify the

(44)

early onset of cancer and that computational models are capable of picking up on such features. Finally, PatchNet also paves the way for sharp feature visualization models while maintaining competitive accuracies. If computational models are ever to be widely adopted in hospital settings, pathologists and doctors will appreciate the feature visualization capabilities of such models.

(45)

Chapter 4 Co-Culture Experiment

This chapter details the setup of the co-culture experiment and presents examples of the images produced.

4.1 Experiment Setup

The co-culture experiment is setup to mimic the onset of cancer in the human body. Specifically, a clump of cancerous cells are placed in the center of a dish and non-cancerous cells are scattered around the clump. This co-culture is then imaged every day for five days. When imaging, the nuclei are stained using DAPI, Vimentin and EPCAM. DAPI is used to identify the nuclei of the cells. Vimentin is a biomarker which helps identify the NIH3T3 cells and EPCAM is a biomarker used in identifying the MCF7 cells. Figure 4-1 is a schematic of the co-culture experimental setup. It is important to note that once a cell is stained, the cell dies. Hence, the same co-culture is not imaged on all 5 days. Rather, several co-culture experiments are started at the same time but each one is imaged on a different day.

(46)

Figure 4-1: The setup of the co-culture experiment. Cancer (MCF7) cells (marked as red in the schematic) are placed as a clump in the center of the collagen matrix. Fibroblasts (NIH3T3) cells (marked as green in the schematic) are scattered around the clump of MCF7 cells. The experiment is sustained for 5 days. On each day, the cells are stained and imaged. Image credit goes to Karren Yang

Besides the co-culture cells, the two cell lines are also imaged for five days in control settings. These cells are also stained using DAPI to identify their nuclei prior to being imaged.

The cells in both the co-culture and the control setups are expected to grow and divide as time elapses. The hypothesis is that the fibroblasts in the co-culture model should be activated due to the presence of the cancerous cells. As a result of becoming activated, they should change more than their counterparts in the control group.

4.2 Data

The 3D imaging of the cells are done by Professor G.V. Shivashankar’s lab in the Mechanobiology Institute in Singapore. Specifically, the microscope starts from the bottom of the collagen gel and takes images in 0.1 micron intervals (each image is known as a z-slice) as it works its way up the gel. When stacked on top of each other, the images form a 3D reconstruction of the nuclei in the gel. Figure 4-2 shows four consecutive z-slices of a given co-culture image.

Figure 4-3 displays a single slice of the co-culture experiment imaged on day 1 and on day 5. As seen in these two slices, the number of nuclei in day 1 is far less than the number of nuclei seen on Day 5.

(47)

(a) z=19 (b) z=20

(c) z=21 (d) z=22

Figure 4-2: Four consecutive z-slices of a raw DAPI-stained image of the co-culture are shown above. The total number of slices in this image was 51. These four slices are roughly around the middle of the stack. As one goes from one slice to another, it is possible to see that certain nuclei become more prominent while others fade.

(48)

(a) Day 1 (b) Day 5

Figure 4-3: In (a), a slice of the co-culture experiment imaged on the first day is shown. The clump in the middle corresponds to the cancerous cells and the non-cancerous fibroblasts are scattered around the clump. (b) shows a slice of the co-culture experiment imaged on day 5. At this point the cells have grown and divided along with having become activated. The fibroblasts that have become activated have now oriented themselves such that they point towards a cancerous clump. Note that both slices come from roughly the middle of their respective stacks.

In addition to the DAPI stained images, we collect EPCAM and Vimentin stained images of the same nuclei. EPCAM and Vimentin are biomarkers that allow us to distinguish between the two types of cells present in the coculture. EPCAM allows us to identify the MCF7 cell line while Vimentin allows us to identify the NIH3T3 cell line. Figure 4-4 shows the three different stained-images of the same slice from Day 5 of the co-culture. The EPCAM stains are most intense where the clump of nuclei are (which correspond to the MCF7 cell line) and the vimentin stains are most intense at the trail of nuclei following the clump which corresponds to activated NIH3T3 cells. However, not all days of the co-culture have the same intensity of EPCAM and vimentin stains. As cells become activated, the intensity increases, hence we should see an increase in intensity of these images as the day number increases. Figure 4-5 shows precisely that.

(49)

(a) EPCAM (b) Vimentin

(c) DAPI

Figure 4-4: The triplet of images are of the same slice of the stack from day 5 of the co-culture. (a) is the EPCAM stain of the slice, (b) is the vimentin stain of the slice and (c) is the DAPI stain of the slice. In the EPCAM-stain image, pixels are most intense where the cancer cells are (i.e. the clumps of nuclei in the DAPI stain). In the vimentin-stained image, the pixel intensities are higher in the region of fibroblasts which corresponds to the long tail of nuclei trailing the larger of the clumps.

(50)

(a) Day 1-EPCAM (b) Day 1-Vimentin

(c) Day 5-EPCAM (d) Day 5-Vimentin

Figure 4-5: A comparison of EPCAM and vimentin stained images where the top row images are slices from the first day of the co-culture while the bottom row images are from the fifth day of the co-culture. The intensities of the stains for both EPCAM and vimentin are higher in day 5 as expected

(51)

Chapter 5 Analysis of Co-Culture Experiment

Images

5.1 Overview

This chapter focuses on the analysis of the images produced by the co-culture ex-periment. We explain how the images are pre-processed followed by introducing the machine learning models we run on the images and stating the results on various classification tasks. The overall pipeline can be seen in Figure 5-1.

Figure 5-1: Overall workflow for the analysis of the images. The nuclei in the raw DAPI-stained images are segmented and then either are input into VoxNet, a convo-lutional neural network, or have features extracted from them which are input into a logistic regression model. These models are trained to accomplish various classifica-tion tasks

(52)

5.2 Pre-processing of the Raw Images

Before training a model, preprocessing has to be done on these raw images to stan-dardize them. The bulk of the preprocessing is segmenting the raw images to extract individual nuclei. In addition, the segmented nuclei from the co-culture have to be labeled as either NIH3T3 or MCF7. We first discuss the segmentation process and then explain how labels were produced for the co-culture nuclei.

5.2.1 Segmentation and Patch Generation

The segmentation workflow consists of several steps. Figure A-1 displays these seg-mentation steps while Figure 5-2 shows consecutive z-slices from an MCF7 and NIH3T3 segmented nucleus. Specifically, the segmentation workflow given a raw DAPI stained 3D image can be detailed as follows:

1. Gaussian blur: a Gaussian blur is run on the image to reduce noise and smooth the image

2. Thresholding: a thresholding method (such as Otsu’s method [34]) is used to convert the grayscale image into a binary image

3. Fill holes: Some of the pixels within the nuclei might have not been thresholded properly such that it seems like there are holes in the nuclei. Thus, these holes are filled using standard image processing software functions

4. Watershed: The image is eroded and dilated [41] before running the watershed algorithm [42]. The idea behind the watershed algorithm is to pick starting points and start ”flooding” the image until the ”water” from two starting points meet at which point a boundary is drawn between them

5. Size Filter: Some nuclei are really close to each other hence watershed identifies them as a single nucleus. To get rid of such nuclei, we run a size filter that removes any segmented objects larger than a specific threshold

(53)

6. Convex Hull: The watershed algorithm produces an initial segmentation from which we can identify the 3D objects that it picks up. However, these objects are not exactly the nuclei we want as they might still have holes in them. Due to the 3D nature of these objects, we can find a convex hull that encapsulates the entirety of the points that watershed assigned to this object. This convex hull will also include points in the holes within this object since these holes are bounded by the object in 3D.

7. Extracting the nuclei: Having identified the convex hull of the 3D object produced by watershed, we can create a bounding box around the convex hull and zero out all pixels outside of the convex hull. Thus, we have extracted the nucleus

8. Overexposure and Blurriness check: The last step is to check whether nuclei are overexposed or blurry. Overexposure check is done by checking if any one slice of the nucleus has more than 80% of its pixels above a certain threshold intensity.

9. Pixel Intensity Normalization: Since technical imaging conditions can show variation on different days, the images are then normalized to have the same mean pixel intensity.

To generate patches from the nuclei, we take a 32 × 32 × 32 window and place it on the standardized full nucleus image. If 80% of the portion of the image that falls into that window is of the nucleus (this computation is done by counting the number of non-zero pixels in the window), then that portion is kept as a patch. This process is repeated by sliding the window across the height and width of the image with slide sizes of 4. Thus, from the 32 × 64 × 64 images of the full nuclei several patches of 32 × 32 × 32 images are generated. Figure 5-3 shows the same level slice from the stack of two different patches generated from the same nucleus.

(54)

(a) MCF7

(b) NIH3T3

Figure 5-2: Example individual nuclei. Consecutive slices of two nuclei are shown above. These nuclei were extracted using the segmentation process described in Sec-tion 5.2.1. (a) shows an example MCF7 nucleus while (b) displays an example NIH3T3 nucleus. (a) slice from patch 1 (b) slice from patch 2

Figure 5-3: Example patches. The same slice from two different patches generated from the same nucleus are shown above.

5.2.2 Labeling Co-Culture Nuclei

Having identified the individual nuclei in the co-culture, producing labels is straight-forward. Since the raw images also contain Vimentin and EPCAM stains of the same cells, we check whether the Vimentin biomarker or EPCAM biomarker is more highly activated in the region of the nucleus to determine whether the cell is NIH3T3 or MCF7 respectively.

The application of deep learning to nucleus images for early cancer diagnostics

The Application of Deep Learning to Nucleus

Images for Early Cancer Diagnostics

by

Ali Can Soylemezoglu

B.S., MASSACHUSETTS INSTITUTE OF TECHNOLOGY,

CAMBRIDGE (2017)

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Bachelor of Science in Computer Science and Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2018

c

Massachusetts Institute of Technology 2018. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

May 22, 2018

Certified by . . . .

Caroline Uhler

Assistant Professor

Thesis Supervisor

Accepted by . . . .

Katrina LaCurts

Chairman, Masters of Engineering Thesis Committee

The Application of Deep Learning to Nucleus Images for

Early Cancer Diagnostics

by

Ali Can Soylemezoglu

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Problem Statement

1.2

Motivation

1.3

Related Works

1.3.1

Machine Learning

1.3.2

Cancer Diagnostics

1.4

Organization of this Thesis

1.5

Contributions

Chapter 2

Background

2.1

Convolutional Neural Networks

2.1.1

Motivation

2.1.2

Preliminary Definitions

2.1.3

Evaluating Models: Introduction to Loss functions

2.1.4

Finding Optimal Parameters: Adam Optimization

2.1.5

The building block: a single neuron

2.1.6

Convolutional Layers

2.1.7

Activation Functions

2.1.8

Pooling Layers

2.1.9

Fully Connected Layers

2.1.10

Softmax Layer

2.2

Fundamentals of Cancer

2.2.1

The Role of the Nucleus

2.2.2