Attojoule scale computation of large optical neural networks

(1)

Attojoule Scale Computation of Large Optical

Neural Networks

by

Alexander Sludds

B.S., Massachusetts Institute of Technology (2018)

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Masters of Engineering in Electrical Engineering and Computer

Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

c

Massachusetts Institute of Technology 2019. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

May 18, 2019

Certified by . . . .

Dirk Englund

Associate Professor

Thesis Supervisor

Accepted by . . . .

Katrina LaCurts

Chair, Master of Engineering Thesis Committee

(2)

Attojoule Scale Computation of Large Optical Neural

Networks

by

Alexander Sludds

Submitted to the Department of Electrical Engineering and Computer Science on May 18, 2019, in partial fulfillment of the

requirements for the degree of

Masters of Engineering in Electrical Engineering and Computer Science

Abstract

The ultra-high bandwidth and low energy cost of modern photonics offers many op-portunities for improving both speed and energy efficiency in classical information processing. Recently a new architecture has been proposed which allows for sub-stantial energy reductions in matrix-matrix products by utilizing balanced homodyne detection for computation and optical fan-out for data delivery. In this thesis I work towards the analysis and implementation of both analog and digital optical neural networks. For analog optical neural networks I discuss both the physical implemen-tation of this system as well as an analysis of limits imposed on this system by shot noise, crosstalk, and electro-optic/opto-electronic information conversion. From these results, it is found that femtojoule-scale computation per multiply and accumulate operation is achievable in the near term with further energy gains foreseeable with emerging technology. This thesis also presents a system-scale throughput and en-ergy analysis of digital optical neural networks, which can enable incredibly high data speeds (> 10GHz) with CMOS compatible voltages at weight transmitter power dissipation comparable to a modern CPU.

Thesis Supervisor: Dirk Englund Title: Associate Professor

(3)

Acknowledgments

I would first like to thank my thesis advisor Professor Dirk Englund. His consistent vision and encouragement has enabled the work in this thesis as well as my substantial personal development.

I would also like to thank my collaborators Dr Ryan Hamerly and Liane Bernstein. I believe we make a great team, and I am always inspired by the unique thought process and ideas that both of you bring, as well as your patience.

I would like to thank Professors Vivienne Sze and Joel Emer for their advice and support. They have taught me to think deeply about what makes a good benchmark and how to weight system level tradeoffs, which has in turn helped me think about what are good next steps experimentally to push existing benchmarks.

I would like to thank other lab members who have gifted their expertise and experience in many subjects to aid this research. In particular, I would like to thank Christopher Panuski for his wealth of knowledge in applied integrated optics, Christopher Foy and Mohamed Ibrahim for their knowledge of modern CMOS processes and integration with photonics, and Ian Christen for his experience in integrated optical processes. I would like to thank the National Science Foundation who have given me a Grad-uate Research Fellowship. With this funding and support I will seek out positive contributions I can make through research and teaching to better our world.

Finally, I must express my very profound gratitude to my parents and to my friends for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you.

(4)

List of Figures

2-1 Fully-Connected Neural Network Architecture . . . 12

2-2 Convolutional Neural Network Architecture . . . 18

2-3 Toeplitz Matrix Conversion Scheme . . . 19

2-4 Patch Matrix Method . . . 20

3-1 Our Analog ONN Architecture . . . 25

3-2 The ratio of shot noise to thermal noise in our architecture . . . 32

3-3 Data batching on our ONN architecture . . . 33

3-4 A shows the definite integral over an incident Gaussian beam, repre-senting the fraction of power absorbed by a photodetector. B visualizes an incident Gaussian mode bleeding over into neighboring photodetec-tors. Note here that there is a non-unity fill-factor on the receiver array. C shows the generated crosstalk matrix. . . 34

3-5 Inference and training under the presence of crosstalk . . . 35

3-6 A comparison of shot noise for several neural network models . . . 36

3-7 A plot of DRAM access energy for our architecture . . . 37

3-8 A comparison of shot noise amount during train and inference . . . . 39

3-9 A high level overview of our experimental setup . . . 41

3-10 DMD diffraction pattern imaged onto camera with interference . . . . 41

3-11 DMD device specifications . . . 42

3-12 Simulated DMD far field diffraction pattern, all mirrors on . . . 44

3-13 Experimental DMD diffraction pattern, all mirrors on. . . 45

3-14 Simulated DMD diffraction pattern, every other row off. . . 46

3-15 Experimental measurement of far field DMD diffraction pattern, every other row off. . . 47

(7)

3-17 Star configuration for imaging the DMD onto a camera with polarizing beamsplitters shown. . . 49 4-1 A description of our Digital ONN architecture . . . 52 4-2 A description of our experimental setup . . . 53 4-3 A plot of SNR and BER for a digital ONN with 1MHz bandwidth per

transmitter . . . 55 4-4 A plot of SNR and BER for a digital ONN with 100MHz bandwidth

per transmitter . . . 55 4-5 A plot of SNR and BER for a digital ONN with 10GHz bandwidth per

(8)

List of Tables

4.1 Digital ONN Experimental Results . . . 54 4.2 Here we show the two figures of merit for the optical architecture, the

energy consumption per MAC and latency for a matrix-matrix product. The latency is discussed in Subsection 4.3.5 . . . 58

(9)

Chapter 1 Introduction

Modern computation is the study and creation of machines which are especially tai-lored towards processing, combining, and analyzing vast amounts of data. Recently, there has been substantial interest in the creation of computers which can efficiently perform the computation for machine learning. Machine learning has become an indis-pensable statistical tool that has substantially advanced the fields of machine-vision [1], game-playing [2], [3], and healthcare diagnostics [4], to name a few. In recent years, a trend in deep learning hardware has becoming increasingly more specialized, moving from CPU-based systems to specialized Application Specific Integrated Cir-cuits (ASICs) optimized for machine learning [5], [6]. A reason for this is because the death of Moore’s law, Amdahl’s Law and Dennard scaling [7], [8] limits the ability for CMOS to continue scaling the energy efficiency and throughput of general purpose computation. However, for ASICs there is still a critical problem limiting the ability to scale: copper is an incredibly lossy material. The bandwidth required for modern computing pushes the material limits of copper, making the limits of computation the interconnection and memory access energies [9], [10]. For this reason, memory ac-cess is the dominating factor in the energy consumption of modern machine learning hardware [11].

In order to overcome this problem, optical neural networks have been developed which can combine the high bandwidth and high communication efficiency of optics with the cost effectiveness, scalability, and computational efficiency of CMOS electronics. Past research from our group resulted in the design and experimental testing of a photonic integrated circuit capable of performing the multiplication of arbitrary 8 × 8 unitary matricies with a 100 KHz reprogramming speed [12]. This circuit was shown

(10)

to have high fidelity on vowel recognition tasks, making it a unique proof of con-cept system. This platform may in the future allow for low-power reprogrammable photonic information processing, similar to what an FPGA can perform for electronic signal processing. Other recent research in the field has exploited the novel properties of photonics for computation such as the explicit implementation of brain-inspired computation through optical spiking neural networks [13], passive computation with millimeter-wave scale diffractive optical neural networks [14], and advanced dataflow implementations with reservoir computation [15]. Recently, we have proposed a large-scale optical neural network which can theoretically enable attojoule large-scale computa-tion of fully-connected and convolucomputa-tional neural networks in both the inference and training phases [16].

In this thesis, we analyze the theoretical limits of this large-scale optical neural net-work, specifically simulations of the energy limits imposed by shot noise, the accuracy effects of crosstalk on the receiver, and the energy consumption of data access and movement. In Chapter 2 of this thesis, we give a brief overview of modern machine learning, the associated hardware, and the algorithms used in modern deep learning systems. In Chapter 3, we outline the architecture for our proposed analog optical neural network which uses balanced homodyne detection as a method for comput-ing products. In Chapter 4, we consider an architecture for a digital optical neural network, as well as the limiting factors placed on this system by modern fabrica-tion constraints. Finally, in Chapter 5, we discuss ongoing experiments and expected near-term results.

(11)

Chapter 2 Motivation: Neural Networks and

their Computational Complexity

Modern neural networks are an emerging tool in science that has allowed for the classification, inference, and processing of large amounts of data with limited human feedback. There are several architectures for neural networks, but the most common architectures currently in use are percetron (fully-connected) neural networks [17], convolutional neural networks (CNN) [18] , recurrent neural networks (RNN) [19] [20] and Generative Adversarial Neural Networks (GANN) [21]. All of these networks, including the CNN through a Toeplitz matrix/patch matrix, have a fundamental op-eration which is a GEneral Matrix Multiplication (GEMM). GEMM opop-erations have long been a bottleneck for neural network computation, with substantial research being performed for specific algorithms and hardware to optimally perform the com-putation for training and inference with neural network models [22]–[30]. For this thesis, we shall consider fully-connected networks and convolutional networks, with the underlying ideas extending to many other types of networks. This chapter gives a mathematical description of neural networks, as well as their complexity.

2.1 Fully-Connected Neural Networks

Fully-Connected Neural Networks are an architecture which can be used to learn via optimization features of a dataset and can accurately perform classification of large datasets. The architecture of a fully-connected neural network, shown here in Figure 2-1, consists of several layers of “neurons” which are connected together via

(12)

“synapses”.

Figure 2-1: Here, several layers of a neural network are connected together. Each layer passes its activation values through synapses represented by a weighting ma-trix. Then, once the next layer of neurons has accumulated all partial products, a nonlinear activation function is applied (typically ReLU). This image is from Stan-ford’s Convolutional Neural Network course, CS231 [31]

After each layer, a nonlinear activation is used, most commonly the Rectified Linear Unit (ReLU):

ReLU(xi) = max(0, xi)

This is very easy to compute and does not suffer from problems associated with vanishing and exploding gradients, which are a significant problem in machine learning training. Often, the final output layer of a neural network uses a softmax activation function which applies the following function to all inputs:

SOFTMAX(xi) =

exi

P

jexj

This activation function allows for a far greater separation between output values. In fact, we can derive the need for the softmax activation function if we consider the cross-entropy of the output. The derivations to come of KL divergence, softmax derivation and back propagation come from the Goodfellow textbook on deep learning [32].

(13)

We define cross-entropy as H(p, q) = Ep[− log(q)], or a measure of the information

difference between two distributions p and q over some common dataset. Suppose the output Y |x1, x2, · · · , xm is Bernoulli distributed. The reason for this is that our

outcome given all of the accumulated values in a layer must give a single classifica-tion. We wish to find the activation function under this Bernoulli distribution that maximizes the probability that the given layer values give a correct classification. Note: we can extend this Bernoulli distribution to be a multinoulli distribution and the derivation does not change substantially, but becomes more notationally complex. We know from the Bernoulli distribution that

E(Y |X) = P r(Y = y|x1, · · · , xm) =

     p if y = 1 1 − p if y = 0

This leads to the overall distribution

P r(Y = y|X) = py(1 − p)1−y

We consider the logit function (also known as log-odds) for the sake of mathematical manipulation. This means:

logit(E[Y |X]) = logit(p) = ln p

1 − p = β0+ β1x1+ · · · + βmxm Taking the exponential of both sides we find that

E[Y |X] = p = logit−1(β · X) = 1 1 + e−βX

If we let Z be a normalization factor, then ln(Pr(Y = Q)) = βQ · X − ln(Z). If we

use the constraint PK

k=1Pr(Y = k) = 1, we find that

1 = K X k=1 Pr(Y = k) = 1 Z K X k=1 eβk·X

(14)

Solving for Z, we find that Z =PK

k=1e βk·X

So, finally we find that

P r(Y = Q) = e

βQ·X

PK

k=1eβk·X

(2.1) We have derived that the optimal guess for a probability distribution given an input vector x can be found from the softmax function. However, we are missing one critical component, we do not know how to update the weight values of our neural network such that we obtain the best guess of our probability distribution.

2.1.1 Back Propagation

Back propagation is a tool which will allow us, given an error at the output of our neural network, to optimize the weights of the network to classify a given dataset. The optimization scheme is Gradient Descent:

θt+1= θt− α∂E(X, θ

t₎

∂θ

where θtare weights at iteration t, X are the input values, and α is the learning rate. In our formulation of back-propagation, we define error as:

E(X, θ) = 1 2N N X i=1 ( ˆyi − yi)2

where ˆyi is the output of a neural network and yi is the actual value. Note that here

we are defining the error as the mean squared error.

For the rest of these calculations, we are going to ignore bias terms to simplify the arithmetic. This is fine because bias values can be folded into the weighting matrix. We now are going to attempt to compute the change in the error with respect to the weighting matrix. We need to do this so that we can perform gradient descent

(15)

∂E ∂wk ij = 1 N N X d=1 ∂Ed ∂ak j ∂ak j ∂wk ij where Ed= 1₂( ˆyd− yd). Since ak j = Prk−1 l=0 wljko k−1

l we can rewrite our above equation as:

∂E ∂wk ij = 1 N N X d=1 ∂Ed ∂ak j ∂ ∂wk ij rk−1 X l=0 w_ljkok−1_l

The final piece of information that we need before we can finally calculate the desired is an expression for _∂a∂Ek

j . ∂Ed ∂ak j = ∂ ∂ak j 1 2( ˆyd− yd) 2 ₌ ∂ ∂ak j 1 2(f (a m 1 ) − y) 2 ₌ (f (am₁ ) − y) ∂ ∂ak j (f (am₁ ) − y) = ( ˆyd− yd) ∂f (am 1 ) ∂ak j

With these in mind, we will first calculate the output of a hidden layer, and then “back propagate” that value to all of the preceding hidden layers.

Output Layer

The output layer is relatively straight-forward. Here we see that we can write the update equation for the output layer as:

∂E ∂wm ij = ∂E ∂ak j ∂ak_j ∂wm ij = ∂E ∂ak j om−1_i = ( ˆyd− yd) ∂f (am 1 ) ∂ak j om−1_i

From this, we see that the output layer’s weight update is only dependent upon Ed,

om−1_i , and the partial derivative of the activation function from the previous layer with respect to ak_j.

(16)

Hidden layers

Hidden layers are a bit more complicated. Their update term will be dependent upon the error from the succeeding layer.

∂E ∂ak j = rk+1 X l=1 ∂E ∂ak+1_j w k+1 lj ∂f (ak j) ∂ak j ∂ak j ∂wk ij = ∂f (a k j) ∂ak j rk+1 X l=1 wk+1_lj ∂E ∂ak+1_j ∂ak j ∂wk lj

One detail to note about this derivation is that for this case we consider error to be the mean-squared error. However, in order to convert over to another error function we simply substitute in the new definition of ∂Ed

∂ak j

, which in turn will give us _∂w∂Em ij

. These two fundamental equations are the backbone for modern deep learning (though some research has been done into alternatives to back-propagation [33]–[36]).

2.2 Measures of Neural Network success

The success of a neural network on a given task is primarily measured by its accuracy for that task. Many benchmarking problems such as ImageNet are classification-based, meaning that given an input, we wish to classify it into one of several categories. However, other benchmarks exist in self-supervised learning such as Kullback-Leibler Divergence (KL Divergence). To motivate KL divergence we first motivate maximum likelihood estimation:

Maximum Likelihood Estimation

Consider a set of m examples X = (x1, x2, · · · , xm) which are drawn independently

from a true data-generating distribution pdata(x). If pmodel(x; θ) is a family of

proba-bility distributions over the same space indexed by θ, then we know that pmodel(x; θ)

(17)

to maximize the likelihood we guess the correct θ. We rewrite this as: θM axL = argmaxθ pmodel(X; θ) = argmax_θ m Y i=1 pmodel(xi; θ)

The last step can be made since we drawing independently from the distribution. In order to get around several computational problems (such as the fact that the product of many small probabilities is susceptible to numerical underflow), we consider the log-likelihood. Since this mapping is an injective function (it has one-to-one mapping) and is monotonic, we can simply rewrite this the above as:

θM axL = argmaxθ m X i=1 log(pmodel(xi; θ))

This final function leads us to the KL divergence, which we define as the dissimilarity between our distribution from our training data and the actual data, We define the KL divergence as: DKL(pestimatedata|pmodel) = m X i=1

log(pestimatedata(xi)) − log(pmodel(xi))

We note that the term on the left is a function only of our data generating process and not the model itself, so in order to minimize the KL divergence, we have to minimize −Pm

i=1(log(pmodel(xi))). Minimizing the KL divergence minimizes the cross-entropy

between distributions, thus making our model match the actual data.

2.3 Convolutional Neural Networks (CNNs)

Another very popular form of neural network is the CNN. CNNs exploit the fact that images contain spatial data, so low-level features, such as edges, that we wish to extract from images are based on neighboring values. The idea behind this neural network is to perform successive convolutions of 3D filters with an incoming image.

(18)

After each convolutional layer, a non-linear function is applied (once again, we usually choose ReLU for the previous mentioned benefits and have Softmax on the output layer). This architecture is shown in Figure 2-2. Back propagation works using a similar principle as the backpropagation in the fully-connected case.

Figure 2-2: Example of a CNN. This image is from a Mathworks webinar [37].

One thing to note is that CNNs are a sparse network, since not all neurons from a layer are connected to all neurons in the next layer. This can prove challenging for computational tasks where cache coherence is important. However, we can use a Toeplitz matrix/patching matrix encoding scheme 2-3 in order to convert this sparse convolution problem into a matrix-matrix multiplication problem. First, we will describe the Toeplitz method for performing this operation:

(19)

Toeplitz Conversion

Figure 2-3: The top image describes the standard VALID convolution operation be-tween a filter and input feature map. The input feature map is converted to a Toeplitz matrix, which has redundant data, and the filter is flattened into a vector. The cor-responding matrix-vector product gives the convolution. We can then tile according to the bottom image in order to generate batched outputs. Image from [38].

Next, we will describe an alternative method (patch matrix) for performing this con-version.

(20)

Patch Matrix

Figure 2-4: The input feature map is divided into several patches, each of which has C depth. The kernel is flattened into a matrix which has number of kernel many rows. The patches are flattened in a similar manner into a patch matrix as described here. Then the matrix-matrix product will correspond to the convolutional output, which we can use an inverse patching method to convert back into the output feature map. Figure adapted from [16], created by Dr. Ryan Hamerly.

It is important to note that the Toeplitz method and patching method are functionally equivalent.

2.4 Computational Complexity of Fully-Connected

Neural Networks

In order to support the basic linear algebra such as matrix multiplication that these machine learning architecture requires techniques have been develop which decrease the total number of required multiplication. If we have a matrices of size (B, N ) and (N, M ), we find that the amount of multiplication operations that must occur is BN M . A naive implementation has a complexity of O(N3_{). However, there has}

been significant progress in numerical methods which are able to do matrix multipli-cation faster. To understand prior work in these numerical methods, I discuss their

(21)

formulation here:

2.4.1 Strassen Method

The Strassen method works by computing the matrix products of 2n× 2n _matrices.

In the event our matrix dimensions are not powers of 2 we pad with zeros to make this true. We then use recursion in order to break up the large 2n× 2n _{matrix into}

many 2 × 2 blocks that we multiply. This 2 × 2 matrix multiplication building block is   a b c d     e f g h  =   ae + bg af + bh ce + dg cf + dh  

Reformulating this using

k1 = a(f − h) k2 = (a + b)h k3 = (c + d)e k4 = d(g − e) k5 = (a + d)(e + h) k6 = (b − d)(g + h) k7 = (a − c)(e + f )

We can rewrite this product as

AB =   k5+ k4− k2+ k6 k1+ k2 k3+ k4 k1+ k5− k3− k7  

Here we see that our 8 multiplications are now 7! This algorithm, however, suffers from several problems. Strassen has poor numerical precision and usually only proves beneficial for matrix multiplication that is much larger than what most machine learning models require [39].

(22)

2.4.2 Coppersmith-Winograd Algorithm

Winograd is another method with better scaling for matrix multiplication.

  i0 i1 i2 i3 i4 i5        w0 w1 w2      =   o0 o1  

Using the substitutions:

k1 = (i0− i2)w0 k2 = (i1+ i2) w0+ w1 + w − 2 2 k3 = (i2− i1) w0+ w1 + w − 2 2 k4 = (i1− i3)w2 Giving   o0 o1  =   k1+ k2+ k3 k2− k3− k4  

Here we see that this algorithm only required 4 multiplications and 8 additions. The Winograd transform method has been implemented in several modern GPUs with several prominent researchers considering the benefits that sparse Winograd methods provide to CNN computation [40]. Winograd proves beneficial on smaller filter sizes (3,3) while other methods (such as the Fast Fourier Transform, which we describe below) dominate for larger filter sizes.

2.5 Computational Complexity of CNN

We now discuss the computational complexity of a CNN algorithm. For this analysis we use SAME padding, where the output is the same size as the input, but other

(23)

padding schemes have the same algorithmic scaling. The computational complexity for an input of size (N0, N0, C) and filter (Fx, Fy, Fc) is O(N02CFxFyFc).

2.5.1 Fast Fourier Transform for Convolution

In an introductory signal processing class, students are introduced to the notion that the dual operation of multiplication in Fourier space is convolution and vice-versa. This prompted researchers to investigate the effectiveness of the Fast Fourier Transform (FFT)for convolutional computation. The FFT reduces the computational complexity of matrix multiplication. Suppose we are given a matrix of dimensions:

(M1, M2, · · · Mi)

A 1D FFT on an array of size N takes O(N log N ). Since a 2D FFT is effectively an extension of a 1D FFT (a quick induction argument shows this is true) we see that the computational complexity increases to O

Qi d=1Md log(Qi d=1Md) . To see the boost this gives to performance in CNN performance we consider the 3D filter case. Here we have input data of shape (N0, N0, C) and filters of shape (Fx, Fy, Fc). Doing

a 2D FFT on the input data (since we don’t convolve across channels) we see that the total cost of our FFTs is O(CN₀2log(N₀2) + FcFxFylog(FxFy)). We see that the

cost of doing elementwise multiplication with the kernel is O(N2

0CFc) and the cost of

the inverse FFT is O(N2

0Fclog(N02). This gives a substantial speedup the naive CNN

algorithm, which would run in O(CN2

0FxFyFc). So, we see from the above that we

(24)

Chapter 3 Analog Optical Neural Networks

As we have mentioned so far, the problems that dominate modern hardware for deep learning are associated with interconnect energy and memory access energy [41]. Modern GPUs can consume 20pJ/multiply-and-accumulate operation (MAC) while ASICs specifically designed for deep learning consume 1pJ/MAC [5]. However, ASICs are still fundamentally based on CMOS technology, meaning that even in the most highly-optimized architectures, where all values are kept in register files, a majority of the energy consumption will still come from data movement, not logic [41], [42]. In order to overcome this issues with CMOS we have developed an architecture that uses balanced homodyne detection as a means of computing the required matrix-matrix multiplication and using optical fanout as a means of decreasing the redun-dant memory access that bottlenecks throughput and energy efficiency. It is worth reiterating here that these large matrix multiplications are the fundamental operation of fully-connected neural networks.

3.1 System Architecture

As we have discussed before, a fully connected neural network is composed of several layers of artificial neurons which are densely connected together via a weighting ma-trix. Here we discuss an architecture, designed by Dr. Ryan Hamerly and proposed in [16], which allows for a substantial reduction in the number of components. The architecture is presented in Figure 3-1.

In Figure 3-1 we are encoding our input ~x(k) _{and weight ~}_A i

(k)

(25)

amplitude-A(0) x(0) _x(1) _x(2) _x(K) x(k) Copies of x(k) A1(k) A2(k) ... AN'(k) ∫ x(k+1) f f f f f f Nonlinear function electronic signal (a) Fa n-out Multiplier, integrator x(3)

optical signal optical signal (b) (beamsplitter)Combiner Source input output weights Readout, serialization Conversion to optical xj Aij + ‒ Aijxj f A(1) _f _A(2) _f _A(K-1) _f ∫ ∫ ∫ ∫ ∫

Figure 3-1: Mapping from a fully connected neural network architecture (a) to our photonic implementation (b). Here, (b) implements a single layer of the neural net-work. Input data is streamed in on the left side as ~x(k) with the amplitude of the electric field proportional to the data value. We feed in the weight values whereA~(k)₁ here represents the first row of the matrix. The integral on the right of (b), which has been expanded in the inset, represents balanced homodyne detection, which is explained below. The photodetector integrates accumulates charge over time, reading out only after the GEMM is complete. Figure from [16].

modulated coherent light using a time-multiplexing scheme. In particular, the ampli-tude of the electric field of the input signal at a given timestep is linearly proportional to the value it represents. The same is true for the weights. For the input values we wish to maximize data reuse/reduce memory access. To do this we use optical fan-out, often through a cylindrical lens, as a way of taking the input values and imaging them to all output photodetectors where they are needed. For the weights we encode the ith _{column of our weight matrix for layer k into a vector ~}_A

i (k)

. We step through the values of these column vectors in time as we simultaneously step through the input vector. In this way, corresponding values can interferrer on our beamsplitter for balanced homodyne detection.

(26)

Here, the detectors on the receiver chip receive the optical signal which is the interfer-ence of the input and weight values. Once the necessary charge has been accumulated at the photodetector, it can be read out to have a non-linear function applied by the electronics (ADC/Compute Unit/DAC). One benefit of this scheme is that electronic readout happens at a rate much slower than the data-input because we operate the photodetector as an integrator, integrating each vector-vector dot product, so we only need to read out after each dot product. If the weight matrix is of size 1000x1000, and we encode data at a speed of 1GHz, electronic readout is performed at 1MHz (feasible with low-cost hardware such as an FPGA or microcontroller). In addition, we can have the error of individual computations in the matrix-vector product be high because we just care about the overall accuracy of the vector-vector dot product.

Balanced Homodyne Detection

A beamsplitter, with incident light xejφ1 _{and Ae}jφ2 _{on different input ports are}

mod-ified by the lossless beamsplitter equation

B =   1 √ 2 j √ 2 j √ 2 1 √ 2  

Note here that we have a 90 degree phase shift between the transmitted and reflected light. We place photodetectors at both output ports of the beamsplitter, with equal distance between the beamsplitters and the photodetectors. This leads to the outputs:

B × input =   Xejφ1+Aejφ2+jπ2 √ 2 Xejφ1+j_√π2+Aejφ2 2  

Measuring the power of this output by multiplying by the vectors conjugate we find:

Power =   X2_+A2_+Axej(φ1−φ2−π2)+Axe−j(φ1−φ2−π2) 2 X2_+A2_+Axej(φ1−φ2−π2)+Axe−j(φ1−φ2−π2) 2  

(27)

Taking the difference between the bottom and top equations we find Axe−j(φ1−φ2)(ej π2−e−j π2)_{+ Axe}j(φ1−φ2)(e−j π2−ej π2)

2

Thus, we see that for φ1−φ2 = π₂ we obtain a measured power of 2Ax, which is linearly

proportional to our desired scalar product. One thing that some may comment is why we choose to use amplitude modulation of the A and x input signals when we can simply phase modulate one amplitude modulated signal and a local oscillator reference at constant amplitude. However, it is far easier to fabricate one coherent source that generates the A and x signals rather than generating two coherent sources and balancing them.

3.2 Energy Analysis and Standard Quantum Limit

The two noise sources which dominate in optical data transfer are thermal noise and shot noise. Noise will limit the energy which we can operate at since large receiver noise will correspond to a decreased accuracy in matrix-matrix computation. In order to find the energy limit of this system we perform an analysis of the thermal noise and shot noise of the system below.

3.2.1 Analysis of Shot Noise

To analyze shot noise we must first create a model of the receiver and its response to individual photons. The following was adapted from a derivation by Dr. Ryan Hamerly [16].

In our architecture, each detector performs balanced homodyne detection. We will first consider what happens when a pulse of electric field amplitude u enters the

(28)

detector. A Poisson-distributed current is generated from the optical input: Q

e ≈ Poisson(|u|

2₎

Where Q is the accumulated charge and e is the electron charge. This distribution has a mean of |u|2 and a standard deviation of |u|. Our homodyne signal comes from interfering our input signal, xj, and weight signals, Aij, on a beamsplitter. The

output of the balanced homodyne detector is

yi = 1 e(Q + i − Q − i )

If we operate in the limit of many photons per neuron (but not necessarily many photons per MAC since each detector integrates over an entire vector-vector product), we find that Q±_i e = X j 1 2 ¯ Aij ± ¯xj 2 + w±_i X j 1 2( ¯A ± ¯xj) 2 1₂

where w_i±≈ N (0, 1) is a normal random variable.

This equation represents signal and noise where the left part of the RHS of this equation is the measured signal from the photodetector, and the right part of the RHS is shot noise, which is linearly proportional in amplitude to the received number of photons.

The subtraction in balanced homodyne detection allows us to subtract means from each other while the noises add in quadrature. Using this, we can rewrite the above as: ¯ yi = Q+_i e − Q−_i e = 2 X j ¯ Aijx¯j+wi X j ( ¯A2_ij+ ¯x2_j) 1₂ = 2 ¯Aijx¯j+ ¯wi q | ¯Ai|2+ |¯x|2 (3.1)

(29)

convert to | ¯Ai|2 for clarity.

We relate the physical values for variables ¯x and ¯A to their logical value by using scaling constants ¯x = ξxx and ¯A = ξAA. This logical value represents the scaling

factor going from number of photons to either a digital signal or an analog voltage. We now wish to see what the overall layer output is. To do this we must consider the ReLU activation function. We know that:

ReLU(cx) = c ReLU(x) for c ≥ 0

So, we can use this activation function to say that the input into the following layer ¯

x0_i = αReLU(¯yi).

Substituting in the result from 3.1 we see

¯ x0_i = α ReLU 2 ¯Aijx¯j + ¯wi q | ¯Ai|2+ |¯x|2

Substituting in our relationship between physical and measured logical values we find that ξxx 0 i = α ReLU 2ξAAijξxxj + wi s |Ai|2 ξ2 A +|x| 2 ξ2 x We let α = _2ξ1

A in order to remove coefficients in front of ReLU.

We wish to find a way to remove ξx and ξA from the above equation so that we can

better model the dynamics. We know that a relationship between logical values and number of photons is:

ntot = nxtot+ n A tot = N 0 X j |xj|2 ξ_x2+ X i,j |Aij|2 ξ_A2

We note that since there are NN’ operations per layer, then the number of photons per MAC is nMAC = _{N N}ntot0.

(30)

A final assumption we make is that we assume that the row vectors of Aij are equal.

That is: Ak ≈ Al ∀k, l ∈ Z≥0. This is a reasonable assumption, because our rows

have many values we will be averaging a large number of points. Also, often times we have regularization on the weights of our neural networks, making this assumption more true. Because of this we can replace

X j |Aij|2 ≈ P i,j|Aij|2 √ N0

Combining all of the parts together, we see that:

x0_i = ReLU Aijxj+ wi (P i,j|Aij| 2₎₍P j|xi| 2₎ 2√N N0 s 1 nx MAC + 1 nA MAC

For a fixed energy per MAC, we can maximize the signal to noise ratio by having the same number of photons per MAC in both the weighting and input values. This can be rewritten as nx MAC = nAMAC = 1 2nMAC. x0_i = ReLU Aijxj + wi (P i,j|Aij| 2₎₍P j|xi| 2₎ √ N N0_n MAC (3.2)

We refer to Equation 3.2 as the analytical expression for shot noise in an optical neural network. We will us this expression in order to model the energy bounds that are set on our system by shot noise and how it establishes a energy limit for our architecture, which we call the standard quantum limit.

3.2.2 Analysis of Thermal Noise

We also must consider whether thermal noise will have a significant impact on the lower energy bound of the neural network architecture. This derivation is an expanded version of the one originally performed by Dr. Ryan Hamerly in the supplemental of [16]. In order to model the thermal noise of the architecture, we will start with a

(31)

modified version of Equation 3.1 such that we include a constant noise term. This gives us: ¯ yi = 2 ¯Aijx¯j+ ¯wi q | ¯Ai|2+ |¯x|2+ 2h∆n2ei (3.3)

Similar to the shot noise derivation, we have several relationships which we can use to reduce this equation. Namely:

¯ x0

i = αReLU(¯y (m)

i ) x = ξ¯ xx x¯0 = ξxx0 A = ξ¯ AA

and for relating the number of photons per MAC we have:

n(x)_MAC = P ix 2 i N ξ 2 x n (A) MAC = P i,jA 2 ij N N0 ξ 2 A

Finally, we again make the assumption that the norms of the rows of the A matrix have the same magnitude and that n(A)_{M AC} = n(x)_{M AC} = 1₂ntotal_{M AC}

We can use these relations along with Equation 3.3 to write:

¯ x0_i = ReLU Aijxj + wi |A||x| √ N N0_n MAC s 1 + 2h∆n 2 ei N nMAC

Here we see that the q

1 + 2h∆n2ei

N nMAC term here includes both shot noise and thermal

noise. We see that if N nMAC << 2h∆n2ei that thermal noise dominates relative to

shot noise, but if N nMAC>> 2h∆n2ei shot noise dominates.

If we assume that we operate with a femtofarad capacitance receiver, which has been proposed as a theoretical design for low energy per bit optical receivers [9] and is being enabled by recent advanced in the field [43], we can estimate hn2

ei = kT Ce2 as is

(32)

Thus, the ratio of thermal noise to shot noise is h∆n2_i thermal h∆n2_i shot = 2kT C e2_{N n} M AC

For T = 300K we see that we generate the following plot of the ratio:

Figure 3-2: Ratio of shot noise to thermal noise for small (100,100)×(100,100) GEMM and large (1000,1000)×(1000,1000) GEMM. As the number of photons per MAC increases, the ratio of thermal noise to shot noise decreases. For larger networks, shot noise dominates at > 1 photon per MAC.

As illustrated in Figure 3-2, for larger neural networks, we operate in a shot-noise-limited regime. Therefore, the lower energy bound of our optical neural network will be defined by the number of photons required to overcome shot noise, which we term the standard quantum limit (SQL).

3.2.3 Analysis of Crosstalk

A concern for the practical implementation of this Optical Neural Network(ONN) architecture is optical crosstalk from spatial modes overlapping at the receiver. In

(33)

Figure 3-3: Here we see that inputs M1 and M2 are interferred on the beamsplitter

and fanned out to the receiver. We define the axis m as the batch of the input data.

order to characterize the effects of crosstalk, what contribution it may have to the standard quantum limit, and whether the effects of crosstalk can be removed by training under the presence of crosstalk, we created a simulation software which models the crosstalk of the system.

Suppose that we are performing a matrix-vector product with the ONN architecture. Using the terminology of Deep Neural Network (DNN) hardware designers, the archi-tecture is output stationary, which means that the partial products are accumulated at the output and are not moved back into memory / moved around between sep-arate processing elements. This means crosstalk presents itself as a bleeding across neighboring photodetectors in our receiver. To better demonstrate how batched infor-mation is passed into our scheme here I show an image which was created by Dr. Ryan Hamerly and is in our ONN paper [16]

In figure 3-3 we see that rows of our receiver are batch information and columns are the results of applying the weight matrix M2 to a given input value in M1. For this

reason, when we are considering crosstalk on our architecture it will be a cross-batch phenomena.

Assuming the spatial modes incident on our receiver are Gaussian we see that crosstalk will be represented as a definite integral over select areas of that Gaussian corre-sponding to the active regions of the photodetectors in the receiver. Because of this implement crosstalk as a 3 × 3 kernel matrix in a Tensorflow2 custom layer that we

(34)

then perform crossbatch convolution over our output activation data with. High level implementation details are demonstrated in Figure 3-4

Figure 3-4: A shows the definite integral over an incident Gaussian beam, representing the fraction of power absorbed by a photodetector. B visualizes an incident Gaussian mode bleeding over into neighboring photodetectors. Note here that there is a non-unity fill-factor on the receiver array. C shows the generated crosstalk matrix. We run these simulations while sweeping both the spot diameter and fill-factor for various models. Results are shown in figure 3-5.

Results and Discussion

By observing Figure 3 − 5 when we train under the presence of crosstalk that we can overcome large amounts of the accuracy drop because of crosstalk. Crosstalk, in this case, is a cross-batch phenomena, but the weight matrix that we are altering has no knowledge of batch size. So, in some way, we are able to ignore this cross-batch crosstalk phenomena. Currently I have no theory to describe or examine this phenomena, but at some point in the future myself and/or Liane Bernstein may investigate this phenomena.

(35)

Figure 3-5: (Top) we sweep the spot diameter for fill-factor=25% in both the LeNet [44] and CIFAR10 models in both training and inference. (Bottom) For both LeNet and a CIFAR10 model, we sweep fill-factor for a fixed lens design in both inference and training + inference.

3.2.4 Standard Quantum Limit

Now that we have considered shot noise, thermal noise, and cross talk and concluded that the system is limited by shot noise, we wish to consider the energy bound that this Standard Quantum Limit (SQL) places on our system. A Keras [45] Custom Layer is created that simulates the insertion of shot noise at the receivers in our ONN architecture. Using this layer, we were able to simulate the error rate of various models trained on the MNIST dataset. The results are shown and discussed in Figure 3-6.

(36)

(a) (c)

(b)

Figure 3-6: Here (a) shows the models that are being used as well as their zero-noise error rate. (b) shows a plot of how networks which we trained without the presence of shot noise respond to shot noise in inference. We see that as we increase the total number of photons used that our error rate drops substantially. (c) highlights the energy consumption of our ONN as we increase the number of neurons per layer.

It is now worth discussing why we believe that this energy consumption, which is fundamental to the quantization noise of photons, limits our energy consumption as opposed to, for example, memory read energy. There are 4 other sources of significant energy consumption in our system:

• Energy for memory read

• Energy for opto-electronic conversion (ADC) • Energy for non-linear function computation • Energy for electro-optic conversion (DAC)

(37)

Energy from Memory Read

An order of magnitude approximation for the energy consumption of a DRAM access is 100pJ per access [8], [41]. This means that for a 1000x1000 matrix we will require 100uJ of energy in order to access all of its elements. We will also require an additional 1000 accesses in order to get our input values. However, we will be performing 109 MACs in the matrix-matrix product. Below is a quick chart that describes the energy consumption of DRAM memory access for a naive standard architecture and our ONN architecture.

Figure 3-7: This plot shows a comparison of the energy per MAC for accessing DRAM energy for a standard CMOS architecture not specifically tuned for GEMM operations vs the ONN architecture. A dotted red line is considered here which represents the energy consumption per MAC of the TPU architecture.

(38)

we move to larger networks. We see that for larger batch sizes and number of neurons that our energy access costs can be substantially lower than the energy access costs in standard architectures. Other memory storage technologies, such as SRAM, have lower energy access costs, which can help to enable femtojoule scale computation.

Energy for activation functions: ADC conversion + Computation + DAC Conversion

As [16] mentions, reasonable near-term estimates for ADCs and modulators are a few-pJ/sample, and the cost of performing basic arithmetic on a digital device like an FPGA or microcontroller is a few pJs. Thus, in the near term, we know that the energy cost per neuron will be a few pJ/neuron in order to apply the activation function. If we divide by the number of accumulated partial products then we can see that sub-fJ per MAC computation is realizable for computing activation functions.

3.3 Lowering the Standard Quantum Limit through

hardware-aware training

It is well known in the machine learning hardware community that systematic errors can be mostly removed at run time by training under the presence of the system errors [46]. One great example of this is the errors that are incurred by converting continuous variables to discrete, also known as quantization. This process is done heavily for converting trained 32 bit floating point models to 8 bit integer models. As a consequence, increasingly, models that will be used in practice are being trained under the presence of quantzation noise at every step, so that when the model is ready to be shipped to production, the weights can be quantized and shipped without significant extra work. Recently in the Tensorflow2 overall,functions were created which add quantization noise to all layer parameters during training [47]. For this reason, we investigated the feasibility of reversing the effects of shot noise and thermal noise

(39)

on our ONN architecture with in situ training. Since above we found that thermal noise is not a significant noise source for larger neural networks, I simply consider our shot-noise model here.

In Keras, we created a custom layer which implements, during both training and inference, shot noise. From doing this we found the following results:

Figure 3-8: Here shot noise has been added during both training and inference. Each of the curves represents a different number of photons per MAC during training. The x axis represents the amount of photons per MAC during inference. One thing we note from these plots is that we don’t have to choose an exact value in order to get close to ideal shot-noise resilience out of our ONN architecture. In addition, we see that we are able to overcome the accuracy drop that shot noise can give to our architecture, and we can push the SQL even lower than had been previously discussed.

A different way to look at the fundamental energy bound of the system would be from the point of view of entropy. Thus, the next section will discuss the Landauer limit.

(40)

3.4 A discussion of the Landauer Limit

The Landauer limit [48] is a bound on the energy for non-reversible computations. It is set by kT ln 2 = 2.87 × 10−21J = 2.87 zeptojoules at 300K. There are two things to note about our architecture. The first is that interference on a beamsplitter is a reversible process since no information is discarded. The second is that we integrate the charge and only read out once from the photodetector every vector-vector dot product. This means that we only perform an irreversible computation for every N computations that we perform. Because of this, the ONN can operate well below the Landauer limit set for CMOS technologies, but is limited by a (much lower) thermodynamic limit specific to this architecture. This new thermodynamic limit is lower than the standard quantum limit.

A limit that many discuss for reversible computation is the quantum limit of compu-tation, which comes from the energy-time definition of the uncertainity principle. It says that the average energy for computation, E, is limited to E∆t ≥ π~₂ =⇒ E ≥ _2∆tπ~ [49]. If we assume a similar time for computation of 1ns per multiply then we see that this limit is 1.66 × 10−26J . However, this absolute quantum limit of computa-tion is not relevant within the foreseeable future, even within the field of quantum computing.

3.5 System Design

Now that the theoretical merits of our optical neural network have been discussed, we will address its physical implementation, occurring in multiple stages. First, a proof of concept system will be designed, showcasing balanced homodyne detection as a means of computation. Future steps will focus on integrating modulators on-chip and using glass interposers as a means of generating arrays of coherent spatial modes. The experimental implementation currently under construction is discussed below.

(41)

Figure 3-9: This image shows a high level description of our experimental setup. A coherent (laser) source is imaged onto both a DMD and a phase shifter. Then, the reflected DMD signal is interferred with the phase shifted signal. The output is imaged onto a CMOS camera (sometimes referred to as receiver).

Our initial experiment is shown in Figure 3-9. A digital micromirror device (DMD) is used to modulate the amplitude of a signal representing the weight matrix. A phase modulator, represented here by φ, is used to take the same coherent source and vary the phase relative to the DMD sources. Both the inputs and weights are interfered on a beamsplitter and the final result is measured by the camera. In this experiment, we measure a matrix-vector product over two stages, in phase and out of phase. In the in-phase stage φ = 0 and in out of phase φ = π.

However, when we first started using the DMD, we observed optical imperfections. For example, the striations shown in Figure 3-10 are as a result of imaging off-axis modes onto the DMD simultaneously, creating interference, as we will show in the next section.

(42)

Figure 3-10: Here the DMD’s diffraction pattern is imaged onto a camera. Note that alignment is not perfect, resulting in ghosting in the image. However, the striations in the image are not a function of poor optical alignment.

To better understand the observed pattern, we simulated the DMD by modeling it as a diffraction grating.

3.5.1 Modeling Digital Micromirror Device (DMD)

Diffrac-tion Patterns

The DMD is inherently a diffraction grating, since the mirrors of the DMD form a periodic reflective structure with a non-unity fill-factor. Most DMDs are a 2D blazed diffraction grating.

Diffraction Model

The DMD that we wish to model has specifications shown in figure 3-11.

Figure 3-11: (a) En face view of the DMD. The axis of rotation of the micro-mirrors is aligned with the vertical axis, and the mirror sizes and spacings are given. (b) Side-on view of an individual micro-mirror and how light is reflected.

To model the diffraction pattern of this device, we consider a blazed diffraction grating using the grating equation:

sin(θout) + sin(θin) =

mλ d

(43)

where d is the period of the grating, λ is the wavelength of incident light and m is the mode order.

Because this device is a mirror we always much have θout = θin.

The DMD mirror tilts by 12 degrees θ = 12 with respect to the neutal (flat/horizontal) position. Because of this, θout = 2θ − θin. For simplicity, because our DMD in

3-11 is diagonal, we rotate it by 45 degrees in our model so that we obtain a 2D blazed grating. In this new coordinate system, we define θ0_out = tan−1(tan(θ_√out)

2 ) and

θ_in0 = tan−1(tan(θ_√in)

2 ). We know that the phase shift between neighboring pixels is 2πd

λ (sin(θ 0

in) + sin(θ 0

out)). This gives us the left image in Figure 3-12 where we see

columns of constant phase on the DMD, as we expect (since we rotated our coordinate system 45 degrees). Using an FFT, we project this phase pattern into the far field, giving the right image in Figure 3-12. We note that diffraction modes in the far field have sinc functions superimposed on them. This is because the DMD pixels are effectively multiplying by a box shaped filter. In the frequency domain, this looks like the convolution of our delta function modes by a sinc function.

(44)

Figure 3-12: (left) phase pattern of the DMD (right) far field diffraction. We find that this simulation matches the measured far field diffraction pattern. This simulation matches experiment, as shown in Figure 3-13:

(45)

Figure 3-13: Here we see the far field diffraction pattern when all of the DMD mirrors are on.

We want to keep only the zero-order diffraction mode, and for it to be on the optical axis, to minimize optical aberration. Therefore, an encoding scheme is required such that the diffraction pattern is on-axis.

To achieve this, we simply turn off every other row of the DMD. A simulation of this method is shown here in Figure 3-14.

(46)

Figure 3-14: (Left) The DMD configuration with every other mirror off. (Right) the far field diffraction pattern of this configuration.

Using the same setup as we used for figure 3-13 and figure 3-10, we apply this new DMD configuration and then place some paper after a focusing lens so that we can see that far field diffraction pattern.

(47)

Figure 3-15: Far-field diffraction pattern measured from DMD. There is a bright 0th_{-order mode at the center of the diffraction pattern. The spacing between the 1}st

order modes and the 0th _{order mode matches simulation. Note that the brightest}

off-axis spots in this diffraction pattern are the up, down, left, and right spots, but in simulation they are the top-left, top-right, bottom-left, bottom-right spots. This is because simulation is rotated 45 degrees from reality.

Because of time constraints I can not setup the DMD in the same interferometer configuration to image all of the diffraction pattern onto the DMD. However, when I did this with this same setup I did not see any interference patterns, just ghosting. Finally, to represent what we will do in our finished interferometer, we use an iris to only pass the 0th _{order mode and then image that mode onto the camera. 3-16.}

(48)

Figure 3-16: 0th order diffraction mode imaged onto a camera with all other modes blocked by an iris.

3.5.2 Polarization and Star Configuration

We wish to image the DMD’s surface onto a camera. However, the “on” state of the DMD is not normal to the DMD, rather it is 12 degrees of axis. As a result, Dr. Ryan Hamerly devised the below star configuration as a means of imaging the DMD onto the camera. This has the benefit that it can be very compact, minimizing diffraction. In addition, we can modulate the amplitude of polarizations separately as we will describe below.

One way to think about the combination of two polarizations on a beamsplitter is that ˆs and ˆp polarizations are orthogonal, so when they combine on a polarizer which is polarized at ˆs + ˆp, they form the a wave with half power at polarization ˆs + ˆp. In this way, we can recombine the signals from the s and p arms of the interferometer. It is not shown here, but we can replace one of the mirrors in the interferometer with the DMD.

We designed the star configuration, shown in Figure 3-17 so that we can properly image the face of the DMD, where mirrors are tilted 12 degrees to be in the on state, onto a camera.

(49)

Figure 3-17: Here we can see that the ˆs and ˆp polarizations are split by our polarizing beam-splitter. Then recombine on the second beam-splitter and all light is sent to the camera.

3.5.3 Current Status of Experiment

Currently we are in the final stages of creating the interferometer. The final step for the optical setup is the properly image the DMD’s surface onto the camera. Once this is complete, significant effort will be put into creating the timing system so that we make sure the DMD and camera are in sync with each other. In this timing scheme the DMD is the master clock, telling all other devices when it has displayed a new signal. We pass this clock to a microcontroller, which applies any required time delay to the signal. Then, the microcontroller sends a signal to the camera for it to capture the incoming signal. This signal is sent over USB to the computer at a high frame rate where it is captured and usage in a later layer.

(50)

Chapter 4 Digital Optical Neural Networks

Machine learning is a tool for the extraction of patterns from vast amounts of infor-mation, allowing for fast inference on data without the need for explicit instructions. The application of machine learning tools, especially “Deep Learning”, has received enormous attention in academia and industry, warranted by advances in areas such as natural language processing [50], computer vision [51] [52], healthcare [53], robotics [54] and game playing [2] [55]. The use of general purpose computing hardware, such as a CPU, for performing Deep Learning is inefficient. As a result, significant effort has been made by industry and academia to generate new hardware architectures which target artificial neural networks (ANNs) and deep learning. Graphical Pro-cessing Units (GPUs), Field-programmable gate arrays (FPGAs), and Application Specific Integrated Circuits (ASICs), including Google’s TPU [5] have improved both the energy efficiency and speed of computation for learning tasks. However, this hardware is limited by the energy required to move data through metal-interconnects and the redundant memory access required to use ANN models [11] [7] [9] [56] [57] [10].

Optical Neural Networks (ONNs), implemented completely optically or through an opto-electronic architecture, offer a promising alternative to microelectronic imple-mentations [58][16]. However, these analog computing schemes require careful cali-bration in order to be compatible with digital computation systems. Digital ONNs, where data is encoded digitally, are a promising architecture to host ANNs for several reasons: (1) Optical fanout in free space allows for the transmission of data from one transmission point to many receivers, without transmitting through a lossy metal waveguide. This enables the elimination of redundant memory access, and a

(51)

sub-stantial decrease to interconnect energies. (2) Photonic systems have been designed which allow for the modulation and detection of data at speeds exceeding 100GHz with minimal energy consumption [59] [58]. Fanout and the reduction of memory access improves the speed at which ONNs can operate (3) Digital architectures allow for easy integration of optical components with modern foundary processes, such as CMOS. These features can enable an ONN to operate at substantially higher speed with lower energy than their electronic counterparts. However, implementing optical interconnects on chip with CMOS compatibility has been a major technical barrier [60], [61]. Our technology seeks to address this problem by providing a free space solution that is highly scalable and parallelizable.

Here, we begin with a theoretical proposal of a digital ONN architecture for the im-plementation of general deep learning algorithms. The speed and power efficiency of our architecture comes from the use of high-speed and low-energy silicon modulators as a means of fanning data out to where it is most needed on the receiver. Within the design constraints of modern CMOS and silicon processes, this architecture can out-perform state-of-the-art electronic designs by orders of magnitude in speed with com-parable energy efficiency, while remaining highly parallelizable and scalable. Next, we experimentally demonstrate the principle of our system by using free scale optical components which allow for the transfer and inference of ANN models. To test the performance of this theoretical proposal, we benchmarked the experimental system’s performance on the MNIST handwritten digit recognition dataset, which verified the validity of the encoding scheme that we propose.

4.1 Digital ONN device architecture

An ANN consists of several layers of artificial neurons, each one represented by a circle in Figure 4-2(b). In each layer information is passed to the succeeding layer via a linear operation, such as matrix multiplication. Following the application of this weighting matrix, a nonlinear operation is applied, often called the activation

(52)

function. In addition, many popular types of neural networks, such as Recurrent Neural Networks [20] and Generative Adversarial Neural Networks [21] are highly dependent on matrix-matrix multiplication.

Our architecture is depicted in Figure 4-1.

Figure 4-1: Architectural Design. (a) Two rows of transmitters fan-out their data to a receiver array, where the routing is described in (b). Individual pixels of the receiver is demonstrate in (c) which shows a potential Talbot grating and demultiplexing system. (d) This system allows for not only inference to occur, but can also have logic designed that allows for training and back-propagation. (a,b,d) are from [16].

4.2 Experiment

To demonstrate the principle of this architecture an experimental implementation was designed. These components included a single board computer, CMOS CCD camera, LED display, and optical imaging relay.

The layout of the physical setup is described by Figure 4-2(a). If we consider an example network such a Figure 4-2(b) we see that we step through the data as de-scribed in Figure 4-2(c). For each neuron in the input we display all weights it would be multiplied by on the weight transmitter. Then, we use the LED to transmit the input signal simultaneously with the weight values. The CCD camera acts as the

(53)

Figure 4-2: Experimental Implantation. A) The physical layout of the experimental implementation, with the receiver in the foreground and the LED and weight array transmitter in the background. B,C) A neural network for consideration colored so that we can see how data is transmitted in time.

logic and processing stage, transferring the data back to the single board computer for storage and processing.

This hardware was used to verify the performance of networks trained on two datasets, the MNIST Digit Classification Dataset [62] and the IMDb Movie review dataset [63]. The MNIST dataset is a collection of 28x28 grayscale images of black handwritten digits on a white background. For this architecture we use a neural network with an input layer, hidden layer with 100 elements, and output layer. For MNIST, in order to increase the runtime, we downsample from 28x28 to 14x14 using cubic interpolation. This is done without significant reduction in accuracy.

The IMDb dataset is a collection of movie reviews from the IMDb movie review website. A word2vec process [64] is used to encode the most commonly used words in the reviews into a vector of integers, making each sentence distinct in 200-dimensional space from each other. This is followed by a 50-neuron hidden layer and an output layer.

(54)

4.2.1 Results

Dataset Quantized Model Accuracy Transfered Accuracy

MNIST 88.6 percent 86.7 percent

IMDb 79.5 percent 79.2 percent

Table 4.1: Here we see that for two datasets we have the accuracy of the quantized model, and then the accuracy of that same quantized model after it has been optically transferred to the receiver.

From the above results, we conclude that the data transferred matches the original quantized model.

4.3 Discussion

4.3.1 Transmitter Source Energy Consumption

Necessary power generation for sources

In order to find the necessary power incident on the receiver we first calculate the signal to noise ratio of the receiver.

SNR = total power

shot noise + thermal noise =

R2P2

2q(I + 2Idark)∆f + 8_RkT

L∆f

where ∆f is the receiver bandwidth, I is the photodetector current, and RL is the

load resistance. Note that we have two sources of dark current and thermal noise, but only a single shot noise source at any time.

Since we know that the detector bandwidth is dependent upon the load resistance and we know that the load capacitance is fixed we can rewrite the thermal noise part of this equation as:

SNR = total power

shot noise + thermal noise =

R2P2

(55)

For practical systems we wish to operate with an SNR ' 100 to minimize the bit error rate. Given that the responsivity of a Germanium PIN photodiode is 0.9 at 1550 and the load capacitance is 10fF.

Figure 4-3: A plot of SNR and BER for a digital ONN with 1MHz bandwidth per transmitter

Figure 4-4: A plot of SNR and BER for a digital ONN with 100MHz bandwidth per transmitter

Given the SNR plot of the receiver, we want to consider what energy we should operate at. This is dependent upon the robustness of neural networks to bit-error rates.

(56)

Figure 4-5: A plot of SNR and BER for a digital ONN with 10GHz bandwidth per transmitter

rates from storage media on running neural networks [65]. This research finds very few examples where low bit error rates (< 10−5) give any accuracy loss to neural network models.

So, based on all the above information, we decide to operate at 0.39fJ/bit = 3.12fJ/MAC. However, this brings above an inherent problem. We know that if the number of photons on a photodetector is Nph = E_hνA where EA is the optical energy incident on

a photodetector. This gives an accumulated charge of Q = EAe

hν which gives a voltage

of ∆V = EAe

hνCLoad. If we have an incident optical energy of 1fJ and operate at 1550nm

= 0.8eV with a load capacitance of 10fF then we find that this can only generate 0.125V. So, in order to generate a voltage which can drive a FET we must have either a larger incident energy or, more likely, a smaller capacitance.

4.3.2 Receiver Energy Consumption

We wish to know the energy consumption of the logic and processing circuitry in modern CMOS processes. It is known that current 8bit multiplication costs 0.2pJ and that 32-bit addition has a 0.1pJ energy cost in 45nm CMOS [66]. In addition, energy scaling in CMOS is linear with the size of the process. From this we conclude

(57)

that the energy consumption in 10nm CMOS must be 67fJ/MAC.

4.3.3 Discretization Scheme

In order to use digital circuitry to perform the computation locally we must first discretize the floating point values. Our discretization scheme is the following:

A8bit = d

127

max(|A|) ∗ Ae

This takes all values and maps them into the range [-127,127]. Note that we pur-posefully exclude -128 because it can lead to an overflow in 8-bit multiplication. This naive quantization scheme has several benefits. For large matricies it is relatively quick to compute, taking time linearly proportional to the number of elements of the matrix. It is quickly invertible, with the original floating point value calculatable by Afloat = max(|A|)/127 ∗ A8bit.

4.3.4 Energy Efficiency

A key problem in the field of computer science is the design of systems and algorithms which can process big data at high speed with low power. The memory bandwidth limits the processing speed of many of these systems, often being the constraint that performance engineers optimize around.

Our digital ONN architecture takes advantage of the low energy cost of fan-out and high parallizability of optics in order to decrease the energy consumption from mem-ory access in neural networks. Once values have been retrieved from memmem-ory, they are fanned out to all locations on the receiver where a partial product will be accumu-lated. In our implementation only a single receiver unit is used, but in practice many receiver units can be used. This provides the benefit of having the network weights fanned out to many inputs at once, thus increasing the parallelization of the system.

Attojoule scale computation of large optical neural networks

Attojoule Scale Computation of Large Optical

Neural Networks

by

Alexander Sludds

B.S., Massachusetts Institute of Technology (2018)

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Masters of Engineering in Electrical Engineering and Computer

Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

c

Massachusetts Institute of Technology 2019. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

May 18, 2019

Certified by . . . .

Dirk Englund

Associate Professor

Thesis Supervisor

Accepted by . . . .

Katrina LaCurts

Chair, Master of Engineering Thesis Committee

Attojoule Scale Computation of Large Optical Neural

Networks

by

Alexander Sludds

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Motivation: Neural Networks and

their Computational Complexity

2.1

Fully-Connected Neural Networks

2.1.1

Back Propagation

2.2

Measures of Neural Network success

2.3

Convolutional Neural Networks (CNNs)

2.4

Computational Complexity of Fully-Connected

Neural Networks

2.4.1

Strassen Method

2.4.2

Coppersmith-Winograd Algorithm

2.5

Computational Complexity of CNN

2.5.1

Fast Fourier Transform for Convolution

Chapter 3

Analog Optical Neural Networks

3.1

System Architecture

3.2

Energy Analysis and Standard Quantum Limit

3.2.1

Analysis of Shot Noise

3.2.2

Analysis of Thermal Noise

3.2.3

Analysis of Crosstalk

3.2.4

Standard Quantum Limit

3.3

Lowering the Standard Quantum Limit through

hardware-aware training

3.4

A discussion of the Landauer Limit

3.5

System Design