Speedup Critical Stage of Machine Learning with Batch Scheduling in GPU

(1)

HAL Id: hal-01403124

https://hal.inria.fr/hal-01403124

Submitted on 25 Nov 2016

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons

Attribution| 4.0 International License

Speedup Critical Stage of Machine Learning with Batch Scheduling in GPU

Yuan Gao, Rui Wang, Ning An, Yanjiang Wei, Depei Qian

To cite this version:

Yuan Gao, Rui Wang, Ning An, Yanjiang Wei, Depei Qian. Speedup Critical Stage of Machine Learn-

ing with Batch Scheduling in GPU. 11th IFIP International Conference on Network and Parallel Com-

puting (NPC), Sep 2014, Ilan, Taiwan. pp.522-525, �10.1007/978-3-662-44917-2_43�. �hal-01403124�

(2)

Speedup Critical Stage of Machine Learning with Batch Scheduling in GPU

Yuan Gao, Rui Wang, Ning An, Yanjiang Wei, and Depei Qian

BeiHang University, XueYuan Road No.37 HaiDian District Beijing, China, rui.wang@jsi.buaa.edu.cn

Abstract. As a superior data analysis method, Machine Learning suf- fers the bottleneck from limited computing capability for many years.

With the advent of numerous parallel computing hardwares, modern G- PU is becoming a promising carrier for the tasks of Machine Learning. In this paper, we propose an efficient GPU execution framework to speedup the forward propagation process of convolution neural network. By ex- tending the convolution unrolling method to fit this batch mode, we get a significant increase of throughput but very little overhead.

Keywords: convolution neural network, framework, GPU, batch process

1 Introduction

With the fast-development of computer hardware, GPU has become a type of important computation carriers with dramatic speedup and better energy eﬃciency.

More and more computing-intensive applications with natural data parallelism have been migrated to GPU environment. However, although the NVIDIA CU- DA can help programmers to develop faster applications, programming on GPU is much more difficult than that on CPUs. Programmers need to determine the timing of copying the data from the host to the GPU, the number of blocks or threads that should be divided reasonably to take full advantage of computing resources, and the size of data slice that should be allocated to each thread. It is quite difficult to develop a framework for a beginner to write efficient GPU programs in any purpose. However, research on high productivity method both in coding and in execution for special domains is promising.

2 CNN GPU Execution Framework

In this section, we take the image recognition application as an example to illustrate the operating mechanism of our framework.

2.1 Batch Process

Improving the GPU computing resource utilization is a good way to speed up the forward propagation since the computing hardware resource of GPU is quite

(3)

II

rich and forward propagation cannot take full advantage of GPU’s ability if only handle one small size image at a time. What we can do, however, is to try to increase the throughput in a single task, which means the CNN processes several images in one forward propagation. The ﬁgure 1 presents the proposed framework which shows two of the most important modules: the input organization module and the convolution module. The input organization module automatically splices tiny input images into a larger one before CNN executes on GPU.

Host

ĊĊ ĊĊ

ĊĊ

GPU

Convolution Sub-sampling Full conneted

Picture1

Picture2

Picture3

ĊĊ

ĊĊ Ċ Ċ

ĊĊ

Fig. 1.Proposed GPU execution framework schematic.

The convolution layer requires the most amount of the calculation and is also the optimization emphasis of our framework while the traditional convolution layer speeding up method involves unrolling convolution

In our framework, we extend the simple unfolding of a convolution technique to a batch mode which enables the convolution layer to process multiple images (input features) at the same time.

When the spliced image (input features) arrives at the convolution layer which came from the host or last layer a split operation is necessary at ﬁrst.

As we know, each layer of the neural network has its fixed input size that is obviously the processing layer’s input size is3×3in the figure 2. According to this input size, the framework will split the spliced images (input features) into the standard size before unrolling convolution. The convolution operation works in a similar way except the unrolled input features from different images need to be added to the bottom of input feature matrix. As shown in the figure 2, the spliced image (input features) will be divided into nine3×3input features and be unrolled to a12×12input feature matrix and the kernel matrix is exactly same as the traditional unrolling convolution. After multiplying input features matrix by the kernel matrix, we get a12×2output feature matrix that is still able to find the boundary of different images. Instead of convoluting the three images in a row, the batch mode unrolls three images to one matrix and does only one multiplication to get the output features.

(4)

III

0 1 1 0 0 0 2 1 3 0 0 0 1 2 0 1 0 1 1 1 0 2 0 2 1 0 2 0 2 1 0 0 0 0 3 1 0 1 0 1 1 1 0 3 0 2 1 2 1 0 2 1 2 1 1 0 3 0 2 1 0 3 1 1 1 0 0 3 0 1 1 3 2 1 1 0 1 0 2 0 2 1 1 1 1 1 0 1 0 3 0 1 1 3 1 3 2 0 1 3 2 0 1 0 1 0 0 3 0 0 3 1 0 1 0 1 0 0 3 0 1 3 2 1 1 0 3 1 0 3 1 1 3 1 1 3 0 1 1 2 3 0 1 3

ĊĊ ĊĊ ĊĊ

1 0 2 1 2 0 1 1 1 2 1 1 0 0 1 1 1 2 0 2 0 1 1 0

* =

ĊĊ ĊĊ

8 8 10 9 9 8 10 13 13 14 16 12 10 10 12 15 13 9 9 6 15 14 19 14 _ĊĊ 00 11 11 00

1 0 0 0 3 0 1 1 3

Img3

0 0 3

g

0 0 0 0 0 0 3 3 3 1 1 1 000 3 0 1 2 1 3 1 1 3 2 0 1 Img2

1 0 1 3 1 2

Img3

1 1 2

Img

11111 1 1 1 1 1 2 2 2 2 2 2 2 2 000 2 1 0 1 0 3 2 0 1 2 0 0 Img2

1 3 1 2 1 3

Img3

9 6 14 14

Img3

13 9 15 19

Img3

9

9 6

14

Imgg3

9 9 666666666666

1 1 1 1 1 14 1214 10 15 1 Img2

1 13 9

9 1 1 13 9999999999

1 1 1 199 13 16 10 12

1

Img211111133 1 1 1 13 16

2 1 1 1 1 1 133 111111116

1 1 1 1 8 1012 9 10

Img1

1 2 2 1

1 1 0 1

1 0 0 1

0 1 0 1

2 1 0 1

2 2 1 0 1

144 111122 1 155 8 9 8 13

Img1

3 3 3 3 3 2 2 2 2 00000

1 1 1 1 1 0 0 0 0 1 0 3 2 1 1 1 0 1

Img222 00

1

Im Imggg22

3 3 3 1 1 1 2 2 00

1 1

g g g

1 1 33

1 1 1 0 0 0 0 0 1 2 1 0 1 2 0 1

Img1

000 111 1

3 3 22 1

1 1 111 000

2 2 000

Im Im Imggg2222 00

000000000 3 3 3 3 3 3 1 1 1 1 1 1

Im Im Imggg

2 2 2 2 2 2 2 2 2 2 2 111 0 0 1 2 1 1 0 0 3

Img1

000 000 222 111

1 1 000

Im Im Img22 11 00

111111111 3 3 3 3 3 3 3 3 3 3

Im Im Img22

3 3 3 3 000 3 0 2 0 0 2 3 1 2

Img1

333 000 000 000

11 22 00 111 1 0 2 0 1 1 1 222 000 11

00 00 22 11 0 1 1 1

33 00 00 00 0 2 0 2

11 00 22 11 22 11 11 00 33 00 22 11 000 333 111 111

2 1 1 0

22 00 11 33 22 00 11 000 0 1 0 1

11 00 00 333 0 0 3 0 00 00 33 11

1 3 2 1 0 0 00 33 11

Output features

Convolution kernels

Input features

Input features matrix Kernel

matrix

Output feature matrix Batch matrix

product version of convolution

Image1

Image2

Image3

ĊĊ ĊĊ

ĊĊ

Intuitive convolution

Fig. 2.Example batch mode unrolling convolution.

(5)

IV

3 Evaluation

0.211 0.232 0.321 0.842 2.786 6.1 10.61

0.0224 0.0493 0.161 0.665 2.584 5.79 10.35

16×16 32×32 64×64 128×128256×256384×384512×512 0

2 4 6 8 10 12

Time(ms)

Dimensions Sequence

Batch

Fig. 3.Execution time for executing image convolution operations with varying image dimensions on the GPU and the splicing number is 10.

Figure 3 and compared the execution time between the two methods, and in the experiment, the ﬁlter size is 5 and the number of output features is ﬁxed by 10. We, however, still take the image dimension and the splicing number as variables.

4 Conclusion

In this paper, we propose an eﬃcient neural network framework on GPU plat- forms. Our framework addresses the gap in abstraction between domain experts and current GPU programming frameworks, and accelerates the process of CNN forward propagation. The framework can organize the input data automatically to make use of GPU resources as much as possible for better performance. We demonstrate the advantage of our framework in each CNN phase with ﬁve exper- iments. According to the power of GPU, framework can extremely increasing the throughput of CNN. The optimization method that we adopt in the framework doesn’t modify the structure of CNN or require training extra neurons, which can combine with other optimization method to achieve better performance.

Acknowledgment

This research is supported by 863 Program of China under grant 2012AA010902, and by the NSFC under grant 61133004, 61073011, 61202425, and Huawei com- pany under grant No.YB2012120105.