HAL Id: hal-01403124
https://hal.inria.fr/hal-01403124
Submitted on 25 Nov 2016
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Distributed under a Creative Commons
Attribution| 4.0 International LicenseSpeedup Critical Stage of Machine Learning with Batch Scheduling in GPU
Yuan Gao, Rui Wang, Ning An, Yanjiang Wei, Depei Qian
To cite this version:
Yuan Gao, Rui Wang, Ning An, Yanjiang Wei, Depei Qian. Speedup Critical Stage of Machine Learn-
ing with Batch Scheduling in GPU. 11th IFIP International Conference on Network and Parallel Com-
puting (NPC), Sep 2014, Ilan, Taiwan. pp.522-525, �10.1007/978-3-662-44917-2_43�. �hal-01403124�
Speedup Critical Stage of Machine Learning with Batch Scheduling in GPU
Yuan Gao, Rui Wang, Ning An, Yanjiang Wei, and Depei Qian
BeiHang University, XueYuan Road No.37 HaiDian District Beijing, China, rui.wang@jsi.buaa.edu.cn
Abstract. As a superior data analysis method, Machine Learning suf- fers the bottleneck from limited computing capability for many years.
With the advent of numerous parallel computing hardwares, modern G- PU is becoming a promising carrier for the tasks of Machine Learning. In this paper, we propose an efficient GPU execution framework to speedup the forward propagation process of convolution neural network. By ex- tending the convolution unrolling method to fit this batch mode, we get a significant increase of throughput but very little overhead.
Keywords: convolution neural network, framework, GPU, batch pro- cess
1 Introduction
With the fast-development of computer hardware, GPU has become a type of im- portant computation carriers with dramatic speedup and better energy efficiency.
More and more computing-intensive applications with natural data parallelism have been migrated to GPU environment. However, although the NVIDIA CU- DA can help programmers to develop faster applications, programming on GPU is much more difficult than that on CPUs. Programmers need to determine the timing of copying the data from the host to the GPU, the number of blocks or threads that should be divided reasonably to take full advantage of computing resources, and the size of data slice that should be allocated to each thread. It is quite difficult to develop a framework for a beginner to write efficient GPU programs in any purpose. However, research on high productivity method both in coding and in execution for special domains is promising.
2 CNN GPU Execution Framework
In this section, we take the image recognition application as an example to illustrate the operating mechanism of our framework.
2.1 Batch Process
Improving the GPU computing resource utilization is a good way to speed up the forward propagation since the computing hardware resource of GPU is quite
II
rich and forward propagation cannot take full advantage of GPU’s ability if only handle one small size image at a time. What we can do, however, is to try to increase the throughput in a single task, which means the CNN processes several images in one forward propagation. The figure 1 presents the proposed frame- work which shows two of the most important modules: the input organization module and the convolution module. The input organization module automati- cally splices tiny input images into a larger one before CNN executes on GPU.
Host
ĊĊ ĊĊ
ĊĊ
GPU
Convolution Sub-sampling Full conneted
Picture1
Picture2
Picture3
ĊĊ
ĊĊ Ċ Ċ
ĊĊ
Fig. 1.Proposed GPU execution framework schematic.
The convolution layer requires the most amount of the calculation and is also the optimization emphasis of our framework while the traditional convolution layer speeding up method involves unrolling convolution
In our framework, we extend the simple unfolding of a convolution technique to a batch mode which enables the convolution layer to process multiple images (input features) at the same time.
When the spliced image (input features) arrives at the convolution layer which came from the host or last layer a split operation is necessary at first.
As we know, each layer of the neural network has its fixed input size that is obviously the processing layer’s input size is3×3in the figure 2. According to this input size, the framework will split the spliced images (input features) into the standard size before unrolling convolution. The convolution operation works in a similar way except the unrolled input features from different images need to be added to the bottom of input feature matrix. As shown in the figure 2, the spliced image (input features) will be divided into nine3×3input features and be unrolled to a12×12input feature matrix and the kernel matrix is exactly same as the traditional unrolling convolution. After multiplying input features matrix by the kernel matrix, we get a12×2output feature matrix that is still able to find the boundary of different images. Instead of convoluting the three images in a row, the batch mode unrolls three images to one matrix and does only one multiplication to get the output features.
III
0 1 1 0 0 0 2 1 3 0 0 0 1 2 0 1 0 1 1 1 0 2 0 2 1 0 2 0 2 1 0 0 0 0 3 1 0 1 0 1 1 1 0 3 0 2 1 2 1 0 2 1 2 1 1 0 3 0 2 1 0 3 1 1 1 0 0 3 0 1 1 3 2 1 1 0 1 0 2 0 2 1 1 1 1 1 0 1 0 3 0 1 1 3 1 3 2 0 1 3 2 0 1 0 1 0 0 3 0 0 3 1 0 1 0 1 0 0 3 0 1 3 2 1 1 0 3 1 0 3 1 1 3 1 1 3 0 1 1 2 3 0 1 3
ĊĊ ĊĊ ĊĊ
1 0 2 1 2 0 1 1 1 2 1 1 0 0 1 1 1 2 0 2 0 1 1 0
* =
ĊĊ ĊĊ
8 8 10 9 9 8 10 13 13 14 16 12 10 10 12 15 13 9 9 6 15 14 19 14 ĊĊ 00 11 11 00
1 0 0 0 3 0 1 1 3
Img3
0 0 3
g
0 0 0 0 0 0 3 3 3 1 1 1 000 3 0 1 2 1 3 1 1 3 2 0 1 Img2
1 0 1 3 1 2
Img3
1 1 2
Img
11111 1 1 1 1 1 2 2 2 2 2 2 2 2 000 2 1 0 1 0 3 2 0 1 2 0 0 Img2
1 3 1 2 1 3
Img3
9 6 14 14
Img3
13 9 15 19
Img3
9
9 6
14
Imgg3
9 9 666666666666
1 1 1 1 1 14 1214 10 15 1 Img2
1 13 9
9 1 1 13 9999999999
1 1 1 199 13 16 10 12
1
Img211111133 1 1 1 13 16
2 1 1 1 1 1 133 111111116
1 1 1 1 8 1012 9 10
Img1
1 2 2 1
1 1 0 1
1 0 0 1
0 1 0 1
2 1 0 1
2 2 1 0 1
144 111122 1 155 8 9 8 13
Img1
3 3 3 3 3 2 2 2 2 00000
1 1 1 1 1 0 0 0 0 1 0 3 2 1 1 1 0 1
Img222 00
1
Im Imggg22
3 3 3 1 1 1 2 2 00
1 1
g g g
1 1 33
1 1 1 0 0 0 0 0 1 2 1 0 1 2 0 1
Img1
000 111 1
3 3 22 1
1 1 111 000
2 2 000
Im Im Imggg2222 00
000000000 3 3 3 3 3 3 1 1 1 1 1 1
Im Im Imggg
2 2 2 2 2 2 2 2 2 2 2 111 0 0 1 2 1 1 0 0 3
Img1
000 000 222 111
1 1 000
Im Im Img22 11 00
111111111 3 3 3 3 3 3 3 3 3 3
Im Im Img22
3 3 3 3 000 3 0 2 0 0 2 3 1 2
Img1
333 000 000 000
11 22 00 111 1 0 2 0 1 1 1 222 000 11
00 00 22 11 0 1 1 1
33 00 00 00 0 2 0 2
11 00 22 11 22 11 11 00 33 00 22 11 000 333 111 111
2 1 1 0
22 00 11 33 22 00 11 000 0 1 0 1
11 00 00 333 0 0 3 0 00 00 33 11
1 3 2 1 0 0 00 33 11
Output features
Convolution kernels
Input features
Input features matrix Kernel
matrix
Output feature matrix Batch matrix
product version of convolution
Image1
Image2
Image3
ĊĊ ĊĊ
ĊĊ
Intuitive convolution
Fig. 2.Example batch mode unrolling convolution.
IV
3 Evaluation
0.211 0.232 0.321 0.842 2.786 6.1 10.61
0.0224 0.0493 0.161 0.665 2.584 5.79 10.35
16×16 32×32 64×64 128×128256×256384×384512×512 0
2 4 6 8 10 12
Time(ms)
Dimensions Sequence
Batch
Fig. 3.Execution time for executing image convolution operations with varying image dimensions on the GPU and the splicing number is 10.
Figure 3 and compared the execution time between the two methods, and in the experiment, the filter size is 5 and the number of output features is fixed by 10. We, however, still take the image dimension and the splicing number as variables.
4 Conclusion
In this paper, we propose an efficient neural network framework on GPU plat- forms. Our framework addresses the gap in abstraction between domain experts and current GPU programming frameworks, and accelerates the process of CNN forward propagation. The framework can organize the input data automatically to make use of GPU resources as much as possible for better performance. We demonstrate the advantage of our framework in each CNN phase with five exper- iments. According to the power of GPU, framework can extremely increasing the throughput of CNN. The optimization method that we adopt in the framework doesn’t modify the structure of CNN or require training extra neurons, which can combine with other optimization method to achieve better performance.
Acknowledgment
This research is supported by 863 Program of China under grant 2012AA010902, and by the NSFC under grant 61133004, 61073011, 61202425, and Huawei com- pany under grant No.YB2012120105.