• Aucun résultat trouvé

A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems

N/A
N/A
Protected

Academic year: 2022

Partager "A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems"

Copied!
40
0
0

Texte intégral

(1)

A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems

Yong Zhuang

Department of Computer Science National Taiwan University

Joint work with Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin

(2)

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 2 / 41

(3)

Introduction

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

(4)

Introduction Motivation

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 4 / 41

(5)

Introduction Motivation

Motivation

Matrix Factorization is an effective method for recommender systems, e.g., Netflix Prize, KDD Cup 2011, etc.

But training is slow. When we did a HW on training KDD Cup 2011 data, it took too long for us to do experiments

(6)

Introduction Motivation

Motivation (Cont’d)

Distributed MF is possible, but complicated

Yahoo!Music: 1 million users, 600 thousands items and 252 million ratings: can be stored in memory We didn’t see many studies on parallel MF on multi-core systems

Our goal is to develop a parallel MF system in shared memory systems, so at least others students didn’t have the same trouble as us

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 6 / 41

(7)

Introduction Matrix Factorization

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

(8)

Introduction Matrix Factorization

Matrix Factorization

For recommender systems: A group of users give ratings to some items

User Item Rating

1 5 100

1 10 80

1 13 30

. . . .

u v r

. . . . (u,v) =r

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 8 / 41

(9)

Introduction Matrix Factorization

Matrix Factorization (Cont’d)

R

m×n

m : u : 2 1

1 2 .. v .. n

ru,v

?2,2

m,n : numbers of users and items u,v : index for uth user and vth item

ru,v : uth user gives a rating ru,v to vth item

(10)

Introduction Matrix Factorization

Matrix Factorization (Cont’d)

q2

R

m×n

≈ ×

PT

m×k

Q

k ×n

m : u : 2 1

1 2 .. v .. n

ru,v

?2,2

pT1 pT2 : pTu

: pTm

q1 q2 .. qv .. qn

k : number of latent dimensions ru,v = pTuqv

?2,2 = pT2q2

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 10 / 41

(11)

Introduction Matrix Factorization

Matrix Factorization (Cont’d)

Objective function:

minP,Q

X

(u,v)∈R

(ru,v −pTuqv)2P kpuk2FQkqvk2F , SGD: Loops over all ratings in the training set.

Prediction error: eu,v ≡ ru,v −pTuqv SGD update rule:

pu ←pu +γ(eu,vqv −λPpu), qv ←qv +γ(eu,vpu −λQqv)

(12)

Introduction Parallel Matrix Factorization

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 12 / 41

(13)

Introduction Parallel Matrix Factorization

Parallel Matrix Factorization

After r3,3 selected, the ratings in gray blocks cannot be updated due to racing condition

r3,1 r3,2 r3,3 r3,4 r3,5 r3,6

r6,6

1 2 3 4 5 6 1

2 3 4 5 6

r3,1 = p3Tq1 r3,2 = p3Tq2 ..

r3,6 = p3Tq6

——————

r3,3 = p3Tq3 r6,6 = p6Tq6

(14)

Introduction Parallel Matrix Factorization

Parallel Matrix Factorization (Cont’d)

We can split the matrix to blocks.

Then use threads to update the blocks where ratings in different blocks don’t share p or q

1 2 3 4 5 6 1

2 3 4 5 6

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 14 / 41

(15)

Existing Problems in Parallel Matrix Factorization

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

(16)

Existing Problems in Parallel Matrix Factorization Locking Problem

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 16 / 41

(17)

Existing Problems in Parallel Matrix Factorization Locking Problem

Locking Problem

DSGD [Gemulla et al., 2011]

Distributed system: Communication cost is an issue T nodes, the matrix → T ×T blocks to reduce communication cost

(18)

Existing Problems in Parallel Matrix Factorization Locking Problem

Locking Problem (Cont’d)

We apply it in shared memory system

1 2 3

1

2

3

Shared memory: Idle time will be an issuse

Block 1: 20s Block 2: 10s Block 3: 20s We have 3 threads Thread 0→10 10→20hi

1 Busy Busy

2 Busy Idle

3 Busy Busy

ok 10s wasted!!

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 18 / 41

(19)

Existing Problems in Parallel Matrix Factorization Memory Discontinuity

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

(20)

Existing Problems in Parallel Matrix Factorization Memory Discontinuity

Memory Discontinuity

HogWild [Niu et al., 2011]: assume the proability of racing condition is really low

All R, P, and P are memory discontinuous R

= ×

PT

1 3 Q

5 2

4 6

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 20 / 41

(21)

Our approach

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

(22)

Our approach

Our approach

We proposes a fast parallel SGD for Matrix

Factorization in shared memory systems (FPSGD) It applies two strategies to speed up Matrix

Factorization

Lock-Free Scheduling Partial Random Method

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 22 / 41

(23)

Our approach Lock-Free Scheduling

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

(24)

Our approach Lock-Free Scheduling

Lock-Free Scheduling

We split the matrix to enough blocks.

E.g., for 2 threads, we split the matrix to 4×4 blocks

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 is the updated counter recording the number of updated times for each block

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 24 / 41

(25)

Our approach Lock-Free Scheduling

Lock-Free Scheduling (Cont’d)

Firstly, T1 selects a block randomly For T2, it selects a block neither green nor gray

T10 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

(26)

Our approach Lock-Free Scheduling

Lock-Free Scheduling (Cont’d)

For T2, it selects a block neither green nor gray randomly For T2, it selects a block neither green nor gray

T1

T2 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 26 / 41

(27)

Our approach Lock-Free Scheduling

Lock-Free Scheduling (Cont’d)

After T1 finishes, the counter for the corresponding block is added by one

T2 1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

(28)

Our approach Lock-Free Scheduling

Lock-Free Scheduling (Cont’d)

T1 can select available blocks to update Rule: select one that is least updated

T2 1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 28 / 41

(29)

Our approach Lock-Free Scheduling

Lock-Free Scheduling (Cont’d)

FPSGD: applying Lock-Free Scheduling FPSGD**: applying DSGD-like Scheduling

0 2 4 6 8 10

0.84 0.86 0.88 0.9

RMSE

Time(s)

FPSGD**

FPSGD

(a) MovieLens 10M

0 100 200 300 400 500 600 22

22.5 23 23.5 24

Time(s)

RMSE

FPSGD**

FPSGD

(b)Yahoo!Music

MovieLens 10M: 18.71s → 9.72s (RMSE: 0.835) Yahoo!Music: 728.23s → 462.55s (RMSE: 21.985)

(30)

Our approach Partial Random Method

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 30 / 41

(31)

Our approach Partial Random Method

Partial Random Method

For SGD, there are two types of update order Update order Advantages Disadvantages

Random Faster and stable Memory discontinuity Sequential Memory continuity Non-stable

Random Sequential

R R

Lock-free scheduling: the property of randomness How to make it more friendly to cache?

(32)

Our approach Partial Random Method

Partial Random Method (Cont’d)

Access both ˆR and ˆP memory continuously Rˆ : (one block)

= ×

T

1 2

3 4

5 6

Partial: FPSGD is sequential in each block

Random: FPSGD is random when selecting block

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 32 / 41

(33)

Our approach Partial Random Method

Partial Random Method (Cont’d)

0 20 40 60 80 100

0.8 0.9 1 1.1 1.2 1.3

Time(s)

RMSE

Random Partial Random

(a) MovieLens 10M

0 500 1000 1500 2000 2500 3000 20

25 30 35 40 45

Time(s)

RMSE

Random Partial Random

(b) Yahoo!Music

The performance of Partial Random Method is better than that of Random Method

(34)

Experiments

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 34 / 41

(35)

Experiments

Experiments

Some of the state-of-art methods

CCD++ [Yu et al., 2012]: Coordinate descent method (Best paper award in ICDM 2012) DSGD [Gemulla et al., 2011]: Stochastic gradient descent

HogWild [Niu et al., 2011]: Stochastic gradient descent

(36)

Experiments

Experiments (Cont’d)

We compare FPSGD with DSGD, HogWild, CCD++, and FPSGD*

FPSGD: with fast SSE instructions FPSGD*: without SSE instructions

0 20 40 60 80 100

0.84 0.86 0.88 0.9

Time(s)

RMSE

DSGD HogWild CCD++

FPSGD FPSGD*

(a) MovieLens 10M

0 500 1000 1500 2000 2500 3000 22

23 24 25

Time(s)

RMSE

DSGD HogWild CCD++

FPSGD FPSGD*

(b) Yahoo!Music

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 36 / 41

(37)

Experiments

Experiments (Cont’d)

Method Category Problem DSGD Stochastic Locking problem HogWild Stochastic Memory discontinuity

Stochastic methods may get a better solution for its randomness

Deterministic methods (e.g., CCD++) avoid tuning learning rate which is the advantage over Stochastic methods

(38)

Conclusion

Outline

1 Introduction Motivation

Matrix Factorization

Parallel Matrix Factorization

2 Existing Problems in Parallel Matrix Factorization Locking Problem

Memory Discontinuity

3 Our approach

Lock-Free Scheduling Partial Random Method

4 Experiments

5 Conclusion

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 38 / 41

(39)

Conclusion

Conclusion

We point out some computational bottlenecks in existing parallel SGD methods

We propose FPSGD to address these issues and confirm its effectiveness by experiments

1 SGD iteration:

1 thread: around 30s → 8 threads: around 4s for Yahoo!Music with 252 million ratings

(40)

Conclusion

Conclusion (Cont’d)

We develop the package LIBMF available at http://www.csie.ntu.edu.tw/~cjlin/libmf Updated paper and instructions of LIBMF is at http://www.csie.ntu.edu.tw/~cjlin/papers/

libmf.pdf

Your comments to our work are very welcome

Yong Zhuang (National Taiwan Univ.) Oct 16, 2013 40 / 41

Références

Documents relatifs

The online stochastic algorithm described in this paper draws inspiration from the primal-dual algorithm for online optimization and the sample average approximation, and is built

We are not aware of any result on probabilistic diffusion limit using perturbed test functions in the context of PDE, but the recent work [4] (in a context of nonlinear Schr¨

The goal of ALP is to minimize the total separation times (i.e., the throughout of a runway) and the delay times for all the landing aircrafts while maintaining the

L'objet du travail présenté dans cet article est plus restreint puisqu'il ne concerne pas à proprement parler l'analyse de cette relation entre un verbe support et un nom prédicatif

We have estimated dry and wet deposition of inorganic N at a tropical pasture site (Rondonia, Brazil) based on real-time measurements of inorganic N containing gases and aerosol

For example, if Newton methods are used to train logistic regression for large-scale document data, the data matrix is sparse and sparse matrix-vector multiplications are the

In this paper, we propose two techniques, lock-free schedul- ing and partial random method, to respectively solve the locking problem mentioned in Section 3.1 and the

A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems.. WEI-SHENG CHIN , National