Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications

(1)

HAL Id: hal-00991709

https://hal.inria.fr/hal-00991709

Submitted on 16 May 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded

Applications

Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, Gilles Muller

To cite this version:

(2)

Remote Core Locking

Migrating Critical-Section Execution to Improve the

Performance of Multithreaded Applications

to appear at USENIX ATC’12

(3)

1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores Memcached/GET Memcached/SET

Problem: scalability

•  Many legacy applications don’t scale well on multicore architectures •  For instance, Memcached (GET/SET requests):

2

Experiments run on a 48-core, “magny-cours” x86 AMD machine

Higher

is

better

(4)

1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores Memcached/GET Memcached/SET

Problem: scalability

•  Many legacy applications don’t scale well on multicore architectures •  For instance, Memcached (GET/SET requests):

2

Collapses !

Experiments run on a 48-core, “magny-cours” x86 AMD machine

Higher

is

better

(5)

Why?

•  Critical sections = bottleneck on multicore architectures •  High contention ⇒ lock acquisition is costly

–  More cores ⇒ more contention

0%" 20%" 40%" 60%" 80%" 100%" 1" 4" 8" 16" 22" 32" 48" SPLASH-2/Radiosity" SPLASH-2/Raytrace" Phoenix2/LG" Phoenix2/SM" Phoenix2/MM" Memcached/GET" Memcached/SET" Berkeley DB/OS" Berkeley DB/SL" Number of cores" % o f ti me sp e n t in cri ti ca l se ct io n * 3

(6)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Execution time (cycles)

Delay (cycles)

•  Better resistance to contention

•  No need to redesign the application

•  Custom microbenchmark to compare locks:

Solution: designing better locks

4

Lower

is

better

"

 Higher contention" Lower contention " MCS "

[Mellor-Crummey91] _"

CAS spinlock "

(7)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Delay (cycles)

•  Better resistance to contention

•  No need to redesign the application

•  Custom microbenchmark to compare locks:

Improvement!

Solution: designing better locks

4

Lower

is

better

"

 Higher contention" Lower contention " MCS "

[Mellor-Crummey91] _"

CAS spinlock "

(8)

Objective: remove atomic instructions and reduce cache misses

•  Execute contended critical sections on a dedicated server core •  Very fast transfer of control, no sync on global variable

6

Server continuously checks mailboxes and executes critical sections!

(18)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

(19)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

(20)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

cache miss

(21)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Server executes critical section"

cache miss

(22)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

NULL!

Server executes critical section"

cache miss

(23)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

(24)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

NULL! cache miss cache miss cache miss

(25)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

(26)

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

NULL! cache miss cache miss cache miss

(27)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Delay (cycles)

Performance

7 CAS spinlock " MCS " RCL "

 Higher contention" Lower contention "

Lower

is

better

(28)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Delay (cycles)

Performance

7 CAS spinlock " MCS " RCL "

Lower

is

better

"

(29)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Delay (cycles) CAS spinlock " MCS " RCL " POSIX " Flat Combining " [Hendler10] _" "

(30)

Using RCL in legacy applications (1)

8

RCL Runtime :

•  Handles blocking in critical sections (I/O, page faults…)

–  Pool of servicing threads on server

–  Able to service other (independent) critical sections when blocked

•  Makes it possible to use condition variables (cond/wait)

(31)

Reengineering:

•  Critical sections must be encapsulated into functions

–  Local variables sent as parameters (context)

9

(32)

Reengineering:

9

Using RCL in legacy applications (2)

void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!

struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !

void __cs(struct context *c) {!

Reengineering:

•  Tool to reengineer applications automatically

–  Possible to pick which locks use RCL –  To avoid false serialization:

Possible to pick which server(s) handle which lock(s).

10

(37)

0 20 40 60 80 100 100 1000 10000 100000 1e+06 % of time in CS Delay (cycles)

All other locks have collapsed (60000 cycles): 70%

Collapse of POSIX (120000 cycles): 20%

% of time in CS

Profiling:

•  Custom proﬁler to ﬁnd good candidates •  Metric: time spent in critical sections

•  Running the proﬁler on the microbenchmark shows that:

–  If time spent in CS > 20%, RCL is more efﬁcient than POSIX locks –  If time spent in CS > 70%, RCL is more efﬁcient than all other locks

11

(38)

Experiments

•  Benchmarks (highly contended ⇒ 70% time spent in CS):

–  SPLASH-2 benchmark suite

–  3 applications out of 10 are highly contended –  Phoenix2 benchmark suite

–  3 applications out of 7 are highly contended –  Memcached

–  Highly contended with the GET workload –  Berkeley DB / TPC-C

–  Highly contended with 2 workloads (Order Status, Stock Level)

(39)

0 1 2 3 4 5 Memcached Set Raytrace Balls4 String Match Linear Regression Memcached Get Radiosity Raytrace Car Matrix Multiply 0 1 2 3 4 5

Best performance relative to best POSIX performance

POSIX x 1.9:3 x25.8:35 x12.8:35 x 4.7:22 x 4.8:10 x10.2:15 x 3.6:5 x 3.6:21 CAS spinlock x 2.6:7 x24.4:31 x10.0:26 x 4.1:11 x 8.8:10 x10.5:13 x 4.5:6 x 3.5:8 Flat Combining x30.2:48 x15.7:41 x 6.6:21 x16.1:36 x 5.4:16 x 5.1:18 MCS x 2.5:11 x34.1:48 x15.9:42 x 6.0:22 x10.3:16 x15.9:32 x 4.8:11 x 5.0:29 RCL x 5.0:22/4 x34.5:48/40 x15.4:35/23 x 9.5:22/7 x10.6:15/12 x23.6:46/18 x12.2:31/7 x 5.8:20/7

•  Better performance and scalability when time in CS > 70%

–  Performance improvement correlated with time in CS

•  Only one or two locks replaced each time

(40)

(41)

(42)

0 2 4 6 8 10

Order Status Stock Level

0 2 4 6 8 10

Performance relative to base application (100 clients)

Base POSIX CAS spinlock Flat Combining MCS-TP RCL x 1.4 x 0.1 x 1.0 x 0.9 x 4.3/10 x 1.4 x 0.2 x 1.6 x 1.4 x 7.7/10

•  Berkeley DB with TPC-C (100 clients) •  Large gains, % in CS underestimated

(43)

1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores POSIX CAS spinlock MCS RCL

•  Memcached, SET requests:

(44)

1 10 100 1000 20 40 48 60 80 100 120 140 160 180 200 Throughput (transactions/s)

Number of clients (1 client = 1 thread) Base POSIX CAS spinlock Flat Combining MCS MCS-TP RCL

RCL Scalability (2)

H ig h er is b et ter

•  Berkeley DB / TPC-C, Stock Level requests:

(45)

Number of clients (1 client = 1 thread)

MCS RCL

RCL Scalability (2)

Collapse! H ig h er is b et ter

(46)

Number of clients (1 client = 1 thread)

MCS-TP MCS RCL

RCL Scalability (2)

Collapse! H ig h er is b et ter

(47)

•  RCL reduces lock acquisition time and improves data locality •  Proﬁler to detect when RCL can be useful

•  Tool to ease the transformation of legacy code •  Future work: adaptive RCL runtime

–  Dynamically switch between locking strategies –  Load balancing between servers

Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications

Remote Core Locking



Migrating Critical-Section Execution to Improve the

Performance of Multithreaded Applications



Problem: scalability

Problem: scalability

Why?

Solution: designing better locks

Solution: designing better locks

Remote Core Locking

Remote Core Locking

Remote Core Locking

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Implementation: general idea

Performance

Performance

Using RCL in legacy applications (1)

RCL Runtime :

Reengineering:

Reengineering:

Using RCL in legacy applications (2)

Reengineering:

Using RCL in legacy applications (2)

Reengineering:

Using RCL in legacy applications (2)

Reengineering:

Using RCL in legacy applications (2)

Reengineering:

Profiling:

Experiments

RCL Scalability (2)

RCL Scalability (2)

RCL Scalability (2)

Conclusion

Remote Core Locking

Performance of Multithreaded Applications