HAL Id: hal-00991709
https://hal.inria.fr/hal-00991709
Submitted on 16 May 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded
Applications
Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, Gilles Muller
To cite this version:
Remote Core Locking
Migrating Critical-Section Execution to Improve the
Performance of Multithreaded Applications
to appear at USENIX ATC’12
1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores Memcached/GET Memcached/SET
Problem: scalability
• Many legacy applications don’t scale well on multicore architectures • For instance, Memcached (GET/SET requests):
2
Experiments run on a 48-core, “magny-cours” x86 AMD machine
Higher
is
better
1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores Memcached/GET Memcached/SET
Problem: scalability
• Many legacy applications don’t scale well on multicore architectures • For instance, Memcached (GET/SET requests):
2
Collapses !
Experiments run on a 48-core, “magny-cours” x86 AMD machine
Higher
is
better
Why?
• Critical sections = bottleneck on multicore architectures • High contention ⇒ lock acquisition is costly
– More cores ⇒ more contention
0%" 20%" 40%" 60%" 80%" 100%" 1" 4" 8" 16" 22" 32" 48" SPLASH-2/Radiosity" SPLASH-2/Raytrace" Phoenix2/LG" Phoenix2/SM" Phoenix2/MM" Memcached/GET" Memcached/SET" Berkeley DB/OS" Berkeley DB/SL" Number of cores" % o f ti me sp e n t in cri ti ca l se ct io n * 3
100 1000 10000 100000 1e+06
1000 10000 100000 1e+06
Execution time (cycles)
Delay (cycles)
• Better resistance to contention
• No need to redesign the application
• Custom microbenchmark to compare locks:
Solution: designing better locks
4
Lower
is
better
"
Higher contention" Lower contention " MCS "
[Mellor-Crummey91] "
CAS spinlock "
100 1000 10000 100000 1e+06
1000 10000 100000 1e+06
Execution time (cycles)
Delay (cycles)
• Better resistance to contention
• No need to redesign the application
• Custom microbenchmark to compare locks:
Improvement!
Solution: designing better locks
4
Lower
is
better
"
Higher contention" Lower contention " MCS "
[Mellor-Crummey91] "
CAS spinlock "
Objective: remove atomic instructions and reduce cache misses
• Execute contended critical sections on a dedicated server core • Very fast transfer of control, no sync on global variable
– Faster than lock acquisitions when contention is high • Shared data remains on server core ⇒ fewer cache misses
Remote Core Locking
Objective: remove atomic instructions and reduce cache misses
• Execute contended critical sections on a dedicated server core • Very fast transfer of control, no sync on global variable
– Faster than lock acquisitions when contention is high • Shared data remains on server core ⇒ fewer cache misses
Remote Core Locking
Objective: remove atomic instructions and reduce cache misses
• Execute contended critical sections on a dedicated server core • Very fast transfer of control, no sync on global variable
– Faster than lock acquisitions when contention is high • Shared data remains on server core ⇒ fewer cache misses
Remote Core Locking
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
Implementation: general idea
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
Implementation: general idea
6
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&lock4!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
cache miss
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
Server executes critical section"
cache miss
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
NULL!
Server executes critical section"
cache miss
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
NULL! cache miss cache miss cache miss
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
• Implementation based on cache line-sized mailboxes • Three fields: lock, context, function
• Client fills the field and waits for the function to be reset • Server loops across the fields
&foo!
&lock4! 0xa0dc5f3a!
Implementation: general idea
6
Client thread 2 wants to execute a critical section protected by “lock4”!
Server continuously checks mailboxes and executes critical sections!
NULL! cache miss cache miss cache miss
100 1000 10000 100000 1e+06
1000 10000 100000 1e+06
Execution time (cycles)
Delay (cycles)
Performance
7 CAS spinlock " MCS " RCL " Higher contention" Lower contention "
Lower
is
better
100 1000 10000 100000 1e+06
1000 10000 100000 1e+06
Execution time (cycles)
Delay (cycles)
Performance
7 CAS spinlock " MCS " RCL " Higher contention" Lower contention "
Lower
is
better
"
100 1000 10000 100000 1e+06
1000 10000 100000 1e+06
Execution time (cycles)
Delay (cycles) CAS spinlock " MCS " RCL " POSIX " Flat Combining " [Hendler10] " "
Higher contention" Lower contention "
Using RCL in legacy applications (1)
8
RCL Runtime :
• Handles blocking in critical sections (I/O, page faults…)
– Pool of servicing threads on server
– Able to service other (independent) critical sections when blocked
• Makes it possible to use condition variables (cond/wait)
Reengineering:
• Critical sections must be encapsulated into functions
– Local variables sent as parameters (context)
9
Reengineering:
• Critical sections must be encapsulated into functions
– Local variables sent as parameters (context)
9
Using RCL in legacy applications (2)
void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!
struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !
void __cs(struct context *c) {!
!c->a = f(c->a)! !f(c->b)!
Reengineering:
• Critical sections must be encapsulated into functions
– Local variables sent as parameters (context)
9
Using RCL in legacy applications (2)
void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!
struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !
void __cs(struct context *c) {!
!c->a = f(c->a)! !f(c->b)!
Reengineering:
• Critical sections must be encapsulated into functions
– Local variables sent as parameters (context)
9
Using RCL in legacy applications (2)
void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!
struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !
void __cs(struct context *c) {!
!c->a = f(c->a)! !f(c->b)!
Reengineering:
• Critical sections must be encapsulated into functions
– Local variables sent as parameters (context)
• Tool to reengineer applications automatically
– Possible to pick which locks use RCL – To avoid false serialization:
Possible to pick which server(s) handle which lock(s).
9
Using RCL in legacy applications (2)
void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!
struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !
void __cs(struct context *c) {!
!c->a = f(c->a)! !f(c->b)!
Reengineering:
• Critical sections must be encapsulated into functions
– Local variables sent as parameters (context)
• Tool to reengineer applications automatically
– Possible to pick which locks use RCL – To avoid false serialization:
Possible to pick which server(s) handle which lock(s).
10
0 20 40 60 80 100 100 1000 10000 100000 1e+06 % of time in CS Delay (cycles)
All other locks have collapsed (60000 cycles): 70%
Collapse of POSIX (120000 cycles): 20%
% of time in CS
Profiling:
• Custom profiler to find good candidates • Metric: time spent in critical sections
• Running the profiler on the microbenchmark shows that:
– If time spent in CS > 20%, RCL is more efficient than POSIX locks – If time spent in CS > 70%, RCL is more efficient than all other locks
11
Experiments
• Benchmarks (highly contended ⇒ 70% time spent in CS):
– SPLASH-2 benchmark suite
– 3 applications out of 10 are highly contended – Phoenix2 benchmark suite
– 3 applications out of 7 are highly contended – Memcached
– Highly contended with the GET workload – Berkeley DB / TPC-C
– Highly contended with 2 workloads (Order Status, Stock Level)
0 1 2 3 4 5 Memcached Set Raytrace Balls4 String Match Linear Regression Memcached Get Radiosity Raytrace Car Matrix Multiply 0 1 2 3 4 5
Best performance relative to best POSIX performance
POSIX x 1.9:3 x25.8:35 x12.8:35 x 4.7:22 x 4.8:10 x10.2:15 x 3.6:5 x 3.6:21 CAS spinlock x 2.6:7 x24.4:31 x10.0:26 x 4.1:11 x 8.8:10 x10.5:13 x 4.5:6 x 3.5:8 Flat Combining x30.2:48 x15.7:41 x 6.6:21 x16.1:36 x 5.4:16 x 5.1:18 MCS x 2.5:11 x34.1:48 x15.9:42 x 6.0:22 x10.3:16 x15.9:32 x 4.8:11 x 5.0:29 RCL x 5.0:22/4 x34.5:48/40 x15.4:35/23 x 9.5:22/7 x10.6:15/12 x23.6:46/18 x12.2:31/7 x 5.8:20/7
• Better performance and scalability when time in CS > 70%
– Performance improvement correlated with time in CS
• Only one or two locks replaced each time
0 1 2 3 4 5 Memcached Set Raytrace Balls4 String Match Linear Regression Memcached Get Radiosity Raytrace Car Matrix Multiply 0 1 2 3 4 5
Best performance relative to best POSIX performance
POSIX x 1.9:3 x25.8:35 x12.8:35 x 4.7:22 x 4.8:10 x10.2:15 x 3.6:5 x 3.6:21 CAS spinlock x 2.6:7 x24.4:31 x10.0:26 x 4.1:11 x 8.8:10 x10.5:13 x 4.5:6 x 3.5:8 Flat Combining x30.2:48 x15.7:41 x 6.6:21 x16.1:36 x 5.4:16 x 5.1:18 MCS x 2.5:11 x34.1:48 x15.9:42 x 6.0:22 x10.3:16 x15.9:32 x 4.8:11 x 5.0:29 RCL x 5.0:22/4 x34.5:48/40 x15.4:35/23 x 9.5:22/7 x10.6:15/12 x23.6:46/18 x12.2:31/7 x 5.8:20/7
• Better performance and scalability when time in CS > 70%
– Performance improvement correlated with time in CS
• Only one or two locks replaced each time
0 1 2 3 4 5 Memcached Set Raytrace Balls4 String Match Linear Regression Memcached Get Radiosity Raytrace Car Matrix Multiply 0 1 2 3 4 5
Best performance relative to best POSIX performance
POSIX x 1.9:3 x25.8:35 x12.8:35 x 4.7:22 x 4.8:10 x10.2:15 x 3.6:5 x 3.6:21 CAS spinlock x 2.6:7 x24.4:31 x10.0:26 x 4.1:11 x 8.8:10 x10.5:13 x 4.5:6 x 3.5:8 Flat Combining x30.2:48 x15.7:41 x 6.6:21 x16.1:36 x 5.4:16 x 5.1:18 MCS x 2.5:11 x34.1:48 x15.9:42 x 6.0:22 x10.3:16 x15.9:32 x 4.8:11 x 5.0:29 RCL x 5.0:22/4 x34.5:48/40 x15.4:35/23 x 9.5:22/7 x10.6:15/12 x23.6:46/18 x12.2:31/7 x 5.8:20/7
• Better performance and scalability when time in CS > 70%
– Performance improvement correlated with time in CS
• Only one or two locks replaced each time
0 2 4 6 8 10
Order Status Stock Level
0 2 4 6 8 10
Performance relative to base application (100 clients)
Base POSIX CAS spinlock Flat Combining MCS-TP RCL x 1.4 x 0.1 x 1.0 x 0.9 x 4.3/10 x 1.4 x 0.2 x 1.6 x 1.4 x 7.7/10
• Berkeley DB with TPC-C (100 clients) • Large gains, % in CS underestimated
1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores POSIX CAS spinlock MCS RCL
• Memcached, SET requests:
1 10 100 1000 20 40 48 60 80 100 120 140 160 180 200 Throughput (transactions/s)
Number of clients (1 client = 1 thread) Base POSIX CAS spinlock Flat Combining MCS MCS-TP RCL
RCL Scalability (2)
H ig h er is b et ter• Berkeley DB / TPC-C, Stock Level requests:
1 10 100 1000 20 40 48 60 80 100 120 140 160 180 200 Throughput (transactions/s)
Number of clients (1 client = 1 thread)
MCS RCL
RCL Scalability (2)
Collapse! H ig h er is b et ter• Berkeley DB / TPC-C, Stock Level requests:
1 10 100 1000 20 40 48 60 80 100 120 140 160 180 200 Throughput (transactions/s)
Number of clients (1 client = 1 thread)
MCS-TP MCS RCL
RCL Scalability (2)
Collapse! H ig h er is b et ter• Berkeley DB / TPC-C, Stock Level requests:
• RCL reduces lock acquisition time and improves data locality • Profiler to detect when RCL can be useful
• Tool to ease the transformation of legacy code • Future work: adaptive RCL runtime
– Dynamically switch between locking strategies – Load balancing between servers