• Aucun résultat trouvé

Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications

N/A
N/A
Protected

Academic year: 2021

Partager "Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications"

Copied!
47
0
0

Texte intégral

(1)

HAL Id: hal-00991709

https://hal.inria.fr/hal-00991709

Submitted on 16 May 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded

Applications

Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, Gilles Muller

To cite this version:

(2)

Remote Core Locking



Migrating Critical-Section Execution to Improve the

Performance of Multithreaded Applications



to appear at USENIX ATC’12

(3)

1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores Memcached/GET Memcached/SET

Problem: scalability

•  Many legacy applications don’t scale well on multicore architectures •  For instance, Memcached (GET/SET requests):

2

Experiments run on a 48-core, “magny-cours” x86 AMD machine

Higher

is

better

(4)

1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores Memcached/GET Memcached/SET

Problem: scalability

•  Many legacy applications don’t scale well on multicore architectures •  For instance, Memcached (GET/SET requests):

2

Collapses !

Experiments run on a 48-core, “magny-cours” x86 AMD machine

Higher

is

better

(5)

Why?

•  Critical sections = bottleneck on multicore architectures •  High contention ⇒ lock acquisition is costly

–  More cores ⇒ more contention

0%" 20%" 40%" 60%" 80%" 100%" 1" 4" 8" 16" 22" 32" 48" SPLASH-2/Radiosity" SPLASH-2/Raytrace" Phoenix2/LG" Phoenix2/SM" Phoenix2/MM" Memcached/GET" Memcached/SET" Berkeley DB/OS" Berkeley DB/SL" Number of cores" % o f ti me sp e n t in cri ti ca l se ct io n * 3

(6)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Execution time (cycles)

Delay (cycles)

•  Better resistance to contention

•  No need to redesign the application

•  Custom microbenchmark to compare locks:

Solution: designing better locks

4

Lower

is

better

"

 Higher contention" Lower contention " MCS "

[Mellor-Crummey91] "

CAS spinlock "

(7)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Execution time (cycles)

Delay (cycles)

•  Better resistance to contention

•  No need to redesign the application

•  Custom microbenchmark to compare locks:

Improvement!

Solution: designing better locks

4

Lower

is

better

"

 Higher contention" Lower contention " MCS "

[Mellor-Crummey91] "

CAS spinlock "

(8)

Objective: remove atomic instructions and reduce cache misses

•  Execute contended critical sections on a dedicated server core •  Very fast transfer of control, no sync on global variable

–  Faster than lock acquisitions when contention is high •  Shared data remains on server core ⇒ fewer cache misses

Remote Core Locking

(9)

Objective: remove atomic instructions and reduce cache misses

•  Execute contended critical sections on a dedicated server core •  Very fast transfer of control, no sync on global variable

–  Faster than lock acquisitions when contention is high •  Shared data remains on server core ⇒ fewer cache misses

Remote Core Locking

(10)

Objective: remove atomic instructions and reduce cache misses

•  Execute contended critical sections on a dedicated server core •  Very fast transfer of control, no sync on global variable

–  Faster than lock acquisitions when contention is high •  Shared data remains on server core ⇒ fewer cache misses

Remote Core Locking

(11)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

Implementation: general idea

(12)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

Implementation: general idea

6

(13)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&lock4!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

(14)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

(15)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

(16)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

(17)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

(18)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

(19)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

(20)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

cache  miss 

(21)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

Server executes critical section"

cache  miss 

(22)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

NULL!

Server executes critical section"

cache  miss 

(23)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

(24)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

NULL! cache  miss  cache  miss  cache  miss 

(25)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

(26)

•  Implementation based on cache line-sized mailboxes •  Three fields: lock, context, function

•  Client fills the field and waits for the function to be reset •  Server loops across the fields

&foo!

&lock4! 0xa0dc5f3a!

Implementation: general idea

6

Client thread 2 wants to execute a critical section protected by “lock4”!

Server continuously checks mailboxes and executes critical sections!

NULL! cache  miss  cache  miss  cache  miss 

(27)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Execution time (cycles)

Delay (cycles)

Performance

7 CAS spinlock " MCS " RCL "

 Higher contention" Lower contention "

Lower

is

better

(28)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Execution time (cycles)

Delay (cycles)

Performance

7 CAS spinlock " MCS " RCL "

 Higher contention" Lower contention "

Lower

is

better

"

(29)

100 1000 10000 100000 1e+06

1000 10000 100000 1e+06

Execution time (cycles)

Delay (cycles) CAS spinlock " MCS " RCL " POSIX " Flat Combining " [Hendler10] " "

 Higher contention" Lower contention "

(30)

Using RCL in legacy applications (1)

8

RCL Runtime :

•  Handles blocking in critical sections (I/O, page faults…)

–  Pool of servicing threads on server

–  Able to service other (independent) critical sections when blocked

•  Makes it possible to use condition variables (cond/wait)

(31)

Reengineering:

•  Critical sections must be encapsulated into functions

–  Local variables sent as parameters (context)

9

(32)

Reengineering:

•  Critical sections must be encapsulated into functions

–  Local variables sent as parameters (context)

9

Using RCL in legacy applications (2)

void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!

struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !

void __cs(struct context *c) {!

!c->a = f(c->a)! !f(c->b)!

(33)

Reengineering:

•  Critical sections must be encapsulated into functions

–  Local variables sent as parameters (context)

9

Using RCL in legacy applications (2)

void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!

struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !

void __cs(struct context *c) {!

!c->a = f(c->a)! !f(c->b)!

(34)

Reengineering:

•  Critical sections must be encapsulated into functions

–  Local variables sent as parameters (context)

9

Using RCL in legacy applications (2)

void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!

struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !

void __cs(struct context *c) {!

!c->a = f(c->a)! !f(c->b)!

(35)

Reengineering:

•  Critical sections must be encapsulated into functions

–  Local variables sent as parameters (context)

•  Tool to reengineer applications automatically

–  Possible to pick which locks use RCL –  To avoid false serialization:

Possible to pick which server(s) handle which lock(s).

9

Using RCL in legacy applications (2)

void func(void) {! int a, b, x;! …! a = …;! …! pthread_mutex_lock();! a = f(a);! f(b);! pthread_mutex_unlock();! …! }!

struct context { int a, b };! ! void func(void) {! !struct context c;! !int x;! !…! !c.a = …;! !…! !execute_rcl(__cs, &c);! !…! }! !

void __cs(struct context *c) {!

!c->a = f(c->a)! !f(c->b)!

(36)

Reengineering:

•  Critical sections must be encapsulated into functions

–  Local variables sent as parameters (context)

•  Tool to reengineer applications automatically

–  Possible to pick which locks use RCL –  To avoid false serialization:

Possible to pick which server(s) handle which lock(s).

10

(37)

0 20 40 60 80 100 100 1000 10000 100000 1e+06 % of time in CS Delay (cycles)

All other locks have collapsed (60000 cycles): 70%

Collapse of POSIX (120000 cycles): 20%

% of time in CS

Profiling:

•  Custom profiler to find good candidates •  Metric: time spent in critical sections

•  Running the profiler on the microbenchmark shows that:

–  If time spent in CS > 20%, RCL is more efficient than POSIX locks –  If time spent in CS > 70%, RCL is more efficient than all other locks

11

(38)

Experiments

•  Benchmarks (highly contended ⇒ 70% time spent in CS):

–  SPLASH-2 benchmark suite

–  3 applications out of 10 are highly contended –  Phoenix2 benchmark suite

–  3 applications out of 7 are highly contended –  Memcached

–  Highly contended with the GET workload –  Berkeley DB / TPC-C

–  Highly contended with 2 workloads (Order Status, Stock Level)

(39)

0 1 2 3 4 5 Memcached Set Raytrace Balls4 String Match Linear Regression Memcached Get Radiosity Raytrace Car Matrix Multiply 0 1 2 3 4 5

Best performance relative to best POSIX performance

POSIX x 1.9:3 x25.8:35 x12.8:35 x 4.7:22 x 4.8:10 x10.2:15 x 3.6:5 x 3.6:21 CAS spinlock x 2.6:7 x24.4:31 x10.0:26 x 4.1:11 x 8.8:10 x10.5:13 x 4.5:6 x 3.5:8 Flat Combining x30.2:48 x15.7:41 x 6.6:21 x16.1:36 x 5.4:16 x 5.1:18 MCS x 2.5:11 x34.1:48 x15.9:42 x 6.0:22 x10.3:16 x15.9:32 x 4.8:11 x 5.0:29 RCL x 5.0:22/4 x34.5:48/40 x15.4:35/23 x 9.5:22/7 x10.6:15/12 x23.6:46/18 x12.2:31/7 x 5.8:20/7

•  Better performance and scalability when time in CS > 70%

–  Performance improvement correlated with time in CS

•  Only one or two locks replaced each time

(40)

0 1 2 3 4 5 Memcached Set Raytrace Balls4 String Match Linear Regression Memcached Get Radiosity Raytrace Car Matrix Multiply 0 1 2 3 4 5

Best performance relative to best POSIX performance

POSIX x 1.9:3 x25.8:35 x12.8:35 x 4.7:22 x 4.8:10 x10.2:15 x 3.6:5 x 3.6:21 CAS spinlock x 2.6:7 x24.4:31 x10.0:26 x 4.1:11 x 8.8:10 x10.5:13 x 4.5:6 x 3.5:8 Flat Combining x30.2:48 x15.7:41 x 6.6:21 x16.1:36 x 5.4:16 x 5.1:18 MCS x 2.5:11 x34.1:48 x15.9:42 x 6.0:22 x10.3:16 x15.9:32 x 4.8:11 x 5.0:29 RCL x 5.0:22/4 x34.5:48/40 x15.4:35/23 x 9.5:22/7 x10.6:15/12 x23.6:46/18 x12.2:31/7 x 5.8:20/7

•  Better performance and scalability when time in CS > 70%

–  Performance improvement correlated with time in CS

•  Only one or two locks replaced each time

(41)

0 1 2 3 4 5 Memcached Set Raytrace Balls4 String Match Linear Regression Memcached Get Radiosity Raytrace Car Matrix Multiply 0 1 2 3 4 5

Best performance relative to best POSIX performance

POSIX x 1.9:3 x25.8:35 x12.8:35 x 4.7:22 x 4.8:10 x10.2:15 x 3.6:5 x 3.6:21 CAS spinlock x 2.6:7 x24.4:31 x10.0:26 x 4.1:11 x 8.8:10 x10.5:13 x 4.5:6 x 3.5:8 Flat Combining x30.2:48 x15.7:41 x 6.6:21 x16.1:36 x 5.4:16 x 5.1:18 MCS x 2.5:11 x34.1:48 x15.9:42 x 6.0:22 x10.3:16 x15.9:32 x 4.8:11 x 5.0:29 RCL x 5.0:22/4 x34.5:48/40 x15.4:35/23 x 9.5:22/7 x10.6:15/12 x23.6:46/18 x12.2:31/7 x 5.8:20/7

•  Better performance and scalability when time in CS > 70%

–  Performance improvement correlated with time in CS

•  Only one or two locks replaced each time

(42)

0 2 4 6 8 10

Order Status Stock Level

0 2 4 6 8 10

Performance relative to base application (100 clients)

Base POSIX CAS spinlock Flat Combining MCS-TP RCL x 1.4 x 0.1 x 1.0 x 0.9 x 4.3/10 x 1.4 x 0.2 x 1.6 x 1.4 x 7.7/10

•  Berkeley DB with TPC-C (100 clients) •  Large gains, % in CS underestimated

(43)

1 1.5 2 2.5 3 3.5 4 4.5 5 1 5 10 15 20 22 Speedup Number of cores POSIX CAS spinlock MCS RCL

•  Memcached, SET requests:

(44)

1 10 100 1000 20 40 48 60 80 100 120 140 160 180 200 Throughput (transactions/s)

Number of clients (1 client = 1 thread) Base POSIX CAS spinlock Flat Combining MCS MCS-TP RCL

RCL Scalability (2)

H ig h er is b et ter

•  Berkeley DB / TPC-C, Stock Level requests:

(45)

1 10 100 1000 20 40 48 60 80 100 120 140 160 180 200 Throughput (transactions/s)

Number of clients (1 client = 1 thread)

MCS RCL

RCL Scalability (2)

Collapse! H ig h er is b et ter

•  Berkeley DB / TPC-C, Stock Level requests:

(46)

1 10 100 1000 20 40 48 60 80 100 120 140 160 180 200 Throughput (transactions/s)

Number of clients (1 client = 1 thread)

MCS-TP MCS RCL

RCL Scalability (2)

Collapse! H ig h er is b et ter

•  Berkeley DB / TPC-C, Stock Level requests:

(47)

•  RCL reduces lock acquisition time and improves data locality •  Profiler to detect when RCL can be useful

•  Tool to ease the transformation of legacy code •  Future work: adaptive RCL runtime

–  Dynamically switch between locking strategies –  Load balancing between servers

Conclusion

Références

Documents relatifs

The gain in schedule length from Figure 2b is obtained by introducing the following flexibilities in the scheduling of communication fragments, while respecting read-exec-write

Precision of the estimation of communication costs Compared to the ILP formulation, the heuristic does not suffer from the aforementioned over-estimation, thus the communication cost

Pour éliminer cette corrélation et permettre l’exécution des applications indépendam- ment les unes des autres, nous proposons de contraindre l’accès à ces trois ressources (la

New design rules are necessary, which will facilitate the parallelization of the control algorithms, but at the same time have to minimize the re-design effort: on

ALAI, Les Front&es du droit dauteur: ses limites et exceptions. Australian Copyright Council, 1999. Cyberspace and the American Dream: A Magna Carta for the

The Sherwood number and the average value of Damköhler number show that the photocatalytic reaction under our experimental conditions is limited by the mass transfer. There

If we come to hate argument, and refuse obstinately to consider and follow only the best arguments in decision-making because we have grown cynical at the (sophistic) misuse