• Aucun résultat trouvé

FTB-Enabled Fault Tolerance in MVAPICH2

N/A
N/A
Protected

Academic year: 2021

Partager "FTB-Enabled Fault Tolerance in MVAPICH2"

Copied!
9
0
0

Texte intégral

(1)

FTB-Enabled

Fault Tolerance in MVAPICH2

(2)

Fault Tolerance with MVAPICH2

(3)

FTB Workflows in MVAPICH2

• Checkpoint-Restart Mechanism

• Fast Check pointing by Write Aggregation with Dynamic Buffer

Interleaving

(4)

Checkpoint-Restart Workflow

Job Launcher

Ckpt Rqst

Computation

Checkpoint

Computation

Checkpoint

Computation

Computation

Checkpoint

Computation

Checkpoint

Computation

Ckpt Rqst

Compute Node Compute Node

Start application

Phase 1: Suspend

communication between

all processes

Phase 2: Use the

checkpoint library

(BLCR) to checkpoint

the individual processes

(5)

Write Aggregation Workflow

Aggregate many VFS writes into

larger chunk writes

•Reduce disk seeks, improve

bandwidth utilization

Overlaps application progress with

slow checkpoint file writing

•Improve application execution

time

•Application-perceived checkpoint

time = t2 - t1

•(t3 - t2) is overlapping between

Application-Progress and File-Write

•Vulnerable to failures in this window

•Provide an interface to poll for

File-Write completion

t1

Application Processes IO Processes

t2

t3

Phase 1 Phase 2

(6)

MPI Proc Manager

MPI Proc

Compute Node

Drain Inflight Messages

Agent MPI Proc

Compute Node

Agent

Login Node

MPI Proc

Agent Agent

Compute Node (Source) Spare (Target)

Trigger

MIGRATE

Phase 1

Phase 2

Proc Images

PII_COMPLETE

RESTART

Phase 3

migration

Phase 4

Barrier + Resume

Barrier

(7)

FTB Events in MVAPICH2 CR Framework

CR_FTB_CHECKPOINT

– Req. a Checkpoint of the MPI Job – Sent by mpirun_rsh

CR_FTB_CKPT_DONE

– Checkpoint completed successfully

– Sent by MPI processes that were

able to checkpoint

CR_FTB_CKPT_FAIL

– Checkpoint failed

– Sent by MPI Processes that failed to take the checkpoint

CR_FTB_RSRT_DONE

– Restart completed successfully

– Sent by MPI Processes that were able to restart

CR_FTB_RSRT_FAIL

– Restart failed

– Sent by MPI Processes that failed to restart

CR_FTB_CKPT_FINALIZE

– CR module has shutdown

– Sent by all MPI Processes doing MPI_Finalize()

CR_FTB_APP_CKPT_REQ

– Sent by MPI Process from where the user requested a checkpoint through

(8)

FTB Events in MVAPICH2 Migration Framework

PREDICTOR_NODES_FAILURE

– An FTB-aware failure predictor informs the Migration Manager about a possible failure – Thrown by: Failure Predictor

REQ_MIGRATE

– The Migration manager arbitrates a spare node from a given set an provides it to mpispawn – Thrown by: Migration Manager

MIGRATE_DONE

– The mpispawn informs the mig. manager about successful migration – Thrown by: mpispawn on target node

MIG_NODE_MIGRATED

– Migration successfully completed – Thrown by: Migration Manager

MIG_NODE_MIGRATE_FAILED

– Migration failed

(9)

Availability and Future Plans

• Availability

– FTB-CR was incorporated in MVAPICH2 1.4

– FTB-enabled CR with write aggregation and Job Migration has been

incorporated in MVAPICH2 1.6

– It is available at

http://mvapich.cse.ohio-state.edu

– Detailed guidelines for using FTB support in MVAPICH2 is available at

https://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5.1.html#x1-400006.6

• Future Plans

Références

Documents relatifs

We show that Sgs1p is an integral component of the S-phase checkpoint response in yeast, which arrests cells due to DNA damage or blocked fork progression during DNA replication..

The RD parameter is coded in the JOB or EXEC statements and is used to request that an automatic step restart be performed if failure occurs and/or to suppress,

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

COMBINED SAW/SPR: EXPERIMENTAL EXAMPLE Collagen 30 µg/ml adsorption on hydrophobic

In: Proceedings of the 11th International Congress of Phonetic

(1) to assist in developing international guidelines for the use of simians in human health programmes ; (2) to advise on methods of limiting the unnecessary international trade

In contrast to existing research, we optimize the operations of a simulation model of clinic designs at a fine-grained level by including individual decision variables for the

Une autre étude, menée chez des patients traités cette fois-ci par nivolumab, a montré que les patients qui développaient un IRAE quel que soit le stade avaient un taux de