Failure Detection and Propagation in HPC systems

George Bosilca; Aurélien Bouteiller; Amina Guermouche; Thomas Hérault; Yves Robert; Pierre Sens; Jack Dongarra

Communication Dans Un Congrès Année : 2016

Failure Detection and Propagation in HPC systems

(1) , (1) , (1) , (1) , (2, 1) , (3) , (1)

1
2
3

George Bosilca

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Aurélien Bouteiller

Fonction : Auteur

Innovative Computing Laboratory [Knoxville]

Amina Guermouche

Fonction : Auteur
PersonId : 170800
IdHAL : aguermouche

Innovative Computing Laboratory [Knoxville]

Thomas Hérault

Fonction : Auteur
PersonId : 954004

Innovative Computing Laboratory [Knoxville]

Yves Robert

Fonction : Auteur
PersonId : 739318
IdHAL : yves-robert
ORCID : 0000-0003-2361-055X
IdRef : 029813611

Optimisation des ressources : modèles, algorithmes et ordonnancement

Innovative Computing Laboratory [Knoxville]

Pierre Sens

Fonction : Auteur
PersonId : 737442
IdHAL : pierre-sens
ORCID : 0000-0002-5156-7715
IdRef : 259987166

Large-Scale Distributed Systems and Applications

Jack Dongarra

Fonction : Auteur
PersonId : 863940

Innovative Computing Laboratory [Knoxville]

Résumé

Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage is using a non-uniform variant of a reliable broadcast over a circulant graph overlay network, and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan ORNL supercomputer, show that the algorithm performs extremely well, and exhibits all the desired properties of an Exascale-ready algorithm.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

sc16-hal.pdf (450.81 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre Sens : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01352109

Soumis le : mercredi 26 octobre 2016-15:22:56

Dernière modification le : mardi 3 octobre 2023-17:18:04

Dates et versions

hal-01352109 , version 1 (26-10-2016)

Identifiants

HAL Id : hal-01352109 , version 1

Citer

George Bosilca, Aurélien Bouteiller, Amina Guermouche, Thomas Hérault, Yves Robert, et al.. Failure Detection and Propagation in HPC systems. SC 2016 - The International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2016, Salt Lake City, United States. ⟨hal-01352109⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON UPMC CNRS INRIA UNIV-LYON1 LIP6 INRIA2 SORBONNE-UNIVERSITE SU-SCIENCES TSP-PARALLEL-DISTRIBUTED-SYSTEMS UDL

515 Consultations

586 Téléchargements

Failure Detection and Propagation in HPC systems

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager