FETOL – Towards Fault Tolerant Massively Parallel Computations on Peta-scale Platforms
Manfred Krafczyk, iRMB, TU Braunschweig
Bernhard Schott, Platform Computing GmbH
Erich Focht, NEC Deutschland GmbH
(slides)
It is well known that for massively parallel computations beyond the Teraflop scale the combined probability of local hardware / network failures will reach a level that substantially decreases the productivity of HPC-systems due to failure of submitted jobs even for moderate runtimes. This also holds for sub-Teraflop applications with extreme runtimes such as MD-applications. Thus it is mandatory to create software frameworks which increase the resilience of HPC applications to partial failures of the underlying hardware resources and thus avoiding a complete restart of a massively parallel application run.
In this talk we will present the structural approach for such a framework to be developed in the recently started BMBF-HPC project FETOL. The talk will illustrate the project concept and describe methodological extensions for several software layers including the communication frameworks, target applications (from CFD and MD), scheduler and an additional middleware tentatively termed Job-Manager. We will conclude by discussing potential shortcomings and limits of our approach.