A Scalable and Robust Run-Time Environment for Modern Many-core Architectures
Thomas Fuhrmann, Technische Universität München, Germany
The J-Cell project aims to mitigate the hardships of modern, highly parallel processors limited memory bandwidth and high cost of cache consistency with a run-time system that hides the processors distributed nature. It creates a single system image where all processor cores seemingly share a global address space. J-Cell is fully decentralized so that a failing processor or memory component does not affect the system as a whole.
To date there are two main architectural models for parallel computers: High-performance computing has for long applied the message passing interface (MPI) model where all mutual interaction is based on messages; in the partitioned global address space (PGAS) model, processors communicate via the systems main memory. While the former is well suited for applications such as fluid dynamics or molecular dynamics, the latter better fits for irregular applications.
The J-Cell run-time system pursues a PGAS approach; it is based on the transactional memory paradigm, in which all computation occurs in form of atomic transactions. Either the software developers structure their programs accordingly on the source code level, or the J-Cell run-time system splits the instruction sequence automatically into transactions. A distributed consistency algorithm merges the transactions modifications of the global state into one causally consistent partial order; otherwise, if consistency cannot be achieved, the respective transaction has to roll-back and retry. Even before global consensus about consistency and potential roll-backs has been reached, transactions can proceed speculatively so that the communication latency of the consistency algorithm is hidden from the application threads.
As a side-effect of this asynchronous consistency algorithm, the J-Cell run-time system creates a version history of the global application state. If parts of that state get lost due to a hardware failure, the J-Cell run-time can transparently roll back to the most recent consistent state that is still available. Thread scheduling heuristics and a distributed garbage collection algorithm control the amount of state that is kept in the system. Thereby, we can adjust the available redundancy to the expected failure rate in the system.
The J-Cell run-time system has been developed as a library for C/C++ applications and as a Java virtual machine. Both systems are proof-of -concept prototypes that demonstrate the potential of the J-Cell approach. When using the C-library, the application programmer has to instrument the source code with calls into that library, for example, to indicate the applications memory layout and the transaction boundaries. The Java VM automatically retrieves the required information from the applications bytecode so that the J-Cell Java VM can run unmodified applications. The J-Cell system is complemented with a fully decentralized file-system that applies the underlying algorithmic ideas to the context of cloud computing.
J-Cell has demonstrated its potential with the help of a bioinformatics use case, the FTrees application of BioSolveIT GmbH. The project is supported by the German Ministry of Education and Research under grant number 01IH08011.