Tools for Intelligent Management of very large Computing Systems (TIMaCS)
Eugen Volk¹, Jochen Buchholz¹, Stefan Wesner¹
Götz Isenmann², Jürgen Schwitalla², Marc Lohrer²,
Erich Focht³, Andreas Jeutter³,
Daniela Koudela⁴, Holger Mickler⁴, Polina Belonozhka⁴,
Matthias Schmidt⁵, Roland Schwarzkopf⁵
¹ High Performance Computing Center Stuttgart (HLRS), Stuttgart, Germany
² Science Computing AG, Tübingen, Germany
³ NEC High Performance Computing Europe, Stuttgart, Germany
⁴ Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH), Dresden, Germany
⁵ Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
(slides)
The increasing complexity of current and future very large computing systems, consisting of 100,000 nodes and beyond, with rapidly growing numbers of cores leads to increasing effort on administration and maintenance of these systems. Existing monitoring tools are neither scalable nor capable to reduce the overwhelming flow of information and provide only essential information of high value. Current management tools lack on scalability and capability to process a huge amount of information intelligently by relating several data and information from various sources together for making right decisions on error/fault handling. In order to solve these problems, we developed in scope of the TIMaCS (www.timacs.de) project a hierarchical, scalable, policy and knowledge based monitoring and management framework, capable to monitor and manage proactively very large computing systems.
This presentation will outline firstly the system architecture of the TIMaCS framework, based on technologies for virtualization, knowledge-based analysis and validation of the collected information, administrator defined metrics and policies. Secondly, we will show evaluation results of the TIMaCS framework, realized as an open and modular framework designed to integrate existing solutions such as Nagios and Ganglia. The evaluation results will highlight experiences gained with scalability and robustness during the operation of the solution on cluster systems at HLRS, ZIH and University of Marburg. Furthermore we will present the role of the solution as part of the product- and service offerings of Science-Computing AG and NEC High Performance Computing Europe.