Article ID: | iaor20121312 |
Volume: | 63 |
Issue: | 2 |
Start Page Number: | 365 |
End Page Number: | 377 |
Publication Date: | Jan 2012 |
Journal: | Computers and Mathematics with Applications |
Authors: | Yang Guangwen, Zheng Weimin, Yuan Yulai, Wu Yongwei, Wang Qiuping |
Keywords: | quality & reliability, networks, computers: information, statistics: empirical |
The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in‐depth understanding of the characteristics of HPC job failures is essential. In this paper, we present an empirical study on job failures of 10 public workload data sets collected from 8 large‐scale HPCs all over the world. Multiple analysis methods are applied to provide a comprehensive and in‐depth understanding of job failures. In order to facilitate design, testing and management of HPCs, we study properties of job failures from the following four aspects: proportion in workload and resource consumption, submission inter‐arrival time, locality, and runtime. Our analysis results show that job failure rates are significant in most HPCs, and on average, a failed job often consumes more computational resources than a successful job. We also observe that the submission inter‐arrival time of failed jobs is better fit by Generalized Pareto and Lognormal distributions, and the probability of failed job submission follows a ‘V’ shape: decreasing during the first 100 seconds right after the submission of the last failed job and increasing afterward. The majority of job failures come from a small number of users and applications, and furthermore these users are the primary factor related to job failures compared with these applications. We find evidence that failed jobs’