Text
E-book Supercomputing Frontiers
Nowadays high performance computing (HPC) clusters are not only deployed in largeresearch centers, but also widely adopted by industries such as chip design and man-ufacture, life sciences, etc. This trend brings more diverse workload patterns to HPCclusters compared with the traditional scientific applications. As those clusters normallyconsists of thousands of nodes, it is common to use resource managers (e.g. IBMSpectrum LSF [1], Slurm [2], Moab [3]) to manage resources and make decisions toallocate proper resources for applications submitted by end users. Resource managersenable multiple users sharing massive cluster resources by scheduling applications asbatch jobs in queue systems. However, end users generally have little knowledge ofcomputing resources, while resource managers normally rely on accurate resource requirements specified by users to scheduling and allocating resources. This conflictproduces challenges for cluster administrators to achieve high resource utilization andjob execution efficiency in their clusters. For example, when users tend to over-estimatethe resource usage of their applications, resource manager willfinally place fewer jobsto run in the cluster as the reserved additional resources cannot be currently used byother waiting jobs. Conversely, application may fail due to compete resources whenusers made under-estimation of resources usage. Another consequence of inaccuratememory requirement is wasting budget to apply excessive memory when burstingworkloads to cloud, where the resources are charged by size over time [25].Recent rapid progress on machine learning gives the opportunities to make resourcemanagers smarter. Specifically, job resource usage together with the job submissionoptions (e.g. submission queue, job command) are normally recorded by resourcemanagers as accounting information after applications are completed. Applications inlarge production cluster are normally run repeatably. Therefore, it is possible to explorethe relationship of resource usage and job patterns from historical job records. Previouswork have been done for predicting job memory usage [4,5], job runtimes [6,7], etc.Most of those work focus on building models using all of the historical data, andcomparing various machine learning algorithms on prediction accuracy.
Tidak tersedia versi lain