Horus:Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung, Ging-Fung and Borowiec, Damian and Yang, Renyu and Friday, Adrian and Harper, R.H.R. and Garraghan, Peter (2021) Horus:Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems. IEEE Transactions on Parallel and Distributed Systems. ISSN 1045-9219 (In Press)

[img]
Text (TPDS_Horus_ging_fung_yeung)
TPDS_Horus_4_.pdf - Accepted Version
Available under License Creative Commons Attribution-NonCommercial.

Download (5MB)

Abstract

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this paper we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model’s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5% for GPU resource utilization, 23.7–30.7% for makespan reduction and 68.3% in job wait time reduction.

Item Type:
Journal Article
Journal or Publication Title:
IEEE Transactions on Parallel and Distributed Systems
Additional Information:
©2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Uncontrolled Keywords:
/dk/atira/pure/subjectarea/asjc/1700/1703
Subjects:
ID Code:
154311
Deposited By:
Deposited On:
28 Apr 2021 08:45
Refereed?:
Yes
Published?:
In Press
Last Modified:
25 Oct 2021 04:47