Horus : An Interference-aware Resource Manager for Deep Learning Systems

Yeung, Gingfung and Borowiec, Damian and Yang, Renyu and Friday, Adrian and Harper, R.H.R. and Garraghan, Peter (2020) Horus : An Interference-aware Resource Manager for Deep Learning Systems. In: Algorithms and Architectures for Parallel Processing. ICA3PP 2020 :. Lecture Notes in Computer Science . Springer, pp. 492-508. ISBN 9783030602383

[thumbnail of ICA3PP - Horus - Yeung (Accepted)]

Text (ICA3PP - Horus - Yeung (Accepted))
ICA3PP_Horus_Yeung_Accepted_.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (902kB)

Abstract

Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identiﬁed that co-location - multiple jobs co-located within the same GPU - is an eﬀective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel proﬁling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-proﬁling, our approach estimates job resource utilization and co-location patterns to determine eﬀective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 diﬀerent models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5% for resource utilization and 33.6% for makespan.

Item Type:

Contribution in Book/Report/Proceedings

Uncontrolled Keywords:

/dk/atira/pure/subjectarea/asjc/1700/1708

Subjects:

?? machine learning systemsperformance interferencedeep learninggpu schedulingcluster resource managementhardware and architectureartificial intelligencecomputer networks and communicationssoftware ??

Departments:

Faculty of Science and Technology > School of Computing & Communications

ID Code:

145713

Deposited By:

ep_importer_pure

Deposited On:

16 Jul 2020 16:15

Refereed?:

Yes

Published?:

Published