Reducing late-timing failure at scale : straggler root-cause analysis in cloud datacenters

Ouyang, Xue and Garraghan, Peter and Yang, Renyu and Townend, Paul and Xu, Jie (2016) Reducing late-timing failure at scale : straggler root-cause analysis in cloud datacenters. In: 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2016-06-28 - 2016-07-01, École Nationale de l’Aviation Civile (ENAC).

[thumbnail of Reducing Late Timing Failure]
Preview
PDF (Reducing Late Timing Failure)
Reducing_Late_Timing_Failure.pdf - Accepted Version
Available under License Creative Commons Attribution.

Download (197kB)

Abstract

Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.

Item Type:
Contribution to Conference (Paper)
Journal or Publication Title:
46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
ID Code:
82342
Deposited By:
Deposited On:
21 Oct 2016 14:00
Refereed?:
Yes
Published?:
Published
Last Modified:
10 Jan 2024 00:03