Ouyang, Xue and Garraghan, Peter and Wang, Changjian and Townend, Paul and Xu, Jie (2016) An approach for modeling and ranking node-level stragglers in cloud datacenters. In: 2016 IEEE International Conference on Services Computing (SCC) :. IEEE, pp. 673-680. ISBN 9781509026289
An_Approach_for_Modeling_and_Ranking_Node_level_Stragglers_in_Cloud_Datacenters.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (803kB)
Abstract
The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.