Meng, Yuan and Zhang, Shenglin and Sun, Yongqian and Zhang, Ruru and Hu, Zhilong and Zhang, Yiyin and Jia, Chenyang and Wang, Zhaogang and Pei, Dan (2020) Localizing Failure Root Causes in a Microservice through Causality Inference. In: 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS) :. IEEE, pp. 1-10. ISBN 9781728168883
Full text not available from this repository.Abstract
An increasing number of Internet applications are applying microservice architecture due to its flexibility and clear logic. The stability of microservice is thus vitally important for these applications' quality of service. Accurate failure root cause localization can help operators quickly recover microservice failures and mitigate loss. Although cross-microservice failure root cause localization has been well studied, how to localize failure root causes in a microservice so as to quickly mitigate this microservice has not yet been studied. In this work, we propose a framework, MicroCause, to accurately localize the root cause monitoring indicators in a microservice. MicroCause combines a simple yet effective path condition time series (PCTS) algorithm which accurately captures the sequential relationship of time series data, and a novel temporal cause oriented random walk (TCORW) method integrating the causal relationship, temporal order, and priority information of monitoring data. We evaluate MicroCause based on 86 real-world failure tickets collected from a top tier global online shopping service. Our experiments show that the top 5 accuracy (AC@5) of MicroCause for intra-microservice failure root cause localization is 98.7%, which is greatly higher (by 33.4 %) than the best baseline method.