Su, Ya and Zhao, Youjian and Xia, Wentao and Liu, Rong and Bu, Jiahao and Zhu, Jing and Cao, Yuanpu and Li, Haibin and Niu, Chenhao and Zhang, Yiyin and Wang, Zhaogang and Pei, Dan (2020) CoFlux : Robustly Correlating KPIs by Fluctuations for Service Troubleshooting. In: 2019 IEEE/ACM 27th International Symposium on Quality of Service (IWQoS) :. IEEE.
Full text not available from this repository.Abstract
Internet-based service companies monitor a large number of KPIs (Key Performance Indicators) to ensure their service quality and reliability. Correlating KPIs by fluctuations reveals interactions between KPIs under anomalous situations and can be extremely useful for service troubleshooting. However, such a KPI flux-correlation has been little studied so far in the domain of Internet service operations management. A major challenge is how to automatically and accurately separate fluctuations from normal variations in KPIs with different structural characteristics (such as seasonal, trend and stationary) for a large number of KPIs. In this paper, we propose CoFlux, an unsupervised approach, to automatically (without manual selection of algorithm fitting and parameter tuning) determine whether two KPIs are correlated by fluctuations, in what temporal order they fluctuate, and whether they fluctuate in the same direction. CoFlux's robust feature engineering and robust correlation score computation enable it to work well against the diverse KPI characteristics. Our extensive experiments have demonstrated that CoFlux achieves the best Fl-Scores of 0.84 (0.90),0.92 (0.95), 0.95 (0.99), in answering these three questions, in the two real datasets from a top global Internet company, respectively. Moreover, we showed that CoFlux is effective in assisting service troubleshooting through the applications of alert compression, recommending Top N causes, and constructing fluctuation propagation chains.