Garraghan, Peter and Perks, Stuart and Ouyang, Xue and McKee, David and Moreno, Ismael Solis (2016) Tolerating transient late-timing faults in cloud-based real-time stream processing. In: 2016 IEEE 19th International Symposium on Real-Time Distributed Computing (ISORC) :. IEEE, pp. 108-115. ISBN 9781467390323
Submitted_ISORC_Real_time_Stream_Processing.pdf - Accepted Version
Available under License Creative Commons Attribution.
Download (411kB)
Abstract
Real-time stream processing is a frequently deployed application within Cloud datacenters that is required to provision high levels of performance and reliability. Numerous fault-tolerant approaches have been proposed to effectively achieve this objective in the presence of crash failures. However, such systems struggle with transient late-timing faults - a fault classification challenging to effectively tolerate - that manifests increasingly within large-scale distributed systems. Such faults represent a significant threat towards minimizing soft real-time execution of streaming applications in the presence of failures. This work proposes a fault-tolerant approach for QoS-aware data prediction to tolerate transient late-timing faults. The approach is capable of determining the most effective data prediction algorithm for imposed QoS constraints on a failed stream processor at run-time. We integrated our approach into Apache Storm with experiment results showing its ability to minimize stream processor end-to-end execution time by 61% compared to other fault-tolerant approaches. The approach incurs 12% additional CPU utilization while reducing network usage by 44%.