Many of the ideas in my current research originated in the Recovery-Oriented Computing project. One avenue we explored in that project was the construction of software building blocks in which common operations, such as failure recovery, scaling up/down, or reprovisioning, can be achieved by rebooting a machine (or its dual, adding a new machine and killing the faulty one). The
Recovery-Oriented Computing: Cheap, Simple Recovery Meets Statistical Machine Learning
Recovery-Oriented Computing (ROC)  observes that a service whose mean time to failure is MTTF and whose mean time to recovery from failure is MTTR experiences an availability A = MTTF / (MTTF + MTTR), with A=1 (i.e. MTTF >> MTTR) corresponding to the ideal of zero downtime. Reducing MTTR is just as effective as increasing MTTF to improve availability, and is an under-explored research approach despite being more consistent with the practical experience that failures and bugs will continue to be a “fact of life” rather than a problem that can be completely eliminated.
ROC for Application Servers: Crash-Only Software & Microreboots
The ROC lesson was that a sufficient reduction in recovery time enabled the use of SML for problem determination in novel ways, detecting problems that do not generally lead to “hard failures” and are therefore often missed by traditional techniques. To demonstrate the potential of SML, we applied path-based analysis, a family of techniques from the natural language processing literature, to the detection of failures in Java enterprise (J2EE) applications. The technique required no source code changes to or other knowledge of the application. Path-based analysis was found to be 1.5x to 4x more sensitive than existing techniques , , but it exhibited false positive rates of up to 20%. However, we had equipped our J2EE application server with our
ROC for Storage: Crash-Only Storage Systems
After demonstrating the success of combining microrebooting with machine learning for stateless application servers, we next demonstrated its feasibility for persistent storage systems by building two special-purpose prototypes for storing Web application data , . By using quorums and relaxing consistency, we were able to design these systems to tolerate crashes of any machine at any time with no data loss and minimal performance loss, and with all provisioning and maintenance operations recast as rebooting or adding/subtracting machines. We concluded that if recovery is sufficiently cheap, it leads to a qualitative change in thinking from “normal-mode vs. recovery-mode” to “always adapting, always recovering”. In other words, while “Reduce recovery time to improve availability” and “build systems to be reboot-safe” may amount to codification of sound design practices, their
Problem Diagnosis as SML-Assisted Information Retrieval
While path-based analysis was a first step in applying SML to systems problems, we next pushed the state of the art by attempting to reduce problem diagnosis to information retrieval , . Our idea was to identify those specific measurable aspects of a running system that were highly correlated with undesirable system behavior, such as violation of its service-level performance agreement, over short time intervals. By capturing the most important measurements as a “signature” of 3 to 8 low-level metrics within each window, we could maintain a database of “signatures” of known problems and essentially model the system as going through a sequence of operational states captured by their respective signatures. When a new problem occurred, we would compute its signature and use classic information retrieval techniques and metrics to compare it to the signatures of known problems in our database. When tested on a real workload containing partially-labeled and some incorrectly-labeled training data, we found one real problem missed by weeks of human diagnosis, and corrected an expert’s misdiagnosis of another problem in a similar system.In both previous and ongoing work, we consistently find that straightforward, naive approaches to problem detection gave poor results, motivating the investigation of more sophisticated SML techniques and algorithms. In the signatures work we demonstrated the need to use tree-augmented Bayes networks, which are more sophisticated representations of conditional distributions than Naive Bayes; we also found that a single simple model failed to capture the relationship between individual system performance metrics and overall SLA compliance/violation. Based on this experience, we suggested  several areas of future research in applying SML to systems that we believe will be fundamental problems, including management of models’ lifecycles, model and process stationarity, thresholding/scoring/distance functions, dealing with false positives, and the challenges of combining supervised with unsupervised learning.
The approach enabled by ROC could therefore be summarized as:
- Make recovery fast.
- Recast recovery and other operational tasks in terms of a small repertoire of simple operations, such as “reboot”, “microreboot”,or “add or remove a machine”. These should incur minimal performance cost, so as to tolerate false positives.
- Apply SML to identify problems, predict performance, etc. to support operational decisions that can be automated, even if the SML algorithms are less-than-perfect (false positives).
- For non-automatable operational tasks or problem resolution, combine SML and visualization to build better tools to support human operators’ data exploration (as we did in ).
Links to references
- G. Candea, A. Brown, A. Fox, D. Patterson. “Building multi-tierdependability.” IEEE COMPUTER 37(11), Nov. 2004 PDF
- G. Candea, A. Fox. Crash-Only Software. Proceedings of theProc. 9th Workshop on Hot Topics in Operating Systems (HotOS IX),Lihue, HI, May 2003 PDF
- E. Kiciman, A. Fox. Detecting and localizing application-levelfailures in Internet services. IEEE Transactions on Neural Networks,Spring 2005 PDF
- M.Y.Chen, A. Accardi, E. Kiciman, A. Fox, D. Patterson,E. Brewer. Path-Based Failure and Evolution Management. Proceedingsof the 1st USENIX/ACM Symposium on Networked Systems Design andImplementation (NSDI 2004), San Francisco, CA, March 29-31, 2004 PDF
- G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, A. Fox.Microreboot: A Technique for Cheap Recovery.Proceedings of Fifth Intl. Conference on Operating SystemsDesign and Implementation (OSDI ‘04), San Francisco, CA, December2004 PDF
- A. Huang and A. Fox. Cheap Recovery: A Key to Self-ManagingState. ACM Trans. on Storage 1(1), 2004 PDF
- B. Ling, E. Kiciman, A. Fox. Session State: Beyond SoftState. Proceedings of the 1st USENIX/ACM Symposium on NetworkedSystems Design and Implementation (NSDI 2004), San Francisco, CA,March 29-31, 2004 PDF
- S. Zhang, I. Cohen, M. Goldszmidt, T. Kelly, J. Symons,A. Fox. Capturing, Indexing, Clustering, and Retrieving SystemHistory. Proc. 20th Usenix/ACM Symposium on Operating SystemsPrinciples (SOSP ‘05), Brighton, UK, October 2005 PDF
- S. Zhang, I. Cohen, M. Goldszmidt, T. Kelly, J. Symons,A. Fox. Ensembles of models for automated diagnosis of systemperformance problems. Proceedings of Intl. Conference on DependableSystems and Networks (DSN 2005), Yokohama, Japan, June 2005 PDF
- M. Goldszmidt, I. Cohen, S. Zhang, A. Fox. Three challenges atthe intersection of machine learning, statistical induction, andsystems. Proc. 10th Workshop on Hot Topics in Operating Systems(HotOS-X), Santa Fe, NM, June 2005 PDF
- P. Bodik, G. Friedman, L. Biewald, H. Levine, G. Candea, A. Fox,D. Patterson, M. Jordan. Combining Visualization and StatisticalAnalysis to Improve Operator Confidence and Efficiency for FailureDetection and Localization. Proc. Second Intl. Conference on AutonomicComputing (ICAC 2005), Seattle, WA, June 2005 PDF