Many of the ideas in my current research originated in the Recovery-Oriented Computing project. One avenue we explored in that project was the construction of software building blocks in which common operations, such as failure recovery, scaling up/down, or reprovisioning, can be achieved by rebooting a machine (or its dual, adding a new machine and killing the faulty one). The policy is based on the use of statistical machine learning (SML) techniques to automatically identify and react to problems that would take too long for a human operator to diagnose manually. The ideal of 99.999% service availability corresponds to just 5 minutes of service downtime per year, which cannot be achieved if humans must participate in every operational decision [1].Two main themes of my previous work on Recovery-Oriented Computing (ROC) have influenced the design of recent commercial and research systems. The first is the design stance of crash-only software [2]: since robust software must survive unexpected crashes anyway, the crash recovery code should be theonly recovery code, and any non-crash problem (slowdown, anomalous behavior, etc.) observed during operation should be immediately coerced to a crash failure. This is a radical design simplification that allows focusing on optimizing the performance of the one and only recovery path. The second theme is exploiting this fast recovery by applying SML problem detection techniques that, while more sensitive than non-SML state-of-the-art methods, have nontrivial false positive rates: the observation is that because of the low cost of recovery, overall availability may still improve from using SML, despite false positives.Engineers and researchers at Amazon, Oracle, eBay, Microsoft and Google have told us they were strongly influenced by the demonstration of these techniques, and Aster Data Systems (founded 2005) is designing its parallel clustered database as crash-only from the ground up. Hewlett-Packard is already putting some of the SML problem detection and diagnosis technology into its system monitoring products. The combination of SML for analyzing log data and visualization to draw the human operator’s attention to interesting patterns in the data was demonstrated on real failure log data from and remains an area of active research.

Recovery-Oriented Computing: Cheap, Simple Recovery Meets Statistical Machine Learning

Recovery-Oriented Computing (ROC) [1] observes that a service whose mean time to failure is MTTF and whose mean time to recovery from failure is MTTR experiences an availability A = MTTF / (MTTF + MTTR), with A=1 (i.e. MTTF >> MTTR) corresponding to the ideal of zero downtime. Reducing MTTR is just as effective as increasing MTTF to improve availability, and is an under-explored research approach despite being more consistent with the practical experience that failures and bugs will continue to be a “fact of life” rather than a problem that can be completely eliminated.

ROC for Application Servers: Crash-Only Software & Microreboots

The ROC lesson was that a sufficient reduction in recovery time enabled the use of SML for problem determination in novel ways, detecting problems that do not generally lead to “hard failures” and are therefore often missed by traditional techniques. To demonstrate the potential of SML, we applied path-based analysis, a family of techniques from the natural language processing literature, to the detection of failures in Java enterprise (J2EE) applications. The technique required no source code changes to or other knowledge of the application. Path-based analysis was found to be 1.5x to 4x more sensitive than existing techniques [3][4], but it exhibited false positive rates of up to 20%. However, we had equipped our J2EE application server with our microreboot capability [5], which allows restarting only certain parts of a failing Web application rather than the entire application, reducing recovery time by 1–2 orders of magnitude for many common transient failures. With recovery so inexpensive, the overall availability of the application in this scenario improved by 53%. Path-based analysis detected and recovered from problems that would have been missed by other techniques, and the cost of its false positive rate was outweighed by the benefit of extremely fast recovery (microrebooting).

ROC for Storage: Crash-Only Storage Systems

After demonstrating the success of combining microrebooting with machine learning for stateless application servers, we next demonstrated its feasibility for persistent storage systems by building two special-purpose prototypes for storing Web application data [6][7]. By using quorums and relaxing consistency, we were able to design these systems to tolerate crashes of any machine at any time with no data loss and minimal performance loss, and with all provisioning and maintenance operations recast as rebooting or adding/subtracting machines. We concluded that if recovery is sufficiently cheap, it leads to a qualitative change in thinking from “normal-mode vs. recovery-mode” to “always adapting, always recovering”. In other words, while “Reduce recovery time to improve availability” and “build systems to be reboot-safe” may amount to codification of sound design practices, their combination has been instrumental in bringing Statistical Machine Learning techniques to bear on systems operational problems.

Problem Diagnosis as SML-Assisted Information Retrieval

While path-based analysis was a first step in applying SML to systems problems, we next pushed the state of the art by attempting to reduce problem diagnosis to information retrieval [8][9]. Our idea was to identify those specific measurable aspects of a running system that were highly correlated with undesirable system behavior, such as violation of its service-level performance agreement, over short time intervals. By capturing the most important measurements as a “signature” of 3 to 8 low-level metrics within each window, we could maintain a database of “signatures” of known problems and essentially model the system as going through a sequence of operational states captured by their respective signatures. When a new problem occurred, we would compute its signature and use classic information retrieval techniques and metrics to compare it to the signatures of known problems in our database. When tested on a real workload containing partially-labeled and some incorrectly-labeled training data, we found one real problem missed by weeks of human diagnosis, and corrected an expert’s misdiagnosis of another problem in a similar system.In both previous and ongoing work, we consistently find that straightforward, naive approaches to problem detection gave poor results, motivating the investigation of more sophisticated SML techniques and algorithms. In the signatures work we demonstrated the need to use tree-augmented Bayes networks, which are more sophisticated representations of conditional distributions than Naive Bayes; we also found that a single simple model failed to capture the relationship between individual system performance metrics and overall SLA compliance/violation. Based on this experience, we suggested [10] several areas of future research in applying SML to systems that we believe will be fundamental problems, including management of models’ lifecycles, model and process stationarity, thresholding/scoring/distance functions, dealing with false positives, and the challenges of combining supervised with unsupervised learning.


The approach enabled by ROC could therefore be summarized as:

  1. Make recovery fast.
  2. Recast recovery and other operational tasks in terms of a small repertoire of simple operations, such as “reboot”, “microreboot”,or “add or remove a machine”. These should incur minimal performance cost, so as to tolerate false positives.
  3. Apply SML to identify problems, predict performance, etc. to support operational decisions that can be automated, even if the SML algorithms are less-than-perfect (false positives).
  4. For non-automatable operational tasks or problem resolution, combine SML and visualization to build better tools to support human operators’ data exploration (as we did in [11]).

Links to references

  1. G. Candea, A. Brown, A. Fox, D. Patterson. “Building multi-tierdependability.” IEEE COMPUTER 37(11), Nov. 2004 PDF
  2. G. Candea, A. Fox. Crash-Only Software. Proceedings of theProc. 9th Workshop on Hot Topics in Operating Systems (HotOS IX),Lihue, HI, May 2003 PDF
  3. E. Kiciman, A. Fox. Detecting and localizing application-levelfailures in Internet services. IEEE Transactions on Neural Networks,Spring 2005 PDF
  4. M.Y.Chen, A. Accardi, E. Kiciman, A. Fox, D. Patterson,E. Brewer. Path-Based Failure and Evolution Management. Proceedingsof the 1st USENIX/ACM Symposium on Networked Systems Design andImplementation (NSDI 2004), San Francisco, CA, March 29-31, 2004 PDF
  5. G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, A. Fox.Microreboot: A Technique for Cheap Recovery.Proceedings of Fifth Intl. Conference on Operating SystemsDesign and Implementation (OSDI ‘04), San Francisco, CA, December2004PDF
  6. A. Huang and A. Fox. Cheap Recovery: A Key to Self-ManagingState. ACM Trans. on Storage 1(1), 2004 PDF
  7. B. Ling, E. Kiciman, A. Fox. Session State: Beyond SoftState. Proceedings of the 1st USENIX/ACM Symposium on NetworkedSystems Design and Implementation (NSDI 2004), San Francisco, CA,March 29-31, 2004 PDF
  8. S. Zhang, I. Cohen, M. Goldszmidt, T. Kelly, J. Symons,A. Fox. Capturing, Indexing, Clustering, and Retrieving SystemHistory. Proc. 20th Usenix/ACM Symposium on Operating SystemsPrinciples (SOSP ‘05), Brighton, UK, October 2005 PDF
  9. S. Zhang, I. Cohen, M. Goldszmidt, T. Kelly, J. Symons,A. Fox. Ensembles of models for automated diagnosis of systemperformance problems. Proceedings of Intl. Conference on DependableSystems and Networks (DSN 2005), Yokohama, Japan, June 2005 PDF
  10. M. Goldszmidt, I. Cohen, S. Zhang, A. Fox. Three challenges atthe intersection of machine learning, statistical induction, andsystems. Proc. 10th Workshop on Hot Topics in Operating Systems(HotOS-X), Santa Fe, NM, June 2005 PDF
  11. P. Bodik, G. Friedman, L. Biewald, H. Levine, G. Candea, A. Fox,D. Patterson, M. Jordan. Combining Visualization and StatisticalAnalysis to Improve Operator Confidence and Efficiency for FailureDetection and Localization. Proc. Second Intl. Conference on AutonomicComputing (ICAC 2005), Seattle, WA, June 2005 PDF