Armando Fox
July 2010
The RAD Lab (Reliable Adaptive Distributed Systems Laboratory) was launched in January 2006 as a five-year effort with the mission statement: ``Enable one person to develop, deploy and operate a next-generation Internet application at scale, without first having to build a company on the scale of eBay or Google to do so.'' Given the scale of contemporary Internet services (thousands of machines, tens of millions of users) and the timescales on which management actions would be required to maintain the high responsiveness users have come to expect on a 24x7 basis, a key element of the RAD Lab vision is the use of Statistical Machine Learning (SML) techniques to facilitate all aspects of developing, deploying and operating such applications. The interdisciplinary faculty represented expertise in systems, networks, and machine learning. Most of my PhD students were advised jointly with Dave Patterson, Mike Jordan or Mike Franklin, and much of the work was done in close collaboration with one or more industrial partners of the RAD Lab.
The five-year meta-contributions made by this subgroup of the lab can be summarized as:
The following project highlights capture specific research contributions.
Automatic Workload Evaluation focused on the use of a less-than-decade-old SML technique, kernel canonical correlation analysis (KCCA) [3], as a different approach to problems involving performance prediction. The insight is that both offered workload and measured performance can be captured as multidimensional vectors; for example, database queries can be characterized by a vector capturing the number and types of operations in a query plan, and performance of a query by a multidimensional vector whose elements include running time, number of interprocessor messages on parallel hardware, number of disk I/Os, etc. A similarity kernel function can then be defined over each dataset; this is necessary because simple geometric differencing of the vectors may fail to capture information the system designer believes is important in measuring the similarity between, say, the structure of given pair of queries. It then remains to find a basis of correlation between these two sets of differences, which the KCCA algorithm does. We applied this method to predict the running time and resource consumption of long-running database queries [9], achieving accuracies beyond those of the built-in query planner. This success led us to apply a similar technique to identify the most promising autotuning parameters for compiling code on multicore processors [8] and for predicting the performance of large Hadoop (map/reduce) cluster jobs based on the sampled performance of a smaller problem [7].
Core parts of the AWE work were done in conjunction with HP Labs.
Free-text console logs (fprintf(STDERR,...) or console.write(...)) are ubiquitous in large software systems, yet they are not machine-friendly because of their free-text format and they are not human-friendly because the volume of log messages is so large that finding ``interesting'' patterns manually is impossible. A better approach than ad-hoc shell scripts is Console Log Mining: combining techniques from SML (data mining, anomaly detection, principal component analysis), information retrieval (bag of words analysis), and static program analysis (basic type inference) to distill millions of lines of free-text unstructured logs to concise operator-friendly visualizations of operational problems or interesting conditions.
The key idea behind the work is to look not for specific events, but sequences of events that are unusual. Principal component analysis (PCA) is used for this step. However, PCA requires features to operate on, and in programs, the features of interest tend to be entitites such as disk block numbers, file descriptor values, integers or strings indicating progress through a state machine (file_opened, file_changed, file_closed, ...), and so on, and in console logs, these identifiers are buried in free-text log messages in unsystematic ways. To recover the underlying ``schema'' of the logs and successfully extract these features, source code parsing is used to infer all messages that could appear in the log, and type inference and abstract syntax tree traversal on the log printing statements is used to determine which parts of the message correspond to identifiers that could become a feature. This comprehensive parsing technique finds identifiers that are missed by existing analysis methods and without which specific problems would not be revealed by the subsequent PCA analysis. The final step is to present a digest of the results in an operator-friendly decision tree indicating the correlations between the presence or absence of certain sets of messages and the existence of an ``unusual'' condition.
We developed both an offline method [11] and later an online method based on Frequent Pattern Mining [12], validating all results on millions of lines of open-source applications (Hadoop, Darkstar) using cloud computing and finding both new bugs and existing inconsistencies between the log messages and actual operation that had been the subject of much confusion on the developer blogs for those applications.
The techniques were developed in collaboration with Intel Research, and have recently been applied to a production application at Google [13]. The work is also the subject of an invited presentation at ICML 2010.
Dealing with such spikes and hotspots requires sound yet tractable and easy-to-work-with models of such behavior. We have brought our machine learning and statistics expertise to bear in creating such models based on a number of real spikes observed in the wild [4]. Our seven-parameter model captures not only spike volume and steepness, but skewness of distribution to data accesses (i.e. it can model hotspots). In particular, we observed that contrary to widespread belief, hotspot distributions do follow a power law, but the Zipf distribution does not capture them accurately. One component of our spike model uses two techniques from SML--the Chinese Restaurant Process and the stick-breaking process--to more accurately capture the hotspot behavior of real observed spikes. We also provide a workload generator that can be used to generate workloads in accordance with our model, and validate the generator against the real spikes we observed.
A common tale of fast-growing Web sites, including eBay, Facebook and others, is the need to rebuild the storage layer of the site as the site grows. Relational databases have long been the staple technology for persistent data in Web services, but the performance opacity of declarative query languages such as SQL complicates scaling: it is very easy to write an unaffordably-expensive SQL query that should not be allowed in an interactive-response setting, because it is too slow or too resource-intensive. Yet the alternative is hardly better: directly coding database query plans in terms of key/value operations makes performance explicit, but undoes the decades of progress of declarative relational queries.
The SCADS system (Scalable Consistency-Adjustable Data Storage) [1] attempts to provide the best of both worlds. By using compile-time analysis on queries to be performed, automatically analyzing which indices are needed to perform updates, and using machine learning techniques to estimate the cost of performing the overall query based on the underlying query plan, SCADS aims to provide an elastic yet SLO-compliant solution for structured data storage that ``looks and feels'' much like traditional relational databases. The Performance-Insightful Query Language, PIQL [2], allows only performance-safe queries to be expressed: those queries whose cost per user will not increase as the number of users increases. SCADS relies on an underlying key/value storage layer that can maintain relatively stable performance for low-level operations such as Get and Put; while many are available, no existing ones are designed to take advantage of the elasticity of cloud computing by gracefully scaling down as well as up. To that end, we are also working on an elastic storage layer that allows the storage system to scale up and down while complying with strict service-level objectives [10]. The idea is to use machine learning to model the costs of scale-up and scale-down operations and use model-predictive control to drive a policy that decides when to do this. Early results suggest that a single set of mechanisms can be used to harness elasticity both to deal with spikes (generating workload spikes using the techniques described above) and to save money during regular diurnal variation in usage, all while maintaining SLO compliance at a level of two nines or better over typical time intervals (e.g. 99% of requests complete within a tight latency bound during any 10-minute interval). We expect to report on this work in Fall 2010.
Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We proposed and evaluated a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach [6] is based on a new and efficient representation of the datacenter's state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. This work was a major refinement of much earlier work on metric selection [14] and benefited from a fresh look at metric selection techniques for such scenarios [5], results that were already in use at Microsoft to decide what forensic telemetry to store permanently. Our evaluation uses 4 months of trouble-ticket data from a production datacenter at Microsoft with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80% correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is the first rigorous evaluation of any such approach on a large-scale production installation.
This work was performed in collaboration with Microsoft and Microsoft Research Silicon Valley.
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 RADLabRetro.tex
The translation was initiated by Armando Fox on 2010-08-04