Mining Console Logs to Help Debug Distributed Systems. Can SML techniques applied to console debugging logs be combined with source code analysis to identify hard-to-reproduce bugs in distributed systems?
- PhD Students: Wei Xu
- Collaborators: Dr. Ling Huang, Intel Research Berkeley
Recent papers: (PDF files and abstracts can be found here)
- Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan. Online System Problem Detection by Mining Patterns of Console Logs. Proc. ICDM 2009.
- Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan. Large-Scale System Problem Detection by Mining Console Logs. Proc. SOSP 2009.
- Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael Jordan. Mining Console Logs for Large-Scale System Problem Detection. Proc. SysML 2008.
As observed by Aaron Brown and others involved with our Recovery-Oriented Computing (ROC) project, in many systems the largest single contributor to downtime was human operator error or lack of proper tools to help diagnose a problem that could not be addressed automatically. To create better tools, we used SML to pre-analyze data to draw the operator’s attention to unusual patterns, combined with visualization techniques that exploit the built-in parallel processing of the human visual system. The combination helps operators quickly spot problems in large data sets and “grounds” their understanding of how the SML algorithms work, leading over time to increased trust in the automated algorithms.
Our ongoing work in this area involves combining text mining of applications’ console logs, analysis of the source code, and visualization, to help spot rarely-occurring patterns or events in the logs that might be indicators of a failure or provide useful forensic evidence in tracking down intermittent failures. For our initial efforts we are using real console logs from a Java-based production search engine. In one instance our prototype helped identify the cause of one bug that took weeks of manual debugging. In another instance text mining of the logs would have focused human attention on the subsystem containing the actual bug, whereas in the absence of this information the operators’ intuition had led him to focus attention on a different subsystem that turned out not to be faulty. We are in the process of applying this to other large-scale back-end services such as text search and extending the techniques to languages other than Java.