Automatic Workload Evaluation (AWE). Using modern SML techniques to understand and characterize complex workloads and their performance on distributed systems.Our goal is to predict simultaneously several aspects of system performance when stimulated by a previously unseen workload. We use Kernel Canonical Correlation Analysis (KCCA) to predict message counts, running time and disk operations for a database business-intelligence workload, after showing that simpler prediction techniques give poor results. Given two data spaces (in this case, the space of database query features and the space of measured performance characteristics of each query), KCCA finds maximally-correlated subspaces of fixed dimension embedded in those spaces. We use these findings to predict the performance of previously unseen queries via interpolation. Our approach achieves predictions within 20% of measured values more than 80% of the time on a real customer workload, even in cases where the database’s built-in query optimizer gives poor estimates.We’re now working on applying this approach to predict the performance of Hadoop (i.e. MapReduce-style) batch jobs and the performance of automatically tuned scientific codes on multicore parallel processors.

Recent papers: (PDF files and abstracts can be found here)

  • Archana Ganapathi, Yanpei Chen, Randy Katz, Armando Fox, David Patterson. Statistics-Driven Workload Modeling for the Cloud. Proc. Workshop on Self-Managing Database Systems (SMDB 2010), to appear.
  • Archana Ganapathi, Kaushik Datta, Armando Fox, David Patterson. Using Machine Learning to Auto-tune a Stencil Code on a Multicore Architecture. Proc. HotPar 2009.
  • Archana Ganapathi, Harumi Kuno, Umeshwar Dayal , Janet Wiener, Armando Fox , Michael Jordan , David Patterson. Predicting Multiple Performance Metrics for Queries: Better Decisions Enabled by Machine Learning. Proc. ICDE 2009.

More Detail:

A lot of this work is about recasting performance prediction, scheduling, etc. as problems in correlation analysis.  While KCCA (Kernel Canonical Correlation Analysis) is a recent and fairly sophisticated SML technology, we found that simpler SML methods such as regression do a poor job of prediction, motivating investigation of KCCA.  We have used it so far to predict performance of a multi-query database workload, assist in autotuning computations on parallel hardware, and improve the scheduling of batch (MapReduce-style) jobs on cloud computing.