- Write a SCADS client app in RoR—a clone of eBay, or some other interesting big-data app (Lead: Amber or Allen)
- Get Rails environment running using JRuby interpreter and ability to call existing SCADS client library functions, so RoR apps can run in-process with SCADS (Lead: Marcelo?)
- Devise a Ruby gem that encapsulates SCADS functionality to wrap the above (Lead: Brandon)
- Write a crawler for Twitter data and metadata; collect a bunch of it, then create some MapReduce jobs to find statistics like density of friendships, things about structure of followers graph, etc., as well as to have tweet data with which to populate SCADr database (Lead: Aaron or Tim)
Archive for category SWDYFORPs
Relative to our current work on SEJITS, autotuning, and “frictionless high performance software”:
- Start an autotuning DB for use by SEJITS as well as manual use. Challenge is to determine a schema for this info that could be used both for human queries and machine queries (eg via XMLRPC). Each time an autotuning parameter set is determined, add it to the DB.
- Use Archana’s and Kristal’s KCCA algorithms as as test case for “frictionless”. They are sparse-matrix eigenvalue solver problems.
- SEJITS: take Andrew Ng et al’s paper on mapping a variety of SML algorithms to “summation form” for GPU execution, and apply SEJITS to those computations.
- SEJITS: look at LAWN 223 (Cholesky factorization on GPU) and encapsulate it in a specializer.
A HotOS 2009 talk and paper talked about “wave computing” on batch jobs (MapReduce style)—the problem is that batch jobs often do wasteful I/O or computation when multiple workers solve identical subproblems. For example, “top 10 daily files” and “top 10 weekly files” are separate jobs.
They propose specific solutions to identify optimization opportunities, but the more general opportunity is supporting dynamic programming in the cloud. In their approach they look at the actual queries to automatically determine what the common subtasks might be, but in some dynamic programming problems you can express these explicitly.
Parallelizing/decomposing big models and trading off accuracy, precision, etc. (Similar to trading consistency for availability/scalability in storage.) EG: EM training with Markov models, you have a single big data structure (the translation table) that everyone uses and then has to be globally updated (in the M-step). A NIPS paper (described as “hacky” by Alex Smola) partitions the model and uses peer-peer anti-entropy to periodically try to sync models.
In general, one avenue of opportunity is to improve performance or power of most sophistiacted models. But another avenue is: what can we do with yesterday’s/less sophisticated models, which may be perfectly adequate for some app domains esp. if they could run in real time or be portable, and/or they could be used in a layered approach with more sophisticated models within a particular domain.
Given that interesting apps will use multiple languages/frameworks (if not at the productivity layer, then at the efficiency layer), we should be working on portable in-memory and on-disk data formats for various types of ML models (and fast swizzling/unswizzling). Use Google Code Protocol Buffers and define some standard schemata?
At the Cloud Computing Workshop this month we’ll be presenting Cloudstone, a Web 2.0 “social events” app in 2 implementations (Rails & PHP) complete with a workload generator and test automation scripts. The idea is that it can be used as a realistic Web 2.0 app with realistic workloads for benchmarking cloud computing, recovery/scaling scenarios, etc.
A great addition would be to add scripts that can inject various kinds of failures—both app-level (e.g. DB timeout or connection reset) and machine-level (machine shuts down unexpectedly, or has a lot of dropped packets or other I/O interference, etc.)—to test datacenter automation scenarios designed to deal with these problems under load.
Email me if you want to work on this!