Archive for category Systems research

What will make grad students want to come into the lab, v2.0

In the days before the RAD Lab was started, Soda Hall was deserted during much of the day because students were working from home rather than coming into their labs.  One hypothesized reason for this was that previously, the technology available in the lab was far beyond what students could afford at home: as a grad student at Cal in 1994, my household had ISDN service—128Kbps—and we could only afford it because we had a partial lab subsidy, and if you had a 17” CRT, you were stylin’. But by the early 2000s, you could get a killer PC with a huge display and infinite hard disk space very affordably, and everyone had broadband at home. So the RAD Lab’s revolutionary idea was that students might come in if they knew with high confidence that their fellow students and even (gasp) their faculty advisors were likely to be there. And the idea worked.

Now I’m sitting at home (it’s the weekend! otherwise I’d be in the lab) with my 24” display etc., and watching my network backup proceed at 1Mbps.  (We have cable modem “service” from Comcast, and like most residential broadband technologies, it’s an asymmetric link.  We get anywhere from 6 to 10 Mbps down, but only 1-3 Mbps up.)

In the coming days of big data, I wonder whether we’ll again see students coming into the lab to work because that’s where the 100Mbps intranet and 10-100 Mbps uplinks are.

New RAD Lab papers

We continue to make progress on applying machine learning to problems in deploying and operating datacenter-scale systems…

  • Peter Bodik’s paper on “Fingerprinting the Datacenter” (joint work with Moises Goldszmidt at Microsoft Research Silicon Valley and Dawn Woodard at Cornell) was accepted to EuroSys 2010, where I’ll also be giving a tutorial on Web 2.0 applications;
  • Wei Xu presented an online version of his work on data mining of console logs (joint with Ling Huang at Intel Research Berkeley) at ICDM 2009 last month;
  • Dr. Archana Ganapathi filed her PhD dissertation (yay!!) and just had a paper accepted to the Self-Managing Database Systems workshop (SMDB 2010) on statistics-driven workload modeling for cloud jobs like Hadoop (joint work with Yanpei Chen)
  • The RAD Lab will be featured in the VMware GoVirtual webzine later this month, stay tuned!

…and of course we are planning submissions to SOCC and WebApps as well.  See the students’ pages or my project pages for more details!

I’d like to disabuse early-career grad students of certain misconceptions…

  1. You are rarely the best judge of the most important material or best presentation strategy for your talk. Corollary: Give one or more practice talks.
  2. Writing is much harder than you think. Corollary 1: You are not that great a writer. Corollary 2: If you don’t have a solid draft 1-2 weeks before the conference deadline, you’re starting with 2 strikes.
  3. 80% or more of submitted papers are rejected. Corollary: You need feedback from colleagues and outsiders to improve your paper. A poor way to get feedback is to submit the paper, wait 6 months, and get a rejection with cryptic reviews. A better way is left as an exercise to the reader. (Thanks to Mike Franklin for this particular way of looking at the “get feedback” issue.)
  4. When you write up your work, remember that nobody cares what you did but only why it advances the state of the art. Edit accordingly. Corollary: edit an outline and paragraph map before you start writing. It’s much easier to rearrange/eliminate at this level than at the prose level.
  5. The reviewer has 20 other papers waiting to be reviewed and is looking for a reason to set yours aside and move on. Corollary: your job is to ensure no such opening is provided—whether by unsupported statements, poor writing, rambling style, etc.
  6. Your goal is not that your work gets the approval of your advisor, but the approval of the research community, as represented by the (usually anonymous) reviewers who will be evaluating your paper. Your advisor can bring her/his experience to bear and give you advice (hence “advisor”) on how to maximize the likelihood of this, but don’t mislead yourself into thinking that your goal should be to please your advisor.  If the community is pleased with your work, chances are excellent your advisor will be too.  Corollary: Get lots of feedback on a paper from people other than your advisor—i.e., people representative of the reviewers who’ll evaluate it—before submitting it.

E-filing your PhD thesis? Why not file your VM as well?

UC Berkeley has finally started accepting electronic (PDF) thesis filing. The trees thank them. I remember, though, that shortly after I filed my (hardcopy) thesis, I quickly lost the ability to even regenerate the PDF from LaTeX sources: I didn’t have the right packages, some figures didn’t get tarred up properly, etc etc.  And as far as trying to run the sizable chunks of software that I and others built and reported on…fuhggedaboudit.

But hey, with disk space being free now, if I was graduating now I would also “file” a copy of the VM images used to format my thesis and run the experiments. Some of my students are doing cloud computing research so some of their VM’s are already being stored as Amazon AMI’s, but why not snapshot a VM image of their laptop as well? We’d be one step closer to truly reproducible results in CS research.

Undergrad projects in cloud computing

  • Write a SCADS client app in RoR—a clone of eBay, or some other interesting big-data app  (Lead: Amber or Allen)
  • Get Rails environment running using JRuby interpreter and ability to call existing SCADS client library functions, so RoR apps can run in-process with SCADS (Lead: Marcelo?)
  • Devise a Ruby gem that encapsulates SCADS functionality to wrap the above (Lead: Brandon)
  • Write a crawler for Twitter data and metadata; collect a bunch of it, then create some MapReduce jobs to find statistics like density of friendships, things about structure of followers graph, etc., as well as to have tweet data with which to populate SCADr database (Lead: Aaron or Tim)

Dynamic programming in the cloud

A HotOS 2009 talk and paper talked about “wave computing” on batch jobs (MapReduce style)—the problem is that batch jobs often do wasteful I/O or computation when multiple workers solve identical subproblems. For example, “top 10 daily files” and “top 10 weekly files” are separate jobs.

They propose specific solutions to identify optimization opportunities, but the more general opportunity is supporting dynamic programming in the cloud. In their approach they look at the actual queries to automatically determine what the common subtasks might be, but in some dynamic programming problems you can express these explicitly.

Add failure injection to Cloudstone

At the Cloud Computing Workshop this month we’ll be presenting Cloudstone, a Web 2.0 “social events” app in 2 implementations (Rails & PHP) complete with a workload generator and test automation scripts. The idea is that it can be used as a realistic Web 2.0 app with realistic workloads for benchmarking cloud computing, recovery/scaling scenarios, etc.

A great addition would be to add scripts that can inject various kinds of failures—both app-level (e.g. DB timeout or connection reset) and machine-level (machine shuts down unexpectedly, or has a lot of dropped packets or other I/O interference, etc.)—to test datacenter automation scenarios designed to deal with these problems under load.

Email me if you want to work on this!

Tags: ,