Wednesday, October 14, 2009

SOSP 2009, Day Three

Final day here at SOSP 2009. Unfortunately my shuttle back to the airport leaves in the middle of the first session, so I am only able to go to the first couple of talks.

Last night at the SIGOPS business meeting, Doug Terry gave an overview of SIGOPS' finances; apparently they are making money hand over fist, and are actually looking for ways to spend down some of the savings. My suggestion: host SOSP 2011 in the ideal location for the conference and don't worry about breaking even.

Speaking of SOSP 2011, Peter Druschel was announced as the Program Chair. There were a few proposals for sites to host the conference. The one that received the most applause was Madeira Island, which is some 500 miles off the coast of Portugal. The site looks fantastic, though I'm not sure about travel connections. Lisbon was the other Portuguese option, which looks great. There were proposals for Dresden, China, Brazil, New Hampshire, and Puerto Rico as well.

The rest of the business meeting included discussions of ACM's copyright policy (and whether SIGOPS should put pressure on them to change it to permit more open distribution of papers); a rehashing of the old single-blind-vs-double-blind controversy (thankfully short); and some discussion of whether the SIGOPS Hall of Fame award should be opened up to non-peer-reviewed publications (there was widespread support for this).

The only talk I have time to write up today is...

Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

This paper deals with data-parallel computation in clusters, popularized by systems such as MapReduce and Hadoop. Microsoft's DryadLINQ system automatically generates distributed queries on clusters from application code (e.g., in C#). In this paper, the authors explore programming interfaces to enable distributed aggregation within the cluster: rather than the conventional two-phase "map-reduce" paradigm, allowing intermediate aggregation at source nodes and within intermediate nodes. Doing so can greatly reduce disk and network I/O. The talk emphasized that the programming interface for defining iterator functions can have a substantial impact on performance, and the authors have experimented with a range of aggregation and pipelining strategies.

I have to run to the airport, so I don't have time to write up the Quincy presentation in detail, but kudos to the authors for the Zero Punctuation-themed talk. I'm bummed that I have to miss the UpRight talk as well as the entire last session.

Tuesday, October 13, 2009

SOSP 2009, Day Two

It's Day Two here in Big Sky and we have a few fresh inches of snow on the ground. Apparently Mike Freedman hit an elk while driving here from Salt Lake City on Sunday (he is OK, the elk is likely not).

Last night at the banquet and awards ceremony, Eric Brewer won the Mark Weiser Award -- well deserved! Eric represents exactly the kind of person this award was designed for, and just before they made the announcement, I was thinking, "Why hasn't Eric won this yet?" Three papers won the SIGOPS Hall of Fame Award, given to the paper that is at least 10 years old that has had substantial influence on the community. I don't recall the papers (they are not posted on the above page yet) but David Cheriton, Butler Lampson, and Hank Levy were all coauthors of the awarded papers.

A few general comments on the SOSP program, and some trends. The hot topics this year continue to be in the realm of debugging and preventing or managing faults. Roughly half of the papers fall into this broad category. There are essentially no papers on distributed systems, at least not in the large scale. (The Microsoft paper on WER is more about debugging than the distributed systems aspect.) Peer to peer, DHTs, etc. are so 2003. Finally, I am quite surprised that there are no papers on virtual machines or cloud computing. This is one area that seems to be getting a lot of attention lately but it's not represented at SOSP this year. Finally, only one BFT paper (gasp!).

Some highlights from Day Two.

Better I/O Through Byte-Addressable, Persistent Memory

This paper from MSR is about rethinking the design of filesystems around the concept of a (hypothetical) byte-addressible, persistent memory, which the authors call BPRAM. Phase Change Memory is now in large-scale production and could provide this capability with reasonable cost within the next few years. BPRAM is expected to be much more durable than flash and much faster (with access times on the order of nanoseconds). The paper proposes BPFS, a new filesystem design that takes advantage of these properties. The key contribution is an atomicity and durability technique called short-circuit shadow paging, which is similar to shadow paging but has a number of optimizations since BPRAM allows one to partially update blocks in place (e.g., one can update just one pointer in an indirect block without rewriting the whole block). The paper also proposes hardware support for atomicity and durability: a new kind of barrier for caches, and adding capacitors to the DIMM so that pending writes complete even if power is lost. What I liked about this paper is that it was looking out beyond existing commercially-available hardware and considering how BPRAM would impact storage system design.

Work-in-Progress Talks

The WIPs are always a lot of fun. Some highlights:

John Ousterhout from Stanford gave an overview on a new project, called RAMCloud, which proposes to reduce the latency for storage in a datacenter by orders of magnitude by developing storage servers with fast RPC (5-10 usec) and that hold all of the data in RAM, rather than on disk.

My own student Geoffrey Challen gave a talk on Integrated Distributed Energy Awareness in sensor networks. The idea is to provide a distributed service allowing sensor nodes to learn of resource availability on other nodes and make informed, coordinated resource allocation decisions. This is the first step towards a distributed OS for sensor nets, something my group is currently working on.

Benjamin Reed from Yahoo! Research gave a talk on "BFT for the skeptics." He made the case that industry is very good at handling crash failures, and some non-crash failures using simple techniques, like checksums. Essentially, it's not clear that BFT solves a real problem that industry has today and argued that unless BFT solutions are easy to configure and solve a real problem, industry would be slow to roll them out.

Eran Tromer from MIT talked about the problem of architectural side-channels allowing information to leak between VMs running on the same physical machine in a cloud. He proposed a tool, called DynamoREA, that rewrites x86 binaries to eliminate these side-channel attacks; for example, by adding noise or delays.

seL4: Formal Verification of an OS Kernel

This paper won one of the best paper awards. The team from NICTA and UNSW formally verified the L4 microkernel, around 8,700 lines of C code, using a 200kloc handwritten specification and machine-checked proof. This is a tremendous amount of work and is the largest verification effort on a complex piece of systems code that I am aware of. They formally proved that there are ZERO bugs in the L4 microkernel: no null pointer dereferences, no memory leaks, no buffer overflows, no infinite loops, and so forth. To make the problem tractable they had to make a number of assumptions - for example, that the compiler and hardware are correct. This seems reasonable to me although we all know that in reality there are bugs in these components as well. Of course, going through this process also uncovered bugs in the specification itself. Very nice work.

Surviving Sensor Network Software Faults

Phil Levis from Stanford gave this talk on Neutron, a new version of TinyOS that adds support for recovery units within the OS, which can be independently rebooted and reinitialized following a memory fault. The idea is to statically determine self-contained groupings of TinyOS components that can be rebooted without impacting each other. The TinyOS kernel itself comprises its own unit which can be rebooted, allowing application code to resume when the reboot is complete. Recovery units can also mark data as "precious" allowing its state to persist across reboots. The authors show that this can be accomplished with low overhead and it reduces the impact of node reboots on overall network availability.

I actually have my doubts about this approach; I think it will make it harder, not easier, for developers to reason about the effect of memory bugs on their code. The advantage of a hard reboot is that it returns the entire node to a known-good state. Having deployed sensor nets with strange bugs impacting data quality, I don't know whether this technique would actually make my life any easier. Damn cool idea though.

Tomorrow I'll blog about the business meeting (tonight) and whatever talks I make it to tomorrow; I have to skip out on the last session to catch my flight home.

Monday, October 12, 2009

SOSP 2009, Day One

I'm in Big Sky, Montana for SOSP 2009 -- widely considered the premier conference on systems. The location is stunning although it's bitterly cold here; I was not ready for this kind of weather for a couple more months. This seems to be the biggest SOSP ever with more than 500 people registered; the last SOSP had a mere 471 attendees. I am not sure what this means. The number of paper submissions has not been going up dramatically, so it's hard to tell if this represents an increasing interest in systems as a field. This year there are a number of co-located workshops, and perhaps with the weak economy people are concentrating their conference travel to focus on fewer high-impact events.

Best paper awards went to three papers: FAWN, RouteBricks, and seL4. These papers all represented a substantial amount of work and have a remarkable number of authors; they are not the kind of papers that were dashed off in two weeks by one or two grad students!

Below are some of the highlights from Day One.

Barbara Liskov - ACM Turing Award Lecture, "The Power of Abstraction"

Barbara used her Turing Lecture as an opportunity to survey the development of Abstract Data Types, CLU, and type hierarchy. It is remarkable how much impact these ideas have had on modern programming languages and practice; Java, C#, Python, and many other languages are the direct descendants of this work. She spend a considerable amount of time talking about the context of programming practice before ADTs were developed, and the issues that computer scientists were concerned about (program structure, control flow, and global state). She surveyed a lot of classic papers -- Dijstra, Wirth, Parnas, and others -- that led to the development of ADTs.

After talking about the mechanisms in CLU, Barbara spent some time reflecting on the current state of programming practice. According to her, modern languages (like Java and C#), but the state of programming is "lousy" -- many untrained programmers developing Internet and browser-based systems rife with things like global state. She wrapped up the lecture by pointing the way to some future opportunities in large multicore systems and reasoning about the semantics of programs running across the Internet.

FAWN - A Fast Array of Wimpy Nodes

The premise of this paper is that it is possible to reduce the power consumption of data centers by an order of magnitude by leveraging much more power-efficient hardware. The idea is to focus on
queries per joule as the metric, rather than simply peak performance. They propose a cluster of inexpensive, power-efficient nodes (based on the AMD Geode platform with a small amount of memory and Compact Flash rather than hard disks) and develop a key-value store application. The system has to address the limited memory on the nodes, the need to minimize random writes to the flash (using a log-structured data store), and the need to add or remove nodes from the cluster (using consistent hashing for lookups and migrating data in background tasks). Among other things, the paper shows that the FAWN prototype can sustain around 350 queries per second per Watt, compared to around 50 queries/sec/W for conventional hardware. They also do a nice analysis of total cost of ownership and talk about the regime (in terms of load and data set size) where FAWN makes the most sense.

The Multikernel: A New OS Architecture for Scalable Multicore Systems

This paper is about revamping the design of OS kernels to account for massively multicore systems with heterogeneous hardware (CPUs, GPUs, programmable NICs, and FPGAs). The key idea is to shift away from a single kernel with shared memory and locks on shared data structures over to running a separate kernel on each core, using message passing as the only coordination primitive between cores. This approach allows one to design interprocessor communication protocols that leverage the topology of the underlying interconnect, which can vary substantially between platforms. In the MultiKernel design, all shared state is treated as replicated across cores and mechanisms for explicit replica management (which are typical in distributed systems) become necessary. The authors have implemented a prototype of this design, called
BarrelFish, and demonstrated its scalability as the number of cores is increased. This is a really nice piece of systems work and is a classic example of "from scratch" OS design to leverage new hardware trends.

Both the MultiKernel and FAWN papers were presented in an earlier form at HotOS 2009, which I have blogged about earlier.

Debugging in the (Very) Large: Ten Years of Implementation and Experience

This paper from Microsoft deals with Windows Error Reporting, which is the largest client-server system in the world by number of installs. WER receives 100 million error reports a day, across 17 million different programs, across a billion or so installed clients. It's been in operation for 10 years and has led to thousands of bugs being fixed, and has fundamentally changed how Microsoft approaches software development. Galen Hunt from MSR gave a great talk on this system.

The system automatically collects "minidumps" from clients when errors occur, which are then classified into buckets by the back-end servers using a variety of statistical analyses. The bucketing strategy takes into account the program name, version, and symbol in which the program crashed. Doing statistical analysis on so many error reports allows them to narrow in on the bugs that affect the most users and therefore concentrate bug fixing efforts. Galen talked about some of the successes of this system over the years, including identifying 5-year-old heisenbugs in the Windows kernel; detection of widespread malware attacks; and even finding bugs in hardware (such as a broken USB controller that did not implement DMA correctly). This is a great paper on a large, real-world system.



I'll blog about the BOFs, banquet, and award ceremony tomorrow.

Sunday, October 4, 2009

Black Box Parenting

I happen to be the proud father of a three-month old baby boy named Sidney. He is a great little guy but as a first-time parent I have often been frustrated with the lack of a clear, coherent instruction manual for taking care of newborns. One would think that after a few million years of raising children, someone would have documented how these things work. Of course, there are tons of baby books out there, but few of them are based on scientific principles. Most are based on anecdotal evidence and often the "parenting philosophy" gets in the way of empirical, pragmatic solutions. Searching for answers on the Internet just drives you crazy as you realize that nobody has any idea what they're talking about.

As a scientist and engineer, my only model for approaching a new, complex system with unknown behavior is to treat it as a black box. In the case of babies, the inputs are food, love, environment; the outputs are behavior (some desirable, some less so); occasional vomit, and dirty diapers. My rational mind tells me that there must be an algorithm at work inside, and my goal is to perform system identification and tweak my feedback control loop to maximize desirable behaviors while minimizing undesirable ones. It seems very simple.

Much of my prior experience in this domain comes from raising a puppy, using positive reinforcement-based operant conditioning as the primary approach. This was largely effective and complex behaviors (e.g., "roll over") can be taught through gradual shaping (i.e., when subject approximates desired behavior, issue reward; subject performs multiple rounds of exploitation vs. exploration to maximize reward potential until objective is reached). Mechanistically, positive reinforcement works with dogs, which is remarkable given that they appear to have far less cognitive capacity and language ability than humans.

Unfortunately, my experiments to date suggest that the strict engineering approach does not appear to be as effective on human infants. First, it is unclear how to provide a positive stimulus in response to desired behavior; one does not (say) offer a squirt of formula as a reward. Likewise, negative punishment is unlikely to be effective at this early developmental stage. How, then, does an infant learn new behaviors?

The biggest challenge so far has been teaching Sidney to fall asleep when he is tired. I previously assumed that falling asleep was a natural, almost automatic response to fatigue. But as we all know we actually have to learn how to fall asleep -- that the feeling of exhaustion is best relieved by finding a dark, quiet place, laying down on a flat soft surface (ideally with pillow and optional teddy bear) and closing one's eyes and relaxing. (Personally, I cannot fall asleep without several conditions being met; imagine how hard it is to sleep in coach class on a long flight when it's noisy, you can't get horizontal, and the drink cart keeps slamming into your knee.) Sidney has no way of knowing that fairly obvious set of facts, and even worse I have no direct way of teaching him those skills. Sidney's only possible learning algorithm is random search; effectively trial-and-error. It must be extremely frustrating to be an infant.

What's frustrating to me is that none of the baby books I have read so far appear to be rooted in any theories of infant learning. I am currently reading Ferber's classic work on sleep training, although the book itself does not offer theories as to how the learning process is supposed to work in a four-to-six month old infant (although there is a lot of good information on infant sleep patterns). Perhaps it is time to start reading some of the scientific studies. Although, perhaps one does not need to be conversant in the medical literature in order to be a good parent?

Next article in the series: Black Box Graduate Student Advising.

Startup Life: Three Months In

I've posted a story to Medium on what it's been like to work at a startup, after years at Google. Check it out here.