The Evolution of Chaos: PRES

Debugging tools have come a long way--who would have thought that with a push of a button you can step through code in different languages running across multiple machines in a seamless environment. These tools are invaluable, but what do you do before you have a bug within a configured environment? Enter the nasty no-repro bug. These are the gremlins that all software developers dread--an issue either reported or confirmed, the detail of which we don't know. How do we handle this? If the error is reproducable in a certain environment, we might have the luxury of instrumenting the code temporarily, and incurring the high overhead associated, or we can just start guessing (see Speculation pattern ;-).

Uniprocessor debugging is hard because issues can be sourced across space (code) and time. On a multiprocessor system, we introduce additional dimensions in which errors can occur, so we end up with something that feels intuitively like geometric growth of debugging complexity.

A tool like PRES is a welcome addition to the developer's troubleshooting arsenal. I agree that bugs don't *need* to be reproducible the very first time in the lab, but especially if the replays are entirely automated, a small number of replays can easily be tolerated (and really, what other choice do you have if you can't withstand the overhead of exhaustive logging?).

Sources of nondeterminism that make this whole process difficult can be somewhat hard to reproduce, due to some of them being generated by low-level constructs like interrupts. Virtual machine technology can help alleviate some of this by virtualizing things that were once relegated to pure hardware-controlled method, with limited possibility to control--now a tool like PRES could decide when things like interrupts might be generated.

The Evolution of Chaos

Wednesday, November 11, 2009

PRES

No comments:

Post a Comment

Followers

Blog Archive

About Me