Tuesday, November 3, 2009

Armstrong Ch5

AND supervisors make sense when a bunch of processes are collaborating and building a collective state towards some goal.  In these cases, the supervisor couldn't restart just one of these processes, because it would introduce inconsistent state, and hence all children must be restarted if any need to be.

OR supervisors make sense when each processes being supervised is essentially independent, or at least where when processes communicate, they don't rely on any monotonic progression of state.

Erlang's method for restartability does not necessarily make restarts cheap, but it certainly provides a good framework through which the programmer can attempt to do so.  Providing the AND/OR hierarchical granularity certainly helps isolate the scope of restarts, but there could certainly be situations where a cascade of restarts occur, each at a larger granularity, until a total subsystem is restarted, where in a different model the programmer may have been able to know that an entire subsystem restart would be necessary, and therefore bypassed all of the cascades of restart.

Using the process/supervisor model keeps the programmer aware of the unit of restart that their code may be subjected to, and therefore it allows for the system to be implemented in such a way that restarts can be enacted at multiple levels--more naive approaches wouldn't necessarily have the clarity of separation to support this.  Because of this, faults can be isolated to a certain process and its supervisor, and therefore not unwind the stack to some arbitrary point, and hence fault tolerance is greatly enhanced.

A rejuvenation model for fault recovery could certainly reduce some of the overhead associated with restarting a big AND group, but it's possible that any benefit to be gained by this would be nullified by the additional complexity involved in supporting the rejuvenation process.  Also, I could imagine that rejuvenation isn't as robust as a restart, and therefore additional errors may be introduced by trying to support such a model.

Erlang prescribes an adherence to specification such that if any deviation is detected, and exception should be thrown.  This is contrary to the all too common reality that a programmer might guess and/or impose their own personal beliefs of how the system should operate, complicating the matter.

Providing a good mental framework, even if not a concrete software framework, for managing the supervisor/task relationship goes a long way to influencing the way a piece of software is built.  As such, even though the supervisor logic doesn't come "out of the box", adopting this paradigm still provides a solid separation of concerns which is well suited to the task at hand, and therefore is superior to more traditional organizations.

I don't see that Erlang's model for restarts is that much different from the crash-only model, except for the granularity involved--whereas the crash-only model is defined at the "component" level, Erlang's model decomposes this into a granular hierarchy, each piece of which does essentially the same thing.  The Erlang model benefits and suffers from the phenomenon I described earlier, where a cascade of restarts may be undertaken, ultimately resulting in whole-subsystem restart.  In this case, the crash-only system would have been faster, but in most cases, there is some distinct scope beyond which an error doesn't persist, and thus the Erlang model provides a better way of handling things.

No comments:

Post a Comment