The Evolution of Chaos

Wednesday, October 21, 2009

Armstrong Ch2

Joe Armstrong's definition of what descriptions a software architecture should be composed of are quite good a characterizing a system. One thing that I think should be moved out of the problem domain section into its own is that of performance constraints and requirements. In this case, the problem domain clearly states that a telecom system should exhibit certain behavior. But in other cases, the problem domain might not be so explicit--a system designed to revolutionize an industry by bringing priorly unheard of speed to solving problems might not characterize the performance requirements in the problem domain--this would be more an attribute of the system that we are trying to build to suit the problem domain, rather than an intrinsic property of it. Required performance guidelines are certainly a central tenet of how a system should be architected.

Messaging for parallelism makes a lot of sense. It helps to reduce the affect of unnecessary assumptions that can be easily imposed by shared memory type systems. I have worked with a number of web services oriented applications which essentially use messaging, and it certainly does enforce strong separation. However, in most of the systems I have worked on with web services, the calls have been blocking in nature, hence no parallelism gains were realized.

"Fail fast" kind of scares me. The idea that an entire process should terminate its execution upon reaching an error case leads me to believe that performance could be detrimentally affected. For example, in the case that a process uses a great deal of cached data, assumedly this cache would be flushed when the process "fast fails", and therefore there could be quite a bit of a performance hit by reloading this cache each time the process restarts.

I think that concurrency oriented programming makes sense in certain problem domains. In others, I could see this style as an impediment from getting the job done. I suppose that as the supporting tools for this type of programming evolve, the barrier to entry of this programming style will be reduced, but it seems to me that unless concurrency is priority one in the system, adopting this model would potentially make more work for you. Having said this, more and more systems are setting concurrency as priority one (especially in the server software world), so I am by no means discrediting this style. Rather, I am proposing that it should be adopted judiciously--only when concurrency is of high enough priority to warrant the overhead.

An unreliable messaging system makes a ton of sense to me. Just look at how the internet has evolved, and worked well (arguably, I guess) for a wide variety of applications. To impose a reliable messaging system would be to burden certain types of communication with overhead that they don't require. Furthermore, as reliable messaging can be built on top of unreliable messaging with proven techniques, I believe that this is the best choice for a messaging system.

Monday, October 19, 2009

Map Reduce

The Map Reduce pattern is quite similar to the Fork/Join pattern. However, whereas the fork/join pattern recursively subdivides one problem into smaller ones, and combines the results when the stack unwinds, the Map Reduce pattern operates over a large list of independent problems, providing a mechanism to gather results. The primary difference, then, being whether the units of work to be completed are decompositions of a large monolithic problem, or samples in a search space--independent from one another, but culminating at a solution.

Errors should be handled in a configurable way, allowing the application developer to specify behavior on a case by case basis. On some problems, failure of any subproblem may prevent the problem from being solved, whereas in, say, a monte carlo simulation, the failure of one task may be essentially inconsequential, presuming that the failure isn't related to a flaw in the underlying model. As such, the application developer would like to either ignore the error, schedule the task for re-execution (in case the error was transient), or terminate the computation. We certainly wouldn't want the framework making a decision to terminate the computation, potentially losing all existing computation results, due to a small anomaly in one task. Hence, a configurable error handling system would be the only way to make a framework general-purpose enough.

I haven't ever used this pattern, and I actually had to a little bit of Wikipedia-ing in order to get a better idea of real applications.

This pattern excels when the majority of the work to be performed is in the "map" function, to be executed independently. When there are a great deal of interdependencies in the "reduce" function, the pattern won't scale well, as reductions may need to be deferred until other mappings have been performed. This case may be outside of the scope of the pattern, but if so, it can be added to the list of weaknesses of the pattern.

Event Based, Implicit Invocation Pattern

The "Event Based, Implicit Invocation" (EBII) Pattern is one that is so incredibly common, it's almost redundant to document it as a pattern, but nonetheless, for completeness, it can prove useful to understand the differences and constraints in implementing it.

The key difference between this pattern and the Observer pattern is the cardinality and loose coupling between the sender and receiver. Whereas in the Observer pattern the "publisher" knows its "subscribers" and issues notifications, in the EBII pattern, the publisher knows nothing about its "subscribers", but rather just knows its own notification mechanisms. These mechanisms interface directly only with a manager component, which is responsible for distributing these notifications. Additionally, whereas the Observer pattern states a "one-to-many" cardinality, the EBII pattern says that any object can publish events and any object can subscribe to events.

When it comes to dispatch methods, implicit invocation provides greater decoupling than explicit invocation. By using explicit invocation, the pattern is considerably closer to the Observer pattern. An additional dimension of decoupling comes from using nonblocking calls. If blocking calls were to be used, the notifying object's timing would be affected by the receiving object's handler execution to a considerably greater degree than in the nonblocking case. Obviously, the event handler will still utilize system resources, but this won't as fundamentally affect the timing semantics inherent in the program.

Applications of this pattern are so prolific (as previously mentioned), that the author named wide classes of programs using this pattern (such as all clients registering over a network to receive some sort of notification), that it's hard to think of an example that doesn't fit under the umbrella provided. RSS feeds would fall under this categorization as a utilization of this pattern.

As is explained in this pattern, the manager component has everything to do with how the event notification system scales. As such, it would be inappropriate to implement a notification broker for all types of events. Imagine the same system being responsible for handling IO Interrupts as RSS notifications--the requirements of performance and complexity are so divergent that one system could not be expected to span this expanse.

Error handling in an event based system is the responsibility of the event receiver. If an event sender *really* needs to know about a failure in another component, it should register a listener for the receiver's error event. This introduces a bit of a bidirectional constraint, but still maintains a great deal of decoupling.

Sunday, October 18, 2009

Chess

I haven't had a great deal of experience with testing parallel programs. As I believe I've stated in my prior blogs, most of the applications that I've built have used totally isolated processing threads, accessing common resource only through a database server--where all but the general awareness of concurrency is abstracted away. In the couple of instances where I have tested the concurrent nature of these programs, I've generally only used stress testing.

While Moore's Law may be one of the most widely known conjectures related to computing, I'd argue that it is still trumped by Murphy's Law: "Anything that can go wrong, will go wrong". When we write test suites for an application, we're hoping that the "will go wrong" will rear its ugly head in the very finite amount of time allotted. Even stress tests running on large farms of machines pale in comparison to the diversity of executions that will happen "in the wild", across an expansive space of hardware and software platforms, user interactions, environmental factors, and configurations, and possibly across years or even decades of deployment. Hence, even if a stress test has run for weeks without failing in a test environment cannot purport to truly capture all behaviors. Additionally, since any code change, either in the application, or in the platform, can totally invalidate a stress test, even the most exhaustive practical stress test is only good for as long as the identical bits are in play--and we all know how quickly requirements change and bugs are found.

The small scope hypothesis is quite an interesting proposition. To be able to bound the complexity of test cases is certainly a desirable trait from the perspective of a real-world tool, such as CHESS. I don't know that I can offer any specific critical analysis on this subject, but would rather argue that regardless of any formal proof of such a concept, code is written by people, and to the extent that they are unrestricted in the concurrency structure of the code they write, they will (again, with the Murphy's Law!). Hence, it's my belief that only through empirical study will we find out what percentage of bugs meet the criteria of this hypothesis with given parameters.

I could envision a monitor on top of CHESS that would profile the efficiency of various interleavings of execution, such that if certain interleavings are found to be particularly efficient, design work can be undertaken to try to guide the execution towards such interleavings (such as monitoring how locking patterns occur, the duration of thread sleeps, and the relative priority of threads).

The semantics of synchronization primitives being misunderstood is minimal to the extent that the conservative option is chosen (in the happens-before graph).

Saturday, October 17, 2009

OPL Iterative Refinement

I don't believe that I have ever used this pattern before. This is probably due to my professional experience being mainly confined to data management type applications, but perhaps because of this I'm an ideal candidate to critique the understandability of this pattern.
I understand what this pattern proposes, but what I am less clear on is when I would use it. The pattern talks in abstract terms about the applicability, but this is perhaps a pattern that would benefit from a more concrete example. Certainly, if I understood my problem domain to look like one of the examples, it would be obvious that this pattern could apply, but my concern is that in applicable cases that are described in terms different than those of the pattern, it might not be obvious that this pattern could be of use.

OPL Layered Systems

Similar to the OPL Pipes & Filters pattern, the OPL Layered Systems pattern is considerably simpler and quicker to grasp than the previously presented pattern. The previous pattern pointed out (and the OPL pattern omitted) that designing a good error handling system can be difficult in layered systems. The OPL pattern made more explicit notice of the performance implications, and was more prescriptive in defining how the number of layers should be managed, whereas the previous presentation simply stated that "crossing component boundaries may impede performance". I'm not sure if it's a strength or weakness of the OPL pattern, but the previous presentation details how to go about deriving a layered architecture, whereas the OPL pattern does not. I guess this would be an argument for what the contents of a pattern should be. It's been my feeling, however, that a pattern should read like a short encyclopedia entry--giving enough information to understand the basics, and deferring specific non-core details to other texts. The OPL pattern does this, whereas the previous presentation goes into considerably more depth. This may be due to the difference of one being an online library of patterns and the other being a chapter in a book, but to describe the best way to get a grasp of a large number of patterns, the former is more expeditious.

OPL Pipes & Filters

The OPL version of the Pipes & Filters pattern is definitely simpler and easier to understand than the previous description. Part of this is due to the fewer detailed examples. The first presentation of this pattern uses a rather complex example of building a programming language, which in my opinion clutters the essence of the pattern. The few short examples in OPL, presented *after* we understand the pattern to a large degree, provide enough detail to grasp the purpose and types of applications for this pattern, and I believe that this is the extent of what a pattern should be--it's not supposed to be an exhaustive reference for the entire body of knowledge relating to the subject, but rather concise enough to be "thumbed through" when searching for a pattern to fit a design need.

The OPL pattern ignores the detail of "push" vs. "pull" pipelines, which in my opinion is bordering on being too implementation specific to be included in the high level pattern. It excels, however, at describing how the granularity of pipe data should be managed to exploit optimal buffering and potentially, concurrency.