Wednesday, September 30, 2009

Fork/Join

Multi-core processors are here to stay.  While not really general purpose, I was amazed by the 216 cores (stream processors, in this case) per card on each of the 2 new nvidia 260gtx cards I just got...this just goes to show that dividing parallel algorithms will separate the men from the boys here in the next few years.  When we talk about dual or even quad core machines, a programmer could generally get away with a single threaded algorithm for all but very large computations, but as we're driven towards massively increasing numbers of processing cores, the experience demanded by users at that time won't be feasible in a single threaded manner.

There is always a chasm between ease of implementation and maximum performance with such technologies, but I believe that user friendly general purpose frameworks with pluggable/configurable performance enhancements will make it easy to start with a simple implementation, and then as requirements demand, move towards performance tweaks at the framework configuration level.

On type of system that I can think of which does something like this is in a SQL database platform.  The tools/language present an interface conducive to parallelizing a workload through use of non-prescriptive, declarative SQL statements.  Then, later on, users can optimize performance by tweaking indexes as well as providing execution hints.  While it is true that a totally custom solution for handling the same problem could achieve better performance than the best possible implementation in a SQL database, the performance is usually close enough to be acceptable, given the reduced programmer effort.  One drawback of ANSI SQL is the inability to explicitly invoke recursive calculations/logic, but Microsoft has started to address this, albeit only in very specific and simplistic situations, through use of Common Table Expressions for recursive JOINs.

Moving back to the Java Fork/Join framework, I can see that this would be useful when the algorithm can be decomposed with the exact semantics supported.  I don't have much experience with parallel algorithms, so I don't really know what percentage of algorithms are efficiently expressible in this manner, but this would be a major factor to the widespread adoption of such a framework.  Even if an algorithm is expressible in this form, if it is considerably less intuitive and therefore harder to maintain, this could be another barrier to adoption of such a framework.

I would cite the widespread popularity of high level, GC enabled languages as reason enough to not concern oneself too much with the language level performance constraints.  Inevitably, if such languages stay popular, the overheads incurred will be worth the benefit derived, else the community would either optimize the language (or VM), or move to another solution.  I don't think that this particular framework is so different from any other performance consideration to warrant special concern.

The work scheduling/stealing algorithm could lead to suboptimal performance if a task is stolen that requires a great deal of data to do it's work, but who doesn't require a great deal of computational time, specifically in the case of a Non-Uniform Memory Architecture (NUMA) system.  In this scenario, a disproportionate penalty will be paid to move the data from one thread/processor's context to another, without the desired boost in parallel execution performance.

No comments:

Post a Comment