Sunday, April 05, 2009

Classes of MapReduce Applications and Different Runtimes

In my experiance with MapReduce runtiems and the applications, I have noticed few classes of applications where the users can benifit from the available MapReduce runtimes. Let me list them below with their common characteristics.

1. Pleasingly Parallel Applications.
E.g. Processing a set of bilogy data files using Cap3 program.

This could be the most common class of parallel applications that many users needs. The common characteristic is the application of a function or a program on a collection of data files producing another set of data files. Processing a set of medical images is also another type of similar application.

When using MapReduce runtimes for these type of applications, we can simply use "map only"
mode of the runtimes. For example, in Hadoop we can use 0 reduce tasks as follows.

JobConf jc = new JobConf(getConf(), Cap3Analysis.class);
jc.setNumReduceTasks(numReduceTasks);


I impletented the "map only" support for CGL-MapReduce as well and it greatly increase the usability of the runtime to another large class of parallel applications.

More information about Cap3 and the performance of different runtimes such as Hadoop, Dryad, and CGL-MapReduce can be found in the CGL-MapReduce web page.

This class of applications also suite best for the Azure applications written using Worker Role where a set of workers are assinged to process a computation tasks available in a Queue.

2. Typical MapReduce Applications

E.g. High Energy Physics Data Analysis, where a set of analysis functions are executed on each data file of a collection of data files during the "map" stage producing a collection of intermediate histograms and in the "reduce" stage these histograms are merged together to produce the final histogram.

Histogramming words and Distributed Grep are very common examples used in most tutorials.

This class utilizes complete execution model of MapReduce. The application of the "reduce" stage and the function performed by that stage depend on the "associative" and "commutative" properties of the application itself.

3. Iterative MapReduce Applications.

This is a complex class of applications where we need to apply mutliple stages of MapReduce computations to accomplish the data analysis task. Kmeans clustering is a simple application of this nature, where each iteration of MapReduce cluster a set of data points to a given number of cluster centers. The next iteration takes the previous cluster centers as the input and calculate another set of cluster centers minimizing the difference.

Matrix Multiplication and Multi Dimensional Scaling aretwo more similar applications of this class and many machine learning algorithms also fit to this class of applications.

However, the applicability of the current runtimes such as Hadoop and Dryad for this class of applications is questionable. According to our results shown in the following graph we can see the overhead added by runtimes such as Dryad and Hadoop for such data analysis. Also note how the light weight MapReduce runtime -CGL-MapReduce - and its support for iterative applications have enabled it to produce close results with MPI.

Figure 1. Performance of different parallel run times in executing Kmeans Clustering
Figure 2. Overhead of different parallel run times in executing Kmeans Clustering

Simple PasswordLess SSHTutorial

I was configuring Xen virtual machines to run MPI programs. Running MPI,Hadoop, or even CGL-MapReduce require password less login between the nodes of a cluster.

It is the same old steps that we need to perform, but found this simple tutorial which list the process very succinctly.

To get MPI to work you also need to add the following line to your bash (or your shell) profile.

export LAMRSH="ssh -x"