Thursday, March 18, 2010

What is the concept of "static data" in Twister?

Many iterative applications we analyzed show a common characteristic of operating on two types of data products called static and variable data. Static data (most of the time the largest of the two) is used in each iteration and remain fixed throughout the computation whereas the variable data is the computed results in each iteration and typically consumed in the next iteration in many expectation maximization (EM) type algorithms.

If a set of data is read by the application but do not get changed, then this set of data can be considered "static" in Twister. Matrix blocks, points in clustering, and web graph in pagerank are all examples of static data.

How does Twister differ from MRnet?

There are many differences between MRnet and MapReduce and then with Twister. I will list some here.

In MapReduce the communication between map and reduce stage of the computation is a graph not a tree. A map can produce (key,value) pairs that may end up in multiple reducers. In MapReduce the framework does not impose any communication topology or connection topology between map and reduce stages of the computation. It is purely the intermediate keys that determine the communication pattern. For example, in an associative and commutative operations such as sum or histogramming, how the intermediate keys are used to distribute intermediate results among the reduce tasks is not that important.However for operations such as sorting, or matrix operations, one can select the intermediate keys in such a way that specific keys goes to specific reduce tasks. Again this is not defined by the network, but the keys and the key selector functions.

Twister uses pub/sub messaging to implement a MapReduce runtime,especially to support iterative MapReduce computations. Similar to other MapReduce runtimes it gives more focus on processing data while maintain data- process affinity. Map and reduce functions in Twister are long running processes providing distinction between static data and variable data. It supports broadcast, scatter type data distributions and reading data via the local disks. I am not sure how the latter two functions can be handled using MRnet.

Comparing MRnet with MapReduce for data processing applications is an interesting analysis one can do though.