Open Notebook: 2010

Friday, July 16, 2010

Azure Platform Appliance

Microsoft recently announced its next move with Azure "Azure Platform Appliance". The idea is to offer Azure platform as an infrastructure that can be used in private clusters. From my perspective, this is somewhat similar to the idea of private clouds (infrastructure services) that one can deploy using runtimes such as Eucalyptus or Nimbus on local clusters. However, since Azure is not an infrastructure service, it will be more flexible for the users. Unlike pure virtual machines, Azure platform appliance will expose most platform services of Azure as well. In addition, the migration between the private Azure and the public Azure will be seamless as well. Overall, I think the biggset advantage of this approach is the piece of mind that the businesses will have "I am in control of my data and it is local".

Thursday, June 24, 2010

Azure Storage Browser in Visual Studio

Now this is a handy tool.

Friday, June 11, 2010

Hadoop Application Wiki

A nice collection of Hadoop apps.
http://wiki.apache.org/hadoop/PoweredBy

Thursday, March 18, 2010

What is the concept of "static data" in Twister?

Many iterative applications we analyzed show a common characteristic of operating on two types of data products called static and variable data. Static data (most of the time the largest of the two) is used in each iteration and remain fixed throughout the computation whereas the variable data is the computed results in each iteration and typically consumed in the next iteration in many expectation maximization (EM) type algorithms.

If a set of data is read by the application but do not get changed, then this set of data can be considered "static" in Twister. Matrix blocks, points in clustering, and web graph in pagerank are all examples of static data.

How does Twister differ from MRnet?

There are many differences between MRnet and MapReduce and then with Twister. I will list some here.

In MapReduce the communication between map and reduce stage of the computation is a graph not a tree. A map can produce (key,value) pairs that may end up in multiple reducers. In MapReduce the framework does not impose any communication topology or connection topology between map and reduce stages of the computation. It is purely the intermediate keys that determine the communication pattern. For example, in an associative and commutative operations such as sum or histogramming, how the intermediate keys are used to distribute intermediate results among the reduce tasks is not that important.However for operations such as sorting, or matrix operations, one can select the intermediate keys in such a way that specific keys goes to specific reduce tasks. Again this is not defined by the network, but the keys and the key selector functions.

Twister uses pub/sub messaging to implement a MapReduce runtime,especially to support iterative MapReduce computations. Similar to other MapReduce runtimes it gives more focus on processing data while maintain data- process affinity. Map and reduce functions in Twister are long running processes providing distinction between static data and variable data. It supports broadcast, scatter type data distributions and reading data via the local disks. I am not sure how the latter two functions can be handled using MRnet.

Comparing MRnet with MapReduce for data processing applications is an interesting analysis one can do though.

Tuesday, February 09, 2010

Twister: Iterative MapReduce

MapReduce programming model has simplified the implementations of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among parallel computing communities. From the years of experience in applying MapReduce programming model to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture which will expand the applicability of MapReduce to more classes of applications.

Twister is a lightweight MapReduce runtime we have developed by incorporating these enhancements. We have published several scientific papers [1-5] explaining the key concepts and comparing it with other MapReduce implementations such as Hadoop and DryadLINQ. Today we would like to announce its first release.

Key Features of Twister are:

Distinction on static and variable data
Configurable long running (cacheable) map/reduce tasks
Pub/sub messaging based communication/data transfers
Combine phase to collect all reduce outputs
Efficient support for Iterative MapReduce computations (extremely faster than Hadoop or DryadLINQ)
Data access via local disks
Lightweight (5600 lines of code)
Tools to manage data

For more details please visit www.iterativemapreduce.org and let us know your thoughts and experience using Twister.

Thank you,

SALSAHPC Team.

[1]. Jaliya Ekanayake, (Advisor: Geoffrey Fox) Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing, Doctoral Showcase, SuperComputing2009.
[2]. Jaliya Ekanayake, Atilla Soner Balkir, Thilina Gunarathne, Geoffrey Fox, Christophe Poulain, Nelson Araujo, Roger Barga, DryadLINQ for Scientific Analyses, Fifth IEEE International Conference on e-Science (eScience2009), Oxford, UK.
[3]. Jaliya Ekanayake, Xiaohong Qiu, Thilina Gunarathne, Scott Beason, Geoffrey Fox High Performance Parallel Computing with Clouds and Cloud Technologies Technical Report August 25 2009 to appear as Book Chapter.
[4]. Geoffrey Fox, Seung-Hee Bae, Jaliya Ekanayake, Xiaohong Qiu, and Huapeng Yuan, Parallel Data Mining from Multicore to Cloudy Grids, High Performance Computing and Grids workshop, 2008. – An extended version of this paper goes to a book chapter.
[5]. Jaliya Ekanayake, Shrideep Pallickara, Geoffrey Fox, MapReduce for Data Intensive Scientific Analyses, Fourth IEEE International Conference on eScience, 2008, pp.277-284.

Sunday, January 03, 2010

Almost MapReduce in 1992

Paralex: An Environment for Parallel Programming in Distributed Systems (1992)

by Özalp Babaoglu , Lorenzo Alvisi , Alessandro Amoreso , Renzo Davoli , Davoli Luigi , Luigi Alberto Giachini

Just found this paper and read it to the end since I noticed some similarities of what they have proposed in 1992 and the current MapReduce programming model and some of the observations are still true for today as well. I will list few of the observations/assumptions they have made showing the similarity of their work and the current MapReduce programming model.

Large-grain data flow model suitable for high-latency low bandwidth networks
Only by keeping the communication-computation ratio to reasonable levels can we expect reasonable performance from parallel applications in such system. – We noticed a similar thing with performing parallel computing in Cloud infrastructures [paper]
Paralex functions mush be “pure” – no side effects
Node corresponds to computations (functions, procedures, programs) and links indicate flow of typed data – Compare this with Microsoft Dryad’s DAG based programming model.

Dryad paper has referred to their work. - I haven't noticed it before ;)

Some of their performance measures had issues with 16MB data set because the memory they had in one of the machines was only 16 MB. Today we have the luxury of using large memories but our data sets are also grown into petabytes. What they did with NFS is now done in HDFS in Hadoop.

Open Notebook