Thursday, September 17, 2009

Tips for MapReduce with Hadoop

I found these nice set of tips for fine tuning MapReduce programs using Hadoop from the Cloudera web site.

Friday, September 11, 2009

MSR Internship is over - Going back to IU

Today I finished my 3 months internship at Microsoft research. It was quite a wonderful experiance for me, and I was able to accomplish most of my internship goals.

At the beginning of my internship I was given the following goals for my internship.
Evaluate the usability of DryadLINQ for scientific analyses
– Develop a series of scientific applications using DryadLINQ
– Compare them with similar MapReduce implementations (E.g. Hadoop)
– Run above DryadLINQ applications on Cloud

During the internship, I developed four DryadLINQ applications and optimized them for performance and also identified several improvements to the current DryadLINQ code base.

I did a detailed performance analysis of the Cap3, HEP, Kmeans applications developed using DryadLINQ comparing them with Hadoop implementations of the same applications. Performance of the pair wise distance calculation application was compared with an MPI implementation of the same application. These findings were all included in the following two papers.
DryadLINQ for Scientific Analyses
Cloud Technologies for Bioinformatics Applicaitons

We (I and my colleague intern –Atilla Balkir) were able to deploy a Windows HPC cluster on GoGrid cloud. I was able to run Cap3 application on Cloud but other applications did not work due to the limitations of the GoGrid infrastructure.

Overall we have the following conclusions regarding DryadLINQ runtime.
  • We developed six DryadLINQ applications with various computation, communication, and data access requirements
    All DryadLINQ applications work, and in many cases perform better than Hadoop
  • We can definitely use DryadLINQ for scientific analyses
  • We did not implement (find)
    –Applications that can only be implemented using DryadLINQ but not with typical MapReduce
  • Current release of DryadLINQ has some performance limitations
  • DryadLINQ hides many aspects of parallel computing from user
    Coding is much simpler in DryadLINQ than Hadoop (provided that the performance issues are fixed)
  • More simplicity comes with less control and sometimes it is hard to fine-tune
  • We showed that it is possible to run DryadLINQ on Cloud

I got all the necessary support from my mentor (Nelson Araujo), Chirstophe, and the ARTS team @ MSR in accomplishing the objectives of my internship. I would also like to thank Dryad team at Silicon Valley for their dedicated support as well. Last but not least, the support from my advisor (Prof. Geoffrey Fox) and the SALSA team at pervasive technology labs was a tremendous encouragement to me.

Sunday we are planning to head back to Indiana with a two week old baby - Our small miracle - in our hands.

Monday, September 07, 2009

DryadLINQ for Scientific Analyses

I spent the last 3 months at Microsoft Research as an intern doing research on DryadLINQ. Our goal (myself and a another intern - Atilla Soner Balkir) was to evalute the usability of DryadLINQ for scientific applications.

We selected a series of scientific applications and developed DryadLINQ programs for those applications, and evaluated their performances. We compared the performance of the DryadLINQ applicaitons against Hadoop and in some cases MPI versions of the same applications.

We identified several improvments to DryadLINQ and its software stack, and found workarounds to these inefficienies and was able to run most applicaitons with 100% CPU utilizations.

We compiled a paper including our findings regarding DryadLINQ and submitted it for the eScience09 conference. You can find a draft of this technical paper here.

Hope this will be usefull to some of you who are developing applications using DryadLINQ.