Open Notebook: 2009

Tuesday, December 01, 2009

Dynamic Provisioning of Virtual Clusters

Here I will present the details of the demonstration that we (SALSA team) presented at the Super Computing 09 conference in Portland.

Deploying virtual/bare-system clusters on demand is an emerging requirement in many HPC centers. The tools such as xCAT and MOAB can be used to provide these capabilities on top of a physical hardware infrastructure.

In this demonstration we coupled the idea of provisioning clusters with parallel runtimes. In other words, we showed that people can switch a given set of hardware nodes into different operating systems (either virtual or non-virtual) and run different applications written using various cloud technologies. Specifically we showed that it is possible to provision clusters with Hadoop on both Linux bare-system and virtual machines and Microsoft DryadLINQ on Windows Server 2008.

The following diagram shows the operating system and the software stacks we used in our demonstrations. We were able to get all configurations demonstrated except for Windows HPC on XEN VMs which does not work well due to the non-existence of para-virtualized drivers.

We setup our demonstration in 4 clusters each with 64 CPU cores (8 nodes with 8 CPU cores each). The first 3 clusters were configured with RedHat Linux bare-system, RedHat Linux on XEN, and Windows HPC 2008 operating systems respectively. The last cluster is configured to dynamically switch between any of the above configurations. In Linux (both bare-system and XEN) clusters we ran a Smith Waterman dissimilarity calculation using Hadoop as our demo application while on Windows we ran a DryadLINQ implementation of the same application.

We developed a performance monitoring infrastructure based on pub-sub messaging to collect and summarize CUP and memory utilization of individual clusters and a performance visualization GUI (Thanks Saliya for the nice GUI) as our front end demonstration component. Following two diagrams show the monitoring architecture and the GUI of our demo.
With the 3 static clusters we were able to demonstrate and compare the performance of Hadoop (both on Linux bare-system and XEN) and DryadLINQ to the people. It also provides a way to show the overhead of virtualization as well. The dynamic cluster demonstrated the applicability of the dynamically provisionable virtual/physical clusters with parallel runtimes for scientific research.

It was a wonderful team work involving all the members of the SALSA team and the members of IU UITS. What is more amazing is that we go this very successful demonstration built from the scratch in less than one month.
We will upload a video of the demonstration soon.

Here is a photo of most of the members of our group.

Front row (left to right): Jaliya Ekanakaye, Dr. Judy Qiu, Thilina Gunarathne, Scott Beason, Jong Choi, Saliya Ekanayake, Li Hui.
Second row (left to right) Prof. Geoffrey Fox, Joe Rinkovsky and Jenett Tillotson.

Sunday, November 01, 2009

CloudComp09 Presentation

I shared the slides presented at CloudComp09 via slideshare.

Windows Server 2008 on Xen

As part of our ongoing research, we started running DryadLINQ applications on a virtual cluster running Windows Server 2008 VMs on top of Xen. Although we noticed acceptable performance degradation with para-virtualized Linux guests (from our previous research), with Windows we noticed extremely higher performance degradations. Also we noticed that whenever we run a DryadLINQ job which has considerable amount of communication, the VM that runs the DryadLINQ job manager crashes. We had an exactly similar experience with Windows VMs provided by GoGrid earlier. However, those VMs had only 1 CPU core where as the VMs we just ran have 8 CPU cores. (Note: This effect has nothing to do with DryadLINQ, but the handling of I/O operations by the guest and the host OSs)

Overall, the full virtualization approach may not seem to work well for parallel applications. Especially for applications with considerable inter-process communication requirements. Following is a brief analysis I wrote.

===========================
If we look at the development of the virtualization technologies, we can see these three virtualization techniques.

Full virtualization -> Para-Virtualization -> Hardware Assisted Virtualization

(e.g. VM Ware Server) e.g. Xen, Hyper-V e.g. Hyper-V, Virtual Iron, VM-Ware Workstation (64 bit)

So far, para-virtualization seems to be the best approach for virtualization to achieve better performance. However, it requires modified guest operating systems. This is the problem for Windows guests, and that is why we cannot run Windows Server 2008 on Xen as para-vritualized guest. According to [1] the hardware assisted approach is still in its early stages and not performing well, but I think this will catch up soon.

Hyper-V coming from Microsoft, may provide better virtualization solutions to Windows guests and currently it supports both para-virtualization and hardware assisted virtualization.

Given these observations and the observations from our tests, currently it does not seem feasible to have one virtualization layer (like Xen) on a cluster and provide both Windows and Linux VMs for parallel computations. (Here I assumed that Linux on Xen is performing better than Linux on Hyper-V, we need to verify this).

However, if we go for a hybrid approach such as using technologies like XCAT to boot Windows and Linux bare-metal host environments, and provide Linux virtualization using Xen and Windows virtualization using Hyper-V, we may be able to utilize best technologies from both worlds.

Then the next thing to figure out is the performance and how we can run Dryad, Hadoop, and MPI on these virtual environments.

Although we did not see this “mythical” 7% virtualization overhead, they are not so bad as well (on private clouds /clusters at least – we saw 15% to 40% performance degradations).

However, we need to figure out ways of handling large data sets in VMs to use Hadoop and Dryad for wide variety of problems. The main motivation of MapReduce is “moving computation to data”. In virtualized environments, we currently store only an image of the VM without data. If the data is large we have to either move them to the local disks of virtual resources before we run computations, or attach blobs/block storage devices to VMs. The first approach does not preserve anything once the virtual resource is rebooted, whereas the second approach adds network links in between the data and computations (moving data). If our problems are not so data intensive, we can simply ignore these aspects for the moment, but I think this is something worth investigating.
==========================
Hope to add more results soon.

Thursday, September 17, 2009

Tips for MapReduce with Hadoop

I found these nice set of tips for fine tuning MapReduce programs using Hadoop from the Cloudera web site.

Friday, September 11, 2009

MSR Internship is over - Going back to IU

Today I finished my 3 months internship at Microsoft research. It was quite a wonderful experiance for me, and I was able to accomplish most of my internship goals.

At the beginning of my internship I was given the following goals for my internship.
Evaluate the usability of DryadLINQ for scientific analyses
– Develop a series of scientific applications using DryadLINQ
– Compare them with similar MapReduce implementations (E.g. Hadoop)
– Run above DryadLINQ applications on Cloud

During the internship, I developed four DryadLINQ applications and optimized them for performance and also identified several improvements to the current DryadLINQ code base.

I did a detailed performance analysis of the Cap3, HEP, Kmeans applications developed using DryadLINQ comparing them with Hadoop implementations of the same applications. Performance of the pair wise distance calculation application was compared with an MPI implementation of the same application. These findings were all included in the following two papers.
DryadLINQ for Scientific Analyses
Cloud Technologies for Bioinformatics Applicaitons

We (I and my colleague intern –Atilla Balkir) were able to deploy a Windows HPC cluster on GoGrid cloud. I was able to run Cap3 application on Cloud but other applications did not work due to the limitations of the GoGrid infrastructure.

Overall we have the following conclusions regarding DryadLINQ runtime.

We developed six DryadLINQ applications with various computation, communication, and data access requirements
All DryadLINQ applications work, and in many cases perform better than Hadoop
We can definitely use DryadLINQ for scientific analyses
We did not implement (find)
–Applications that can only be implemented using DryadLINQ but not with typical MapReduce
Current release of DryadLINQ has some performance limitations
DryadLINQ hides many aspects of parallel computing from user
Coding is much simpler in DryadLINQ than Hadoop (provided that the performance issues are fixed)
More simplicity comes with less control and sometimes it is hard to fine-tune
We showed that it is possible to run DryadLINQ on Cloud

I got all the necessary support from my mentor (Nelson Araujo), Chirstophe, and the ARTS team @ MSR in accomplishing the objectives of my internship. I would also like to thank Dryad team at Silicon Valley for their dedicated support as well. Last but not least, the support from my advisor (Prof. Geoffrey Fox) and the SALSA team at pervasive technology labs was a tremendous encouragement to me.

Sunday we are planning to head back to Indiana with a two week old baby - Our small miracle - in our hands.

Monday, September 07, 2009

DryadLINQ for Scientific Analyses

I spent the last 3 months at Microsoft Research as an intern doing research on DryadLINQ. Our goal (myself and a another intern - Atilla Soner Balkir) was to evalute the usability of DryadLINQ for scientific applications.

We selected a series of scientific applications and developed DryadLINQ programs for those applications, and evaluated their performances. We compared the performance of the DryadLINQ applicaitons against Hadoop and in some cases MPI versions of the same applications.

We identified several improvments to DryadLINQ and its software stack, and found workarounds to these inefficienies and was able to run most applicaitons with 100% CPU utilizations.

We compiled a paper including our findings regarding DryadLINQ and submitted it for the eScience09 conference. You can find a draft of this technical paper here.

Hope this will be usefull to some of you who are developing applications using DryadLINQ.

Monday, July 13, 2009

Microsoft released Dryad and DryadLINQ for Academic Use

http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx

Friday, June 19, 2009

High Performance Parallel Computing with Clouds and Cloud Technologies

We compiled the latest results/findings of our research as a paper and submitted to CloudComp2009.
Following is the abstract of the paper.

Infrastructure services (Infrastructure-as-a-service), provided by cloud vendors, allow any user to provision a large number of compute instances fairly easily. Whether leased from public clouds or allocated from private clouds, utilizing these virtual resources to perform data/compute intensive analyses requires employing different parallel runtimes to implement such applications. Among many parallelizable problems, most “pleasingly parallel” applications can be performed using MapReduce technologies such as Hadoop, CGL-MapReduce, and Dryad, in a fairly easy manner. However, many scientific applications, which require complex communication patterns, still require optimized runtimes such as MPI. We first discuss large scale data analysis using different MapReduce implementations and then, we present a performance analysis of high performance parallel applications on virtualized resources.

You can find the draft of the paper here.

Thursday, May 14, 2009

How to control the number of tasks per node when you run your jobs in Hadoop cluster?

One of my colleague asked me the following question.

How to control the number of tasks per node when you run your jobs in Hadoop cluster?

We can do this by modifying the hadoop-site.xml. However, the exact xml for the properties are there in hadoop-default.xml.

So here is the method.

Modify the $HADOOP_HOME/conf/hadoop-site.xml and add the following lines.

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>8</value>
</property>

<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>

You can find these properties in Hadoop-default.xml, but better not modify them there.
Instead, copy the properties to hadoop-site.xml and change the value. Then the default values will be overridden by the properties in the hadoop-site.xml.

JobConf.setNumMapTasks(). Is to define how many map tasks Hadoop should execute for the entire job. This simply determines the data splitting factor.

With all three parameters we can control the Hadoop's job parallelism better.

Wednesday, May 13, 2009

High Energy Physics Data Analysis using Microsoft Dryad

A demo of the Dryad version of the HEP data analysis can be found here.

Sunday, April 05, 2009

Classes of MapReduce Applications and Different Runtimes

In my experiance with MapReduce runtiems and the applications, I have noticed few classes of applications where the users can benifit from the available MapReduce runtimes. Let me list them below with their common characteristics.

1. Pleasingly Parallel Applications.
E.g. Processing a set of bilogy data files using Cap3 program.

This could be the most common class of parallel applications that many users needs. The common characteristic is the application of a function or a program on a collection of data files producing another set of data files. Processing a set of medical images is also another type of similar application.

When using MapReduce runtimes for these type of applications, we can simply use "map only"
mode of the runtimes. For example, in Hadoop we can use 0 reduce tasks as follows.

JobConf jc = new JobConf(getConf(), Cap3Analysis.class);
jc.setNumReduceTasks(numReduceTasks);

I impletented the "map only" support for CGL-MapReduce as well and it greatly increase the usability of the runtime to another large class of parallel applications.

More information about Cap3 and the performance of different runtimes such as Hadoop, Dryad, and CGL-MapReduce can be found in the CGL-MapReduce web page.

This class of applications also suite best for the Azure applications written using Worker Role where a set of workers are assinged to process a computation tasks available in a Queue.

2. Typical MapReduce Applications

E.g. High Energy Physics Data Analysis, where a set of analysis functions are executed on each data file of a collection of data files during the "map" stage producing a collection of intermediate histograms and in the "reduce" stage these histograms are merged together to produce the final histogram.

Histogramming words and Distributed Grep are very common examples used in most tutorials.

This class utilizes complete execution model of MapReduce. The application of the "reduce" stage and the function performed by that stage depend on the "associative" and "commutative" properties of the application itself.

3. Iterative MapReduce Applications.

This is a complex class of applications where we need to apply mutliple stages of MapReduce computations to accomplish the data analysis task. Kmeans clustering is a simple application of this nature, where each iteration of MapReduce cluster a set of data points to a given number of cluster centers. The next iteration takes the previous cluster centers as the input and calculate another set of cluster centers minimizing the difference.

Matrix Multiplication and Multi Dimensional Scaling aretwo more similar applications of this class and many machine learning algorithms also fit to this class of applications.

However, the applicability of the current runtimes such as Hadoop and Dryad for this class of applications is questionable. According to our results shown in the following graph we can see the overhead added by runtimes such as Dryad and Hadoop for such data analysis. Also note how the light weight MapReduce runtime -CGL-MapReduce - and its support for iterative applications have enabled it to produce close results with MPI.

Figure 1. Performance of different parallel run times in executing Kmeans Clustering

Figure 2. Overhead of different parallel run times in executing Kmeans Clustering

Simple PasswordLess SSHTutorial

I was configuring Xen virtual machines to run MPI programs. Running MPI,Hadoop, or even CGL-MapReduce require password less login between the nodes of a cluster.

It is the same old steps that we need to perform, but found this simple tutorial which list the process very succinctly.

To get MPI to work you also need to add the following line to your bash (or your shell) profile.

export LAMRSH="ssh -x"

Friday, February 27, 2009

Eucalyptus without a DNS

It has been a while that I last blog relating to my research. I went to my home country (Sri Lanka) for the winter break and came back in mid January. After coming here I just realized that I need a big vacation again :). I had to finish a lot of pending work and now back on track and I can blog again. Still it is 1.56 am in the morning :)

I just got a set of Xen VMs running with Eucalyptus after a few hours of debugging to fix the following warning that I got when I try to ssh between the VM instances.
-----

get_socket_address: getnameinfo 8 failed: Name or service not known
userauth_hostbased: cannot get local ipaddr/name
-----

Actually, it has nothing to do with the VM usage but something I have missed in configuring host based authentication using ssh.

Since it could be helpful to someone else, let me describe the problem and the solution.
I was trying to start a set of VM instances using a Eucalyptus cloud that has been setup in a iDataplex cluster here at Indiana. I was able to follow the guidelines and start the VMs.

Then my goal is to run a set of MPI applications on these VMs. To get MPI working, I need to have password less login between the nodes. As I have previously configured this in the VM image things worked fine but it gave the above warning when I try login between nodes, which then stops MPI daemons from starting.

When I used these VM images with Eucalyptus public cloud every VM instance got assigned a static IP. So, setting up any DNS was not required. This time, however, we use only one static IP for one of the nodes of the VM instances while the rest have dynamic IPs. Also there is no DNS setup for these VM instances. This causes the above warning when I try to login between the nodes using ssh.

Joe Rinkovsky a Unix Systems Specialist @ Indiana explained me the fix. I need to add (IP , name) pairs to /etc/hosts file of the head node of the VM instances. The solution is explained here under "make sure name resolution works" which I haven't noticed till today :).

Open Notebook