Open Notebook

Offline Disk - Make it Online.

2014-02-14T11:03:00.003-08:00

I noticed that one of the disks in my machine is set to offline and the Disk Management tool was not giving an option to make it online. Thanks to following blog, I was able to fix the issue.

the disk is offline because of policy set by an administrator

Azure Platform Appliance

2010-07-16T10:27:00.000-07:00

Microsoft recently announced its next move with Azure "Azure Platform Appliance". The idea is to offer Azure platform as an infrastructure that can be used in private clusters. From my perspective, this is somewhat similar to the idea of private clouds (infrastructure services) that one can deploy using runtimes such as Eucalyptus or Nimbus on local clusters. However, since Azure is not an infrastructure service, it will be more flexible for the users. Unlike pure virtual machines, Azure platform appliance will expose most platform services of Azure as well. In addition, the migration between the private Azure and the public Azure will be seamless as well. Overall, I think the biggset advantage of this approach is the piece of mind that the businesses will have "I am in control of my data and it is local".

Azure Storage Browser in Visual Studio

2010-06-24T16:12:00.000-07:00

Now this is a handy tool.

Hadoop Application Wiki

2010-06-11T14:54:00.001-07:00

A nice collection of Hadoop apps.
http://wiki.apache.org/hadoop/PoweredBy

What is the concept of "static data" in Twister?

2010-03-18T21:31:00.000-07:00

Many iterative applications we analyzed show a common characteristic of operating on two types of data products called static and variable data. Static data (most of the time the largest of the two) is used in each iteration and remain fixed throughout the computation whereas the variable data is the computed results in each iteration and typically consumed in the next iteration in many expectation maximization (EM) type algorithms.

If a set of data is read by the application but do not get changed, then this set of data can be considered "static" in Twister. Matrix blocks, points in clustering, and web graph in pagerank are all examples of static data.

How does Twister differ from MRnet?

2010-03-18T21:27:00.000-07:00

There are many differences between MRnet and MapReduce and then with Twister. I will list some here.

In MapReduce the communication between map and reduce stage of the computation is a graph not a tree. A map can produce (key,value) pairs that may end up in multiple reducers. In MapReduce the framework does not impose any communication topology or connection topology between map and reduce stages of the computation. It is purely the intermediate keys that determine the communication pattern. For example, in an associative and commutative operations such as sum or histogramming, how the intermediate keys are used to distribute intermediate results among the reduce tasks is not that important.However for operations such as sorting, or matrix operations, one can select the intermediate keys in such a way that specific keys goes to specific reduce tasks. Again this is not defined by the network, but the keys and the key selector functions.

Twister uses pub/sub messaging to implement a MapReduce runtime,especially to support iterative MapReduce computations. Similar to other MapReduce runtimes it gives more focus on processing data while maintain data- process affinity. Map and reduce functions in Twister are long running processes providing distinction between static data and variable data. It supports broadcast, scatter type data distributions and reading data via the local disks. I am not sure how the latter two functions can be handled using MRnet.

Comparing MRnet with MapReduce for data processing applications is an interesting analysis one can do though.

Twister: Iterative MapReduce

2010-02-09T15:26:00.000-08:00

MapReduce programming model has simplified the implementations of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among parallel computing communities. From the years of experience in applying MapReduce programming model to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture which will expand the applicability of MapReduce to more classes of applications.

Twister is a lightweight MapReduce runtime we have developed by incorporating these enhancements. We have published several scientific papers [1-5] explaining the key concepts and comparing it with other MapReduce implementations such as Hadoop and DryadLINQ. Today we would like to announce its first release.

Key Features of Twister are:

Distinction on static and variable data
Configurable long running (cacheable) map/reduce tasks
Pub/sub messaging based communication/data transfers
Combine phase to collect all reduce outputs
Efficient support for Iterative MapReduce computations (extremely faster than Hadoop or DryadLINQ)
Data access via local disks
Lightweight (5600 lines of code)
Tools to manage data

For more details please visit www.iterativemapreduce.org and let us know your thoughts and experience using Twister.

Thank you,

SALSAHPC Team.

[1]. Jaliya Ekanayake, (Advisor: Geoffrey Fox) Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing, Doctoral Showcase, SuperComputing2009.
[2]. Jaliya Ekanayake, Atilla Soner Balkir, Thilina Gunarathne, Geoffrey Fox, Christophe Poulain, Nelson Araujo, Roger Barga, DryadLINQ for Scientific Analyses, Fifth IEEE International Conference on e-Science (eScience2009), Oxford, UK.
[3]. Jaliya Ekanayake, Xiaohong Qiu, Thilina Gunarathne, Scott Beason, Geoffrey Fox High Performance Parallel Computing with Clouds and Cloud Technologies Technical Report August 25 2009 to appear as Book Chapter.
[4]. Geoffrey Fox, Seung-Hee Bae, Jaliya Ekanayake, Xiaohong Qiu, and Huapeng Yuan, Parallel Data Mining from Multicore to Cloudy Grids, High Performance Computing and Grids workshop, 2008. – An extended version of this paper goes to a book chapter.
[5]. Jaliya Ekanayake, Shrideep Pallickara, Geoffrey Fox, MapReduce for Data Intensive Scientific Analyses, Fourth IEEE International Conference on eScience, 2008, pp.277-284.

Almost MapReduce in 1992

2010-01-03T21:42:00.000-08:00

Paralex: An Environment for Parallel Programming in Distributed Systems (1992)

by Özalp Babaoglu , Lorenzo Alvisi , Alessandro Amoreso , Renzo Davoli , Davoli Luigi , Luigi Alberto Giachini

Just found this paper and read it to the end since I noticed some similarities of what they have proposed in 1992 and the current MapReduce programming model and some of the observations are still true for today as well. I will list few of the observations/assumptions they have made showing the similarity of their work and the current MapReduce programming model.

Large-grain data flow model suitable for high-latency low bandwidth networks
Only by keeping the communication-computation ratio to reasonable levels can we expect reasonable performance from parallel applications in such system. – We noticed a similar thing with performing parallel computing in Cloud infrastructures [paper]
Paralex functions mush be “pure” – no side effects
Node corresponds to computations (functions, procedures, programs) and links indicate flow of typed data – Compare this with Microsoft Dryad’s DAG based programming model.

Dryad paper has referred to their work. - I haven't noticed it before ;)

Some of their performance measures had issues with 16MB data set because the memory they had in one of the machines was only 16 MB. Today we have the luxury of using large memories but our data sets are also grown into petabytes. What they did with NFS is now done in HDFS in Hadoop.

Dynamic Provisioning of Virtual Clusters

2009-12-01T12:15:00.001-08:00

Here I will present the details of the demonstration that we (SALSA team) presented at the Super Computing 09 conference in Portland.

Deploying virtual/bare-system clusters on demand is an emerging requirement in many HPC centers. The tools such as xCAT and MOAB can be used to provide these capabilities on top of a physical hardware infrastructure.

In this demonstration we coupled the idea of provisioning clusters with parallel runtimes. In other words, we showed that people can switch a given set of hardware nodes into different operating systems (either virtual or non-virtual) and run different applications written using various cloud technologies. Specifically we showed that it is possible to provision clusters with Hadoop on both Linux bare-system and virtual machines and Microsoft DryadLINQ on Windows Server 2008.

The following diagram shows the operating system and the software stacks we used in our demonstrations. We were able to get all configurations demonstrated except for Windows HPC on XEN VMs which does not work well due to the non-existence of para-virtualized drivers.

We setup our demonstration in 4 clusters each with 64 CPU cores (8 nodes with 8 CPU cores each). The first 3 clusters were configured with RedHat Linux bare-system, RedHat Linux on XEN, and Windows HPC 2008 operating systems respectively. The last cluster is configured to dynamically switch between any of the above configurations. In Linux (both bare-system and XEN) clusters we ran a Smith Waterman dissimilarity calculation using Hadoop as our demo application while on Windows we ran a DryadLINQ implementation of the same application.

We developed a performance monitoring infrastructure based on pub-sub messaging to collect and summarize CUP and memory utilization of individual clusters and a performance visualization GUI (Thanks Saliya for the nice GUI) as our front end demonstration component. Following two diagrams show the monitoring architecture and the GUI of our demo.
With the 3 static clusters we were able to demonstrate and compare the performance of Hadoop (both on Linux bare-system and XEN) and DryadLINQ to the people. It also provides a way to show the overhead of virtualization as well. The dynamic cluster demonstrated the applicability of the dynamically provisionable virtual/physical clusters with parallel runtimes for scientific research.

It was a wonderful team work involving all the members of the SALSA team and the members of IU UITS. What is more amazing is that we go this very successful demonstration built from the scratch in less than one month.
We will upload a video of the demonstration soon.

Here is a photo of most of the members of our group.
Front row (left to right): Jaliya Ekanakaye, Dr. Judy Qiu, Thilina Gunarathne, Scott Beason, Jong Choi, Saliya Ekanayake, Li Hui.
Second row (left to right) Prof. Geoffrey Fox, Joe Rinkovsky and Jenett Tillotson.

CloudComp09 Presentation

2009-11-01T21:03:00.000-08:00

I shared the slides presented at CloudComp09 via slideshare.

Windows Server 2008 on Xen

2009-11-01T20:41:00.000-08:00

As part of our ongoing research, we started running DryadLINQ applications on a virtual cluster running Windows Server 2008 VMs on top of Xen. Although we noticed acceptable performance degradation with para-virtualized Linux guests (from our previous research), with Windows we noticed extremely higher performance degradations. Also we noticed that whenever we run a DryadLINQ job which has considerable amount of communication, the VM that runs the DryadLINQ job manager crashes. We had an exactly similar experience with Windows VMs provided by GoGrid earlier. However, those VMs had only 1 CPU core where as the VMs we just ran have 8 CPU cores. (Note: This effect has nothing to do with DryadLINQ, but the handling of I/O operations by the guest and the host OSs)

Overall, the full virtualization approach may not seem to work well for parallel applications. Especially for applications with considerable inter-process communication requirements. Following is a brief analysis I wrote.

===========================
If we look at the development of the virtualization technologies, we can see these three virtualization techniques.

Full virtualization -> Para-Virtualization -> Hardware Assisted Virtualization

(e.g. VM Ware Server) e.g. Xen, Hyper-V e.g. Hyper-V, Virtual Iron, VM-Ware Workstation (64 bit)

So far, para-virtualization seems to be the best approach for virtualization to achieve better performance. However, it requires modified guest operating systems. This is the problem for Windows guests, and that is why we cannot run Windows Server 2008 on Xen as para-vritualized guest. According to [1] the hardware assisted approach is still in its early stages and not performing well, but I think this will catch up soon.

Hyper-V coming from Microsoft, may provide better virtualization solutions to Windows guests and currently it supports both para-virtualization and hardware assisted virtualization.

Given these observations and the observations from our tests, currently it does not seem feasible to have one virtualization layer (like Xen) on a cluster and provide both Windows and Linux VMs for parallel computations. (Here I assumed that Linux on Xen is performing better than Linux on Hyper-V, we need to verify this).

However, if we go for a hybrid approach such as using technologies like XCAT to boot Windows and Linux bare-metal host environments, and provide Linux virtualization using Xen and Windows virtualization using Hyper-V, we may be able to utilize best technologies from both worlds.

Then the next thing to figure out is the performance and how we can run Dryad, Hadoop, and MPI on these virtual environments.

Although we did not see this “mythical” 7% virtualization overhead, they are not so bad as well (on private clouds /clusters at least – we saw 15% to 40% performance degradations).

However, we need to figure out ways of handling large data sets in VMs to use Hadoop and Dryad for wide variety of problems. The main motivation of MapReduce is “moving computation to data”. In virtualized environments, we currently store only an image of the VM without data. If the data is large we have to either move them to the local disks of virtual resources before we run computations, or attach blobs/block storage devices to VMs. The first approach does not preserve anything once the virtual resource is rebooted, whereas the second approach adds network links in between the data and computations (moving data). If our problems are not so data intensive, we can simply ignore these aspects for the moment, but I think this is something worth investigating.
==========================
Hope to add more results soon.

Tips for MapReduce with Hadoop

2009-09-17T08:37:00.001-07:00

I found these nice set of tips for fine tuning MapReduce programs using Hadoop from the Cloudera web site.

MSR Internship is over - Going back to IU

2009-09-11T22:16:00.000-07:00

Today I finished my 3 months internship at Microsoft research. It was quite a wonderful experiance for me, and I was able to accomplish most of my internship goals.

At the beginning of my internship I was given the following goals for my internship.
Evaluate the usability of DryadLINQ for scientific analyses
– Develop a series of scientific applications using DryadLINQ
– Compare them with similar MapReduce implementations (E.g. Hadoop)
– Run above DryadLINQ applications on Cloud

During the internship, I developed four DryadLINQ applications and optimized them for performance and also identified several improvements to the current DryadLINQ code base.

I did a detailed performance analysis of the Cap3, HEP, Kmeans applications developed using DryadLINQ comparing them with Hadoop implementations of the same applications. Performance of the pair wise distance calculation application was compared with an MPI implementation of the same application. These findings were all included in the following two papers.
DryadLINQ for Scientific Analyses
Cloud Technologies for Bioinformatics Applicaitons

We (I and my colleague intern –Atilla Balkir) were able to deploy a Windows HPC cluster on GoGrid cloud. I was able to run Cap3 application on Cloud but other applications did not work due to the limitations of the GoGrid infrastructure.

Overall we have the following conclusions regarding DryadLINQ runtime.

We developed six DryadLINQ applications with various computation, communication, and data access requirements
All DryadLINQ applications work, and in many cases perform better than Hadoop
We can definitely use DryadLINQ for scientific analyses
We did not implement (find)
–Applications that can only be implemented using DryadLINQ but not with typical MapReduce
Current release of DryadLINQ has some performance limitations
DryadLINQ hides many aspects of parallel computing from user
Coding is much simpler in DryadLINQ than Hadoop (provided that the performance issues are fixed)
More simplicity comes with less control and sometimes it is hard to fine-tune
We showed that it is possible to run DryadLINQ on Cloud

I got all the necessary support from my mentor (Nelson Araujo), Chirstophe, and the ARTS team @ MSR in accomplishing the objectives of my internship. I would also like to thank Dryad team at Silicon Valley for their dedicated support as well. Last but not least, the support from my advisor (Prof. Geoffrey Fox) and the SALSA team at pervasive technology labs was a tremendous encouragement to me.

Sunday we are planning to head back to Indiana with a two week old baby - Our small miracle - in our hands.

DryadLINQ for Scientific Analyses

2009-09-07T21:04:00.001-07:00

I spent the last 3 months at Microsoft Research as an intern doing research on DryadLINQ. Our goal (myself and a another intern - Atilla Soner Balkir) was to evalute the usability of DryadLINQ for scientific applications.

We selected a series of scientific applications and developed DryadLINQ programs for those applications, and evaluated their performances. We compared the performance of the DryadLINQ applicaitons against Hadoop and in some cases MPI versions of the same applications.

We identified several improvments to DryadLINQ and its software stack, and found workarounds to these inefficienies and was able to run most applicaitons with 100% CPU utilizations.

We compiled a paper including our findings regarding DryadLINQ and submitted it for the eScience09 conference. You can find a draft of this technical paper here.

Hope this will be usefull to some of you who are developing applications using DryadLINQ.

Microsoft released Dryad and DryadLINQ for Academic Use

2009-07-13T11:43:00.001-07:00

http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx

High Performance Parallel Computing with Clouds and Cloud Technologies

2009-06-19T00:33:00.000-07:00

We compiled the latest results/findings of our research as a paper and submitted to CloudComp2009.
Following is the abstract of the paper.

Infrastructure services (Infrastructure-as-a-service), provided by cloud vendors, allow any user to provision a large number of compute instances fairly easily. Whether leased from public clouds or allocated from private clouds, utilizing these virtual resources to perform data/compute intensive analyses requires employing different parallel runtimes to implement such applications. Among many parallelizable problems, most “pleasingly parallel” applications can be performed using MapReduce technologies such as Hadoop, CGL-MapReduce, and Dryad, in a fairly easy manner. However, many scientific applications, which require complex communication patterns, still require optimized runtimes such as MPI. We first discuss large scale data analysis using different MapReduce implementations and then, we present a performance analysis of high performance parallel applications on virtualized resources.

You can find the draft of the paper here.

How to control the number of tasks per node when you run your jobs in Hadoop cluster?

2009-05-14T11:42:00.000-07:00

One of my colleague asked me the following question.

How to control the number of tasks per node when you run your jobs in Hadoop cluster?

We can do this by modifying the hadoop-site.xml. However, the exact xml for the properties are there in hadoop-default.xml.

So here is the method.

Modify the $HADOOP_HOME/conf/hadoop-site.xml and add the following lines.

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>8</value>
</property>

<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>

You can find these properties in Hadoop-default.xml, but better not modify them there.
Instead, copy the properties to hadoop-site.xml and change the value. Then the default values will be overridden by the properties in the hadoop-site.xml.

JobConf.setNumMapTasks(). Is to define how many map tasks Hadoop should execute for the entire job. This simply determines the data splitting factor.

With all three parameters we can control the Hadoop's job parallelism better.

High Energy Physics Data Analysis using Microsoft Dryad

2009-05-13T16:35:00.001-07:00

A demo of the Dryad version of the HEP data analysis can be found here.

Classes of MapReduce Applications and Different Runtimes

2009-04-05T09:16:00.000-07:00

In my experiance with MapReduce runtiems and the applications, I have noticed few classes of applications where the users can benifit from the available MapReduce runtimes. Let me list them below with their common characteristics.

1. Pleasingly Parallel Applications.
E.g. Processing a set of bilogy data files using Cap3 program.

This could be the most common class of parallel applications that many users needs. The common characteristic is the application of a function or a program on a collection of data files producing another set of data files. Processing a set of medical images is also another type of similar application.

When using MapReduce runtimes for these type of applications, we can simply use "map only"
mode of the runtimes. For example, in Hadoop we can use 0 reduce tasks as follows.

JobConf jc = new JobConf(getConf(), Cap3Analysis.class);
jc.setNumReduceTasks(numReduceTasks);

I impletented the "map only" support for CGL-MapReduce as well and it greatly increase the usability of the runtime to another large class of parallel applications.

More information about Cap3 and the performance of different runtimes such as Hadoop, Dryad, and CGL-MapReduce can be found in the CGL-MapReduce web page.

This class of applications also suite best for the Azure applications written using Worker Role where a set of workers are assinged to process a computation tasks available in a Queue.

2. Typical MapReduce Applications

E.g. High Energy Physics Data Analysis, where a set of analysis functions are executed on each data file of a collection of data files during the "map" stage producing a collection of intermediate histograms and in the "reduce" stage these histograms are merged together to produce the final histogram.

Histogramming words and Distributed Grep are very common examples used in most tutorials.

This class utilizes complete execution model of MapReduce. The application of the "reduce" stage and the function performed by that stage depend on the "associative" and "commutative" properties of the application itself.

3. Iterative MapReduce Applications.

This is a complex class of applications where we need to apply mutliple stages of MapReduce computations to accomplish the data analysis task. Kmeans clustering is a simple application of this nature, where each iteration of MapReduce cluster a set of data points to a given number of cluster centers. The next iteration takes the previous cluster centers as the input and calculate another set of cluster centers minimizing the difference.

Matrix Multiplication and Multi Dimensional Scaling aretwo more similar applications of this class and many machine learning algorithms also fit to this class of applications.

However, the applicability of the current runtimes such as Hadoop and Dryad for this class of applications is questionable. According to our results shown in the following graph we can see the overhead added by runtimes such as Dryad and Hadoop for such data analysis. Also note how the light weight MapReduce runtime -CGL-MapReduce - and its support for iterative applications have enabled it to produce close results with MPI.

Figure 1. Performance of different parallel run times in executing Kmeans Clustering

Figure 2. Overhead of different parallel run times in executing Kmeans Clustering

Simple PasswordLess SSHTutorial

2009-04-05T09:10:00.000-07:00

I was configuring Xen virtual machines to run MPI programs. Running MPI,Hadoop, or even CGL-MapReduce require password less login between the nodes of a cluster.

It is the same old steps that we need to perform, but found this simple tutorial which list the process very succinctly.

To get MPI to work you also need to add the following line to your bash (or your shell) profile.

export LAMRSH="ssh -x"

Eucalyptus without a DNS

2009-02-27T22:53:00.000-08:00

It has been a while that I last blog relating to my research. I went to my home country (Sri Lanka) for the winter break and came back in mid January. After coming here I just realized that I need a big vacation again :). I had to finish a lot of pending work and now back on track and I can blog again. Still it is 1.56 am in the morning :)

I just got a set of Xen VMs running with Eucalyptus after a few hours of debugging to fix the following warning that I got when I try to ssh between the VM instances.
-----

get_socket_address: getnameinfo 8 failed: Name or service not known
userauth_hostbased: cannot get local ipaddr/name
-----

Actually, it has nothing to do with the VM usage but something I have missed in configuring host based authentication using ssh.

Since it could be helpful to someone else, let me describe the problem and the solution.
I was trying to start a set of VM instances using a Eucalyptus cloud that has been setup in a iDataplex cluster here at Indiana. I was able to follow the guidelines and start the VMs.

Then my goal is to run a set of MPI applications on these VMs. To get MPI working, I need to have password less login between the nodes. As I have previously configured this in the VM image things worked fine but it gave the above warning when I try login between nodes, which then stops MPI daemons from starting.

When I used these VM images with Eucalyptus public cloud every VM instance got assigned a static IP. So, setting up any DNS was not required. This time, however, we use only one static IP for one of the nodes of the VM instances while the rest have dynamic IPs. Also there is no DNS setup for these VM instances. This causes the above warning when I try to login between the nodes using ssh.

Joe Rinkovsky a Unix Systems Specialist @ Indiana explained me the fix. I need to add (IP , name) pairs to /etc/hosts file of the head node of the VM instances. The solution is explained here under "make sure name resolution works" which I haven't noticed till today :).

MapReduce for Data Intensive Scientific Analyses

2008-11-17T11:00:00.000-08:00

Most scientific data analyses comprise analyzing voluminous data collected from various instruments. Efficient parallel/concurrent algorithms and frameworks are the key to meeting the scalability and performance requirements entailed in such scientific data analyses. The recently introduced MapReduce technique has gained a lot of attention from the scientific community for its applicability in large parallel data analyses. Although there are many evaluations of the MapReduce technique using large textual data collections, there have been only a few evaluations for scientific data analyses. The goals of this paper are twofold. First, we present our experience in applying the MapReduce technique for two scientific data analyses: (i) High Energy Physics data analyses; (ii) Kmeans clustering. Second, we present CGL-MapReduce, a stream based MapReduce implementation and compare its performance with Hadoop.

The paper accepted for the eScience 2008.

Multi Dimensional Scaling (MDS) using MapReduce

2008-10-10T12:38:00.000-07:00

As part of my ongoing research on various parallelization techniques, I implemented a MapReduce version of a MDS program. My colleague Seung-Hee Bea has developed the in memory implementation of MDS using C# which runs on a multi-core computers. As the first step in converting that algorithm to a MapReduce implementation, I implemented a single threaded version of the same program in Java (since I am planning to use CGL-MapReduce as my MapReduce runtime). I ran this program to reduce 1024 four dimension (4D) data set to a 3D data set and measured the execution time of the different parts of the program. The result showed that 97.7% of the overall execution time of this program is used for matrix multiplication (both square matrices and matrix and vector multiplications).

This observation motivated me to first implement the matrix multiplication in MapReduce and parallelize the matrix multiplication part of the entire computation using MapReduce technique. I developed both square matrix and matrix and vector multiplication algorithms using CGL-MapReduce and incorporated these to the MDS algorithm.

I tested this MapReduce version of the MDS program using 1024 and 4096 points data sets (4 dimensional data reduced to 3D data) and obtained the following results. The data I used has a predefined structure and the results show the exact structure of the data. In addition the results also tally with the results of the single threaded program. So I can reasonably assume that the program performs as expected. The program performed the MDS on 1024 data set in nearly 11 minuites and the 4096 data set in 2 hours and 40 minuites.

Figure 1. Resulting clusters of Multi Dimensional Scaling. Left hand side images shows the clusters visualized using MeshView software and the right hand side images illustrate the predefined structure expected in the data set using lines drawn on the same image.

Memory Size Limitation

The current MapReduce implementation of the MDS algorithm handles only the matrix multiplication in the MapReduce style. However, I believe that we can come up with a different algorithm which will allow most of the computations performed in MDS to be performed using MapReduce. This will reduce the amount of data that needs to be held in-memory by the master program allowing the MDS program to handle larger problems utilizing the total memory available in the distributed setting.

Hadoop as a Batch Job using PBS

2008-08-20T17:53:00.000-07:00

During my previous data analyses using Hadoop and CGL-MapReduce I had to use the compute resources accessible via a job queue. For this purpose I used the Quarry cluster @ Indiana University which support batch job submissions via Portable Batch System(PBS). I contacted one of the system administrators of the Quarry (George Wm Turner) and he point me to the Hadoop On Demand (HOD) project of Apache which mainly try to solve the problem that I am facing.

We gave HOD a serious try but could not get it working the way we wanted. What we tried was to install HOD in set of nodes and let users to use it on demand via a job queue. This option simply did not work for multiple users, since the configurations options for the file system overlap between the users causing only one user to use Hadoop at a given time.

With this situation, I decided to give it a try to start Hadoop dynamically using PBS. The task the script should perform is as follows.

1. Identify the master node
2. Identify the slave nodes
3. Update $HADOOP_HOME/conf/masters and $HADOOP_HOME/conf/slaves files
4. UPdate the $HADOOP_HOME/conf/hadoop-site.xml
5. Cleanup any hadoop file system specific directories created in previous runs
6. Format a new Hadoop Distribtued File System (HDFS)
7. Start Hadoop daemons
8. Execute the map-reduce computation
9. Stop Daemons
10. Stop HDFS

Although the list is bit long, most of the tasks are straigt forward to perfom in a shell script.

The first problem I face is finding the IP addresses of the nodes. PBS passes this information via the variable PBS_NODEFILE. However, in a multi-core and multi-processor systems the PBS_NODEFILE contains multiple entries of the same node depending on the numebr of processors we requested in each node. So I have to find the "SET" of IP addresses eliminating the duplicates. Then have to update the configuration files depending on this information. So I decided to use a Java program to do the job (with my shell script knowledge I could not find an easy way to do a "SET" operation)

Once I have this simple Java file to perform the steps 1 -4 the rest is straightforward. Here is my PBS script and I the link after the script will show you the simple Java program which update the configuration files.

----------------------------------------------------------
#!/bin/bash
#PBS -l nodes=5:ppn=8
#PBS -l walltime=01:00:00
#PBS -N hdhep
#PBS -q hod

java -cp ~/hadoop_root/bin HadoopConfig $PBS_NODEFILE ~/hadoop-0.17.0/conf

for line in `cat $PBS_NODEFILE`;do
echo $line
ssh $line rm -rf /tmp/hadoop*
done

var=`head -1 $PBS_NODEFILE`
echo $var

ssh -x $var ~/hadoop-0.17.0/bin/hadoop namenode -format
ssh -x $var ~/hadoop-0.17.0/bin/start-dfs.sh
sleep 60
ssh -x $var ~/hadoop-0.17.0/bin/start-mapred.sh
sleep 60
ssh -x $var ~/hadoop-0.17.0/bin/hadoop jar ~/hadoop-0.17.0/hep.jar hep ~/hadoop_root /N/dc/scratch/jaliya/alldata 40 2
ssh -x $var ~/hadoop-0.17.0/bin/stop-mapred.sh
ssh -x $var ~/hadoop-0.17.0/bin/stop-dfs.sh
-------------------------------------------------------------------
[HadoopConfig.java]

As you can see in the script, I use ssh to log into each node of the cluster and perform the cleaning up of the HDFS directories. Then I use SSH to login to the master node to start the Hadoop daemons.

Next comes the actual execution of the data analysis task, which I have coded in the hep.jar.
After the MapReduce computation is over the rest of the commands simply stop the daemons.

This is method has more flexibility to the user and requires no changes for the batch job scheduling system. It also serves as an easy opti

Save Now

on when the number of MapReduce computations are smaller than the other batch jobs.

However, if the cluster is dedicated to run MapReduce computations, then everybody starting and stopping a HDFS file system does not make sense. Ideally the Hadoop should be started by the adminitrators and then the users should be allowed to simply execute the MapReduce computations on it.

Your comments are welcome!

High Energy Physics Data Analysis Using Hadoop and CGL-MapReduce

2008-08-20T17:51:00.001-07:00

After the previous set of tests wtih parallel Kmeans clusting using CGL-MapReduce, Hadoop and MPI I shift the direction of testing to another set of tests. This time, the test is to process large number (and volume of) High Energy Physics data files and produce a histogram of interesting events. The amount of data that needs to be processed is 1 terabyte.

Converting this data analysis into a MapReduce version is straigt forward. First the data is split into managable chunks and each map task process some of these chunks and produce histograms of interested events. Reduce tasks merge the resulting histograms producing more concentrated histograms. Finally a merge operation combine all the histograms produced by the reduce tasks.

I performed the above test by incresing the data size on a fixed set of computing nodes using both CGL-MapReduce and Hadoop. To see the scalability of the MapReduce approach and the scalability of the two MapReduce implementations, I performed another test by fixing the amount of data to 100GB and varying the number of compute nodes used. Figure 1 and Figure 2 shows my findings.

Figure 1. HEP data analysis, execution time vs. the volume of data (fixed compute resources)

Figure 2. Total time vs. the number of compute nodes (fixed data)

Hadoop and CGL-MapReduce both show similar performance. The amount of data accessed in each analysis is extremely large and hence the performance is limited by the I/O bandwidth of a given node rather than the total processor cores. The overhead induced by the MapReduce implementations has negligible effect on the overall computation.

The results in Figure 2 shows the scalability of the MapReduce technique and the two implementations. It also shows how the performance increase obtained by the parallelism diminshes after a certain number of computation node for this particular data set.