Thursday, May 14, 2009

How to control the number of tasks per node when you run your jobs in Hadoop cluster?

One of my colleague asked me the following question.

How to control the number of tasks per node when you run your jobs in Hadoop cluster?

We can do this by modifying the hadoop-site.xml. However, the exact xml for the properties are there in hadoop-default.xml.

So here is the method.

Modify the $HADOOP_HOME/conf/hadoop-site.xml and add the following lines.

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>8</value>
</property>

<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>

You can find these properties in Hadoop-default.xml, but better not modify them there.
Instead, copy the properties to hadoop-site.xml and change the value. Then the default values will be overridden by the properties in the hadoop-site.xml.

JobConf.setNumMapTasks(). Is to define how many map tasks Hadoop should execute for the entire job. This simply determines the data splitting factor.

With all three parameters we can control the Hadoop's job parallelism better.

Wednesday, May 13, 2009

High Energy Physics Data Analysis using Microsoft Dryad

A demo of the Dryad version of the HEP data analysis can be found here.