Monday, December 31, 2007

December 26th Report

Scalability of the Rootlet Architecture:

Last two weeks I was working on improving the HEP(High Energy Physics) data processing implementation, so that I can do a benchmark on the scalability of the proposed architecture. As the first step, I was able to benchmark the Naradabrokering's C++ Client that I wrote. The following graph compares the performance of Naradaborkering's Java Client vs. C++ Client.


The graph measures the time for two hops (in milliseconds) for various message sizes. The reason for the step wise increase that the Java client demonstrates is mainly the buffer allocation strategy in Java sockets. During the benchmark a message rate of approximately 50 messages per seconds was maintained.

Next, I measured the time for two hops for a 100KB message with increasing message rates. The results shows that the both Java and C++ implementations show stable performance upto the measured 1000 messages per second message rate. According to the results, the C++ Client performs better than the Java clients for higher message rates. (Please see the graph below)Next Step:
The next task is to measure the scalability of the HEP data processing implementation as a whole. For this I am trying to process large amount of HEP data by increasing the number of processing nodes to process the same amount of data so that we can gain performance improvements by splitting the computation task among multiple processing entities.

MapReduce:

Prof. Fox pointed me to few interesting papers(listed below) which discuss on a technique to parallelize large data processing tasks, named MapReduce, which has its roots in functional programming. Right now I am reading the papers and was simply amazed by the similarity of the work we have done so far the and technique described by these papers:

J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing
on large clusters,” in OSDI’04: Sixth Symposium on Operating System
Design and Implementation, December 2004.

R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the
data: Parallel analysis with sawzall,” Scientific Programming Journal
Special Issue on Grids and Worldwide Computing Programming Models
and Infrastructure, vol. 13, no. 4, pp. 227–298, 2005.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:
Distributed data-parallel programs from sequential building blocks,” in
European Conference on Computer Systems (EuroSys), March 2007.

H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reducemerge:
Simplified relational data processing on large clusters,” in Proc.
SIGMOD, 2007.

Hope to discuss them more in my next blog.

2 comments:

pingooo said...

Hi,

I'm a high energy physicist turned Google software engineer. I'm promoting MapReduce and Google File System in the universities as part of the Google Academic Cloud Computing Initiative in Taiwan. I think it is really possible to use this paradigm in high energy physics: Monte Carlo generation, event reconstruction + splitting, skimming, or making analysis samples. I'd love to talk with you if you are interested.

cheers,
Ping

Jaliya Ekanayake said...

Hi Ping,

Nice to hear that you are also interested in the same field of research.

Yes we can definitely use the MapReduce paradigm to most of the data parallel applications. Also to the interesting class of applications which can tolerate communication latencies in the range of milliseconds rather than the typical latencies(microseconds) expected in shared memory parallelization techniques.

I would like to talk with you and share my thoughts and experience with you. Please send me an email to jekanaya AT cs DOT indiana DOT edu

Thanks,
-jaliya