Monday, December 31, 2007

December 26th Report

Scalability of the Rootlet Architecture:

Last two weeks I was working on improving the HEP(High Energy Physics) data processing implementation, so that I can do a benchmark on the scalability of the proposed architecture. As the first step, I was able to benchmark the Naradabrokering's C++ Client that I wrote. The following graph compares the performance of Naradaborkering's Java Client vs. C++ Client.


The graph measures the time for two hops (in milliseconds) for various message sizes. The reason for the step wise increase that the Java client demonstrates is mainly the buffer allocation strategy in Java sockets. During the benchmark a message rate of approximately 50 messages per seconds was maintained.

Next, I measured the time for two hops for a 100KB message with increasing message rates. The results shows that the both Java and C++ implementations show stable performance upto the measured 1000 messages per second message rate. According to the results, the C++ Client performs better than the Java clients for higher message rates. (Please see the graph below)Next Step:
The next task is to measure the scalability of the HEP data processing implementation as a whole. For this I am trying to process large amount of HEP data by increasing the number of processing nodes to process the same amount of data so that we can gain performance improvements by splitting the computation task among multiple processing entities.

MapReduce:

Prof. Fox pointed me to few interesting papers(listed below) which discuss on a technique to parallelize large data processing tasks, named MapReduce, which has its roots in functional programming. Right now I am reading the papers and was simply amazed by the similarity of the work we have done so far the and technique described by these papers:

J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing
on large clusters,” in OSDI’04: Sixth Symposium on Operating System
Design and Implementation, December 2004.

R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the
data: Parallel analysis with sawzall,” Scientific Programming Journal
Special Issue on Grids and Worldwide Computing Programming Models
and Infrastructure, vol. 13, no. 4, pp. 227–298, 2005.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:
Distributed data-parallel programs from sequential building blocks,” in
European Conference on Computer Systems (EuroSys), March 2007.

H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reducemerge:
Simplified relational data processing on large clusters,” in Proc.
SIGMOD, 2007.

Hope to discuss them more in my next blog.

Friday, December 14, 2007

Decmeber 12th Report

TCSC Symposium Proposal:

After SC07 my main target was to write a paper for the above symposium. According to their website;
"The IEEE TCSC Doctoral Symposium provides a forum for students in the area of Scalable Computing to obtain feedback on their dissertation topics and advice on initiating a research career."

I was able to draft a proposal documentation and then with lot of help from Prof. Fox and Dr. Shrideep we were able to submit it before the deadline.

I learnt a lot regarding writing papers especially in presenting ideas. Coming from the programming background, I always tend to go into details straight away. Sherideep helped me to correct this in the paper.

The paper present our plans on designing a "Scalable Framework for Collaborative Analysis of Sceintific Data" especially for data with the "composition" property. That is, the data analysis task can be broken down to set of sub analyses which can be executed concurrently and merge or combine the results of these sub analyses to form the final results.