Last two weeks I was working on improving the HEP(High Energy Physics) data processing implementation, so that I can do a benchmark on the scalability of the proposed architecture. As the first step, I was able to benchmark the Naradabrokering's C++ Client that I wrote. The following graph compares the performance of Naradaborkering's Java Client vs. C++ Client.
The graph measures the time for two hops (in milliseconds) for various message sizes. The reason for the step wise increase that the Java client demonstrates is mainly the buffer allocation strategy in Java sockets. During the benchmark a message rate of approximately 50 messages per seconds was maintained.
Next, I measured the time for two hops for a 100KB message with increasing message rates. The results shows that the both Java and C++ implementations show stable performance upto the measured 1000 messages per second message rate. According to the results, the C++ Client performs better than the Java clients for higher message rates. (Please see the graph below)
The next task is to measure the scalability of the HEP data processing implementation as a whole. For this I am trying to process large amount of HEP data by increasing the number of processing nodes to process the same amount of data so that we can gain performance improvements by splitting the computation task among multiple processing entities.
MapReduce:
Prof. Fox pointed me to few interesting papers(listed below) which discuss on a technique to parallelize large data processing tasks, named MapReduce, which has its roots in functional programming. Right now I am reading the papers and was simply amazed by the similarity of the work we have done so far the and technique described by these papers:
J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing
on large clusters,” in OSDI’04: Sixth Symposium on Operating System
Design and Implementation, December 2004.
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the
data: Parallel analysis with sawzall,” Scientific Programming Journal
Special Issue on Grids and Worldwide Computing Programming Models
and Infrastructure, vol. 13, no. 4, pp. 227–298, 2005.
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:
Distributed data-parallel programs from sequential building blocks,” in
European Conference on Computer Systems (EuroSys), March 2007.
H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reducemerge:
Simplified relational data processing on large clusters,” in Proc.
SIGMOD, 2007.
Hope to discuss them more in my next blog.