Monday, December 31, 2007

December 26th Report

Scalability of the Rootlet Architecture:

Last two weeks I was working on improving the HEP(High Energy Physics) data processing implementation, so that I can do a benchmark on the scalability of the proposed architecture. As the first step, I was able to benchmark the Naradabrokering's C++ Client that I wrote. The following graph compares the performance of Naradaborkering's Java Client vs. C++ Client.


The graph measures the time for two hops (in milliseconds) for various message sizes. The reason for the step wise increase that the Java client demonstrates is mainly the buffer allocation strategy in Java sockets. During the benchmark a message rate of approximately 50 messages per seconds was maintained.

Next, I measured the time for two hops for a 100KB message with increasing message rates. The results shows that the both Java and C++ implementations show stable performance upto the measured 1000 messages per second message rate. According to the results, the C++ Client performs better than the Java clients for higher message rates. (Please see the graph below)Next Step:
The next task is to measure the scalability of the HEP data processing implementation as a whole. For this I am trying to process large amount of HEP data by increasing the number of processing nodes to process the same amount of data so that we can gain performance improvements by splitting the computation task among multiple processing entities.

MapReduce:

Prof. Fox pointed me to few interesting papers(listed below) which discuss on a technique to parallelize large data processing tasks, named MapReduce, which has its roots in functional programming. Right now I am reading the papers and was simply amazed by the similarity of the work we have done so far the and technique described by these papers:

J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing
on large clusters,” in OSDI’04: Sixth Symposium on Operating System
Design and Implementation, December 2004.

R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the
data: Parallel analysis with sawzall,” Scientific Programming Journal
Special Issue on Grids and Worldwide Computing Programming Models
and Infrastructure, vol. 13, no. 4, pp. 227–298, 2005.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:
Distributed data-parallel programs from sequential building blocks,” in
European Conference on Computer Systems (EuroSys), March 2007.

H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reducemerge:
Simplified relational data processing on large clusters,” in Proc.
SIGMOD, 2007.

Hope to discuss them more in my next blog.

Friday, December 14, 2007

Decmeber 12th Report

TCSC Symposium Proposal:

After SC07 my main target was to write a paper for the above symposium. According to their website;
"The IEEE TCSC Doctoral Symposium provides a forum for students in the area of Scalable Computing to obtain feedback on their dissertation topics and advice on initiating a research career."

I was able to draft a proposal documentation and then with lot of help from Prof. Fox and Dr. Shrideep we were able to submit it before the deadline.

I learnt a lot regarding writing papers especially in presenting ideas. Coming from the programming background, I always tend to go into details straight away. Sherideep helped me to correct this in the paper.

The paper present our plans on designing a "Scalable Framework for Collaborative Analysis of Sceintific Data" especially for data with the "composition" property. That is, the data analysis task can be broken down to set of sub analyses which can be executed concurrently and merge or combine the results of these sub analyses to form the final results.

Saturday, November 24, 2007

November 28th Report

After month of silence:
During the last month I was completely engaged in getting the ROOT client working with Naradabrokering and Clarens so that the Physicist can submit and monitor analysis jobs collaboratively. It was not just another implementation problem, but a implementation+integration task which requires different hetrogenous components to work together. The user case that we tried to achieve.
  • A Physicist identify a dataset from a partial physics experiment
  • He then write an analysis script based on some analysis criteria using ROOT language and test it using a sample data file in his computer.
  • Now he needs to execute this analysis on all the data files available for a particular experiment.
  • While the jobs are getting processed, he should be able to monitor the results of each analysis sub task, which is a histogram.
  • The client program that needs to be developed should be able to display and merge the resulting histograms in real time.
  • Also, any other physicist who would like to see the result of the analysis as an when it is happening, should be able to connect to the same experiment and see the results getting merged one by one in his Client Software.
With a month of work I was able to implement a Client software written in ROOT language that can achieve all of the above requirements. We were able to show this demonstration during the Supercomputing Conference 07 in Reno Nevada. The following image shows the software while executing an analysis on files located in three different servers.

Some explanation about the software:
  • Main canvas shows the histogram of the results. All the histograms generated at each analysis sub task is merged and displayed to the user
  • Panel at the top right; shows the connected server. In this example, it has been connected to three Clarens servers running in three different machines.
  • Panel below that; shows the available data files at each server. This panel also shows the status of each file, whether it has been processed or not by changing color. "Grey" color indicates that the file has not yet been processed, "Red" color indicates that the resulting histogram for that file has been received and in the process of merging it with the available results so far, and the "Blue" color indicates that the file has been processed and the resulting histogram has been merged with the existing results.

Few Implementation Details:
The GUI is completely written in the interpreted language provided by the ROOT framework.
It uses a C++ bridge for Naradabrokering and C++ client library for Clarens server internally.
It also uses a python script to submit analysis tasks to multiple Rootlets (A concept similar to Servlet) concurrently. This is especially required because the interpreted ROOT language does not support multi-threading.

Monday, October 01, 2007

October 3rd Report

Had a meeting with Prof. Geoffrey relating to the Ph.D. topic that I should select. He advice me to find more use cases of the data analysis tasks that are similar to the Particle Physics data analysis that we are doing using Clarens and ROOT.

The data analysis tasks that we handled has the following characteristic.

  • Data is in large files and these files are distributed across the globe.
  • One or more analysis technique(in our case, one analysis script) can be applied to all the data to identify patterns.
  • The outcome of an analysis of a single file is a histogram.
  • These outcomes(histograms) can be merged to produce the final results.

So far I have found one strong use case of this nature and that is;
Astronomical Image Processing - mainly for identifying features in astronomical images.

There are few candidate areas that I found interesting and they are;
Analysis of Earthquake Data
Microarray Analysis for Genes
Pattern Matching in Financial Data

Currently I am reading to find out the exact data analysis requirements of these fields. The target is to find more use cases for "Distributed Composable Data Analysis"

September 19th Report

Conrad tested the demo from CERN and it worked well. So, now I can focus on the next step of the project.
Conrad also sent me a link to more root data files so as the next step I will test the demo with those new root data files. The first demo only uses a single rootlet and the reason for this is mainly the way how the Clarens client is written. Each analysis request is processed synchronously and hence the client send requests one by one to the server for each root data file to be analyzed.

As the next step of the project, I am planning to run the client in with multiple processes and allow it to create multiple rootlets so that the analysis can be performed simultaneously utilizing the full cpu power. Hope to get the results soon.

Thursday, September 13, 2007

September 5th Report

Demo is ready.

ROOT application development supports both interpreted code and also compiled code. Interpreted code (Mainly the analysis code written by physicist) can use compiled shared libraries including ROOT provide ones and also other user written libraries. Interpretation helps the user to debug and fix code easily and write the necessary analysis as a function.

As explained in my previous blog (August 22nd Report) the plan is to let the rootlet to publish the location of the generated histogram files to the subscribed clients. Clients who receive those notifications then use these files to update their results.

As the second phase, I developed a ROOT compliant wrapper classes for the C++ client of Naradabrokering. These allow ROOT clients to utilize the publish/subscribe capabilities of Naradabrokering.
The steps for developing the ROOT compliant classes are explained in the following tutorial.
Part1, part2 and part3

I was able to get the publishing of messages working from interpreted code and thought that the subscription would work in the same manner. After few days of trying I found that I was wrong.
To get the subscription to work, the interpreted code should pass a function pointer as the callback to the compiled code and on the reception of a notification , the compiled code should call this callback (which is in the interpreted code). Passing a function pointer to the compiled code is easy and straight forward, however, when the compiled code try to call that pointer I got an error *****Segmentation Violation*********

This happens mainly because of the limitation of the ROOT interpreter in resolving function pointers across the interpreted/compiled code boundaries. After few days of searching and querying Conrad, we decided to ask the question on the ROOT mailing list. They replied really soon and helped us to solve the problem. The solution came in a way of a reflection type call interface for calling interpreted code from the compiled code supported by ROOT.
Here is the mail thread.
http://root.cern.ch/phpBB2/viewtopic.php?t=5408

So with that help, I was able to get the demonstration working and it is so nice to see how it is working. Results of the remote analysis triggers users who are subscribed for notifications and their histograms get updated with remote results.

Somethings are too good to be true!

Wednesday, August 22, 2007

August 22nd Report

Last week Conrad helped us in setting up a Clarens server in gridfarm003 and after resolving few issues with our certificates I was able to use it.

Usage Scenario : Big Picture

  • The user writes a Client code in C++ which utilizes services of the clarens server. Let's call this ClientCode.C
  • She also has written the analysis code for root data. Let's call these files Analysis.C and Analysis.H .
  • She then executes the ClientCode.C using the C++ interpreter provided by the ROOT. (ROOT has a built in C++ interpreter)
  • ClientCode.C uses the built in ROOT libraries to locate data files in the server and also to upload the above two files to the clarens server.
  • After discovering (polling for files) root data files, ClientCode.C send request/requests to the Clarens server for creating rootlet/rootlets to execute the analysis code that it uploaded.
  • For every rootlet request Clarens server creates a wrapper script for rootlet and executes it with the input and output files. This wrapper script is called rootlet_wrappper.sh
  • Finally the ClientCode.C poll for output files and display the results of the analysis in a histogram generated at the user's machine.

Incorporating Naradabrokeing

As the first step, I changed the rootlet_wrapper.sh to publish a message after finishing the analysis using the nbclient program we wrote using C++. This works fine and we can eliminate the polling requirement of the ClientCode.C to find the results.

Next task is to incorporate the subscriber functionality to the visualization part of the ClientCode.C

Tuesday, July 31, 2007

August 8th Report

Secure Message Transfer Between Java and C++

Scenario:

We are developing an application which require secure message transfer between Naradabrokering (java based messaging substrate) and C++ client application. The communications between the entities uses a custom publish/subscribe messaging protocol to get better performance. (No XML processing)

JDK has a built in support for security features such as certificate handling, encryption and signing. However, to get those functionalities in C++ a separate library needs to be installed. For this we used Openssl (http://www.openssl.org/) To develop applications it is required to have the development files of the openssl and the installation is different according to the underlying operating system. For my machine running Ubuntu 2.6.15-28-386 it is simply;

apt-get install libssl0.9.7
apt-get install libssl-dev

Following sections of shows the code fragments that we can use to encrypt/decrypt messages (bytes) both in Java and C++. The algorithm used for the encryption is AES (http://en.wikipedia.org/wiki/Advanced_Encryption_Standard)


Encryption in JAVA

In java the encryption is handled by the provided javax.crypto.Cipher class. The following code fragment shows the encryption in Java.

byte[] bytesToEncrypt = /*Bytes to be encrypted*/
byte[] encBytes = null; /*Encrypted Bytes*/

/**
* Create a Cipher by specifying the following parameters a. Algorithm
* name - here it is AES */

Cipher aesCipher;
try {
aesCipher = Cipher.getInstance(Constants.AES_ALGO);

aesCipher.init(Cipher.ENCRYPT_MODE, secretKey);
encBytes = aesCipher.doFinal(msg.getBytes());
} catch (Exception e) {

throw new ClarensException(

"Error encrypt message using secret key",e);

}

These bytes are then transferred to the C++ client using socket based communication channel.

Decryption in C++

Openssl provides a set of libraries for handling the decryption and the following utility function shows how we can use those to decrypte the received set of bytes.

bool
SecurityUtil::decryptAES(const unsigned char *in,int inputLength ,unsigned char *out,int &outputLength, string aesKey){

int olen, tlen, n;
EVP_CIPHER_CTX ctx;
EVP_CIPHER_CTX_init (& ctx);
EVP_DecryptInit (& ctx, EVP_aes_128_ecb (), (unsigned char *)aesKey.c_str(), NULL);

olen=0; tlen=0;

if (EVP_DecryptUpdate (& ctx, out, & olen, (const unsigned char*)in,inputLength) != 1)
{
cerr<<"error in decrypt update"< return false;
}

if (EVP_DecryptFinal(& ctx, out + olen, & tlen) != 1)
{
cerr<<"error in decrypt final"< return false;
}

olen += tlen;
outputLength=olen;

EVP_CIPHER_CTX_cleanup (& ctx);
return true;
}


Ok, now let's see the other side of the story, from C++ to Java

Encryption in C++

Again Openssl provides a set of library functions for encryption as well. Following is the utility function for the encryption

bool

SecurityUtil::encryptAES(const unsigned char* in,int
inputLength ,unsigned char *out,int &outputLength, string
aesKey){

int olen, tlen, n;
EVP_CIPHER_CTX ctx;
EVP_CIPHER_CTX_init (& ctx);

EVP_EncryptInit (& ctx, EVP_aes_128_ecb (), (unsigned char
*)aesKey.c_str(), NULL);

if (EVP_EncryptUpdate (& ctx, out, & olen, (const unsigned
char*)in , inputLength) != 1)
{
cerr<<"error in decrypt update"<

if (EVP_EncryptFinal (& ctx, out + olen, &amp;amp;amp;amp;amp;amp; tlen) != 1)
{
cerr<<"error in encrypt final"<

olen+=tlen;
outputLength=olen;

EVP_CIPHER_CTX_cleanup (& ctx);
return true;
}

Decrypting the bytes received from the C++ client in JAVA

Decryption is java is also handled by the javax.crypto.Cipher class and is fairly straight forward. Here is the code segment.

byte[] decBytes = null;

Cipher aesCipher;

try {
aesCipher = Cipher.getInstance(Constants.RSA_ALGO);

aesCipher.init(Cipher.DECRYPT_MODE, prKey);
decBytes = aesCipher.doFinal(msgBytes);

} catch (Exception e) {
throw new ClarensException(
"Error decrypting message using private key ", e);
}

Simple right? The main problem I faced when developing the above application was the lack of documentation on this regard. There are tons of documentation on how to handle encryption/decryption using java but very small number for the same in C++. How about encryption/decryption between Java and C++? I could not find anything in this sort. Openssl has a good documentation on various functions/data structures it offers for encryption/decryption but the main problem is there very limited amount of code examples which shows the exact usage. Followings are some of the resources that I used to come up with this implementation and hope someone will find this helpful.

http://www.openssl.org/docs/
http://www.madboa.com/geek/openssl/#cert-self
http://www.ibm.com/developerworks/linux/library/l-openssl.html
http://www.mail-archive.com/openssl-users@openssl.org/msg40449.html
http://www.mail-archive.com/openssl-users@openssl.org/msg23119.html
http://www.fortrel.net/blog/index.php?title=encryption_java_c&more=1&amp;c=1&tb=1&pb=1
http://www.adp-gmbh.ch/cpp/common/base64.html

Next Blog: Signing and Verifying between Java and C++

July 25th Report

During the last two weeks i was mainly focusing on two things.
1. Qualifying Exam (on 19th of July)
2. Finishing the Service Discovery Framework for Clarens

Results:
I passed the Oral Qualifiers. However, from the questions asked from me, I realized that there are lot more that I should learn.

Service Discovery Framework has some more work. Getting the encryption (both using symmetric keys and RSA keys) working with C++ and Java seems bit tough. Main reason for this is the lack of documentation.

Wednesday, July 11, 2007

July 11th Report

Last week I was able to complete the functionality of the agent discovery for the C++ client. After that I put the client into a load test and found that it crashes after few (correct)discovery cycles.

Since it works correctly for at least one discovery the error is not in the logic but in the implementation details. It is implemented in C++ and the debugging is not that easy when there are some threads in the code. After a lot of (a lot of actually) debugging I found the problem.
It was just an update of boolean variable that I missed. (my eyes could not catch it during the first few debug cycles.)

After that I went through the code again, and did a complete cleanup. Then I let the test run for few hours and found that it is running without any problems and most importantly without any memory leaks.

Thursday, June 21, 2007

June 27th Report

Axis2 Hackathon June 11th to June 16th.

We started the hackathon with the intention of cleaning up the bugs list and come with some API fixes. It was a successful hackathon and we (Srinath Perera, Eran Chinthaka, Deepal Jayasinghe, Amila Suriyaarchhi, Ajith Ranabahu, Glen Daniels, and Myself) were able to fix most of the bugs whcih were labled as blockers and do some cleanup in the API as well. Axis2 has now grown into a large project and the API changes are not very straight forward as we did one year back. It is now a process of improvement plus depreciation.

An excellent presentation on API Design.
http://www.infoq.com/presentations/effective-api-design

C++ Agent/Service Discovery for Clarens

I was able to get the C++ client working for the Agent Discovery. The *next step* is to get the service discovery working which require both encryption and signing of messages.

Getting the exchange of messages between java and C++ was not easy with security in place. I am using openssl libraries for the C++ client and for the java side, it is the built in cryptographic extension of JDK. However, there are compatibility issues when it comes to different versions of certificates and padding schemes. So I will write a blog with complete code samples regarding the above issues after getting that *next step* completed.

June 13th Report

Started working on the C++ version of the Clarens implementation. I need to implement the client code of the proposed architecture in C++ so that both C++ and Python clients can use it.

The initial plan was to develop a version without security, but it seems like Caltech needs security in all their applications. It will not be easy to inject security related code once developed without it, so I decided to implement the complete version of the client code with security similar to the java implementation.

Currently I am implementing the above code and it is a bit of slow going process because I have to learn how to use OpenSSL libraries for security functions.

In the meanwhile I was able to schedule my Oral exam with my professors and the data I could get was 19th of July. Currently preparing for the exam as well.

May 30th Report

Continued working on the Clarens Project :

I created a python service "ping.ping" which responds to an agent's requests informing that the service is up and running and deployed it in clarens.

And also tested the following scenario;

First, a RootletSimulator(java) registers the rootlet service with an agent. (In the actual implementation this should be handled by the initialization method of the rootlet service)

After the registration, the agent will keep on pinging the rootlet service (python running on Clarens) to make sure that the service is up.

Now, if we use the ClientSimulator (java) it can discover the service and may be use the service.

Then we can stop the Clarens server or remove the rootlet service from the web-frontend (need to find out how to do this)

Now if we use the ClientSimulator(java) again, it won't discover the service, because the agent has removed the service from its list of available services.

Security Considerations:

Agent discovery request/response uses signed messages.
Service discovery request/response uses encrypted and signed messages.
*pings* use xmlrpc over http and the messages are secured using a grid-proxy.

Tuesday, May 15, 2007

May 15th Report

Started working on the C++ Bridge for Nardabrokering.
Current work includes, adding a layer of functions to the bridge so that it can be used to discover agents and rootlet services in architecture comprises of multiple agents and rootlet services.

Started developing the C++ client and then started implementing the agent (java) as it is required to test the C++ client code.

To provide secure agent/rootlet discovery mechanism architecture was discussed that utilize both symmetric and asymmetrical keys. The proposed message exchange pattern for discovering an agent and rootlet service is as follows.

Client ------------------------------------------- Agent

->DiscoverAgentRequest with Client’s credentials.

<- DiscoverAgentResponse encrypted using a secret key
Secret key is encrypted using the Client’s public key
Message is signed using Agent’s Private key
Message contains agents credentials.

->DiscoverRootletRequest with Client’s credentials
Message encrypted using the secret key (Shared during the previous step)
Signed using Client’s Private key

<- DiscoverRootletResponse Message encrypted using the secret key
Signed using the Agent’s Private key

Since we cannot exchange objects in their native serialization form from java and C++ a customer serialization mechanism is designed to transfer data between the clients and agents.

Currently working on getting the security framework working in both java and C++.

Friday, May 04, 2007

My Version of the Spring Semester

January - Diarrhea + Vomiting
- My wife got Chickenpox
February - Chickenpox
- Reactive Arthritis due to chickenpox
- Enlargement in my thyroid gland
March - Went back home
April - Thyroid Surgery
May - Came back to US

Monday, January 22, 2007

January 24th Report

During the last two weeks I completed the required benchmarks for the paper "A Scalable Approach for the Secure and Authorized Tracking of the Availability of Entities in Distributed Systems, Shrideep Pallickara, Jaliya Ekanayake and Geoffrey Fox, which was accepted for the IPDPS 2007 .

In addition I was researching on the Hessain library which can be used in the data transformation layer of the C++ bridge for Naradabrokering.