You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by David Alves <da...@dei.uc.pt> on 2009/01/19 18:30:17 UTC

Java RMI and Hadoop RecordIO

Hi
	I've been testing some different serialization techniques, to go  
along with a research project.
	I know motivation behind hadoop serialization mechanism (e.g.  
Writable) and the enhancement of this feature through record I/O is  
not only performance, but also control of the input/output.
	Still I've been running some simple tests and I've foud that plain  
RMi beats Hadoop RecordIO almost every time (14-16% faster).
	In my test I have a simple java class that has 14 int fields and 1  
long field and I'm serializing aroung 35000 instances.
	Am I doing anything wrong? are there ways to improve performance in  
RecordIO? Have I got the use case wrong?
	
Regards
David Alves

Re: Java RMI and Hadoop RecordIO

Posted by Steve Loughran <st...@apache.org>.

David Alves wrote:
> Hi
>     I've been testing some different serialization techniques, to go 
> along with a research project.
>     I know motivation behind hadoop serialization mechanism (e.g. 
> Writable) and the enhancement of this feature through record I/O is not 
> only performance, but also control of the input/output.
>     Still I've been running some simple tests and I've foud that plain 
> RMi beats Hadoop RecordIO almost every time (14-16% faster).
>     In my test I have a simple java class that has 14 int fields and 1 
> long field and I'm serializing aroung 35000 instances.
>     Am I doing anything wrong? are there ways to improve performance in 
> RecordIO? Have I got the use case wrong?
>     
> Regards
> David Alves
>     

-. Any speedups are welcome; people are looking at ProtocolBuffers and 
Thrift
- Are you also measuring packet size and deserialization costs?
- add a string or two
- and references to other instances
- then try pushing a few million round the network using the same 
serialization stream instance

I do use RMI a lot at work, once you come up with a plan to deal with 
its brittleness against change (we keep the code in the cluster up to 
date, make no guarantees about compatibility across versions), it is 
easy to use. but it has so many, many problems, and if you hit one, as 
the code is deep in the JVM, it is very hard to deal with. One example, 
RMI tries to send a graph over; it likes to make sure it hasn't pushed a 
copy over earlier. The longer you keep a serialization stream up, the 
slower it gets.

-steve