You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Alves <da...@dei.uc.pt> on 2009/01/19 18:30:17 UTC
Java RMI and Hadoop RecordIO
Hi
I've been testing some different serialization techniques, to go
along with a research project.
I know motivation behind hadoop serialization mechanism (e.g.
Writable) and the enhancement of this feature through record I/O is
not only performance, but also control of the input/output.
Still I've been running some simple tests and I've foud that plain
RMi beats Hadoop RecordIO almost every time (14-16% faster).
In my test I have a simple java class that has 14 int fields and 1
long field and I'm serializing aroung 35000 instances.
Am I doing anything wrong? are there ways to improve performance in
RecordIO? Have I got the use case wrong?
Regards
David Alves
Re: Java RMI and Hadoop RecordIO
Posted by Steve Loughran <st...@apache.org>.
David Alves wrote:
> Hi
> I've been testing some different serialization techniques, to go
> along with a research project.
> I know motivation behind hadoop serialization mechanism (e.g.
> Writable) and the enhancement of this feature through record I/O is not
> only performance, but also control of the input/output.
> Still I've been running some simple tests and I've foud that plain
> RMi beats Hadoop RecordIO almost every time (14-16% faster).
> In my test I have a simple java class that has 14 int fields and 1
> long field and I'm serializing aroung 35000 instances.
> Am I doing anything wrong? are there ways to improve performance in
> RecordIO? Have I got the use case wrong?
>
> Regards
> David Alves
>
-. Any speedups are welcome; people are looking at ProtocolBuffers and
Thrift
- Are you also measuring packet size and deserialization costs?
- add a string or two
- and references to other instances
- then try pushing a few million round the network using the same
serialization stream instance
I do use RMI a lot at work, once you come up with a plan to deal with
its brittleness against change (we keep the code in the cluster up to
date, make no guarantees about compatibility across versions), it is
easy to use. but it has so many, many problems, and if you hit one, as
the code is deep in the JVM, it is very hard to deal with. One example,
RMI tries to send a graph over; it likes to make sure it hasn't pushed a
copy over earlier. The longer you keep a serialization stream up, the
slower it gets.
-steve