You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Curt Cox <cu...@gmail.com> on 2006/09/25 04:54:00 UTC

Why not use Serializable?

Hi,

I'm curious why the new Writable interface was chosen rather than
using Serializable.  Handling version changes can be quite painful
with standard serialization, but nothing seems to implement
VersionedWritable.  Introducing a new interface makes interoperation
with other code more clumsy, since it means converting back and forth
between standard and writable representations.

Was it for speed?  If so, I have implemented something similar to
writable that has the advantage that no particular interface needs to
be implemented.  Consequently, the standard datatypes
(java.lang.Integer etc...) can be used.  The idea is that IO proxies
register for each class.

Whatever answers result from this thread, should be incorporated into
the Javadoc.

Thanks,
Curt

Re: Why not use Serializable?

Posted by Doug Cutting <cu...@apache.org>.

Feng Jiang wrote:
> In my implementation, I still permit the out-of-order RPC call by the same
> way. the only difference between my impl and your previous impl is:
> 1. I made use of threadpool(JDK1.5) to replace the "Handler" threads. I
> believe the JDK's impl should not be worse than ourselves, and threadpool
> could be grows or shrinks dynamically, even more, I can control the pool
> size, core size, etc.
> 2. Since I just put the connection into the threadpool directly, I removed
> the queue, and that is why i believe my impl should save more memory than
> your previous impl. Because you have only 100 threads, if the 101th client
> sends a request to you, you have to put the connection into the queue, so
> the queue may become bigger and bigger. In my impl, if the 101th client
> comes, the listener will be blocked until the threadpool is available (if i
> limit the threadpool's max size. If not, the listener will never be 
> blocked,
> the each request always has chance to be executed).

To me it seems that ThreadPoolExecutor must still have a thread switch 
per request, so I don't understand why it should be much faster.  And 
the cost of the queue should not be great.  It's intent was to absorb 
small bursts of traffic without increasing the number of worker threads.

If you believe that Java 1.5's ThreadPoolExecutor would improve things 
significantly, please submit a patch to Server.java, changing it to use 
this.  Please attach the patch to a bug in Jira.  Even if it does not 
prove significantly faster, making the Hadoop code smaller could alone 
justify the change.

Thanks,

Doug

Re: Why not use Serializable?

Posted by Feng Jiang <fe...@gmail.com>.

On 9/27/06, Doug Cutting <cu...@apache.org> wrote:
>
> Feng Jiang wrote:
> > As for the IPC (it used to be RPC about one year ago) implementation, I
> > think it has some performance problem. I don't know why the Listener has
> to
> > read the data and prepare the Call instance then put the Call instance
> into
> > a queue. The reading process may be a long time, and other Calls are
> > blocked. I think that is the reason why you have to catch the
> > OutOfMemoryError.
>
> I don't completely understand your concern.  The queue is used to limit
> the number of requests that will be processed simultaneously by worker
> threads.  A single thread reads requests off the socket and queues them
> for worker threads to process and write results.  Unless we break
> requests and responses into chunks and re-assemble them, the reading and
> writing of individual requests and responses must block other requests
> and responses on that connection.
>
> Previously there was server thread per connection, reading requests.
> (Responses are written directly by worker threads.)  Recently this has
> been re-written so that a single thread uses nio selectors to read
> requests from all connections.  Calls are still queued for worker
> threads to process and reply.
>
> > I used to implemented another RPC Server and Client based on your code,
> by
> > simply using java's ThreadPool, the performance is about 10-20% higher
> than
> > that implementation one years ago. The main algorithm is:
> > 1. Server has a threadpool, initially it is small and grows as need.
> > 2. Listen accept a connection, then put the connection into threadpool.
> > 3. The threads in the pool handle everything for the connections.
>
> This does not permit a slow request from a given client to be passed by
> a fast request from that client, i.e., out-of-order responses.  The IPC
> protocol was originally designed for search, where some queries take
> significantly longer than others.  All front-ends send each query to all
> backends.  In this case we do not want queries which are fast to process
> to be stuck behind those which are slow to process.
>
> Permitting out-of-order responses does increase the base latency a bit,
> since a thread switch is required between reading the request and
> writing the response.  Perhaps that accounts for the 10-20% difference
> you saw.  It is possible to combine the approaches.  Worker threads
> could use synchronization to directly listen on the socket and process
> requests as soon as they are complete.  That would avoid the thread
> switch, but I think it would be complicated to develop.


I understand the importance of out-of-order RPC call. Actually the "id"
attribute in "Call" class is used to identity the Call object in client
side, since the Client object may send a lot of request, and it needs the
"id" to know which Call has been completed.

In my implementation, I still permit the out-of-order RPC call by the same
way. the only difference between my impl and your previous impl is:
1. I made use of threadpool(JDK1.5) to replace the "Handler" threads. I
believe the JDK's impl should not be worse than ourselves, and threadpool
could be grows or shrinks dynamically, even more, I can control the pool
size, core size, etc.
2. Since I just put the connection into the threadpool directly, I removed
the queue, and that is why i believe my impl should save more memory than
your previous impl. Because you have only 100 threads, if the 101th client
sends a request to you, you have to put the connection into the queue, so
the queue may become bigger and bigger. In my impl, if the 101th client
comes, the listener will be blocked until the threadpool is available (if i
limit the threadpool's max size. If not, the listener will never be blocked,
the each request always has chance to be executed).

> by this way, it saves memory usage and threadpool's schedule should be
> > better than our own.
>
> We are using a thread pool for worker threads.  The difference is that
> worker threads call wait() to wait for a request, and the listener
> thread call notify() to tell the worker thread that a request is ready,
> and the scheduler must switch from running the listener to the worker.
>
> Doug
>

Re: Why not use Serializable?

Posted by Doug Cutting <cu...@apache.org>.

Feng Jiang wrote:
> As for the IPC (it used to be RPC about one year ago) implementation, I
> think it has some performance problem. I don't know why the Listener has to
> read the data and prepare the Call instance then put the Call instance into
> a queue. The reading process may be a long time, and other Calls are
> blocked. I think that is the reason why you have to catch the
> OutOfMemoryError.

I don't completely understand your concern.  The queue is used to limit 
the number of requests that will be processed simultaneously by worker 
threads.  A single thread reads requests off the socket and queues them 
for worker threads to process and write results.  Unless we break 
requests and responses into chunks and re-assemble them, the reading and 
writing of individual requests and responses must block other requests 
and responses on that connection.

Previously there was server thread per connection, reading requests. 
(Responses are written directly by worker threads.)  Recently this has 
been re-written so that a single thread uses nio selectors to read 
requests from all connections.  Calls are still queued for worker 
threads to process and reply.

> I used to implemented another RPC Server and Client based on your code, by
> simply using java's ThreadPool, the performance is about 10-20% higher than
> that implementation one years ago. The main algorithm is:
> 1. Server has a threadpool, initially it is small and grows as need.
> 2. Listen accept a connection, then put the connection into threadpool.
> 3. The threads in the pool handle everything for the connections.

This does not permit a slow request from a given client to be passed by 
a fast request from that client, i.e., out-of-order responses.  The IPC 
protocol was originally designed for search, where some queries take 
significantly longer than others.  All front-ends send each query to all 
backends.  In this case we do not want queries which are fast to process 
to be stuck behind those which are slow to process.

Permitting out-of-order responses does increase the base latency a bit, 
since a thread switch is required between reading the request and 
writing the response.  Perhaps that accounts for the 10-20% difference 
you saw.  It is possible to combine the approaches.  Worker threads 
could use synchronization to directly listen on the socket and process 
requests as soon as they are complete.  That would avoid the thread 
switch, but I think it would be complicated to develop.

> by this way, it saves memory usage and threadpool's schedule should be
> better than our own.

We are using a thread pool for worker threads.  The difference is that 
worker threads call wait() to wait for a request, and the listener 
thread call notify() to tell the worker thread that a request is ready, 
and the scheduler must switch from running the listener to the worker.

Doug

Re: Why not use Serializable?

Posted by Feng Jiang <fe...@gmail.com>.

On 9/26/06, Doug Cutting <cu...@apache.org> wrote:
>
> Curt Cox wrote:
> > Let me restate, so you can tell me if I'm wrong.  "Writable is used
> > instead of Serializable, because it provides for more compact stream
> > format and allows for easier random access.  They have different
> > semantics, but don't have a major impact on versioning."
>
> Serialization's formats are also somewhat more complex for
> interoperation with other programming languages.  Hadoop, long-term,
> would like to provide easy data interoperability.  The current attempt
> at this is the record i/o package:
>
>
> http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/record/package-summary.html
>
> Java's Serialization protocol would complicate things somewhat:
>
>
> http://java.sun.com/j2se/1.5.0/docs/guide/serialization/spec/protocol.html#10258
>
> This grammar would need to be re-implemented for each language.
>
> > In my experience, using Serialization instead of DataInput/DataOutput
> > streams has a major impact on versioning.  Serialization keeps a lot
> > of metadata in the stream.  This makes detecting format changes very
> > easy, but can really complicate backward compatibility.  Also,
> > serialization is geared toward preserving the connections of an object
> > graph, which is behind a lot of the differences you mentioned.
>
> That all sounds correct.
>
> > You didn't address the interoperability advantage of using standard
> > Java classes instead of WritableS.  As I mentioned, while using
> > serialization would provide this benefit, it isn't necesary for it.
> > You could provide a mechanism for Writers to be registered for
> > classes.  So, instead of IntWriteable, users could just use a normal
> > Integer.  The stream would be byte-for-byte identical to what it is
> > now, but users could work with standard types.
>
> You're right, that is an advantage of Serialization.  Each class has a
> default (if big & slow) serialized form.  The record package (linked
> above) is Hadoop's current plan to provide more efficient automatic
> serialization and language interoperability at the same time.
>
> (A sidenode: Integer isn't the best example.  In Hadoop's RPC we use
> ObjectWritable, so RPC protocols can already pass and return some
> primitive types like int, long and float.  Since these types are not
> classes, IntWritable isn't much more awkward than Integer: values must
> still be explicitly wrapped.)
>
> But we could also extend ObjectWritable to use introspection and
> automatically serialize arbitrary objects.
>
> ----
>
> Let me try to answer the original question again: Why didn't I use
> Serialization when we first started Hadoop?  Because it looked
> big-and-hairy and I thought we needed something lean-and-mean, where we
> had precise control over exactly how objects are written and read, since
> that is central to Hadoop.  With Serialization you can get some control,
> but you have to fight for it.
>
> The logic for not using RMI was similar.  Effective, high-performance
> inter-process communications are critical to Hadoop.  I felt like we'd
> need to precisely control how things like connections, timeouts and
> buffers are handled, and RMI gives you little control over those.


As for the IPC (it used to be RPC about one year ago) implementation, I
think it has some performance problem. I don't know why the Listener has to
read the data and prepare the Call instance then put the Call instance into
a queue. The reading process may be a long time, and other Calls are
blocked. I think that is the reason why you have to catch the
OutOfMemoryError.

I used to implemented another RPC Server and Client based on your code, by
simply using java's ThreadPool, the performance is about 10-20% higher than
that implementation one years ago. The main algorithm is:
1. Server has a threadpool, initially it is small and grows as need.
2. Listen accept a connection, then put the connection into threadpool.
3. The threads in the pool handle everything for the connections.

by this way, it saves memory usage and threadpool's schedule should be
better than our own.

Feng


A quick search turns up the following on the mail archives:
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200508.mbox/%3C20050808215612.B75C010FB2B4@asf.osuosl.org%3E
>
> http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200602.mbox/%3C43ED0500.7000500@apache.org%3E
>
> The latter ends with:
>
> > In summary, I don't think I'd reject a patch that makes this change, but
> > I also would not personally wish to spend a lot of effort implementing
> > it, since I don't see a huge value.
>
> That remains my opinion.  One could probably migrate Hadoop to use
> Serializeable and/or Externalizeable, and possibly do so without a huge
> performance impact, and the system might even be easier to use.  If
> someone achieved that, I'd say, "Bravo!" and commit the patch.
>
> Doug
>

Re: Why not use Serializable?

Posted by Doug Cutting <cu...@apache.org>.

Curt Cox wrote:
> Let me restate, so you can tell me if I'm wrong.  "Writable is used
> instead of Serializable, because it provides for more compact stream
> format and allows for easier random access.  They have different
> semantics, but don't have a major impact on versioning."

Serialization's formats are also somewhat more complex for 
interoperation with other programming languages.  Hadoop, long-term, 
would like to provide easy data interoperability.  The current attempt 
at this is the record i/o package:

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/record/package-summary.html

Java's Serialization protocol would complicate things somewhat:

http://java.sun.com/j2se/1.5.0/docs/guide/serialization/spec/protocol.html#10258

This grammar would need to be re-implemented for each language.

> In my experience, using Serialization instead of DataInput/DataOutput
> streams has a major impact on versioning.  Serialization keeps a lot
> of metadata in the stream.  This makes detecting format changes very
> easy, but can really complicate backward compatibility.  Also,
> serialization is geared toward preserving the connections of an object
> graph, which is behind a lot of the differences you mentioned.

That all sounds correct.

> You didn't address the interoperability advantage of using standard
> Java classes instead of WritableS.  As I mentioned, while using
> serialization would provide this benefit, it isn't necesary for it.
> You could provide a mechanism for Writers to be registered for
> classes.  So, instead of IntWriteable, users could just use a normal
> Integer.  The stream would be byte-for-byte identical to what it is
> now, but users could work with standard types.

You're right, that is an advantage of Serialization.  Each class has a 
default (if big & slow) serialized form.  The record package (linked 
above) is Hadoop's current plan to provide more efficient automatic 
serialization and language interoperability at the same time.

(A sidenode: Integer isn't the best example.  In Hadoop's RPC we use 
ObjectWritable, so RPC protocols can already pass and return some 
primitive types like int, long and float.  Since these types are not 
classes, IntWritable isn't much more awkward than Integer: values must 
still be explicitly wrapped.)

But we could also extend ObjectWritable to use introspection and 
automatically serialize arbitrary objects.

----

Let me try to answer the original question again: Why didn't I use 
Serialization when we first started Hadoop?  Because it looked 
big-and-hairy and I thought we needed something lean-and-mean, where we 
had precise control over exactly how objects are written and read, since 
that is central to Hadoop.  With Serialization you can get some control, 
but you have to fight for it.

The logic for not using RMI was similar.  Effective, high-performance 
inter-process communications are critical to Hadoop.  I felt like we'd 
need to precisely control how things like connections, timeouts and 
buffers are handled, and RMI gives you little control over those.

A quick search turns up the following on the mail archives:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200508.mbox/%3C20050808215612.B75C010FB2B4@asf.osuosl.org%3E
http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200602.mbox/%3C43ED0500.7000500@apache.org%3E

The latter ends with:

> In summary, I don't think I'd reject a patch that makes this change, but 
> I also would not personally wish to spend a lot of effort implementing 
> it, since I don't see a huge value.

That remains my opinion.  One could probably migrate Hadoop to use 
Serializeable and/or Externalizeable, and possibly do so without a huge 
performance impact, and the system might even be easier to use.  If 
someone achieved that, I'd say, "Bravo!" and commit the patch.

Doug

Re: Why not use Serializable?

Posted by Doug Cutting <cu...@apache.org>.

Curt Cox wrote:
> In my experience, using Serialization instead of DataInput/DataOutput
> streams has a major impact on versioning.  Serialization keeps a lot
> of metadata in the stream.  This makes detecting format changes very
> easy, but can really complicate backward compatibility.

FYI, Owen has just proposed a mechanism for object & API versioning that 
stores the versions separately from the data.

http://issues.apache.org/jira/browse/HADOOP-558

Doug

Re: Why not use Serializable?

Posted by Curt Cox <cu...@gmail.com>.

Doug,

Let me restate, so you can tell me if I'm wrong.  "Writable is used
instead of Serializable, because it provides for more compact stream
format and allows for easier random access.  They have different
semantics, but don't have a major impact on versioning."

In my experience, using Serialization instead of DataInput/DataOutput
streams has a major impact on versioning.  Serialization keeps a lot
of metadata in the stream.  This makes detecting format changes very
easy, but can really complicate backward compatibility.  Also,
serialization is geared toward preserving the connections of an object
graph, which is behind a lot of the differences you mentioned.

You didn't address the interoperability advantage of using standard
Java classes instead of WritableS.  As I mentioned, while using
serialization would provide this benefit, it isn't necesary for it.
You could provide a mechanism for Writers to be registered for
classes.  So, instead of IntWriteable, users could just use a normal
Integer.  The stream would be byte-for-byte identical to what it is
now, but users could work with standard types.

- Curt

Re: Why not use Serializable?

Posted by Doug Cutting <cu...@apache.org>.

Curt Cox wrote:
> I'm curious why the new Writable interface was chosen rather than
> using Serializable.

The Writable interface is subtly different than Serializable. 
Serializable does not assume the class of stored values is known.  So 
each instance is tagged with its class.  ObjectOutputStream and 
ObjectInputStream optimize this somewhat, so that 5-byte handles are 
written for instances of a class after the first.  But object sequences 
with handles cannot be then accessed randomly, since they rely on stream 
state.  This complicates things like sorting.

Writable, on the other hand, assumes that the application knows the 
expected class.  The application must be able to create an instance in 
order to call readFields().  So the class need not be stored with each 
instance.  This results in considerably more compact binary files, 
straightforward random access and generally higher performance.

Arguably Hadoop could use Serializable.  One could override writeObject 
or writeExternal for each class whose serialization was performance 
critical.  (MapReduce is very i/o intensive, so nearly every class's 
serialization is performance critical.)  One could implement 
ObjectOutputStream.writeObjectOverride() and 
ObjectInputStream.readObjectOverride() to use a more compact 
representation, that, e.g., did not need to tag each top-level instance 
in a file with its class.  This would probably require as least as much 
code as Haddop has in Writable, ObjectWritable, etc., and the code would 
be a bit more complicated, since it would be trying to work around a 
different typing model.  But it might have the advantage of better 
built-in version control.  Or would it?

Serializable's version mechanism is to have classes define a static 
named serialVersionUID.  This permits one to protect against 
incompatible changes, but does not easily permit one to implement 
back-compatibility.  For that, the application must explicitly deal with 
versions.  It must reason, in a class-specific manner, about the version 
that was written while reading, to decide what to do.  But 
Serializeable's version mechanism does not support this any more or less 
than Writable.

See, for example, the "Design Considerations" section of:

http://www.javaworld.com/javaworld/jw-02-2006/jw-0227-control_p.html

So, in summary, I don't think Serializable holds many advantages: it 
wouldn't substantially reduce the amount of code, and it wouldn't solve 
versioning.

Doug

Re: Why not use Serializable?

Posted by Feng Jiang <fe...@gmail.com>.

Hi Curt,

In my mind, the advantage of Writable is that when I read the code and find
a class implementing Writable, i will know it must be used as RPC(mapred
context) parameter. Another advantage is that you can control what kinds of
attributes you want to output, and what other kinds of attributes you don't
want ot output. For example, you have a class named "Person", and it looks
like:

class Person {
  private String name;
  private Date birthday;
  private int age;
  //...
}

Actually when you output the person instance, you only need to output the
name, and birthday, because the age could be calculated by birthday and
"now" (it is just a sample, but in the real world, you really have to do
something like this.). but if you use java's serializable, the "age"
attrbute must be transient. if you forget to make it transient, you may get
trouble. Even more, you want to output the "name" attribute as some other
encoded string, such as GBK encoding (%XX%XX),   for another application
written by C++. how can you do it?

by using distinguished Writable interface, you can control everything.

Best wishes,
Feng

On 9/25/06, Curt Cox <cu...@gmail.com> wrote:
>
> "but when you develop a big project, you will find you need another
> way to help you identity which type is serializable in MR context just
> by a glance."
>
> Please explain what you mean by "serializable in MR context".  How is
> the fact that lots of stuff already serializable a reason to favor
> Writable?
>
> - Curt
>

Re: Why not use Serializable?

Posted by Curt Cox <cu...@gmail.com>.

"but when you develop a big project, you will find you need another
way to help you identity which type is serializable in MR context just
by a glance."

Please explain what you mean by "serializable in MR context".  How is
the fact that lots of stuff already serializable a reason to favor
Writable?

- Curt

Re: Why not use Serializable?

Posted by Feng Jiang <fe...@gmail.com>.

Using java's serialization, almost every type in java language are
serializable. but when you develop a big project, you will find you need
another way to help you identity which type is serializable in MR context
just by a glance. Then distinguished Writable interface may be a good way,
although defining your own IntWritable is a little bit painful.

On 9/25/06, Curt Cox <cu...@gmail.com> wrote:
>
> Hi,
>
> I'm curious why the new Writable interface was chosen rather than
> using Serializable.  Handling version changes can be quite painful
> with standard serialization, but nothing seems to implement
> VersionedWritable.  Introducing a new interface makes interoperation
> with other code more clumsy, since it means converting back and forth
> between standard and writable representations.
>
> Was it for speed?  If so, I have implemented something similar to
> writable that has the advantage that no particular interface needs to
> be implemented.  Consequently, the standard datatypes
> (java.lang.Integer etc...) can be used.  The idea is that IO proxies
> register for each class.
>
> Whatever answers result from this thread, should be incorporated into
> the Javadoc.
>
> Thanks,
> Curt
>