You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Alex Loddengaard (JIRA)" <ji...@apache.org> on 2008/09/10 10:12:44 UTC
[jira] Updated: (HADOOP-3788) Add serialization for Protocol Buffers

     [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Loddengaard updated HADOOP-3788:
-------------------------------------

        Fix Version/s: 0.19.0
    Affects Version/s: 0.19.0
         Release Note: 
The patch being submitted is an in-progress patch.  It is being submitted with failing tests because I seek help and advice from others.

See my comment for more information.
         Hadoop Flags: [Incompatible change]
               Status: Patch Available  (was: Open)

Per Tom's advice, I created a Protocol Buffer (PB) serialization framework and tests to show its usage.  I used HADOOP-3787 as a guide while doing so.

I ran into a problem, though.  My test, _TestPBSerialization_, is precisely the same as the test in HADOOP-3787 with the exception of using PBs instead of Thrift.  My test threw PB exceptions due to issues with deserializing.  I engaged in dialog with a Google employee via the the PB Google Group in hopes of diagnosing my problem.  Our thread can be found [here|http://groups.google.com/group/protobuf/browse_thread/thread/19ab6bbb364fef35].  The key point to the thread is that the _InputStream_ passed to a PB _Message_ instance during deserialization cannot have trailing binary data.  For example, if a _Message_ instance is serialized to "<binary>asdf", then giving "<binary>asdf<arbitrary binary>" to a PB deserializer will break by means of a PB Exception.  This is due to serialized PB _Message_ instances not being self-delimiting, which was a design decision made by Google to guarantee small serialized size and speed.

I created a second test, _TestPBSerializationIsolated_, that demonstrates the correctness of _PBSerializer_ and _PBDeserializer_, the two classes that actually do the work to serialize and deserialize.  This test passed, hinting that perhaps there is an incompatibility with Hadoop's current workings and PBs.

I then created a third test, _TestPBHadoopStreams_, which tries to understands the Hadoop stream that is given to _PBDeserializer_.  Though this test is somewhat silly (it always passes with an assertTrue(true) -- read the class comment for an explanation), its System.err output shows the makeup of the serialized _StringMessage_ and the _InputStream_ given for deserialization.  I discovered that when Hadoop gives _PBDeserializer_ an _InputStream_, the stream contains arbitrary binary data after the serialized PB _Message_ instance.  This is problematic for reasons I have previously discussed.  I am confident that this extra arbitrary data is not a result of using PBs but instead a Hadoop implementation decision.

To be very precise, below is a serialized _StringMessage_ with a value, "testKey".  Note that the below was copy-pasted from _less_, which is why strange ASCII characters are displayed:
{noformat}

^GtestKey
{noformat}

Below is the stream given to _PBDeserializer_ to deserialize:
{noformat}
^GtestKeyx���,I-.K�)M^E^@^S�^C�
{noformat}

Again, take note to the trailing bits, starting with 'x'.

Can someone comment on this issue?  Was having trailing binary information a design decision?  Can it be avoided somehow easily?  Is there a way around this issue?

In the meantime, I plan to dig deeper into Hadoop's inner workings to understand why the _InputStream_ might have extra binary data.  Similarly. I plan to use Hadoop's default _Serialization_ to see if the _InputStream_ also has arbitrary trailing bytes.

> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a compact binary format. This issue is to write a ProtocolBuffersSerialization to support using Protocol Buffers types in MapReduce programs, including an example program. This should probably go into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.