You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Alex Loddengaard (JIRA)" <ji...@apache.org> on 2008/09/10 10:12:44 UTC
[jira] Updated: (HADOOP-3788) Add serialization for Protocol
Buffers
[ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Loddengaard updated HADOOP-3788:
-------------------------------------
Fix Version/s: 0.19.0
Affects Version/s: 0.19.0
Release Note:
The patch being submitted is an in-progress patch. It is being submitted with failing tests because I seek help and advice from others.
See my comment for more information.
Hadoop Flags: [Incompatible change]
Status: Patch Available (was: Open)
Per Tom's advice, I created a Protocol Buffer (PB) serialization framework and tests to show its usage. I used HADOOP-3787 as a guide while doing so.
I ran into a problem, though. My test, _TestPBSerialization_, is precisely the same as the test in HADOOP-3787 with the exception of using PBs instead of Thrift. My test threw PB exceptions due to issues with deserializing. I engaged in dialog with a Google employee via the the PB Google Group in hopes of diagnosing my problem. Our thread can be found [here|http://groups.google.com/group/protobuf/browse_thread/thread/19ab6bbb364fef35]. The key point to the thread is that the _InputStream_ passed to a PB _Message_ instance during deserialization cannot have trailing binary data. For example, if a _Message_ instance is serialized to "<binary>asdf", then giving "<binary>asdf<arbitrary binary>" to a PB deserializer will break by means of a PB Exception. This is due to serialized PB _Message_ instances not being self-delimiting, which was a design decision made by Google to guarantee small serialized size and speed.
I created a second test, _TestPBSerializationIsolated_, that demonstrates the correctness of _PBSerializer_ and _PBDeserializer_, the two classes that actually do the work to serialize and deserialize. This test passed, hinting that perhaps there is an incompatibility with Hadoop's current workings and PBs.
I then created a third test, _TestPBHadoopStreams_, which tries to understands the Hadoop stream that is given to _PBDeserializer_. Though this test is somewhat silly (it always passes with an assertTrue(true) -- read the class comment for an explanation), its System.err output shows the makeup of the serialized _StringMessage_ and the _InputStream_ given for deserialization. I discovered that when Hadoop gives _PBDeserializer_ an _InputStream_, the stream contains arbitrary binary data after the serialized PB _Message_ instance. This is problematic for reasons I have previously discussed. I am confident that this extra arbitrary data is not a result of using PBs but instead a Hadoop implementation decision.
To be very precise, below is a serialized _StringMessage_ with a value, "testKey". Note that the below was copy-pasted from _less_, which is why strange ASCII characters are displayed:
{noformat}
^GtestKey
{noformat}
Below is the stream given to _PBDeserializer_ to deserialize:
{noformat}
^GtestKeyx���,I-.K�)M^E^@^S�^C�
{noformat}
Again, take note to the trailing bits, starting with 'x'.
Can someone comment on this issue? Was having trailing binary information a design decision? Can it be avoided somehow easily? Is there a way around this issue?
In the meantime, I plan to dig deeper into Hadoop's inner workings to understand why the _InputStream_ might have extra binary data. Similarly. I plan to use Hadoop's default _Serialization_ to see if the _InputStream_ also has arbitrary trailing bytes.
> Add serialization for Protocol Buffers
> --------------------------------------
>
> Key: HADOOP-3788
> URL: https://issues.apache.org/jira/browse/HADOOP-3788
> Project: Hadoop Core
> Issue Type: Wish
> Components: examples, mapred
> Affects Versions: 0.19.0
> Reporter: Tom White
> Assignee: Alex Loddengaard
> Fix For: 0.19.0
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a compact binary format. This issue is to write a ProtocolBuffersSerialization to support using Protocol Buffers types in MapReduce programs, including an example program. This should probably go into contrib.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.