You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Alex Loddengaard (JIRA)" <ji...@apache.org> on 2008/09/11 11:31:45 UTC

[jira] Issue Comment Edited: (HADOOP-3788) Add serialization for Protocol Buffers

    [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630139#action_12630139 ] 

alexlod edited comment on HADOOP-3788 at 9/11/08 2:30 AM:
-------------------------------------------------------------------

Attaching a new patch.  Changes:

 * Removed _*Tracker_ and _TestPBHadoopStreams_ because they aren't very useful now that we've established streams have trailing data
 * Did not keep a single Builder instance in _PBDeserializer_, because Builders cannot be reused once _build()_ has been called.  From the PB API: "[build()] Construct the final message. Once [build()] is called, the Builder is no longer valid, and calling any other method may throw a NullPointerException. If you need to continue working with the builder after calling build(), clone() it first."  I made the decision to just re-instantiate instead of clone, because I thought the performance differences were negligible.  Please argue with me if I'm wrong.
* Changed SequenceFile.Reader#next(Object)
* Changed _TestPBSerialization_ to just write and read a SequenceFile, respectively.
* Created a new test, _TestPBSerializationMapReduce_, that uses PBs in a MapReduce program

_TestPBSerialization_ passes, but _TestPBSerializationMapReduce_ does not, which means you're right, Tom, that other code will need to change, though I'm not familiar enough with Hadoop to say more than that.  If we decide to move further along by changing Hadoop such that deserializers will never be given trailing data, then more guidance would be greatly appreciated :).

This patch breaks a few existing tests such as _org.apache.hadoop.fs.TestCopyFiles_ and _org.apache.hadoop.fs.TestFileSystem_.  It's unclear if my change causes these or if my lack of change to others areas does.  Regardless, I think this proves that creating the contract of not having extra data in the _Deserializer_'s _InputStream_ would probably be a large change.

There is a discussion going on in the PB Google Group about possibly making PBs self-delimiting.  Take a look [here|http://groups.google.com/group/protobuf/browse_thread/thread/b0ce2c7d8b05896e?hl=en].  In summary, a few different people are trying to determine the best way to allow self-delimiting, though there hasn't been any talk about a schedule.

      was (Author: alexlod):
    Attaching a new patch.  Changes:

 * Removed _*Tracker_ and _TestPBHadoopStreams_ because they weren't very useful now that we've established streams have trailing data
 * Did not keep a single Builder instance in _PBDeserializer_, because Builders need to be rebuilt once _build()_ has been called.  From the PB API: "[build()] Construct the final message. Once [build()] is called, the Builder is no longer valid, and calling any other method may throw a NullPointerException. If you need to continue working with the builder after calling build(), clone() it first."  I made the decision to just re-instantiate instead of clone, because I thought the performance differences were negligible.  Please argue with me if I'm wrong.
* Changed SequenceFile.Reader#next(Object)
* Changed _TestPBSerialization_ to just write and read a SequenceFile, respectively.
* Created a new test, _TestPBSerializationMapReduce_, that uses PBs in a MapReduce program

_TestPBSerialization_ passes, but _TestPBSerializationMapReduce_ does not, which means you're right, Tom, that other code will need to change, though I'm not familiar enough with Hadoop to say more than that.  If we decide to move further along by changing Hadoop such that deserializers will never be given trailing data, then more guidance would be greatly appreciated :).

This patch breaks a few existing tests such as _org.apache.hadoop.fs.TestCopyFiles_ and _org.apache.hadoop.fs.TestFileSystem_.  It's unclear if my change causes these or if my lack of change to others areas does.  Regardless, I think this proves that creating the contract of not having extra data in the _Deserializer_'s _InputStream_ would probably be a large change.

There is a discussion going on in the PB Google Group about possibly making PBs self-delimiting.  Take a look [here|http://groups.google.com/group/protobuf/browse_thread/thread/b0ce2c7d8b05896e?hl=en].  In summary, a few different people are trying to determine the best way to allow self-delimiting, though there hasn't been any talk about a schedule.
  
> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3788-v1.patch, hadoop-3788-v2.patch, protobuf-java-2.0.1.jar
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a compact binary format. This issue is to write a ProtocolBuffersSerialization to support using Protocol Buffers types in MapReduce programs, including an example program. This should probably go into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.