You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Alex Loddengaard (JIRA)" <ji...@apache.org> on 2008/09/03 12:03:44 UTC
[jira] Commented: (HADOOP-3788) Add serialization for Protocol Buffers

    [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627948#action_12627948 ] 

Alex Loddengaard commented on HADOOP-3788:
------------------------------------------

After fiddling with Protocol Buffers (PBs) and reading documentation around them, actually using PBs may not require the introduction of a new Serialization class.

PBs work in the following way:

First, the developer defines a .proto file, which is essentially a schema that describes the type of data the user wishes to deal with.  Below is an example of an addressbook.proto file taken from the PB documentation, located here: http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
{code:title=addressbook.proto|borderStyle=solid}
package tutorial;

option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

message AddressBook {
  repeated Person person = 1;
}
{code}

Once the user has defined their .proto file, they use PB's compiler, _protoc_, to generate an outer Java class, accompanied by a few subclasses.  Amongst the generated code are methods to serialize and deserialize given an OutputStream and InputStream, respectively.

Refactoring .proto files is somewhat tricky given the way PBs work (read their documentation for more info), so Google recommends that PBs are wrapped inside of other classes and only used when serializing and deserializing.  This structure fits in perfectly with Hadoop's Writable structure.  That is, if a user wants to utilize PBs, they need to use _protoc_ to create a Java class, which is essentially a Bean.  They can then define a new Writable implementation that uses the _protoc_-generated class to serialize and deserialize.  This is all possible without creating a ProtocolBuffersSerialization class because the default Serialization, org.apache.hadoop.io.serializer.WritableSerialization, delegates its read and write methods to the Writable that is being serialized or deserialized.

A general ProtocolBuffersSerialization class would not fully utilize PBs to their fullest, because it would have to use a very primitive, generalized .proto file (for example a file with just one field: a large String).

With that said, a few things can be done with regard to this feature:
# I can create an example that extends Text and overwrites its serialization methods to use a _protoc_-generated class
# I can begin extending Hadoop's Writable implementations to use PBs instead
# I can begin replacing Hadoop's Writable implementations to use PBs instead
# I can try and create a general ProtocolBuffersSerialization and see how it performs, though this solution seems against the premise of using PBs
# You can tell me that my understanding of PBs is completely wrong (please follow-up with a more accurate description if this is the case :))

Before either option 2 or 3 is decided on, profiling should be done to ensure that PBs are in fact faster than Java's built in mechanism.  If profiling proves PBs are faster in all cases, then it seems like option 3 would be the most desirable.  However, perhaps more discussion should be made to determine if 2 or 3 or some other solution altogether is better.

Again, I'm totally new here, so please argue with me if I have misunderstood Hadoop's workings, PBs, or anything else.  While I'm waiting for responses, I can begin working on option 1 to prove my understanding of PBs is correct.

> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a compact binary format. This issue is to write a ProtocolBuffersSerialization to support using Protocol Buffers types in MapReduce programs, including an example program. This should probably go into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.