You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2010/11/19 06:46:17 UTC
[jira] Issue Comment Edited: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933686#action_12933686 ] 

Owen O'Malley edited comment on HADOOP-6685 at 11/19/10 12:45 AM:
------------------------------------------------------------------

There were a few driving goals:
* Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently stored.
* Supporting context-specific serializations (input key, input value, shuffle key, shuffle value, output key, output value, etc) so that different serialization options can chosen depending on the application's requirements. (MAPREDUCE-1462)
* Using serialization of the "active" objects themselves (input format, mapper, reducer, output format, etc.) to simplify making compound objects. This will allow us to get rid of the static methods to define properties like input and output directory without pushing them into the framework. (MAPREDUCE--1183)
* Clean up the serialization interface to make it  clear that each object has to be serialized independently. The current API gives the impression that the Serializer and Deserializer can hold state, which is incorrect, and led to a bug in the first implementation of the Java serialization plugin.

The first attempt to generalize the serialization metadata was done via string to string maps. Since MapReduce already has a configuration, which is a string to string map, they used that. However, they needed to nest the serialization maps into the configuration. So for each context there was a prefix string and the values under that prefix were the metadata for that serialization. This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to read all of the code to figure out what the legal values for the configuration were. The code was full of  static methods for each serialization in each context that updated the configuration. Further, since it was never clear what was intended to be "visible" versus "opaque" the user ended up being responsible for all of it.

Therefore, I decided to use another approach. Instead of use string to string maps, we use bytes to capture the metadata. The bytes are opaque except to the serialization itself. This allows the serialization to define what data is important to it and handle it in a type and version safe manner.  It is also symmetric to the solution of MR-1183 where you use component specific metadata to save their parameters. That is the framework that has been laid out in this patch. It includes the work on the container files to show that it can be used to write and read the different serializations. It includes the serializations to show that they work correctly when used together. By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided.

Part of the lesion learned from the train wreck of MR-1126 was that implementing sweeping changes to the API and framework by writing and committing little patches spread out over 6 months is not a healthy way of working. The reviewer needs to understand why they care and how the parts are going to work together. I should have done this jira in a public development branch, but that wouldn't have lessened this debate. Doug and I just disagree about the design of the interface. The indication that he gave when I gave the presentation on my plan 5 months ago was that he didn't like it, but wouldn't block it. He reiterated that position on this jira 6 days ago. Have you changed your mind, Doug?

To Doug's specific points:

{quote}
inheritance is used in serialization implementations, and inheritance is harder to implement with binary objects
{quote}

Actually handling extensions is quite easy using protocol buffers, which is part of why I chose to use them for storing the metadata. Inheritance in string to string maps is quite tricky and must be managed completely by the plugin writer.

{quote
binary encodings are less transparent and create binary serialization bootstrap problems
{quote}

I will grant you they are less transparent and require a tool to dump their contents.  Bootstrapping wasn't a problem at all. (Granted, it would have been a problem with Avro, as I discussed here http://bit.ly/cJ1tVp 

{quote}
serialization metadata is not large nor read/written in inner loops, so binary is not required
{quote}

It isn't required, but it isn't a problem either.

{quote}
using a binary encoding for serialization metadata will require substantial changes to serialization clients.
{quote}

The change to the clients is the same size, regardless of whether the metadata is encoded in binary or string to string maps. It is extra information that needs to be available. The data is smaller and type-safe if it is done in binary compared to string to string maps.


      was (Author: owen.omalley):
    There were a few driving goals:
* Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently stored.
* Supporting context-specific serializations (input key, input value, shuffle key, shuffle value, output key, output value, etc) so that different serialization options can chosen depending on the application's requirements. (MAPREDUCE-1452)
* Using serialization of the "active" objects themselves (input format, mapper, reducer, output format, etc.) to simplify making compound objects. This will allow us to get rid of the static methods to define properties like input and output directory without pushing them into the framework. (MAPREDUCE--1183)
* Clean up the serialization interface to make it  clear that each object has to be serialized independently. The current API gives the impression that the Serializer and Deserializer can hold state, which is incorrect, and led to a bug in the first implementation of the Java serialization plugin.

The first attempt to generalize the serialization metadata was done via string to string maps. Since MapReduce already has a configuration, which is a string to string map, they used that. However, they needed to nest the serialization maps into the configuration. So for each context there was a prefix string and the values under that prefix were the metadata for that serialization. This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to read all of the code to figure out what the legal values for the configuration were. The code was full of  static methods for each serialization in each context that updated the configuration. Further, since it was never clear what was intended to be "visible" versus "opaque" the user ended up being responsible for all of it.

Therefore, I decided to use another approach. Instead of use string to string maps, we use bytes to capture the metadata. The bytes are opaque except to the serialization itself. This allows the serialization to define what data is important to it and handle it in a type and version safe manner.  It is also symmetric to the solution of MR-1183 where you use component specific metadata to save their parameters. That is the framework that has been laid out in this patch. It includes the work on the container files to show that it can be used to write and read the different serializations. It includes the serializations to show that they work correctly when used together. By making the framework use typed metadata instead of the very generic, but type-less, string to string map many user errors will be avoided.

Part of the lesion learned from the train wreck of MR-1126 was that implementing sweeping changes to the API and framework by writing and committing little patches spread out over 6 months is not a healthy way of working. The reviewer needs to understand why they care and how the parts are going to work together. I should have done this jira in a public development branch, but that wouldn't have lessened this debate. Doug and I just disagree about the design of the interface. The indication that he gave when I gave the presentation on my plan 5 months ago was that he didn't like it, but wouldn't block it. He reiterated that position on this jira 6 days ago. Have you changed your mind, Doug?

To Doug's specific points:

{quote}
inheritance is used in serialization implementations, and inheritance is harder to implement with binary objects
{quote}

Actually handling extensions is quite easy using protocol buffers, which is part of why I chose to use them for storing the metadata. Inheritance in string to string maps is quite tricky and must be managed completely by the plugin writer.

{quote
binary encodings are less transparent and create binary serialization bootstrap problems
{quote}

I will grant you they are less transparent and require a tool to dump their contents.  Bootstrapping wasn't a problem at all. (Granted, it would have been a problem with Avro, as I discussed here http://bit.ly/cJ1tVp 

{quote}
serialization metadata is not large nor read/written in inner loops, so binary is not required
{quote}

It isn't required, but it isn't a problem either.

{quote}
using a binary encoding for serialization metadata will require substantial changes to serialization clients.
{quote}

The change to the clients is the same size, regardless of whether the metadata is encoded in binary or string to string maps. It is extra information that needs to be available. The data is smaller and type-safe if it is done in binary compared to string to string maps.

  
> Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch, SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for the serialization specific configuration. Since this data is really internal to the specific serialization, I think we should change it to be an opaque binary blob. This will simplify the interface for defining specific serializations for different contexts (MAPREDUCE-1462). It will also move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.