You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2010/11/12 19:50:18 UTC

[jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931486#action_12931486 ] 

Doug Cutting commented on HADOOP-6685:
--------------------------------------

> It includes support for Avro, Thrift, ProtocolBuffers, Writables, Java serialization, and an adaptor for the old style serializations.
> All of the types can be put into SequenceFiles, MapFiles, BloomFilterMapFiles, SetFile, and ArrayFile.

Could you please explain the motivation for extending these file
formats to support all of these serialization systems?  The patch
changes the APIs for these classes, deprecating methods and adding new
methods to support new serializations.  We know from experience that
changing APIs has a cost, so we ought to justify that cost.

To my thinking, a priority for the project is to support file formats
that can be processed by other programming languages.  Avro, Thrift
and ProtocolBuffers are implemented in other languages, but
SequenceFile, MapFile, BloomFilterMapFile, SetFile, ArrayFile and
TFile are not. Unless we intend to implement these formats in a
variety of other programming languages, I don't see a big advantage of
supporting so many different serialization systems from Java only.  It
doesn't greatly increase the expressive power available to Java
developers, and the added variety introduces more potential support
issues.

It would be useful if the shuffle could process things besides
Writable (MAPREDUCE-1126) and it would be useful to have InputFormats
and OutputFormats for language-independent file formats like Avro's
(MAPREDUCE-815).  Much of this patch seems like it could help
implement these, but parts of it (e.g., the metadata serialization,
enhancements to SequenceFile, etc.) don't seem relevant to these
goals.  I don't see supporting multiple Java serialization APIs as a
goal in and of itself.


> Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: serial.patch
>
>
> Currently, the generic serialization framework uses Map<String,String> for the serialization specific configuration. Since this data is really internal to the specific serialization, I think we should change it to be an opaque binary blob. This will simplify the interface for defining specific serializations for different contexts (MAPREDUCE-1462). It will also move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.