You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2009/07/22 13:53:14 UTC
[jira] Updated: (HADOOP-6165) Add metadata to Serializations

     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Attachment: HADOOP-6165.patch

Here's a patch with some ideas about how to go about this. It is very preliminary.



1. One of the problems is that Serialization/Serializer/Deserializer are all interfaces, which makes it difficult to evolve them. One way to manage this is to introduce Base{Serialization,Serializer,Deserializer} abstract classes that implement the corresponding interface. SerializationFactory will read the io.serializations configuration property and if a serialization implements BaseSerialization it will use that directly, while if it is a (legacy) Serialization it will wrap it in a BaseSerialization. The trick here is to put legacy Serializations at the end, since they have less metadata and are therefore less discriminating.

The Serialization/Serializer/Deserializer interfaces are all deprecated and can be removed in a future release, leaving only Base{Serialization,Serializer,Deserializer}.

2. In addition to the Map<String, String> metadata do we need to keep the class metadata? That is, do we need 

public abstract boolean accept(Class<?> c, Map<String, String> metadata);

or is the following sufficient?

public abstract boolean accept(Map<String, String> metadata); 

We could have a "class" entry in the map which stores this information, but we'd have to convert it to a Class object to do the isAssignableFrom check that some serializations need to do, e.g. Writable.class.isAssignableFrom(c). Perhaps this is OK.

3. Should we have a Metadata class to permit evolution of beyond Map<String, String>? (E.g. to keep a Class property.)

4. Where does the metadata come from? In the context of MapReduce, the answer depends on the stage of MapReduce. (None of these changes have been implemented in the patch.)

i. Map input

The metadata comes from the container. For example, in SequenceFiles the metadata comes from the key-value class types, and the SequenceFile metadata (a Map<Text, Text>, which is ideally suited for this scheme).

To support this, SequenceFile.Reader would pass its metadata to the deserializer. Similarly, SequenceFile.Writer would add metadata from the BaseSerializer to the SequenceFile's writer.

ii. Map output/Reduce input

The metadata would have to be supplied by the MapReduce framework. Just like we have mapred.mapoutput.{key,value}.class, we could have properties to specify extra metadata. Metadata is a map, so something like mapred.mapoutput.{key,value}.metadata.K where K can be an arbitrary string.

For example, one might define mapred.mapoutput.key.metadata.avroSchema to be the Avro schema for map output key types. To get this to work we would need support from Configuration to get a Map from a property prefix. So conf.getMap("mapred.mapoutput.key.metadata") would return a Map<String, String> of all the properties under the mapred.mapoutput.key.metadata prefix.

iii. Reduce output

The metadata would have to be supplied by the MapReduce framework. Just like the map output we could have mapred.output.{key,value}.metadata.K properties.

5. Resolution process

To take an Avro example: AvroReflectSerialization's accept method would look for a "serialization" key of org.apache.hadoop.io.serializer.avro.AvroReflectSerialization. The nice thing about this is that we don't need a list of packages, or even a base type (AvroReflectSerializeable). This would only work if we had the mechanisms in 4 so that the metadata was passed around correctly.

Writables are an existing Serialization, so the implementation is different, since there is plenty of existing data with no extra metadata (in SequenceFiles for instance). So its accept method would check to see if the "serialization" key is set, and if it is, that it is "org.apache.hadoop.io.serializer.WritableSerialization". If not set, it would fall back to the existing check: Writable.class.isAssignableFrom(c).


> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.