You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tom White (JIRA)" <ji...@apache.org> on 2009/07/22 13:25:14 UTC

[jira] Created: (HADOOP-6165) Add metadata to Serializations

Add metadata to Serializations
------------------------------

                 Key: HADOOP-6165
                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
             Project: Hadoop Common
          Issue Type: New Feature
          Components: contrib/serialization
            Reporter: Tom White
            Priority: Blocker
             Fix For: 0.21.0


The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I've just committed this.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165-v4.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742961#action_12742961 ] 

Tom White commented on HADOOP-6165:
-----------------------------------

bq. That way Avro data can be read whether or not a specific or reflect class is loaded.

So there'd just be a single AvroSerialization?

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Status: Patch Available  (was: Open)

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165-v4.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Status: Open  (was: Patch Available)

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165-v4.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750774#action_12750774 ] 

Hudson commented on HADOOP-6165:
--------------------------------

Integrated in Hadoop-Common-trunk-Commit #12 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk-Commit/12/])
    . Add metadata to Serializations.


> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165-v4.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Attachment: HADOOP-6165-v4.patch

bq. Since AvroGenericSerialization is added last, the fallback could simply be to change its accept method to accept anything that has AVRO_SCHEMA_KEY defined, no?

Done.

bq. One other thing: we should probably adopt a naming convention for metadata keys. Should they be Java-package-like strings, e.g., org.apache.hadoop.io.serialization.class, or HTTP/SMTP header-like things, e.g., Serialization-Class?

In this case I don't have a strong feeling either way, so I've changed the keys to be named as "header-like things".

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165-v4.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Assignee: Tom White
      Status: Patch Available  (was: Open)

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734242#action_12734242 ] 

Doug Cutting commented on HADOOP-6165:
--------------------------------------

>  is the following sufficient?
> 
> public abstract boolean accept(Map<String, String> metadata); 

I think it is, since this is called once per container, not per object.  In some cases there may not be a more distinguished class than Object and/or the class may not be known.

> Should we have a Metadata class to permit evolution of beyond Map<String, String>?

No, metadata should be trivially serializeable.

> we could have properties to specify extra metadata. Metadata is a map, so something like mapred.mapoutput.{key,value}.metadata.K

Alternately we can have a single key with a complex value, e.g.:
  mapred.mapoutput.metadata.key="a=b,b=c"
We'd have to process escapes if we want values to be able to contain comma and equals.

Or we could extend Configuration to fully support nested maps, e.g., a nested configuration in a value's XML would create a Map value.

Or we could pass these through outside of the Configuration, e.g., in IFile.



> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Status: Patch Available  (was: Open)

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742878#action_12742878 ] 

Hadoop QA commented on HADOOP-6165:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12416454/HADOOP-6165-v2.patch
  against trunk revision 803296.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 169 javac compiler warnings (more than the trunk's current 116 warnings).

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/602/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/602/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/602/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/602/console

This message is automatically generated.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Attachment: HADOOP-6165-v3.patch

Here's a new patch which adopts the names suggested by Doug. I've also added AvroGenericSerialization which looks for a schema in the metadata, and a test for it. I haven't added the fallback capability discussed, but it shouldn't be too hard to add.

I've also fixing failing tests, and reduced the number of deprecation warnings - I can't get rid of all of them until the deprecated interfaces are removed.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Status: Open  (was: Patch Available)

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743366#action_12743366 ] 

Hadoop QA commented on HADOOP-6165:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12416560/HADOOP-6165-v3.patch
  against trunk revision 804317.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

    -1 javac.  The applied patch generated 145 javac compiler warnings (more than the trunk's current 116 warnings).

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 112 release audit warnings (more than the trunk's current 111 warnings).

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/604/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/604/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/604/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/604/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/604/console

This message is automatically generated.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743424#action_12743424 ] 

Doug Cutting commented on HADOOP-6165:
--------------------------------------

> I haven't added the fallback capability discussed, but it shouldn't be too hard to add. 

Since AvroGenericSerialization is added last, the fallback could simply be to change its accept method to accept anything that has AVRO_SCHEMA_KEY defined, no?

One other thing: we should probably adopt a naming convention for metadata keys.  Should they be Java-package-like strings, e.g., org.apache.hadoop.io.serialization.class, or HTTP/SMTP header-like things, e.g., Serialization-Class?

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Attachment: HADOOP-6165-v2.patch

This patch implements 1, and 3.i (for sequence files) above but not 3.ii, 3.iii, or 5. I've changed the accept() method to just take the metadata map, and not the Class, following Doug's suggestion.

For the 0.21 release it is only necessary to make the API changes to the serialization framework. This means deprecating the Serialization/Serializer/Deserializer interfaces and introducing Base{Serialization,Serializer,Deserializer} (1).

Changes to take full advantage of the new framework in MapReduce (3.ii, 3.iii) can be introduced progressively in later JIRAs, since they will be additions and shouldn't affect backward compatibility.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742947#action_12742947 ] 

Doug Cutting commented on HADOOP-6165:
--------------------------------------

This looks very nice!  A few nits:
 - BaseSerialization and BaseDeserialization might be instead called SerializationBase and DeserializationBase.
 - BaseSerializationWrapper might instead be called LegacySerialization.  Similarly for BaseDeserializationWrapper.
 - Should this patch update AvroSerialization too?  In this case we could use something like SpecificRecord.class.isAssignableFrom(Class.forName(meta.get("class")), and, if that fails, use GenericDatumReader.  That way Avro data can be read whether or not a specific or reflect class is loaded.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-6165) Add metadata to Serializations

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-6165:
------------------------------

    Attachment: HADOOP-6165.patch

Here's a patch with some ideas about how to go about this. It is very preliminary.



1. One of the problems is that Serialization/Serializer/Deserializer are all interfaces, which makes it difficult to evolve them. One way to manage this is to introduce Base{Serialization,Serializer,Deserializer} abstract classes that implement the corresponding interface. SerializationFactory will read the io.serializations configuration property and if a serialization implements BaseSerialization it will use that directly, while if it is a (legacy) Serialization it will wrap it in a BaseSerialization. The trick here is to put legacy Serializations at the end, since they have less metadata and are therefore less discriminating.

The Serialization/Serializer/Deserializer interfaces are all deprecated and can be removed in a future release, leaving only Base{Serialization,Serializer,Deserializer}.

2. In addition to the Map<String, String> metadata do we need to keep the class metadata? That is, do we need 

public abstract boolean accept(Class<?> c, Map<String, String> metadata);

or is the following sufficient?

public abstract boolean accept(Map<String, String> metadata); 

We could have a "class" entry in the map which stores this information, but we'd have to convert it to a Class object to do the isAssignableFrom check that some serializations need to do, e.g. Writable.class.isAssignableFrom(c). Perhaps this is OK.

3. Should we have a Metadata class to permit evolution of beyond Map<String, String>? (E.g. to keep a Class property.)

4. Where does the metadata come from? In the context of MapReduce, the answer depends on the stage of MapReduce. (None of these changes have been implemented in the patch.)

i. Map input

The metadata comes from the container. For example, in SequenceFiles the metadata comes from the key-value class types, and the SequenceFile metadata (a Map<Text, Text>, which is ideally suited for this scheme).

To support this, SequenceFile.Reader would pass its metadata to the deserializer. Similarly, SequenceFile.Writer would add metadata from the BaseSerializer to the SequenceFile's writer.

ii. Map output/Reduce input

The metadata would have to be supplied by the MapReduce framework. Just like we have mapred.mapoutput.{key,value}.class, we could have properties to specify extra metadata. Metadata is a map, so something like mapred.mapoutput.{key,value}.metadata.K where K can be an arbitrary string.

For example, one might define mapred.mapoutput.key.metadata.avroSchema to be the Avro schema for map output key types. To get this to work we would need support from Configuration to get a Map from a property prefix. So conf.getMap("mapred.mapoutput.key.metadata") would return a Map<String, String> of all the properties under the mapred.mapoutput.key.metadata prefix.

iii. Reduce output

The metadata would have to be supplied by the MapReduce framework. Just like the map output we could have mapred.output.{key,value}.metadata.K properties.

5. Resolution process

To take an Avro example: AvroReflectSerialization's accept method would look for a "serialization" key of org.apache.hadoop.io.serializer.avro.AvroReflectSerialization. The nice thing about this is that we don't need a list of packages, or even a base type (AvroReflectSerializeable). This would only work if we had the mechanisms in 4 so that the metadata was passed around correctly.

Writables are an existing Serialization, so the implementation is different, since there is plenty of existing data with no extra metadata (in SequenceFiles for instance). So its accept method would check to see if the "serialization" key is set, and if it is, that it is "org.apache.hadoop.io.serializer.WritableSerialization". If not set, it would fall back to the existing check: Writable.class.isAssignableFrom(c).


> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742981#action_12742981 ] 

Doug Cutting commented on HADOOP-6165:
--------------------------------------

> So there'd just be a single AvroSerialization?

Perhaps.  An alternative might be to have three: reflect, specific and generic.  Each could accept records if they have the right base class.  But if you read a file that was written with, e.g., specific and don't have that class, or data written by python, that names no class, then you'd be unable to read that data.  Also, with Avro, you're not tied to records as the schema: values could be a union, a map, or an array.

If the data was written with reflect or specific, and you have the class used to write it loaded, then its probably best to use that.  But in all other cases generic is probably your best bet.  I guess this could be implemented by placing generic last on the list, so that it accepts anything that has an avro schema, with specific and reflect picking off things that have classes loaded.  Is that better?  I don't have a strong feeling.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-6165) Add metadata to Serializations

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745953#action_12745953 ] 

Hadoop QA commented on HADOOP-6165:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12417254/HADOOP-6165-v4.patch
  against trunk revision 806430.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 145 javac compiler warnings (more than the trunk's current 116 warnings).

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/619/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/619/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/619/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/619/console

This message is automatically generated.

> Add metadata to Serializations
> ------------------------------
>
>                 Key: HADOOP-6165
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6165
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: contrib/serialization
>            Reporter: Tom White
>            Assignee: Tom White
>            Priority: Blocker
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-6165-v2.patch, HADOOP-6165-v3.patch, HADOOP-6165-v4.patch, HADOOP-6165.patch
>
>
> The Serialization framework only allows a class to be passed as metadata. This assumes there is a one-to-one mapping between types and Serializations, which is overly restrictive. By permitting applications to pass arbitrary metadata to Serializations, they can get more control over which Serialization is used, and would also allow, for example, one to pass an Avro schema to an Avro Serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.