You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2009/10/20 19:15:59 UTC

[jira] Created: (MAPREDUCE-1126) shuffle should use serialization to get comparator

shuffle should use serialization to get comparator
--------------------------------------------------

                 Key: MAPREDUCE-1126
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: task
            Reporter: Doug Cutting


Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806139#action_12806139 ] 

Chris Douglas commented on MAPREDUCE-1126:
------------------------------------------

{{s/supertype is Yak/subtype is Yak/}}

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803474#action_12803474 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

> the users aren't likely to change the serialization format from the default for the types they are using

There is no default for Long, ByteBuffer, Map<URL,byte[]>, unions, etc.  These are reasonable values for mapreduce processing.  Moreover, a given class might be serialized and compared in different ways by different jobs.

> we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types.

I don't follow your math here.  An application need call no more methods than it did before to configure a job.

> I thought that you were going to create a marker interface for AvroRecord that has a getSchema method.

That only works for the case when a single record is the top-level type.  It does not work for arrays, maps, enums, unions or primitives, all reasonable values for mapreduce.  Nor does it work where one might have a legacy Writable that one also sometimes wishes to process with Avro reflection.  In the general case, how you serialize something is independent of its in-memory representation.

We use the marker interface when we can, but we cannot always.  When a marker interface is appropriate, job configuration looks much as it did before: one can still set key and value classes.

> Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods.

I do not understand what you're counting here. Currently we specify job parameters with a configuration.  In the future, we might change this to instead use serialized instances of five interfaces.  If we do that, then we would declare a single mechanism by which these instances are serialized, just as we now declare the single mechanism by which a configuration is serialized.

> we need to use the global serialization/deserialization factory that we already have.

That's precisely what this patch does.  It updates the shuffle to no longer use the serialization factory in a now-deprecated manner.

> moving the {set,get}MapOutput{Key,Value}Class methods is a non-starter.

This is an odd statement.  We're forever forbidden from making the MapReduce framework more abstract?

> Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable.

That was established in HADOOP-6165.  If you'd like to revisit that representation for metadata, that should be done as a separate issue.  That API not been released, and so would be easy to change if there is consensus to do so.  This current issue builds on the serialzation metadata mechanism currently in trunk and should be evaluated independently of the serialization metadata representation.

> we need to check on the map side whether the type the map is outputting is correct.

That check is now done by the serializer.

> you need to support union types in Writable too.

I don't understand this suggestion.  A serialization system that implements unions should check unions, but a serialization system that does not support unions need not.

> i'm not wild about having Configuration.setMap.  Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.

That was established in HADOOP-6420.  This API has not been released, and would be easy to change if there is consensus.  If you wish to revisit that API, and have a better implementation idea, then that should be done in a separate issue, but that question should be considered separately from the current issue.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball reassigned MAPREDUCE-1126:
----------------------------------------

    Assignee: Aaron Kimball

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802916#action_12802916 ] 

Owen O'Malley edited comment on MAPREDUCE-1126 at 1/20/10 6:19 PM:
-------------------------------------------------------------------

Not really. I see now what you are trying to accomplish, but I think it is the wrong model. While the FileInputFormat is similar in structure, the issues are:
  1. the analogy isn't precise because you aren't setting WritableJobData as a parameter some where.
  2. the users care about the types that come out of their map
  3. the users aren't likely to change the serialization format from the default for the types they are using
  4. we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types. That is *way* too complicated.

I thought that you were going to create a marker interface for AvroRecord that has a getSchema method. That would have enabled the AvroWriter to get the schema from the types rather than get the types from the schema.

I also think that removing the type checks from the collector and ifile code is a bad plan and will allow a lot of errors to reach much further into the system.

Let's consider the proposal that Arun has been discussing. Instead of doing:

{noformat}
   FileInputFormat.setInputPath(job, new Path("/foo"));
   job.setInputFormatClass(TextInputFormat.class);
{noformat}

you do:

{noformat}
  TextInputFormat input = new TextInputFormat();
  input.setInputPath(new Path("/foo"));
  job.setInputFormat(input);
{noformat}

clearly the job needs to serialize the InputFormat object and reconstruct it on the other side. This is much much easier for users to understand than the current model and can probably be done in a backwards compatible manner. Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods.

I think that:
  * we need to use the global serialization/deserialization factory that we already have.
  * moving the {set,get}MapOutput{Key,Value}Class methods is a non-starter. As a general rule, if you need to modify all of the examples, we need to carefully discuss the issues.
  * the metadata should not be user visible and it would be better if it was just used to communicate within the serializer. Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable.
  * we can debate whether the type restrictions on map outputs should be loosened, but certainly we need to check on the map side whether the type the map is outputting  is correct. If you are going to loosen it, the class methods should become deprecated and vestigial and you need to support union types in Writable too.
  * i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.


      was (Author: owen.omalley):
    Not really. I see now what you are trying to accomplish, but I think it is the wrong model. While the FileInputFormat is similar in structure, the issues are:
  1. the analogy isn't precise because you aren't 
  2. the users care about the types that come out of their map
  3. the users aren't likely to change the serialization format from the default for the types they are using
  4. we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types. That is *way* too complicated.

I thought that you were going to create a marker interface for AvroRecord that has a getSchema method. That would have enabled the AvroWriter to get the schema from the types rather than get the types from the schema.

I also think that removing the type checks from the collector and ifile code is a bad plan and will allow a lot of errors to reach much further into the system.

Let's consider the proposal that Arun has been discussing. Instead of doing:

{noformat}
   FileInputFormat.setInputPath(job, new Path("/foo"));
   job.setInputFormatClass(TextInputFormat.class);
{noformat}

you do:

{noformat}
  TextInputFormat input = new TextInputFormat();
  input.setInputPath(new Path("/foo"));
  job.setInputFormat(input);
{noformat}

clearly the job needs to serialize the InputFormat object and reconstruct it on the other side. This is much much easier for users to understand than the current model and can probably be done in a backwards compatible manner. Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods.

I think that:
  * we need to use the global serialization/deserialization factory that we already have.
  * moving the {set,get}MapOutput{Key,Value}Class methods is a non-starter. As a general rule, if you need to modify all of the examples, we need to carefully discuss the issues.
  * the metadata should not be user visible and it would be better if it was just used to communicate within the serializer. Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable.
  * we can debate whether the type restrictions on map outputs should be loosened, but certainly we need to check on the map side whether the type the map is outputting  is correct. If you are going to loosen it, the class methods should become deprecated and vestigial and you need to support union types in Writable too.
  * i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.

  
> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797765#action_12797765 ] 

Hadoop QA commented on MAPREDUCE-1126:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12429576/MAPREDUCE-1126.4.patch
  against trunk revision 896781.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The patch appears to cause tar ant target to fail.

    -1 findbugs.  The patch appears to cause Findbugs to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/252/console

This message is automatically generated.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Attachment: MAPREDUCE-1126.6.patch

Fixes javac warnings in examples and tools based on deprecated JobConf methods (these now use the new jobdata API).

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805301#action_12805301 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

It is also "trivial" to wrap everything in Writable. Should we just do that?

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Status: Open  (was: Patch Available)

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798313#action_12798313 ] 

Hadoop QA commented on MAPREDUCE-1126:
--------------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12429808/MAPREDUCE-1126.6.patch
  against trunk revision 897118.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 15 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/370/console

This message is automatically generated.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Status: Open  (was: Patch Available)

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Jay Booth (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805143#action_12805143 ] 

Jay Booth commented on MAPREDUCE-1126:
--------------------------------------

For what it's worth, I really like setOutputKeyClass and while I understand the distinction that a given type might map to more than one serialization method, could we find a compromise that keeps the simplicity of a general setOutputKeyClass  while providing the flexibility that Doug's looking for?  This could have the added benefit of decoupling the class declaration from the serialization declaration, since they're coupled but kind of separate.

Maybe something like:

// defaults to class-based introspection, if there are conflicts, Writable wins for legacy reasons
job.setOutputKeyClass(Text.class)
job.setOutputValueClass(Text.class)
...

// explicitly use avro
job.setOutputKeyClass(AvroText.class)
job.setOutputValueClass(AvroText.class)
job.setOutputKeySerialization(new AvroSerializationSpec(conf,schema))
job.setOutputValueSerialization(new AvroSerializationSpec(conf, Schema.create(Schema.Type.NULL))




Under this setup, the explicit Avro situation will be slightly wordier, but all serialization formats could go in through a common object type (SerializationSpec or something) and we wouldn't have to have a different static setter for every different serialization type.  It also ensures that keyClass and valueClass are set the same way regardless of serialization type, which I'd think is worth having if possible.

Another option would be to enforce the 4-liner version in all cases so that people don't get unpredictable behavior based on class introspection, I could see going either way on that one..  it's more robust but would break legacy code and is more typing for a lot of common cases where there's no chance for confusion.

For what it's worth, I cringe a little bit when doing Class.staticMethod(Job), I'm always worried in the back of my mind about squashing already-resident values.  Yeah, I know, namespacing, but I generally feel like if something is core functionality then it should be accounted for in the base Job class.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830873#action_12830873 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

Tom's patch adds two calls to that exact method. Neither of them is in a deprecated method.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, m-1126-3.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829222#action_12829222 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Arun> allowing map-in-key and map-in-value to have different serialization mechanisms seems like an over-kill. Do we have use-cases for such requirements?

I don't think this is a requirement.  But neither am I convinced, long-term, that separating keys and values is a requirement.  But it is a legacy.  We already permit different class-specified serializations for keys and values.  This issue proposes to consistently alter how key and value serializations are specified, to not be restricted to class-only, but rather to be consistent with the serialization API.  Doing this without permitting keys and values to have different serializations would be harder than simply allowing it, as it would require the introduction of a new concept, the serialization context, while the current patch simply is the logical combination of the existing Mapreduce API and the existing serialization API.

Perhaps one of the causes of this present tension is that Job.java is striving to be two things at once: a high-level API for folks directly building Java mapreduce applications and a low-level API for folks building higher-level abstractions based on mapreduce.  Perhaps we should instead add a new low-level API.  This might looks something like:

{code}
final class Split {
  String[] locations;
  ByteBuffer data;
}
interface MapEmitter { // implemented by kernel
  emit(ByteBuffer datum, int partition);
}
interface CombineEmitter { // implemented by kernel
  emit(ByteBuffer datum);
}
interface Job {  // implemented by user code
  Split[] getSplits();
  void map(ByteBuffer splitData, MapEmitter out);
  int compare(ByteBuffer x, ByteBuffer y);
  boolean hasCombiner();
  void combine(Iterator<Iterator<ByteBuffer>>, CombineEmitter)
  void reduce(Iterator<Iterator<ByteBuffer>>, String attemptName);
  void abandon(String[] attemptNames);
  void commit(String[] attemptNames);
}
{code}

This is of course over-simplified.  We'd need to pass a Context through to job methods, so that they can report progress, etc.  We'd also need to figure out how the Job implementation itself is specified, serialized, reconstituted.  Etc.

So perhaps a way around the current impasse is to devise a new, lower-level, mapreduce API that will then support multiple higher-level APIs, including the current API.  Since there is resistance to evolving the existing API, it might remain untouched as a legacy API.  Folks who, e.g., wish to use Avro would use a new higher-level MapReduce API that we'd develop on top of this shared low-level runtime.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805149#action_12805149 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Jay, there's been an effort to move job accessor methods away from Job.java unless they're required by the mapreduce framework.  For example, Job#setInputPath(Path) moved to FileInputFormat#setInputPath(Job, Path), since not all jobs have files for input, and the file-specific code is all user-replaceable, not built into the mapreduce system.  Similarly, after the current patch for this issue is applied, the framework no longer contains calls to any particular serialization system (except in deprecated back-compatibility methods).

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806472#action_12806472 ] 

Scott Carey commented on MAPREDUCE-1126:
----------------------------------------

bq.@Scott: the annotations for Input/OutputFormat seem to be misplaced. It seems desirable to be able to write a single Map function that does wordcount on Strings, regardless of whether those strings are stored in newline-delimited text, sequence files, avro data files, or whatever.

Philip, yes, they are not in the right place.  I just wanted to bring into the conversation that 'SomeObject.setSomeBinding()' is not the only way to do these sort of things.  Annotations, unlike setter methods, can be moved around and adapted to work in various ways without breaking APIs.   For example, the Input/OutputFormat annotation could go on either a Map class, OR some other more specific annotation site, and with defaults and priority (set on configuration > annotated on configuration > annotated on map > default) determining which applies.

After thinking about it a bit more, and doing some research into how other APIs do some tricky things with Annotations, there are a few things to consider.  
* It is possible in some situations to infer the generic types of a class at runtime by constructing an instance of an object with the same type arguments.  Example: http://wiki.fasterxml.com/JacksonInFiveMinutes#Data_Binding_with_Generics.
* Annotations on class A can be applied to class B "Mix-In Annotations"; http://wiki.fasterxml.com/JacksonMixInAnnotations
* Post-compile time checks via an annotation processor can validate code before run time in cases where the current M/R framework only breaks at run time.

What I think is most important to this discussion is that some layers of configuration complexity can be hidden from users, and some of it deferred to the future.
The 'site' of the configuration can be moved around with Annotations, opening up ways to simplify the steps required to do declarative configuration.

With this in mind, some additional complexity to the procedural configuration methods is more acceptable if there are good defaults and a later (backwards compatible) API addition simplifies things.  Likewise, some elements of complexity can be skipped for now if it can be seen that those could be available through a configuration extension later.  Perhaps the procedural API would never allow configuring a key and value to use different serializers to avoid API complexity, but an annotation extension in the future allows that.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831032#action_12831032 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

My primary point is that this issue got really out of scope and that I was trying to find the subset of the patch that would address the issue of "shuffle should use serialization to get comparator" and the follow on jira MAPREDUCE-1462 that discusses the much wider ranging discussion on changing the MapReduce type and serialization system.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, m-1126-3.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828686#action_12828686 ] 

Tom White commented on MAPREDUCE-1126:
--------------------------------------

Yesterday I spoke to Owen offline about his design for this JIRA. Briefly, it works as follows. (I apologize in advance for any errors due to misunderstandings on my part! Owen, please correct anything I've got wrong.)

The Serialization classes change as follows. SerializationFactory becomes RootSerializationFactory, while Serialization becomes SerializationFactory, and Serializer/Deserializer pairs are combined into single Serialization per pair. Serialization has the ability to write itself to and read itself from a stream, using its own serialization format. There is a subclass of Serialization called TypedSerialization, which is subclassed by WritableSerialization. AvroGenericSerialization would not be a TypedSerialization. RootSerializationFactory can map types into Serializations in the usual manner (via io.serializations). Non-typed serializations, such as AvroGenericSerialization, would be set explicitly as described below.

Instead of the metadata map, each job context (map input key, map output key, etc) has a serialized serialization that is deserialized for the Task (by the framework) and used to carry out the serialization for that context. The serialized serializations are not stored in the job configuration, but rather are stored in a new file in the job directory which has the format {{(context, serialization class, serialization bytes)+}}.

In terms of configuration for the user, the API looks like the one described by Arun and Chris. That is, {{Job#setSerialization(MapReduceSerializationContext, Serialization)}}.

Some comments:
* The changes to the serialization API are not backwards compatible, so a new package of serializer types would need creating.
* I'm not sure why we need to serialize serializations. The current patch avoids the need for this by using a simple string mechanism for configuration. Having an opaque binary format also makes it difficult to retrieve and use the serialization from other languages (e.g. C++ or other Pipes languages). The current patch is language-neutral in this regard.
* Adding a side file for the context-serializer mapping complicates the implementation. It's not clear what container file would be used for the side file (Avro container, custom). I understand that putting framework configuration in the job configuration may not be desirable, but it has been done in the past so I don't know why it is being ruled out here. I would rather have a separate effort (and discussion) to create a "private" job configuration (not accessible by user code) for such configuration.
* The user API is no shorter than the one in the current patch. Compare:

{code}
Schema keySchema = ...
AvroGenericSerialization serialization = new AvroGenericSerialization();
serialization.setSchema(keySchema);
job.set(MAP_OUTPUT_KEY, serialization);
{code}

with

{code}
Schema keySchema = ...
AvroGenericData.setMapOutputKeySchema(job, keySchema);
{code}

I was hoping it might help reach consensus if we could incorporate some of Owen's ideas with the existing patch. However it is not clear to me how to do this. In the meantime, I would appreciate it if someone would review the latest patch. Thanks!


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804934#action_12804934 ] 

Arun C Murthy commented on MAPREDUCE-1126:
------------------------------------------

{quote}
1. We should use the current global serializer factory for all contexts of a job.
4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serialier factory).
{quote}

+1, strongly agree.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787812#action_12787812 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

I wonder if instead of adding more methods to JobContext we ought to add these to relevant serialization implementations.  For example, we might have WritableSerialization.setMapOutputKeyClass(Class) and AvroSerialization.setMapOutputKeySchema(Schema).  This makes the methods perhaps harder for folks to find, but it bakes less into JobContext.  The serialization system is entirely user code, so it seems reasonable that the kernel should not directly support it.  With this, JobContext would only have serialization agnostic methods like get/setMapOutputKeySerializationMetadata() and getMapOutputKeySerializer().


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804118#action_12804118 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Owen, I would like to figure out what we need to do to address your concerns without losing functionality.

Some notes on functionality:

We need to support non-record top-level types.  A Python or Ruby program might generate data whose top-level type is a union, array, map, long, string, etc.  A serialization-specific class is not a natural Java representation for such data.  We should not create AvroSpecificLong, AvroReflectLong, etc. wrappers as we have been forced to do with Writable.  Rather applications should be able to use built-in Java types for such data.

A class alone is not sufficient to specify a serialization.  One must be able to specify both:
 - the binary format of the data
 - the mapping from binary to in-memory

These are distinct. In Avro a schema defines the binary format and a DatumReader usually defines the mapping from binary to in-memory.  Avro includes three different DatumReader implementations: generic, specific and reflect.  These map built-in types differently.  For example, ReflectDatumReader maps Avro strings to java.lang.String while SpecificDatumReader maps Avro strings to org.apache.avro.util.Utf8.

Moreover an application can use a lower-level event-based API to map its in-memory Java data structures to Avro.  For example, Pig data is a union of Bag, Tuple and some built-in types.  It's straightforward to define an Avro schema for this, but none of Avro's provided DatumReader implementations might be optimal.  Reflection is slow.  Specific and generic would require copying data to-and-from generated classes.  So Pig might best declare its schema and then directly read and write its own data structures directly as Avro data.  Avro includes tools to efficiently make this type-safe.  So Pig might define its own serialization for Long, null, etc.

In the current patch one would use AvroGenericData.setMapOutputKeySchema(conf, schema) to specify that Avro's generic in-memory representation should be used for data whose binary format corresponds to a given schema.  One would use WritableData.setMapOutputKeyClass(conf, class) to specify that a Writable class should be used to define both the binary format and its in-memory representation.  Both of these set configuration parameters used by the serialization factory to produce an appropriate serializer.  A third party can define new serializers for, e.g., Pig data, and configure their jobs to use it with something like PigData.setMapOutputFormat(conf).

We cannot easily get this from the InputFormat.  We could instead configure the input format with such information, and have it use the serialization factory.  Applications would still need to set the same number of parameters, just in a different place: the binary format still needs to be declared, as does the mapping from binary to in-memory, these cannot be inferred automatically.  That's a more substantial change that we did not wish to make in this patch.

Moreover, the inputformat is irrelevant to the current patch, since the serialization used during the shuffle can be different than that used for input and output.  We don't force input and output serializations to be identical, nor should we force intermediate serialization to match one or the other.  The map input might be legacy data, and the map function might covert it to a different representation that requires a different serialization.  This patch concerns intermediate data.  MAPREDUCE-815 makes the corresponding changes for input and output data.

The tests for MAPREDUCE-815 include an end-to-end Avro job whose key uses Avro's generic data representation and whose value is simply null.  Have a look at the code that creates such a job: it looks much like our job creation today, the API doesn't appear fundamentally different to me.  It's not currently possible to specify null as a MapReduce value, since null has no class.  NullWritable is a crude, serialization-specific manner for doing this that, long-term, I hope we can deprecate in favor of simply declaring and passing null when that's appropriate.

I hope these observations can help us reach consensus without extensive delay.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated MAPREDUCE-1126:
-------------------------------------

    Attachment: m-1126-3.patch

I was a little too aggressive in my trimming. Here is a better patch.

Doug, neither this or my previous patch use any deprecated APIs.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, m-1126-3.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805679#action_12805679 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Which Avro serializer? Avro includes three different mappings from in-memory to binary representations, and applications can add more. In generic, a java.lang.String represents an enum symbol, while in reflect it represents a string.

And do we really want to privilege Avro here? It should be possible to use Thrift too, and intermix the two within a single job. A Long in the input might be a part of Thrift union, and a Long in the output may use Avro.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805760#action_12805760 ] 

Arun C Murthy commented on MAPREDUCE-1126:
------------------------------------------

bq. Great points, Chris. Yahoo! has stated that a significant majority of their MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among our many customers at Cloudera, it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice.

Uh, no. That is precisely the point - making it slightly harder on _framework_ authors is better than making it harder for the average users of the Map-Reduce api. Only the framework authors pay the cost...

---- 

Along similar sentiments I'd like to re-state:

{quote}
1. We should use the current global serializer factory for all contexts of a job.
4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serializer factory).
{quote}

I'm not convinced we need to allow multiple serialization mechanism for the same job. I'm also much less convinced that we need to allow a serializer per map-in-key, map-in-value, map-out-key, map-out-value, reduce-out-key, reduce-out-value etc.

I can see that we might have some phase of transition where people might move from Writables to Avro as the preferred serialization mechanism. For e.g. they might have SequenceFiles with Writables as input-records and might produce SequenceFiles with Avro output-records. However, even with a single serializer-factory for all contexts of a job it is trivial to write wrappers, provide bridges in libraries or other frameworks etc. to cross the chasm.

----

At a later point, *iff* we get to a world where we need to console multiple serialization mechanisms for the same job on a regular basis e.g. a world where we have a lot of data in Writables *and* Avro *and* Thrift etc. I'd like to propose a slightly less involved version of Chris's proposal.

The simplification is that we have view 4 separate 'record contexts':
# INPUT (map-in-key, map-in-value)
# INTERMEDIATE (map-out-key, map-out-value)
# OUTPUT (reduce-out-key, reduce-out-value)
# JOB_DEFINITION (currently only InputSplit, possibly more in future via MAPREDUCE-1183)

Then we have Chris's proposal:

{noformat}
enum Context {
  INPUT,
  INTERMEDIATE,
  OUTPUT,
  JOB_DEFINITION
}

Job::setSerializationFactory(Context context, SerializationFactory...)
{noformat}

Thus we allow serializers to be specified for the 'records' flowing through the Map-Reduce framework... allowing map-in-key and map-in-value to have different serialization mechanisms seems like an over-kill. Do we have use-cases for such requirements? 

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802537#action_12802537 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

The problem is highlighted by this part of the patch:

{noformat}
-    job.setMapOutputKeyClass(LongWritable.class);
-    job.setMapOutputValueClass(BytesWritable.class);
+    WritableJobData.setMapOutputKeyClass(job.getConfiguration(),
+        LongWritable.class);
+    WritableJobData.setMapOutputValueClass(job.getConfiguration(),
+        BytesWritable.class);
{noformat}

That is a *really* ugly change to the user's application. If anything, I want the api to infer the map output types from the Mapper's type parameters by default.

We used to infer the serialization from the types. That was a clean model. Why are you trying to change it? The type should imply a schema almost always. Am I missing something? Why would someone want to set a schema separately from the type?

Additionally, you have the framework depending on a library. That is a problem. The right solution is to have a interface in mapreduce and have the library code implement it.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801007#action_12801007 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Owen, do you have specific concerns?

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805233#action_12805233 ] 

Ted Dunning commented on MAPREDUCE-1126:
----------------------------------------


Isn't there a middle ground available (at least from the user's point of view)?

My thought would be that if the user specifies types in the current style, they would be limited to Writables in the current fashion.  That could be marked as old-fashioned, but I wouldn't necessarily deprecate it.  It does leave Writable in a privileged position relative to other serialization frameworks, but it *is* in a privileged position since it existed first.

Alternately, the user could specify a serialization framework specific configuration much like Doug suggests.  It should be true that if any non-standard serialization is used that specifying a type is an error and vice versa.  This should be easy to check.

>From the user's point of view, they could use old-style job configuration or the new style that Doug suggests.  I strongly prefer the new style, but I wouldn't be anxious to have to change all my old style programs.

Under the covers, almost anything could happen, but the important thing that would happen is that if any special serialization is invoked, the job config would need to know about it which might affect many other components like the shuffle.

Is there any technical reason why this cannot be made to work?

Is there really any philosophical reason that old programs must be broken?

If no and no, why is there a problem here?  I think that this middle ground would satisfy Owen's (and my own) needs for backwards compatibility as well as Doug's (and my own) desire for flexibility for serialization.



> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803975#action_12803975 ] 

Tom White commented on MAPREDUCE-1126:
--------------------------------------

This patch is making changes that make it possible to take advantage of the more general serialization mechanism introduced in HADOOP-6165 in MapReduce. Until now the serialization used for a key or value was driven by the type of the key or value. This is not sufficiently general for Avro, which is what motivated the work in HADOOP-6165. (Note that HADOOP-6165 did not have any effect on user APIs, since users don't typically interact with serialization classes directly.) However, it is true that many serialization frameworks *are* type driven - Writables, Thrift, Java Serialization, Avro Specific, to name a few - so I think there may be an argument to retain job.setMapOutputKeyClass() as it currently stands. The advantage is that existing Writable-based jobs do not have to be changed, which I think is at the heart of Owen's criticism.

For Avro Generic, or serializations where the schema for the types needs to be specified, we can use the AvroGenericJobData class in this patch. (BTW Aaron, why does SchemaBasedJobData exist? It seems to reference Avro internally, even though it names suggests it is general.) In this case, there would be no need to call job.setMapOutputKeyClass().

Would this address folks' concerns?


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Status: Patch Available  (was: Open)

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797305#action_12797305 ] 

Aaron Kimball commented on MAPREDUCE-1126:
------------------------------------------

As a followup: ReduceTask#SkippingReduceValuesIterator#getClassFromMetadata is necessary due to the dependency on WritableSerialization for supporting skipping. When record skipping works more generally, that method will disappear and the correct jobdata methods will be used.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787803#action_12787803 ] 

Tom White commented on MAPREDUCE-1126:
--------------------------------------

+1 This sounds good to me.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Status: Patch Available  (was: Open)

Marking patch-available now that dependencies have been checked in.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787816#action_12787816 ] 

Aaron Kimball commented on MAPREDUCE-1126:
------------------------------------------

Doug: That's in line with what we do in FileInputFormat / FileOutputFormat, etc; it stands to reason that we should try to treat intermediate (k, v) pairs the same as how we configure our inputs and outputs too..

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805166#action_12805166 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

{quote}
Well, ok, guess I'm in the minority on that instinct 
{quote}
No, that is pretty constant from the users. :)

My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. I haven't talked to *any* users that care about using a non-default serializer for a given type.

I certainly do want Avro to go through the MapReduce pipeline. I guess I'll need to implement a prototype to make it clear how Avro-based MapReduce could work without a massively incompatible user-facing change.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805106#action_12805106 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

For folk's consideration, here's what creation of an Avro-based Job looks like after this and MAPREDUCE-815 are committed:

{code}
    // Create a MapReduce job to sort this input.
    Job job = new Job();
    job.setInputFormatClass(AvroInputFormat.class);
    job.setOutputFormatClass(AvroOutputFormat.class);
    job.setNumReduceTasks(1);

    Configuration conf = job.getConfiguration();
    SchemaBasedJobData.setOutputKeySchema(conf, schema);
    SchemaBasedJobData.setOutputValueSchema(conf,
        Schema.create(Schema.Type.NULL));

    Path outPath = new Path(workDir, "out");

    FileInputFormat.addInputPath(job, inPath);
    FileOutputFormat.setOutputPath(job, outPath);

    boolean result = job.waitForCompletion(true);
{code}

Is this API markedly more complex that what we have now?  It is markedly more powerful, since, e.g., nulls, unions, arrays, maps, and bulit-in types may be passed directly, without creating a wrapper class.  A custom input serialization (e.g., for log files or Pig data) can be easily and freely intermixed in a job with a library of standard serializations (Avro in various flavors, Writable, etc.).

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804734#action_12804734 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

I've very disappointed that this jira went in with a title and description that completely misrepresented the content and scope of the patch. This patch *completely* revamps the type system and semantics of the map/reduce framework. Changing that without a large discussion is uncool.

I disagree with the fundamental approach taken here. The details are also problematic, but we need to find an acceptable model before any progress on this or any related patches can be made.

My concerns are:
  1. We should use the current global serializer factory for *all* contexts of a job. We have 7 serialized types already (map in key, map in value, map out key, map out value, reduce out key, reduce out value, input split). We will likely end up with more types later. Having a separate serializer and metadata for each type will be extremely confusing to the users.
  2. Defining the schema should be an Avro specific function and not part of the framework.
  3. I don't see any reason to support union types at the top level of the shuffle. There are already libraries that handle this without changing the framework. Furthermore, an Avro record on top of the schema is free in serialization size.
  4. Only the default comparator should come from the serializer. The user has to be able to override it in the framework (not change the serialier factory).

That said, I think that it is perfectly reasonable for the Avro serializer to accept all types. So if you have a Mapper<String,String,String,String> it will use Avro serialization.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829253#action_12829253 ] 

Aaron Kimball commented on MAPREDUCE-1126:
------------------------------------------

Jay,

Right now Mappers are configured directly by the user, and the InputSplits are allocated by the InputFormat. A system that combines the configuration two is intriguing, but would represent a pretty big shift in API. I think this is something you might want to file as a separate JIRA; this thread has enough ideas floating around on it already, and something like that would require its own separate discussion.

- Aaron

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Attachment: MAPREDUCE-1126.2.patch

Attaching a new patch for this issue that incorporates the API changes initially proposed as HADOOP-6438.

This adds a package {{org.apache.hadoop.mapreduce.lib.jobdata}} which contains static classes to set class-based or schema-based metadata. Explicit subclasses exist for particular serialization frameworks (e.g., WritableSerialization) to allow users to denote which framework they are using; in general their behavior is the same as their superclasses.

This keeps the mapreduce-specific job configuration data in the MapReduce project, but does not clutter JobContext with per-serialization-framework setters for clients to use. 

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802571#action_12802571 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Would this change be more palatable if we used a shorter term than "JobData", instead just calling it "Data"?  Combined with changing the setters to accept a JobContext directly, this would look like:
{code}
-    job.setMapOutputKeyClass(LongWritable.class);
-    job.setMapOutputValueClass(BytesWritable.class);
+    WritableData.setMapOutputKeyClass(job, LongWritable.class);
+    WritableData.setMapOutputValueClass(job, BytesWritable.class);
{code}
Is that better?


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Status: Patch Available  (was: Open)

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802033#action_12802033 ] 

Jeff Hammerbacher commented on MAPREDUCE-1126:
----------------------------------------------

Hey Owen,

We'd like to address your concerns on this issue as soon as possible, as there are many other tasks which depend upon this particular issue. Could you please address Aaron and Doug's responses?

Thanks,
Jeff

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Jay Booth (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829266#action_12829266 ] 

Jay Booth commented on MAPREDUCE-1126:
--------------------------------------

Yeah, I'm definitely not a stakeholder in this issue or in low-level mapreduce in general, but was just saying if we're throwing around the idea of an entirely new underlying API anyways, maybe now would be the time to consider that -- it'd be pretty simple to wrap the existing API around the concept if it were baked into the lower-level API, but it'd be harder to add the concept to the lower-level API once it's already solidified.

Either way, I don't have the mojo to make a big push for it, was just throwing the idea out there if it's decided to go forward with a different underlying API.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805735#action_12805735 ] 

Scott Carey commented on MAPREDUCE-1126:
----------------------------------------

I am neck deep in building stuff on Avro.  I've also got a custom Pig reader that reads only my Avro record types as a prototyping stopgap until there is a general solution. 

Ted's Idea of a middle ground sounds useful.  Special casing Writables is OK as long as they don't have to be used.  Making all the new stuff harder to use sounds like a bad idea.  Ideally, a schema system means as a user I never have to write a Writable again.  Those are annoying enough.

bq. For union types, it is a trivial restriction to limit map/reduce users to wrap records around the unions 

Wrappers are trivial to deal with in a schema declaration.  They are not trivial on the other side. First you have Foo and then Bar, and neither needs a wrapper.  But then they might because sometimes you want to serialize Bar in columnar order and sometimes you don't.  Now you have Bar, and ColulmnarBar.  Then you realize you need to have a union, so you make FooBarUnion.  Then in another use case you need BarFooUnion (different order, same Equals and HashCode -- fun with permutations when the union is large).
Then you have composite types.  FooBarStringNullUnion, and FooStringLongUnion.   
Mapping serialization options to classes is not fun.  Wrappers are a useful design pattern for many purposes, but not for encapsulating one to many relationships.


In general, users are moving away from writing directly to the MapReduce API and using various frameworks.  Making these frameworks high performance, powerful, and expressive is more important IMO than preventing the low level MapReduce API from getting a bit more complicated.
As for end-users writing MapReduce, the current situation is not all that pretty anyway.  4 generic type arguments that must be perfectly aligned with several type setting calls on a job configuration to avoid a runtime error?  The Map and Reduce classes have compile time checking on a few types, and thats it.


Choosing a serialization is a declarative task, not a procedural one.  Annotations are what Java has right now for declarative metadata.  Unfortunately, very few people are experts at building annotation based frameworks or  using tools like ASM to enrich the capabilities.  Has there been any proposals to allow Annotations to handle this in a way cleaner than these declarative setter methods?

I haven't thought through it that far, but here's a quick annotation based idea that can bolt on to the work above or the current API.   This is just a point of contrast for this discussion, not a proposal to change -- perhaps it sparks some ideas to simplify the user experience while also making the framework more powerful.  With the right tools those can go hand in hand.

WordCount with configurable Input/Output formats via annotations:
{code}
public class WordCount {

  // if missing each of these has a default, or can be set with the traditional setters
  @Input(TextInputFormat.class);
  @Output(AvroOutputFormat.class);
  @InputWritables(key = LongWritable.class, val = Text.class).
  @SchemaConfig(SchemaBasedJobData.class);
  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Integer> {
     public void map(LongWritable key, Text value, OutputCollector<Text, Integer> output, Reporter reporter) throws IOException {
      . . .
     }
   }
}

  @Input(AvroInputFormat.class);
  @Output(AvroOutputFormat.class);
  @SchemaConfig(SchemaBasedJobData.class);
  public static class Reduce extends MapReduceBase implements Reducer<Text, Integer, Text, Integer> {
    public void reduce(Text key, Iterator<Integer> values, OutputCollector<Text, Integer> output, Reporter reporter) throws IOException {
    . . .
  }

  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    // infer inputs and outputs from annotations on the Map, Reduce, and Combiner setters
    // throw an error if these are not compatible. 
    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    // the "old way" can still work, setting individual input and output classes.  But it is not necessary when annotated and incompatible with schema based systems.

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    JobClient.runJob(conf);
  }
}
{code}

There are some things missing above.  
For frameworks, configuration by method call isn't as big of a deal, but for hand-written classes it is a good thing to keep the generic type declaration and key/value class/type/schema declaration in the same place -- JobConf won't tell you at compile time that you have screwed it up and misaligned types in your Map.  And generic types aren't available at runtime, increasing the number of times the same information must be set.

Annotations can only set primitive types, Class, Enum, and String, as well as arrays of these, so configuring a Schema becomes more difficult.
Schemas can only be stored in annotations if they are in String form -- which may be too verbose to store in an annotation.  I have one Schema that is over 4k in its Avro json form.   

Anyhow, its just an idea for contrast.  The debate is about how to deal with declarative configuration in Java, so it seems relevant to bring up Annotations. 

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806048#action_12806048 ] 

Allen Wittenauer commented on MAPREDUCE-1126:
---------------------------------------------

Perhaps a dumb question and I recognize I'm totally out of my scope of knowledge (never stopped me before *smile*), but... what happens if my serialization code is not written in Java and I have to use JNI to get to it?

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Jay Booth (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805155#action_12805155 ] 

Jay Booth commented on MAPREDUCE-1126:
--------------------------------------

Well, ok, guess I'm in the minority on that instinct :)  

Should we still consider separating the concerns of key/value class type and key/value serialization method?  They're sort of linked but sort of separate, for example in a unit testing scenario you might want the classes regardless of the serialization method.  Too much setter spam then?  



> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Jay Booth (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829251#action_12829251 ] 

Jay Booth commented on MAPREDUCE-1126:
--------------------------------------

+1 for the general concept of a lower-level API, great idea

Any thoughts regarding explicitly setting a Mapper per Split?  Joins between different formats are a pretty primary use case, and it's always awkward using MultipleInputs to shoehorn the different classes into a single conf..  as I understand it now, with MultipleInputs, the MapTask wakes up, looks at its input split, compares that to a magic configuration field mapping splits to mapper classes, and instantiates that mapper class.  Which leads to trouble if you need to mix it with, say, CombineFileInputFormat or anything else that relies on configuration, since the different static setConfigValue(conf) methods set a single value assuming a single mapper class.

If we set a specific mapper class per split, and then a specific config per mapper class, I think it would be a lot more flexible to shoehorn different types of functionality if you're a framework author -- if you're just a user, maybe you don't want to deal with the extra environment setup for simple jobs but if this is a lower level API, maybe it could be useful?  It would certainly be cleaner if a single-input job is just a N=1 multiple inputs job, rather than the current situation where a multiple inputs job is a configuration-level hack on top of the single-input framework.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829365#action_12829365 ] 

Tom White commented on MAPREDUCE-1126:
--------------------------------------

I've created MAPREDUCE-1452 to discuss details about a new low-level API.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797280#action_12797280 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

This is looking great.  A few minor nits, and then I think we can commit it.
  - the patch is stale and does not apply cleanly
 - should ReduceTask#getClassFromMetadata be in ClassBasedJobData?
 - should we file another issue to fix the skipping feature to be serialization-savvy?


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801022#action_12801022 ] 

Aaron Kimball commented on MAPREDUCE-1126:
------------------------------------------

Do you think that it's okay to add methods like {{Job.setMapOutputKeySchema()}} then? In the limit, if another serialization framework makes its way into common use (e.g., Hessian, Protobufs) in Hadoop, we would then need to add a mechanism to set their serialization system-specific metadata to {{Job}} as well. We factored out InputFormat/OutputFormat-specific getters and setters (c.f. {{FileInputFormat.addInputPath()}}) a while back, and deprecated {{JobConf.addInputPath()}}; this seems like a logical next step.

Furthermore, what specific framework dependencies are you referring to? The jobdata package provides getters and setters that allow users to configure serialization system-specific metadata keys and values, but they are put into well-defined "system wide" metadata locations (e.g. {{JobContext.MAP_OUTPUT_KEY_METADATA}}) in the Configuration itself. The SerializerBase/DeserializerBase classes are instantiated in JobConf without touching the {{jobdata}} package at all (they rely only on the system-wide Configuration names).

The only dependency on {{jobdata}} classes in Job/JobContext/JobConf is to push-down the now-deprecated getter/setter methods that the user would call in legacy code, but the framework no longer makes any calls to {{JobConf.getMapOutputKeyClass()}}. It instead calls {{JobConf.getMapOutputKeySerializer()}} and  {{JobConf.getMapOutputKeyDeserializer()}} directly.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797851#action_12797851 ] 

Hadoop QA commented on MAPREDUCE-1126:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12429676/MAPREDUCE-1126.5.patch
  against trunk revision 896781.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 2342 javac compiler warnings (more than the trunk's current 2330 warnings).

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/253/console

This message is automatically generated.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830191#action_12830191 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Owen, your patch uses a deprecated means to create a serialization that does not work for all serializations.  As such, it is not a complete patch for this issue.  The intended way to use the serialization API is for the application to use Map<String,String> metadata.  This was established by HADOOP-6165 and was the rationale for HADOOP-6420.  The method SerializationBase.getMetadataFromClass should not be called by the framework, but is a utility for class-based serialization implementations.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805289#action_12805289 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

For union types, it is a trivial restriction to limit map/reduce users to wrap records around the unions

record(x: union(int, string))



> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805861#action_12805861 ] 

Chris Douglas commented on MAPREDUCE-1126:
------------------------------------------

bq. Why are proposals now focusing on allowing users to specify different serialization factories?

Yes, you're right; I mixed up the naming. {{s/SerializationFactory/SerializationBase/g}}. The setter formal with a vararg is still attractive, {{setSerialization(ctxt, SerializationBase...)}}, so the semantics are the same for configuring the {{SerializationFactory}} whether there are one or many. Setting it with one is equivalent to setting the base w/o a {{SerializationFactory}}, but it keeps the {{accept}} check. The purpose is to configure a context-specific {{SerializationFactory}}; I'm liking Arun's suggestion of limiting these hooks to record contexts, rather than all the user types.

bq. [...] it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice.

As Arun points out, framework authors are uncommon, even if their users are not. The possible performance improvements from open hooks in the framework are compelling (if most are theoretical at the moment), but the user-facing API for MapReduce shouldn't be the Job's serialization. A more abstract collector may be an interesting extension/interface when the use cases are better understood.
----
@Tom
The much-simplified patch is a considerable improvement. Losing the {{\*JobData}} classes effected a cleaner integration. Thanks for this.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated MAPREDUCE-1126:
-------------------------------------

    Attachment: m-1126-2.patch

Here is the 1.5k subset of Tom's patch that addresses the issue of this jira. I propose we commit some variant of it and move the discussion of generalizing serialization over to a new jira MAPREDUCE-1462.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Attachment: MAPREDUCE-1126.3.patch

Per concerns raised in HADOOP-6420, jobdata metadata map setters now set a flag indicating that they were used.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Attachment: MAPREDUCE-1126.5.patch

new patch resynced with trunk.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831039#action_12831039 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

> I was trying to find the subset of the patch that would address the issue of "shuffle should use serialization to get comparator"

The subject of the issue was meant to be interpreted as "any serialization" not just "some serializations".  If the key class were sufficient in all cases to determine the serialization then this issue would indeed be a lot simpler, but it's not.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, m-1126-3.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805663#action_12805663 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

{quote}
For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow
{quote}

I think the default configuration should use a WritableSerializer for Writables and AvroSerializer for everything else. Java serialization was a great experiment, but it was never performant for serious use. So the question is not whether you want different serializers, but rather a job needs different serializers for the same class.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787799#action_12787799 ] 

Aaron Kimball commented on MAPREDUCE-1126:
------------------------------------------

To allow Avro-based comparators during the shuffle, a few changes need to take place. Here's my survey of the scope of the problem and proposal for addressing it.

HADOOP-6323 allows {{SerializationBase}} to provide {{RawComparator}} instances. But for Avro to be germane here, the {{Serializer}} and {{Deserializer}} instances used by {{MapOutputBuffer}} and other shuffle code need to be updated as well. {{MapOutputBuffer}} and other places where the {{Serializer}} and {{Deserializer}} interfaces are used should be changed to use instead {{SerializerBase}} and {{DeserializerBase}}. This should be done by changing the current calls to {{SerializationBase.getSerializer(keyClassName)}} to use the newer {{getSerializer(metadata)}} method.

Currently the only metadata we store is the {{Serialization-Class}} that identifies the intermediate key and value classes. This is implicitly set in, e.g. {{Job.setMapOutputKeyClass()}}. (It actually just stores a class name, and the {{getSerializer(className)}} method converts that to the appropriate metadata map for {{getSerializer(metadata)}}.) HADOOP-6165 allows an arbitrary map of metadata to be passed to the serialization framework. We should take advantange of this map throughout the shuffling process. The job's {{Configuration}} should be updated to embed a map a la the outline in HADOOP-6165 (See Tom's comment, part 4(ii)) to include a "family" of configuration parameters. Given the renaming of various parameters, I propose to embed the metadata maps in {{mapreduce.map.output.key.metadata}} and {{mapreduce.map.output.value.metadata}}.

The {{JobContext.setMapOutputKeyClass()}} method would instead of setting {{mapreduce.map.output.key.class}}, set the appropriate metadata strings under {{mapreduce.map.output.key.metadata}} or {{mapreduce.map.output.value.metadata}} to use the {{WritableSerialization}} implementation. If this metadata is not present, metadata could be inferred from {{mapreduce.map.output.key/value.class}}, which should be marked as deprecated.

{{JobContext}} would also include new methods which could be used instead of {{setMapOutputKeyClass()}}, e.g. {{setMapOutputKeySchema()}} which would accept an Avro schema and configure the shuffle to use AvroGenericSerialization for this data. A generic {{setMapOutputKeySerializationMetadata()}} could be added as well, which would accept any {{string -> string}} map of metadata for use with other arbitrary {{SerializationBase}} implementations. (It might be used internally by {{setMapOutputKeyClass()}} and {{setMapOutputKeySchema()}}.)

On the internal side, {{JobContext}} already provides a method {{getSortComparator()}} which provides the appropriate {{RawComparator}} implementation. But it does not directly provide {{SerializerBase}} or {{DeserializerBase}} instances; it instead only returns the key and value class names and clients of this (e.g., {{MapOutputBuffer}}) are expected to instantiate the correct serialization objects themselves. I think that {{JobContext}} should provide {{getMapOutputKeySerializer()}} and {{getMapOutputKeyDeserializer()}} methods that correctly handle either the use of {{mapreduce.map.output.key.metadata}} or {{mapreduce.map.output.key.class}} to avoid duplication of this logic.

All {{SerializerBase}}, {{DeserializerBase}}, and {{RawComparator}} implementations instantiated during shuffle would then be created via uses of this new API.

Longer term, metadata maps are necessary for reducer output as well; e.g., {{mapreduce.reduce.output.key.metadata}}. This would allow full end-to-end Avro use for user data. Similar methods for JobContext would need to be added to set these parameters, and {{Job.setOutputKeyClass()}} et al. would need to be updated to use this new API as well. But I think that this may be more reasonably part of MAPREDUCE-815 than the current issue, which is concerned with just the shuffle process.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805618#action_12805618 ] 

Alan Gates commented on MAPREDUCE-1126:
---------------------------------------

bq. [From Owen] My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. I haven't talked to any users that care about using a non-default serializer for a given type.

Pig would like to.  For scalar types Pig uses Java String, Long, Integer, etc.  But default Java serialization is slow.  So currently we convert these to and from Writables as we go across the Map and Reduce boundaries to get the faster Writable serialization.  If we could instead define an alternate serializer and avoid these conversions it would make our code simpler and should perform better.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805732#action_12805732 ] 

Jeff Hammerbacher commented on MAPREDUCE-1126:
----------------------------------------------

bq. Especially for frameworks written on top of MapReduce, less restrictive interfaces here would surely be fertile ground for performance improvements.

bq. Writing wrappers can be irritating, but for the MR API, I'd rather make it easier on common cases and users than on advanced uses and framework authors.

Great points, Chris. Yahoo! has stated that a significant majority of their MapReduce jobs are written in Pig, and Facebook says the same of Hive. Among our many customers at Cloudera, it's far more common to target the MapReduce execution engine with a higher level language rather than the Java API. What you propose as the common case, then, appears to be uncommon in practice. Perhaps we should adjust our design criteria to match the usage data reported by the users of the project?

Thanks,
Jeff

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805779#action_12805779 ] 

Aaron Kimball commented on MAPREDUCE-1126:
------------------------------------------

Why are proposals now focusing on allowing users to specify different serialization *factories?*

If allowing users to specify the use of a particular {{SerializationBase}} via a flexible metadata map is considered too obscure, then I feel like the notion of having separate {{SerializationFactory}} instances seems to be an unnecessary level of abstraction. The current {{SerializationFactory}} implemented in hadoop-common allows access to all {{SerializationBase}} instances. If the focus is on user-accessibility of API, asking users to define a SerializationFactory which will only produce a single {{SerializationBase}} (e.g., {{WritableSerialization}} or {{AvroGenericSerialization}}) requires needless one-off code, and clutters the class hierarchy.

Instead, I might understand adding an API such as {{Job.setSerializationBase(ctxt, SerializationBase}}) where users directly set the {{SerializationBase}} instance to use in a given context, and disregard the {{SerializationFactory}} entirely.

For what it's worth, the patches that I and Tom have produced all make use of the default {{SerializationFactory}} in Hadoop; this API then uses the metadata map as defined in HADOOP-6165 to acquire the user's desired {{SerializationBase}} instance as appropriate for each of map output key, value, etc.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830222#action_12830222 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

> Doug, neither this or my previous patch use any deprecated APIs.

They use a method (SerializationBase.getMetadataFromClass) that's not intended as a primary, serialization-independent means for the framework to create a serialization.  That's a utility intended to be shared by serialization systems that use a class to create their metadata.  The only other calls to it are from deprecated, back-compatibility methods and contexts where serialization is class-based.  It *should* have been deprecated.  New framework code should not use this method as a primary means to create a serialization, as it does not support all serializations.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, m-1126-3.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805266#action_12805266 ] 

Tom White commented on MAPREDUCE-1126:
--------------------------------------

I'm working on a modified patch which hopefully will find a middle ground. It avoids any change to the existing API (so existing jobs will run without change or deprecation warnings), and it permits generic Avro types in the shuffle.

That said, I'm also interested in seeing your prototype Owen.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802564#action_12802564 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

> That is a really ugly change to the user's application.

It is moving something from kernel to library code, using the encouraged pattern, to make the kernel data-representation agnostic.  Avro applications will be on equal footing with Writable applications.  They'll do something like AvroJobData.setMapInputSchema(job, mySchema).  With generic data, the base class isn't germane, but the schema is.  Should we instead add that method directly to Job, then do the same for Thrift, Protobuf, Hessian, Java Serialization, etc?

This is analagous to the change from job.addInputPath() to FileInputFormat.addInputPath().

These methods could be a bit more friendly, accepting the job directly rather than requiring folks to call job.getConfiguration().  Perhaps Aaron can provide a revised version that does this?

> We used to infer the serialization from the types. That was a clean model. Why are you trying to change it?

The model was changed a few months back, in HADOOP-6165.  A class does not uniquely identify a serialization.  A class might implement Writable, but one might wish to use Avro Reflection to serialize it.  Avro also works for built-in types like Long, null, etc. directly, as do other serialization systems.  Moreover, with union types (as, e.g., supported by Avro, Thrift and perhaps others) there is no single class that corresponds to every input or output datum.

There could, for example, be an Avro-based inputformat and outputformat that represents all data types as byte[], containing the Avro binary data.  Since Avro comparators operate efficiently on binary data, this would perhaps be an efficient way to use MapReduce as a sorting engine.

> Additionally, you have the framework depending on a library. That is a problem.

That's only for back-compatibility: what used to be the default is now in a library.  The only calls to library code are in deprecated Job setter methods, invoked by user code.  The framework never calls these setters.

If this is a problem, then we could duplicate the code from WritableJobData in these deprecated methods.  That's what was done when setInputPath() was deprecated.  Is that preferred?

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831013#action_12831013 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

> Tom's patch adds two calls to that exact method.

One call is in a test method called getWritableSerializationMap, clearly serializaton-specific code.  The other is in reading input splits, which has not been converted to use the new serialization API, and still uses the split's class alone to determine serialization.

A primary goal of this issue is to update the shuffle to use the new Map<String,String> means of naming serializations, so that any serialzation may be used, not just those determinable from the class of the instance to be serialized.  Your patch uses the map output key class alone to determine the serialization, and hence does not meet this goal.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, m-1126-3.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Attachment: MAPREDUCE-1126.patch

Attaching a patch that makes this work. This depends on HADOOP-6438 which adds the configuration getters/setters discussed earlier.

This patch adds the following new files:

{code}
svn add src/test/mapred/org/apache/hadoop/mapreduce/TestAvroSerialization.java
svn add src/test/mapred/org/apache/hadoop/mapreduce/avro/key.avsc
svn add src/test/mapred/org/apache/hadoop/mapreduce/avro/val.avsc
{code}

Tested by running Mapreduce unit tests; added new unit tests that use the Avro serialization systems (generic, specific, and reflect) for intermediate data types. Booted a pseudo-distributed cluster and ran some example jobs, which work.

Will mark as patch-available after the dependencies are all met

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801034#action_12801034 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

> Also deprecating the {get,set}MapOutput{Key,Value}Class methods seems unnecessary.

The preferred style is to keep setters and getters that are not directly used by the framework to library code.  These methods are specific to certain serialization systems, mostly Writable.  The properties these methods set are no longer read by the framework itself, but only by library code.  Thus their setters also belong in library code, not in Job.



> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated MAPREDUCE-1126:
------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.22.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Aaron!

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806136#action_12806136 ] 

Chris Douglas commented on MAPREDUCE-1126:
------------------------------------------

bq. "1) throwing away all Java type hierarchies". Only sometimes, no? This is only in the case where you explicitly want to do unions (and Java's union support is either Object, type hierarchies, or wrappers). In the typical case, your map functions on SomeSpecificRecordType, outputs SomeSpecificMapOutputKey/ValueType, and so forth. You still get type safety in many of the recommended use cases.

Sure, but this doesn't cite the negative side. The patch changes the collection from a 1:1 class match- more restricted than the Java types- to a model unrelated to the declared types. So if a job accepts {{Short}}, {{Integer}}, and {{Long}} it may declare its type as {{Numeric}} but- again, depending on the serialization details- may reject (or fail to reject) {{Double}} and {{Float}}. So instead of being forced to declare a union type, whether this is reasonable is decided between the serialization and the user. This is a contrived example, but one can easily imagine other scenarios where an accepted subset of the supertypes are not only unenforced by the framework, but unenforceable. The even more interesting case is when one has a type hierarchy the serializer cares about that isn't expressed in Java types (e.g., valid keys contain a field named "dingo" whose supertype is Yak). The proposed model doesn't make it impossible to write type-checked code, but it does make it easier to write code that isn't (which, again: great for frameworks, arguably not as good for users). As I said earlier, it's a powerful, but dramatic shift from the current model that should be carefully considered.

bq. what happens if my serialization code is not written in Java and I have to use JNI to get to it?

I don't think any model yet proposed would make this harder than it is today.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802916#action_12802916 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

Not really. I see now what you are trying to accomplish, but I think it is the wrong model. While the FileInputFormat is similar in structure, the issues are:
  1. the analogy isn't precise because you aren't 
  2. the users care about the types that come out of their map
  3. the users aren't likely to change the serialization format from the default for the types they are using
  4. we end up with 3 configuration knobs per a class instead of the current 1. There are 6 classes in the MR pipeline, that means before this is done we have 18 methods to set the class/schema/serializer for the various types. That is *way* too complicated.

I thought that you were going to create a marker interface for AvroRecord that has a getSchema method. That would have enabled the AvroWriter to get the schema from the types rather than get the types from the schema.

I also think that removing the type checks from the collector and ifile code is a bad plan and will allow a lot of errors to reach much further into the system.

Let's consider the proposal that Arun has been discussing. Instead of doing:

{noformat}
   FileInputFormat.setInputPath(job, new Path("/foo"));
   job.setInputFormatClass(TextInputFormat.class);
{noformat}

you do:

{noformat}
  TextInputFormat input = new TextInputFormat();
  input.setInputPath(new Path("/foo"));
  job.setInputFormat(input);
{noformat}

clearly the job needs to serialize the InputFormat object and reconstruct it on the other side. This is much much easier for users to understand than the current model and can probably be done in a backwards compatible manner. Notice that this adds another 5 types that we are going to want serialize (InputFormat, Mapper, Combiner, Reducer, OutputFormat) per a job. With the current proposal that means that we get 11 * 3 = 33 serialization methods.

I think that:
  * we need to use the global serialization/deserialization factory that we already have.
  * moving the {set,get}MapOutput{Key,Value}Class methods is a non-starter. As a general rule, if you need to modify all of the examples, we need to carefully discuss the issues.
  * the metadata should not be user visible and it would be better if it was just used to communicate within the serializer. Why is the metadata a Map? I'd rather have it be an opaque blob that is serializable.
  * we can debate whether the type restrictions on map outputs should be loosened, but certainly we need to check on the map side whether the type the map is outputting  is correct. If you are going to loosen it, the class methods should become deprecated and vestigial and you need to support union types in Writable too.
  * i'm not wild about having Configuration.setMap. Having a function in StringUtils that takes a Map<String,String> map into and from a String seems more appropriate.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869425#action_12869425 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

If we elect to abandon MAPREDUCE-815 in favor of AVRO-493, and since all of the underpinnings of this issue have been reverted, perhaps we should now close this as "won't fix"?

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: m-1126-2.patch, m-1126-3.patch, MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805709#action_12805709 ] 

Chris Douglas commented on MAPREDUCE-1126:
------------------------------------------

Replacing the type driven serialization with an explicitly specified, context-sensitive factory is 1) throwing away all Java type hierarchies, 2) asserting that the serialization defines the user types, and 3) implying that these types- and relationships between them- should remain opaque to the MapReduce framework.

It's making a tradeoff discussed in HADOOP-6323: all the type checks are removed from the framework, but enforced by the serializer. So {{WritableSerialization}}- appropriately- requires an exact match for the configured class, but other serializers may not. The MapReduce framework can't do any checks of its own- neither, notably, may Java- to verify properties of the types users supply; their semantics are _defined by_ the serialization. For example, a job using related {{Writable}} types may pass a compile-time type check, work with explicit Avro serialization in the intermediate data, but fail if it were run with implicit Writable serialization.

This is a *huge* shift. It means the generic, Java types for the Mapper, Reducer, collector etc. literally don't matter; they're effectively all {{Object}} (relying on autoboxing to collect primitive types). This means that every serialization has its own type semantics which need not look anything like what Java can enforce, inspect, or interpret. Given this, that the patch puts the serialization as the most prominent interface to MapReduce is not entirely surprising.

It's also powerful functionality. By allowing any user type to be serialized/deserialized per context, the long-term elimination of the key/value distinction doesn't change {{collect(K,V)}} to {{collect(Object)}} as proposed, but rather {{collect(Object...)}}: the serializer transforms the record into bytes, and the comparator works on that byte range, determining which bytes are relevant per the serialization contract. Especially for frameworks written on top of MapReduce, less restrictive interfaces here would surely be fertile ground for performance improvements.

That said: I hate this API for users. Someone writing a MapReduce job is writing a transform of data; how these data are encoded in different contexts is usually irrelevant to their task. Forcing the user to pick a serialization to declare their types to- rather than offering their types to MapReduce- is backwards for the vast majority of cases. Consider the Writable subtype example above: one is tying the correctness of the {{Mapper}} to the intermediate serialization declared in the submitter code, whose semantics are inscrutable. That's just odd.

If one's map is going to emit data without a common type, then doesn't it make sense to declare that instead of leaving the signature as {{Object}}? That is, particularly given MAPREDUCE-1411, wouldn't the equivalent of {{Mapper<Text,Text,Text,AvroRecord>}} be a more apt signature than {{Mapper<Text,Text,Text,Object>}} for an implementation emitting {{int}} and {{String}} as value types?

I much prefer the semantics of the global serializer, but wouldn't object to adding an inconspicuous knob in support of context-sensitive serialization. Would a {{Job::setSerializationFactory(CTXT, SerializationFactory...)}} method, such that {{CTXT}} is an enumerated type of framework-hooks (i.e. {{DEFAULT}}, {{MAP_OUTPUT_KEY}}, {{MAP_OUTPUT_VALUE}}, etc.) be satisfactory? This way, one can instruct the framework to use/prefer a particular serialization in one context without requiring most users to change their jobs. It also permits continued use of largely type-based serialization which- as Tom notes- is a very common case. Writing wrappers can be irritating, but for the MR API, I'd rather make it easier on common cases and users than on advanced uses and framework authors.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805283#action_12805283 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

> Should we still consider separating the concerns of key/value class type and key/value serialization method?

Jay, with union types there is no single class save Object that all keys and values in a job must share.  So the key/value classes for some jobs are not very useful or interesting.  Consider, e.g., Pig's data model, where things are either a bag. tuple or a built-in type (int, long, String, float double, etc.).  For tests, a job that uses Avro would be better off validating the data against the schema than checking whether a particular class is used.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated MAPREDUCE-1126:
-------------------------------------

    Attachment: MAPREDUCE-1126.4.patch

new patch synced with trunk.

|should ReduceTask#getClassFromMetadata be in ClassBasedJobData?

There is no such literally-named method. The output types are queried with {{Job.getOutputKeyClass()}} and {{getOutputValueClass()}}. Eventually, these should migrate to ClassBasedJobData (e.g., in MAPREDUCE-815). But the scope of this issue is map-output types only.

|should we file another issue to fix the skipping feature to be serialization-savvy?

yes. This requires establishing a well-defined embedding of serialization metadata for key and value types/schemas in the sequencefile metadata block.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805625#action_12805625 ] 

Ted Dunning commented on MAPREDUCE-1126:
----------------------------------------

{quote}
   >  [From Owen] My assertion is that leaving the type as the primary instrument of the user in defining the job is correct. 
   > I haven't talked to any users that care about using a non-default serializer for a given type.

Pig would like to. For scalar types Pig uses Java String, Long, Integer, etc. But default Java serialization is slow. So currently we convert these to and from Writables as we go across the Map and Reduce boundaries to get the faster Writable serialization. If we could instead define an alternate serializer and avoid these conversions it would make our code simpler and should perform better.
{quote}

I would like to.  I would like to start using Avro for greater expressive power as soon as possible.  I also can't change all of my legacy code right away so I will have some code that implements both Writable and Avro serialization.  I need to be able to use writable for old code and Avro for new code.



> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829680#action_12829680 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Since Arun closed MAPREDUCE-1452 as a duplicate of MAPREDUCE-326, I've moved discussion of a new low-level API to MAPREDUCE-326.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12806042#action_12806042 ] 

Philip Zeyliger commented on MAPREDUCE-1126:
--------------------------------------------

@Scott: the annotations for Input/OutputFormat seem to be misplaced.  It seems desirable to be able to write a single Map function that does wordcount on Strings, regardless of whether those strings are stored in newline-delimited text, sequence files, avro data files, or whatever.  

@Chris: "1) throwing away all Java type hierarchies".  Only sometimes, no?  This is only in the case where you explicitly want to do unions (and Java's union support is either Object, type hierarchies, or wrappers).  In the typical case, your map functions on SomeSpecificRecordType, outputs SomeSpecificMapOutputKey/ValueType, and so forth.  You still get type safety in many of the recommended use cases.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805095#action_12805095 ] 

Doug Cutting commented on MAPREDUCE-1126:
-----------------------------------------

Owen, I am sorry you did not realize earlier the magnitude of the
changes implied by this issue.  As Aaron started work on it it became
apparent that the changes required were substantial.  A number of
people were following the issue as it progressed, and I assumed that
it had good oversight.

> 1. We should use the current global serializer factory for *all* contexts of a job.

I don't see that as an acceptable solution.  Job inputs and outputs
can reasonably differ in their serialization methods.  Specifying
seven different serializations can be done in seven lines of code that
replace the seven lines of code we currently use to specify these as
classes.  There are now more possibilties for what those seven lines
can contain, but I don't see this as a huge increase in complexity for
end users.

WritableData.setMapOutputKeyClass(job, class) is not much more
complicated than the current job.setMapOutputKeyClass(class).  The
natural way for an Avro generic data job to specify its map output key
is AvroGenericData.setMapOutputKeySchema(job, schema).  In Avro, a
schema specifies classes, binary format, and sort order.  There is in
general no single class that represents all of these aspects of a
schema.  Different serialization systems can reasonably differ in what
kind of metadata they require.

(A mapreduce simplification that might make sense long-term is to
eliminate the key/value distinction.  Each map input could be a single
object, each map output could be a single comparable object, and each
reduce output a single object.  This would eliminate three settings
per job, and I can think of no use cases where keys and values might
use different serialization systems.)

> 2. Defining the schema should be an Avro specific function.

It is Avro-specific in the current patch, so we agree on this point.

> 3. I don't see any reason to support union types at the top-level of
  the shuffle.

We should not force folks to define wrapper classes not required or
generated by their serialization system just to pass data through
mapreduce.  The Java class namespace is a poor mechanism to represent
mappings between all Java objects and all their possible serialized
forms.  Serialization is not completely determined by a class, and a
class does not completely determine a serialization.  Java Strings
longs and floats can be serialized in different ways.  A given job
might take String data from a file using one serialization system, map
it to a union type using another serialization system that provides
efficient, structured binary comparison, then write the final output
to a database as String using yet another serialization system.  Why
should we require folks to define wrapper classes to achieve this?

> 4. Only the default comparator should come from the serializer.

That would make sense if we only permit a single, global serializer.
If however the shuffle has its own serializer, then it could be done
in either place: the job could define a shuffle comparator, or it
could use the comparator from the shuffle's serializer.  In either
case, users should be able to override the comparator.  Since
comparators are a part of the serialization API, it seems better
modularization to use the comparator specified by the shuffle's
serializer, no?

> That said, I think it is perfectly acceptable for the Avro serializer to accept all types.

That would give the Avro serializer privledged status.  One could not also use another serializer (e.g., a Pig, Thrift or Hessian serializer) that also accepts String.  Applications should be able to specify which serializations they intend.

A primary design goal of the Avro project is improving the flexibility
of serialization APIs.  Mapreduce is a primary target application for
Avro.  We should not hobble Avro in Mapreduce.  The Writable model,
where classes define their serialization, has served us well, but that
model is limited.  Avro permits flexible mappings between in-memory
representations and serializations.  We can easily support this in
Mapreduce without either giving Avro privledged status or making the
Mapreduce API overly complex.  I hope you will not block this effort.


> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Tom White (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated MAPREDUCE-1126:
---------------------------------

    Attachment: MAPREDUCE-1126.patch

Here's a much-simplified patch. To show how it works with nested types I've added an example mapper with signature {{Mapper<LongWritable, Text, Utf8, Map<Utf8, Long>>}} which uses the generic Avro serialization for the intermediate key and value. It is configured by calling 

{code}
Schema keySchema = Schema.create(Schema.Type.STRING);
Schema valSchema = Schema.parse("{\"type\":\"map\", \"values\":\"long\"}");
AvroGenericData.setMapOutputKeySchema(job, keySchema);
AvroGenericData.setMapOutputValueSchema(job, valSchema);
{code}

This replaces the calls to job.setMapOutputKeyClass() and job.setMapOutputValueClass().

I'm interested in hearing people's thoughts about this.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley reopened MAPREDUCE-1126:
--------------------------------------


-1 to this massive API change without much more dialog. The scope of the patch was much larger than the description.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829261#action_12829261 ] 

Arun C Murthy commented on MAPREDUCE-1126:
------------------------------------------

I agree with Aaron, let us please keep the discussions here focussed on one issue.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801012#action_12801012 ] 

Owen O'Malley commented on MAPREDUCE-1126:
------------------------------------------

I reverted this until we can discuss the API changes. In particular, the framework can not depend on the stuff in mapreduce/lib. *sigh* Also deprecating the {get,set}MapOutput{Key,Value}Class methods seems unnecessary.

Clearly, getting Avro to flow through the shuffle is important, but the changes to the API were poorly considered.

> shuffle should use serialization to get comparator
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1126
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Doug Cutting
>            Assignee: Aaron Kimball
>             Fix For: 0.22.0
>
>         Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch
>
>
> Currently the key comparator is defined as a Java class.  Instead we should use the Serialization API to create key comparators.  This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.