You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Alejandro Abdelnur (JIRA)" <ji...@apache.org> on 2008/07/07 09:43:31 UTC

[jira] Created: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
-------------------------------------------------------------------------

                 Key: HADOOP-3702
                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
         Environment: all
            Reporter: Alejandro Abdelnur
            Assignee: Alejandro Abdelnur
            Priority: Minor


On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 

Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining.

The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the {{map}} and {{reduce}} methods and do the output via the {{OutputCollector}}.

The API would look something like:

{code:java}

public class ChainMapper implements Mapper {

  public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
  ...
}

public class ChainReducer implements Reducer {

  public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties reducerConf);

  public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
  ...
}

{code}

The {{Properties}} configuration passed to the {{Mapper}} and {{Reducer}} when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-3702:
--------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Alejandro!

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-3702:
----------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619844#action_12619844 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

*On #1*, I don't have a problem either way, I was just trying to be as less disruptive as possible. Somebody has to make the call here.

*On #2*, I've tried to generify {{ChainOutputCollector}} but hitting a wall there, I never have the types to instantiate it so I keep getting the compiler warnings. Thoughts?


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626839#action_12626839 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

failed test seems unrelated to this patch.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610905#action_12610905 ] 

tucu00 edited comment on HADOOP-3702 at 7/7/08 12:59 AM:
---------------------------------------------------------------------

This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining.

The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the {{map}} and {{reduce}} methods and do the output via the {{OutputCollector}}.

The API would look something like:

{code:java}

public class ChainMapper implements Mapper {

  public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
  ...
}

public class ChainReducer implements Reducer {

  public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties reducerConf);

  public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
  ...
}

{code}

The {{Properties}} configuration passed to the {{Mapper}} and {{Reducer}} when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain.

Example of creating and submitting a chain job:

{code:java}

JobConf conf = new JobConf();

// chaining maps in the Map phase

Properties mapAConf = new Properties();
mapAConf.setProperty("a", "A");
ChainMapper.addMapper(conf, AMap.class, mapAConf);

ChainMapper.addMapper(conf, BMap.class, null);

// setting the reducer

Properties reduceConf = new Properties();
ChainReducer.setReducer(conf, XReduce.class, reduceConf);

// chaining maps in the Reduce phase

ChainReducer.addMapper(conf, CMap.class, null);

ChainReducer.addMapper(conf, DMap.class, null);

...

FileInputFormat.setInputPaths(conf, inDir);
FileOutputFormat.setOutputPath(conf, outDir);

JobClient jc = new JobClient(conf);
RunningJob job = jc.submitJob(conf);

{code}

      was (Author: tucu00):
    Example of creating and submitting a chain job:

{code:java}

JobConf conf = new JobConf();

// chaining maps in the Map phase

Properties mapAConf = new Properties();
mapAConf.setProperty("a", "A");
ChainMapper.addMapper(conf, AMap.class, mapAConf);

ChainMapper.addMapper(conf, BMap.class, null);

// setting the reducer

Properties reduceConf = new Properties();
ChainReducer.setReducer(conf, XReduce.class, reduceConf);

// chaining maps in the Reduce phase

ChainReducer.addMapper(conf, CMap.class, null);

ChainReducer.addMapper(conf, DMap.class, null);

...

FileInputFormat.setInputPaths(conf, inDir);
FileOutputFormat.setOutputPath(conf, outDir);

JobClient jc = new JobClient(conf);
RunningJob job = jc.submitJob(conf);

{code}
  
> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626492#action_12626492 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

bq. Making ChainMapper<K1, V1, K2, V2> does not make sense in this case as a developer is never instantiating it. Thus there is not type checking being done at compilation here for K1, V1, K2, V2 .
It does not matter whether the classes are instantiated by the user or not, it is better to have them use generics properly as much as we can, so that we can get rid of the @SuppressWarnings("unchecked"). IndentityMapper is also never instantiated by the user. 

bq. Even if you would do create an instance of ChainMapper bound to concrete classes for K1, V1, K2, V2 it would work for the common case of mappers in the chain using different key/value classes.See, the contract between the maps in a chain is that the input key/values classes of the first mapper are the same as the input key/values classes of the job, the input key/values classes of the second mapper are the same as the output key/values classes of the first mapper, and so on, and the output key/value classes of the last mapper in the chain (for the ChainMapper) are the same as the input key/values classes of the reducer.
Yes I know that the whole dataflow would be 
{code}
(K1, V1)  -> (Ka, Va) -> ... -> (K2, V2) -> (k3, V3) 
{code}
where first mapper takes K1, V1, and last mapper outputs K2, V2, and the patch checks for that. 
For example Chain#mappers are Mapper<?,?,?,?> does not enforce the mappers to use the same input / output classes, and ChainOutputCollector uses <K,V,KO,VO> so that its input and output classes are enforced (but unfortunately we cannot instantiate ChainOutputCollectors with bound parameters. ). 
Chain#addMapper() is <i>static</i> so that its generic <K1,V1,K2,V2> parameters are not bound by the Chain<K1,...V3> parameters.  So you can again use 
{code}
Jobconf conf = ...
  ChainMapper.addMapper(conf, AMap.class, K1.class, V1.class, Ka.class, Va.class, null);
  ChainMapper.addMapper(conf, BMap.class, Ka.class, Va.class, Kb.class, Vb.class, null);
  ChainMapper.addMapper(conf, CMap.class, Kb.class, Vb.class, K2.class, V2.class, null);
{code}
My version of the patch uses a similar approach to JobConf, in that we try to use generics as much as possible, which is, I think, should be the preferred way. 



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Release Note: 
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.


  was:
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.

The Configuration write(OutputStream) method has been renamed to writeXml(OutputStream) to avoid ambiguity with the Writable write(DataOutput) method.


    Hadoop Flags:   (was: [Incompatible change])

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619840#action_12619840 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

bq. On #1, Yes but this introduces a backwards incompatibility as there is code out there (outside of core) that uses the write(OutputStream) with DataOuputStream instances and the change is breaking such usages (as the previous patch found).

Yes, at some level it will introduce backwards incompatibility. But thinking in abstract terms, having {{func(Interface1 i1)}}, and introducing {{func(Interface2 i2)}} is not a direct incompatibility. Only those calls where the object both implements Interface1 and Interface2 would be affected(in this case DataOutputStream), and there is a clear workaround for this. I think introducing a Serialization for Configuration is far more worse. I suggest we keep write(OutputStream), introduce write(DataOutput), fix all the cases where DataOutputStream is passed (including contrib), and mark this change as incompatible with clear documentation in release note. 


bq. On #2, I'm missing something here, under what circumstances would it make sense to use the Chain* classes with generics that would be checked at compile time? If it is just a way of avoiding the @SuppressWarnings annotations I'd prefer the anotations as, IMO, they are meant for this cases.
@SuppressWarnings annotations are just "hacks" for the compiler to stop complaining. There are valid reasons for the compiler to issue warnings, and instead of fixing them, we say the compiler to ignore these, which is not desired. 
The motivation to use generics is the same as the one for the use of generics in Mapper, Reducer, etc. I guess with some little extra effort, we could make this change, no? 

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627554#action_12627554 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389247/patch3702.txt
  against trunk revision 691055.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3152/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

fixing test and style errors.

still missing is pass by value semantics 

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613141#action_12613141 ] 

Runping Qi commented on HADOOP-3702:
------------------------------------


I don't understand why the map chaining is necessary.
Why not just compoite all the function into one and use the composited one as the mapper or reducer.



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617773#action_12617773 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

On the first bullet item, Finally got the {{Stringifier}} thing, Thanks.

On the second bullet item, the serialized chain element {{JobConf}} (A) is deserialized, a new {{JobConf}} (B) with the task's {{JobConf}} as parent is created (to ensure all Hadoop runtime values are avail), the values of A are applied to B to ensure the config values of the chain element have precedence when the chain mapper/reducer gets its conf, B.

On the third bullet item, Yes, you are right. 2 reasons, one is to keep code simpler, the second is that if a MultithreadedMapRunner is used a single pipeline will not work as it is not thread safe.

On the fourth bullet item, this will go away as I'm using {{Stringifier}} in the next version of patch.


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616329#action_12616329 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12386720/patch3702.txt
  against trunk revision 679202.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to cause Findbugs to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2930/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Christophe Taton
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

TestPipes is failing due to a signature change in the {{Configuration}}, from {{write(OutputStream)}} to {{writeXml(OutputStream)}}.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616908#action_12616908 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12386850/patch3702.txt
  against trunk revision 679772.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 12 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2951/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619541#action_12619541 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

*On #1*, the ambiguity for the existing {write(OutputStream)}} and the {{write(DataOutput)}} introduced by the {{Writable}} interface arises when a {{DataOutputStream}} object is passed to the {{write(...)}} method, the compiler cannot resolve which one to use. (I've done this in one of the previous patches, my solution was changing the existing {{write()}} to {{writeXml()}} but it started breaking in different places in contrib as the method is used outside of core. 

*On #2*, it doesn't make much sense to generify the {{Chain*}} classes as they are not directly exposed to the M/R developer, they are artifacts used to enable chaining, the M/R developer doesn't code against them. If they would be generified it wold not add any value as at compile time nothing could be checked for them.

*On #3*, {{ChainOutputCollector}} is lightweight class, see my answer to Chris' regarding a similar question.


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629418#action_12629418 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

+1 

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-3702:
----------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613565#action_12613565 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

Another thing missing in the current patch is specifying the key/value in/out classes for the maps and reduce in the chain and making sure they are compatible between the elements of the chain.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626652#action_12626652 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389067/patch3702.txt
  against trunk revision 689733.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3136/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Fix Version/s: 0.19.0
           Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

refactoring code to add pass by-value support in the chain.

everything is in place except the ser/deser implementation, the {{ChainOutputCollector.passByValue()}} method (coming soon).

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616681#action_12616681 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12386782/patch3702.txt
  against trunk revision 679506.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2941/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Description: 
On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 

Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.


  was:
On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.

If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 

Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining.

The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the {{map}} and {{reduce}} methods and do the output via the {{OutputCollector}}.

The API would look something like:

{code:java}

public class ChainMapper implements Mapper {

  public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
  ...
}

public class ChainReducer implements Reducer {

  public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties reducerConf);

  public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
  ...
}

{code}

The {{Properties}} configuration passed to the {{Mapper}} and {{Reducer}} when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain.



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617900#action_12617900 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12387102/patch3702.txt
  against trunk revision 680577.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2972/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615523#action_12615523 ] 

Owen O'Malley commented on HADOOP-3702:
---------------------------------------

I don't like that each stage in the pipeline has its own configuration and that you serialize them all and put them into the outer configuration. What is the use case for that? If you are going to do that, wouldn't it be easier to just make Configuration Writable and use the writable stringifier?

You should use hadoop.io.Data{In,Out}putBuffer rather than defining your own DirectBufferByteArrayOutputStream, especially since the name sounds like a direct buffer.

You should probably convert the task id string to a TaskAttemptID and call the isMap method rather than parsing the taskid string.

The preferred style is Sun's:
{code}
if (...) {
} else {
}
{code}

Other than that, it seems good.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur reassigned HADOOP-3702:
------------------------------------------

    Assignee: Alejandro Abdelnur  (was: Christophe Taton)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629074#action_12629074 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389651/patch3702.txt
  against trunk revision 692700.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3202/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

Modified to use {{Stringifier}} as suggested, added {{Serialization}} impl for {{Configuration}} for that. 

This removes the 'incompatible change' introduced in the previous patch.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3702:
------------------------------------

    Release Note: 
Introduced ChainMapper and the ChainReducer classes to allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ REDUCE MAP*.


  was:
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12610905#action_12610905 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

Example of creating and submitting a chain job:

{code:java}

JobConf conf = new JobConf();

// chaining maps in the Map phase

Properties mapAConf = new Properties();
mapAConf.setProperty("a", "A");
ChainMapper.addMapper(conf, AMap.class, mapAConf);

ChainMapper.addMapper(conf, BMap.class, null);

// setting the reducer

Properties reduceConf = new Properties();
ChainReducer.setReducer(conf, XReduce.class, reduceConf);

// chaining maps in the Reduce phase

ChainReducer.addMapper(conf, CMap.class, null);

ChainReducer.addMapper(conf, DMap.class, null);

...

FileInputFormat.setInputPaths(conf, inDir);
FileOutputFormat.setOutputPath(conf, outDir);

JobClient jc = new JobClient(conf);
RunningJob job = jc.submitJob(conf);

{code}

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.
> This could be done with ChainMapper and ChainReducer classes that would manage the chain of Maps and they would override the OutputCollector to implement the chaining.
> The Maps and Reduce that are part of the Chain are unware they are executed in a Chain, they receive records via the {{map}} and {{reduce}} methods and do the output via the {{OutputCollector}}.
> The API would look something like:
> {code:java}
> public class ChainMapper implements Mapper {
>   public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
>   ...
> }
> public class ChainReducer implements Reducer {
>   public static void setReducer(JobConf job, Class<? extends Reducer> klass, Properties reducerConf);
>   public static void addMapper(JobConf job, Class<? extends Mapper> klass, Properties mapperConf);
>   ...
> }
> {code}
> The {{Properties}} configuration passed to the {{Mapper}} and {{Reducer}} when setting them into the chain are injected into a copy of the job's configuration. This allows maps to be configured as usual without being aware that they are in a chain.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628545#action_12628545 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

I'm confused, why option 3 would not work with your example. 

My reasoning for Option 3 is that as input the mapper/reducer can take process a less specific class of input key/value classes, but the output cannot be less (or more) specific because the concrete class of the output is chosen to get the serializer.

Classes: {{A, B extends A, C extends B}}

Input Key/Value classes: {{ C, B}}
Output Key/Value classes: {{ A, B}}

Valid mapper classes: {{Mapper<C, B, A, B>}} {{Mapper<C, A, A, B>}}  {{Mapper<B, B, A, B>}} {{Mapper<B, A, A, B>}} {{Mapper<A, B A, B>}}, {{Mapper<A, A, A, B>}}  

Invalid mapper classes: {{Mapper<*, *, B, B>}} (and any other mapper were the output is not {{A, B}} )




> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

Fixing {{TestPipes}} test case.

By making {{Configuration}} implement {{Writable}} there is an ambiguity with the {{write(OutputStream)}} that already existed in {{Configuration}} and the {{write(DataOuptut)}} that is brought in by implementing {{Writable}}. The ambiguity arises when using a {{DataOutputStream}}, to resolve this I've renamed the original {{write(OutputStream)}} to {{writeXml(OutputStream)}}.


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-3702:
----------------------------------

    Priority: Major  (was: Minor)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628622#action_12628622 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

With *option 3* the _//should fail_ line fails.

Your comment _However the output classes can indeed be more specific than declared. An example of this is a mapper which does not change the input key, and directly outputs it._ is correct, nasty but correct.

This bring then another question, currently when adding a mapper to the chain the {{addMapper}} method checks that the input key/value classes match (being equal, a ==) the output key/value classes of the previous mapper. Should this be changed to an {{isAssignable()}}? Or is it too much?




> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

Fixing test and style errors.

Still not addressing by value semantics on the chain

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

Integrating Owen's comments:

* Modified Configuration to be a Writable
* Modified WritableUtils to have methods to convert Writable to/from String
* Modified Chain to use above methods instead hack with Properties
* Using hadoop DataOutputBuffer
* Using if/else Sun's sytle

I could not figure out the comment on taskAttempId as the patch is not using taskId at all to find out if its a map or reduce task. The ChainMapper and ChainReducer create a Chain to delegate common logic and at creation time it indicates what is to be use for.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613242#action_12613242 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12385953/patch3702.txt
  against trunk revision 676069.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2854/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628600#action_12628600 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

Ok, I confused option 3 as a derivative of 1, which clearly is a derivative of 2.
However the output classes can indeed be more specific than declared. An example of this is a mapper which does not change the input key, and directly outputs it. 

The following code fragment, demonstrates that we should go with the option 2. 
{code}
  static class A  {
  }
  static class B extends A {
  }
  static class AMapper extends MapReduceBase implements Mapper<A, Text, A, Text> {
    @Override
    public void map(A key, Text value, OutputCollector<A, Text> output,
        Reporter reporter) throws IOException {
    }
  }
  static class BMapper extends MapReduceBase implements Mapper<B, Text, B, Text> {
    @Override
    public void map(B key, Text value, OutputCollector<B, Text> output,
        Reporter reporter) throws IOException {
    }
  }
  static {
    JobConf job = new JobConf();
    addMapper(true, job, AMapper.class, A.class, Text.class, A.class, Text.class, false, job);
    addMapper(true, job, AMapper.class, B.class, Text.class, B.class, Text.class, false, job);
    addMapper(true, job, BMapper.class, A.class, Text.class, A.class, Text.class, false, job);//should fail
    addMapper(true, job, BMapper.class, B.class, Text.class, B.class, Text.class, false, job);
  }
{code}
 



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Assignee: Christophe Taton  (was: Alejandro Abdelnur)
      Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Christophe Taton
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment:     (was: patch3702.txt)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated HADOOP-3702:
----------------------------------

    Attachment: Hadoop-3702.patch

I am attaching a patch which is a modified version of the last one. 

I have added generic arguments where applicable. But not all the instances could be generified, since we cannot know the arguments. This is because we do not know the map input and output types after the first map and before the last map. At least ChainReducer and ChainMapper are now generified. 

Implementing a Serializer for a specific class does not fit into the design of Serializer. Serializer is intended to be defined once for each set of classes (implemeting Writable, Serializable, etc.). 
In the attached patch WritableJobConf extends JobConf and implements Writable. Chain uses this class to ser/deser the configuration. And this does not introduce an incompatibility, since Configuration/JobConf does not implement Writable.  



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment:     (was: patch3702.txt)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

Inverted precedence of configuration.

The chain element JobConf configuration has precedence over the chain job JobConf.

This allows to set some defaults at chain job JobConf level and override them at chain element level.

Note that this does not break anything at runtime as ll the injected configuration for the task is used at the Chain{Mapper,Reducer}


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619814#action_12619814 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

*On #1*, Yes but this introduces a backwards incompatibility as there is code out there (outside of core) that uses the {{write(OutputStream)}} with {{DataOuputStream}} instances and the change is breaking such usages (as the previous patch found).

*On #2*, I'm missing something here, under what circumstances would it make sense to use the {{Chain*}} classes with generics that would be checked at compile time? If it is just a way of avoiding the @SuppressWarnings annotations I'd prefer the anotations as, IMO, they are meant for this cases.


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628024#action_12628024 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

bq. Still, I don't agree on making the Chain* classes generic. Their generics would not be use for any type checking at compilation. And it is not possible to get rid of the @SuppressWarnings("unchecked") .
I do not wish to further delay the patch for this. I'm OK with the current patch, although the reasons for using generics remain.

bq. Shouldn't be appropriate to make them?...
Sigh. yes, since in generics, G<Foo> is not a subtype of G<Bar> even if Foo extends Bar. So please make that change. 



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612987#action_12612987 ] 

Chris Douglas commented on HADOOP-3702:
---------------------------------------

Very basic question: what are the semantics of a mapper in this chain calling collect(K,V)? Currently, it is guaranteed that neither the key nor the value will be modified, so the following must hold:
{code}
key.set(some_value);
value.set(some_other_value);
collect(key, value);
assert key.get().equals(some_value);
assert value.get().equals(some_other_value);
{code}

Chaining mappers can violate this property unless the following maps guarantee (by convention, presumably) that they will not modify either argument. It might make sense to require chained mappers (excluding the final mapper) to implement a different interface- even if that interface is empty- to promise to treat the record as const.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Release Note: 
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.


  was:
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*.
An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.


          Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Release Note: 
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.

The Configuration write(OutputStream) method has been renamed to writeXml(OutputStream) to avoid ambiguity with the Writable write(DataOutput) method.


  was:
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*. An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.


    Hadoop Flags: [Incompatible change]

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619529#action_12619529 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

# Having a distinct serializer for Configuration is not desired. Configuration is already marshaled/unmarshaled to/from string form as XML. So having Configuration implement Writable is the choice here. There is no ambiguity in the Configuration.write() method. We can keep current write(OutputStream out) method and add : 
{code}
public void write(final DataOutput out) throws IOException {
    write(new OutputStream() {
      @Override
      public void write(int b) throws IOException {
        out.writeByte(b);
      }
    });
  }
{code}
The readFields() can be implemented by factoring common functionality of loadProperties() method, instead of reading from a URL or file, the XML will be built using DataInput. 
# It would be better if ChainMapper, ChainReducer, ChainOutputCollector to use generics, similar to Mapper, Reducer, and OutputCollector classes. 
# creating a new ChainOutputCollector() at each collect() call might be suboptimal. Could this not be done once for each Map/Reduce task? 

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628240#action_12628240 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

Thanks Enis,

On the use of generics in the {{addMapper/setReducer}} which one of the 3 options is the most appropriate? I would be inclined to think that the third one. My reasoning is that on the input the mapper/reducer must handle at least a subclass of the key/value classes, and on the output the mapper/reducer define what the key/value classes are.

*Option 1*
{code}
public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                           Class<? extends Mapper<? extends K1,? extends  V1,? extends  K2,? extends  V2>> klass,
                           Class<K1> inputKeyClass, Class<V1> inputValueClass,
                           Class<K2> outputKeyClass, Class<V2> outputValueClass,
                           boolean byValue, JobConf mapperConf) {
{code}

*Option 2*
{code}
public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                           Class<? extends Mapper<K1, V1, K2, V2>> klass,
                           Class<? extends K1> inputKeyClass, Class<? extends V1> inputValueClass,
                           Class<? extends K2> outputKeyClass, Class<? extends V2> outputValueClass,
                           boolean byValue, JobConf mapperConf) {
{code}
 
*Option 3*
{code}
public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                           Class<? extends Mapper<K1, V1, K2, V2>> klass,
                           Class<? extends K1> inputKeyClass, Class<? extends V1> inputValueClass,
                           Class<K2> outputKeyClass, Class<V2> outputValueClass,
                           boolean byValue, JobConf mapperConf) {
{code}



> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

Changing signature of {{addMapper}} and {{setReducer}} methods to use {{? extends T}} for the input/output key/value classes.

Changing check of chained input/output key/value classes to use {{isAssignable}} instead of using {{equals}}.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614113#action_12614113 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12386208/patch3702.txt
  against trunk revision 677379.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2880/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626431#action_12626431 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

[apologies for the delay following up on this, I was off all last week]

*On using generics*

Enis, I don't think the use of generics in your proposed patch is correct. Let me try to explain.

_First reason:_

The intended normal use of {{ChainMapper}} and {{ChainReducer}} is via configuration, i.e.:

{code}
  Jobconf conf = ...
  ChainMapper.addMapper(conf, ...);
  ChainMapper.addMapper(conf, ...);
  ChainMapper.addMapper(conf, ...);
  ...
{code}

Making {{ChainMapper<K1, V1, K2, V2>}} does not make sense in this case as a developer is never instantiating it. Thus there is not type checking being done at compilation here for {{K1, V1, K2, V2}} .

_Second reason:_

Even if you would do create an instance of {{ChainMapper}} bound to concrete classes for {{K1, V1, K2, V2}} it would work for the common case of mappers in the chain using different key/value classes. 

See, the contract between the maps in a chain is that the input key/values classes of the first mapper are the same as the input key/values classes of the job, the input key/values classes of the second mapper are the same as the output key/values classes of the first mapper, and so on, and the output key/value classes of the last mapper in the chain (for the {{ChainMapper}}) are the same as the input key/values classes of the reducer.

For example: take a job that the map input/output classes are {{K1,V1,K2,V2}} you can have the following chain:

{code}
  Jobconf conf = ...
  ChainMapper.addMapper(conf, AMap.class, K1.class, V1.class, Ka.class, Va.class, null);
  ChainMapper.addMapper(conf, BMap.class, Ka.class, Va.class, Kb.class, Vb.class, null);
  ChainMapper.addMapper(conf, CMap.class, Kb.class, Vb.class, K2.class, V2.class, null);
{code}

*On using a {{Serializer}} for {{Configuration}}*

Note that the {{Serializer}} is for {{Configuration}} and subclasses, it is not bound to {{Configuration}}.

I'm OK with your proposed patch here.


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

This patch uses Enis' {{WritableJobConf}} instead making a {{Serialization}} for {{Configuration}}.

But it does not generify the Chain classes (per previous comment).

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630182#action_12630182 ] 

Hudson commented on HADOOP-3702:
--------------------------------

Integrated in Hadoop-trunk #600 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/600/])

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622478#action_12622478 ] 

Devaraj Das commented on HADOOP-3702:
-------------------------------------

bq. I think introducing a Serialization for Configuration is far more worse.

Enis, could you please expand on this one? It is not clear to me how this is worse..

Regarding the generics signature for the chain* classes, I think it is okay to not have generics since this class's key/value types are driven by the first/last map/reduce's key/value type in the chain. 

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621368#action_12621368 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

Enis, 

I have not heard anything back from you on this.


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

there is an ambiguity with {{Configuration.write}} due to making {{Configuration}} implement Writable.

The exiting {{Configuration.write(OutputStream)}} that writes out XML must be renamed to something like {{writeXml(OutputStream)}}

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Christophe Taton
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

renaming original {{Configure.write(OutputStream)}} to {{{{Configure.writeXml(OutputStream)}} to avoid ambiguity with the {{write(DataOutput)}} introduced by implementing {{Writable}}. The ambiguity occurs when when using a {{DataOutputStream}} (this happens in the {{JobHistory, JobClient, TaskTracker, TaskRunner}}).


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

Added byValue support to preserve the semantics of the collector not modifying the key and values.

Still byReference can be used as an optimization if the mappers and reducer in the chain don't use the byValue semantics.

Improved the patch to allow different key/value classes to be used by different elements in the chain, only restriction is that the output of a mapper has to match the expected input of the next mapper.

Sample usage:

{code:java}

  JobConf conf = new JobConf();
  conf.setJobName("chain");
  
  conf.setInputFormat(TextInputFormat.class);
  conf.setOutputFormat(TextOutputFormat.class);
  
  FileInputFormat.setInputPaths(conf, inDir);
  FileOutputFormat.setOutputPath(conf, outDir);

  JobConf mapAConf = new JobConf();
  ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf);
 
  JobConf mapBConf = new JobConf();
  ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, mapBConf);
 
  JobConf reduceConf = new JobConf();
  ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class, Text.class, true, reduceConf);
 
  ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, null);
 
  ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class, LongWritable.class, true, null);
   ...
 
  JobClient jc = new JobClient(conf);
  RunningJob job = jc.submitJob(conf);
{code}

Note: the previous to last {{boolean}} parameter indicates if the Mapper/Reducer added in to the chain wants a byValue (TRUE) or byReference (FALSE)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

I've got Enis point of making use of generics in the {{addMapper}} and {{setReducer}} methods in the {{Chain*}} classes. This enforces a compile time type check between the defined key/value in/out classes and the mapper/reducer classes when defining the chain. I've modified the patch to do so.


An additional question on this. Currently the {{addMapper}} and {{setReducer}} signatures are like:

{code}
  public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                           Class<? extends Mapper<K1, V1,  K2,  V2>> klass,
                           Class<K1> inputKeyClass, Class<V1> inputValueClass,
                           Class<K2> outputKeyClass, Class<V2> outputValueClass,
                           boolean byValue, JobConf mapperConf) {
{code}

Shouldn't be appropriate to make them?

{code}
  public static <K1, V1, K2, V2> void addMapper(boolean isMap, JobConf jobConf,
                           Class<? extends Mapper<? extends K1,? extends  V1,? extends  K2,? extends  V2>> klass,
                           Class<K1> inputKeyClass, Class<V1> inputValueClass,
                           Class<K2> outputKeyClass, Class<V2> outputValueClass,
                           boolean byValue, JobConf mapperConf) {
{code}

-----
Still, I don't agree on making the {{Chain*}} classes generic. Their generics would not be use for any type checking at compilation. And it is not possible to get rid of the {{@SuppressWarnings("unchecked")}} .


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Release Note: 
The ChainMapper and the ChainReducer classes allow composing chains of Maps and Reduces in a single Map/Reduce job, something like MAP+ / REDUCE MAP*.
An immediate benefit of this pattern is reduction in disk IO as many Maps can be club together in a single job.

          Status: Patch Available  (was: Open)

Proposed implementation. Different from the initial comment, it uses a JobConf to configure the Maps in the Chain.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Open  (was: Patch Available)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613093#action_12613093 ] 

Hadoop QA commented on HADOOP-3702:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12385705/patch3702.txt
  against trunk revision 676069.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to cause Findbugs to fail.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2842/console

This message is automatically generated.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628360#action_12628360 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

Lets see, 
if we have 
{code}
class B extends A 
{code}
then {{Mapper<A, X, X, X>}} can accept {{Class<B>}} as inputkeyclass, but not the other way around, right? 
In this case Option 1 and 3 are not correct. 
Option2 is good, I guess. And no need to limit the output key/value types(the input key class can be the same as output key class).


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.19.0
>
>         Attachments: Hadoop-3702.patch, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613227#action_12613227 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

* On Chris' comment:

You are right, I've missed that point in my proposed implementation. 

To address that the Chain code should take care of cloning the key and value before passing them to the following Map in the chain.

Still, as optimization (to avoid serializing/deserializing keys and values) for every link in the chain a {{passByReference}} property could be set.

* On Runping's comment:

In our current implementation we are doing as you are suggesting, this means we have our own set of interfaces for processing, not Mappers and Reducers, and we have a ChainMapper and a ChainReducer that manage the lifecycle of our private interfaces.

Using Mapper/Reducer interfaces directly is cleaner, more consistent and simpler for developers.


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Attachment: patch3702.txt

Using JobConf(false) for the chain mappers and reducer jobconfs when created internally to reduce size in chain job jobconf.

Reusing the ByteArrayOutputStream used for byValue passing, using a threadlocal to work with the MultiThreadedMapRunner.

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617633#action_12617633 ] 

Chris Douglas commented on HADOOP-3702:
---------------------------------------

* Instead of adding WritableUtils::asString, it might make more sense to use the o.a.h.io.Stringifier interfaces, particularly since you create a new JobConf (permitting a user to pass in an object would be an excellent, but not strictly necessary, addition to Stringifier, too)
* I'm not sure I understand how the configuration deserialization works. In Chain::getChainElementConf, a new config is created, its fields cleared and populated from the stream, then each property defined in the deserialized conf is (re)defined on a clone of the JobConf passed in (presumably to permit final, etc. to be observed). Is that accurate?
* Is it true that each call to {{map}} creates a series of {{ChainOutputCollectors}}? These look like lightweight objects, but is there a reason the pipeline needs to be recreated each time?
* It looks like this broke the eclipse plugin:
{panel}
     [exec] compile:
     [exec]      [echo] contrib: eclipse-plugin
     [exec]     [javac] Compiling 30 source files to /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/build/contrib/eclipse-plugin/classes
     [exec]     [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/contrib/eclipse-plugin/src/java/org/apache/hadoop/eclipse/server/HadoopServer.java:422: write(java.io.DataOutput) in org.apache.hadoop.conf.Configuration cannot be applied to (java.io.FileOutputStream)
     [exec]     [javac]     this.conf.write(fos);
     [exec]     [javac]              ^
     [exec]     [javac] /zonestorage/hudson/home/hudson/hudson/jobs/Hadoop-Patch/workspace/trunk/src/contrib/eclipse-plugin/src/java/org/apache/hadoop/eclipse/servers/RunOnHadoopWizard.java:166: write(java.io.DataOutput) in org.apache.hadoop.conf.Configuration cannot be applied to (java.io.FileOutputStream)
     [exec]     [javac]       conf.write(fos);
     [exec]     [javac]           ^
{panel}

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615633#action_12615633 ] 

Alejandro Abdelnur commented on HADOOP-3702:
--------------------------------------------

Thanks for the feedback.

*On "each stage in the pipeline having its own configuration .. put them into the outer configuration":*

The use case is that mappers/reducer in the chain (I'm not using 'pipe' to avoid confusion as it has other use in Hadoop) are not aware of each other, and that they are in a chain, and they may have collision in the configuration keys they use. 

*On "If you are going to do that, ... make Configuration Writable and use the writable stringifier?":*

That was my initial idea but it would have required changes in the Configuration and I was avoiding any change in the core.

I'll make this change.

*On "use {{hadoop.io.Data{In,Out}putBuffer}} ...":*

I'll look at it.

*On "You should probably convert the task id string to a TaskAttemptID ...":*

OK

*On "preferred sytle is Sun's":*

OK
 

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619803#action_12619803 ] 

Enis Soztutar commented on HADOOP-3702:
---------------------------------------

# obviously, casting the DataOutputStream to OutputStream would resolve the ambiguity. All these kind of references could be changed (if any) to first cast the reference. 
{code}
DataOutputStream dos = ...
...
conf.write((OutputStream)dos);
...
{code}
# Although the Chain classes are not exposed, it is always good practice to use the generics. We can get rid of  @SuppressWarnings({"unchecked"}) annotations. 

# If you are confident that this does not introduce extra overhead, I'm fine with this. 


> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-3702) add support for chaining Maps in a single Map and after a Reduce [M*/RM*]

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-3702:
---------------------------------------

    Status: Patch Available  (was: Open)

> add support for chaining Maps in a single Map and after a Reduce [M*/RM*]
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-3702
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3702
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Minor
>         Attachments: patch3702.txt, patch3702.txt, patch3702.txt, patch3702.txt
>
>
> On the same input, we usually need to run multiple Maps one after the other without no Reduce. We also have to run multiple Maps after the Reduce.
> If all pre-Reduce Maps are chained together and run as a single Map a significant amount of Disk I/O will be avoided. 
> Similarly all post-Reduce Maps can be chained together and run in the Reduce phase after the Reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.