You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Rod Taylor (JIRA)" <ji...@apache.org> on 2006/03/30 18:36:26 UTC

[jira] Created: (HADOOP-113) Allow multiple Output Dirs to be specified for a job

Allow multiple Output Dirs to be specified for a job
----------------------------------------------------

         Key: HADOOP-113
         URL: http://issues.apache.org/jira/browse/HADOOP-113
     Project: Hadoop
        Type: New Feature
  Components: mapred  
    Versions: 0.1    
    Reporter: Rod Taylor
 Attachments: hadoop_multisegment.patch

Allow a single job to create multiple outputs. 2 additional simple functions only

This allows for more complex branching of the process to occur either with multiple steps of the same type or allow different actions to take place on each output directory depending on the required actions.


For my specific use, it allows me to run multiple Generate Outputs instead of a single Generate Output as submitted in NUTCH-171(http://issues.apache.org/jira/browse/NUTCH-171)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-113) Allow multiple Output Dirs to be specified for a job

Posted by "paul sutter (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-113?page=comments#action_12373749 ] 

paul sutter commented on HADOOP-113:
------------------------------------


Is the intention to have one mapper fork into multiple reducers, saving the file io of doing independent map passes?

mapper1 -> output a -> reducer 1
                 -> output b -> reducer 2 

instead of

mapper1 -> output a -> reducer 1
mapper 2 -> output b -> reducer 2

wherein the second example, the map input file is read twice instead of once?

that could be useful. i not sure how much it would really speed things up.

> Allow multiple Output Dirs to be specified for a job
> ----------------------------------------------------
>
>          Key: HADOOP-113
>          URL: http://issues.apache.org/jira/browse/HADOOP-113
>      Project: Hadoop
>         Type: New Feature

>   Components: mapred
>     Versions: 0.1.0
>     Reporter: Rod Taylor
>  Attachments: hadoop_multisegment.patch
>
> Allow a single job to create multiple outputs. 2 additional simple functions only
> This allows for more complex branching of the process to occur either with multiple steps of the same type or allow different actions to take place on each output directory depending on the required actions.
> For my specific use, it allows me to run multiple Generate Outputs instead of a single Generate Output as submitted in NUTCH-171(http://issues.apache.org/jira/browse/NUTCH-171)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-113) Allow multiple Output Dirs to be specified for a job

Posted by "Rod Taylor (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/HADOOP-113?page=all ]

Rod Taylor updated HADOOP-113:
------------------------------

    Attachment: hadoop_multisegment.patch

> Allow multiple Output Dirs to be specified for a job
> ----------------------------------------------------
>
>          Key: HADOOP-113
>          URL: http://issues.apache.org/jira/browse/HADOOP-113
>      Project: Hadoop
>         Type: New Feature
>   Components: mapred
>     Versions: 0.1
>     Reporter: Rod Taylor
>  Attachments: hadoop_multisegment.patch
>
> Allow a single job to create multiple outputs. 2 additional simple functions only
> This allows for more complex branching of the process to occur either with multiple steps of the same type or allow different actions to take place on each output directory depending on the required actions.
> For my specific use, it allows me to run multiple Generate Outputs instead of a single Generate Output as submitted in NUTCH-171(http://issues.apache.org/jira/browse/NUTCH-171)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-113) Allow multiple Output Dirs to be specified for a job

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley resolved HADOOP-113.
----------------------------------

    Resolution: Won't Fix
      Assignee:     (was: Owen O'Malley)

The way that Hadoop would deal with this these days is to have a pair of static methods in the specific OutputFormat that supports multiple directories. So there would be no need to add them into JobConf.

> Allow multiple Output Dirs to be specified for a job
> ----------------------------------------------------
>
>                 Key: HADOOP-113
>                 URL: https://issues.apache.org/jira/browse/HADOOP-113
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.1.0
>            Reporter: Rod Taylor
>         Attachments: hadoop_multisegment.patch
>
>
> Allow a single job to create multiple outputs. 2 additional simple functions only
> This allows for more complex branching of the process to occur either with multiple steps of the same type or allow different actions to take place on each output directory depending on the required actions.
> For my specific use, it allows me to run multiple Generate Outputs instead of a single Generate Output as submitted in NUTCH-171(http://issues.apache.org/jira/browse/NUTCH-171)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-113) Allow multiple Output Dirs to be specified for a job

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/HADOOP-113?page=comments#action_12372563 ] 

Doug Cutting commented on HADOOP-113:
-------------------------------------

We should probably instead add a Configuration.getFiles() method, used by this and by getInputDirs().  This should be implemented in terms of Configuration.getStrings().   And we should add a Configuration.addFile() method that's used by this and by addInputDir().  This should be implemented in terms of a Configuration.addString() method.  Otherwise we end up copying the same code around in too many places.

However I'm not yet convinced that this feature is the best way to achieve your goal.  I've commented on NUTCH-171 that an alternate mechanism might better achive your goal.  If that or something like it makes sense (specifying, for a job, the maximum number of its maps & reduces that should be run on a single node at once, so that a job can use less than the entire cluster, permitting other jobs to pass it) then we should start a new Hadoop bug for that.

> Allow multiple Output Dirs to be specified for a job
> ----------------------------------------------------
>
>          Key: HADOOP-113
>          URL: http://issues.apache.org/jira/browse/HADOOP-113
>      Project: Hadoop
>         Type: New Feature
>   Components: mapred
>     Versions: 0.1
>     Reporter: Rod Taylor
>  Attachments: hadoop_multisegment.patch
>
> Allow a single job to create multiple outputs. 2 additional simple functions only
> This allows for more complex branching of the process to occur either with multiple steps of the same type or allow different actions to take place on each output directory depending on the required actions.
> For my specific use, it allows me to run multiple Generate Outputs instead of a single Generate Output as submitted in NUTCH-171(http://issues.apache.org/jira/browse/NUTCH-171)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira