You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Philip Zeyliger (JIRA)" <ji...@apache.org> on 2009/06/15 00:47:07 UTC

[jira] Created: (HADOOP-6039) Computing Input Splits on the MR Cluster

Computing Input Splits on the MR Cluster
----------------------------------------

                 Key: HADOOP-6039
                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
             Project: Hadoop Core
          Issue Type: New Feature
          Components: mapred
            Reporter: Philip Zeyliger


Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6039) Computing Input Splits on the MR Cluster

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719389#action_12719389 ] 

Devaraj Das commented on HADOOP-6039:
-------------------------------------

Isn't it possible to do this as part of the JOB_SETUP task itself?

> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6039) Computing Input Splits on the MR Cluster

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719638#action_12719638 ] 

Owen O'Malley commented on HADOOP-6039:
---------------------------------------

This patch should reintroduce checkInputSplits into org.apache.hadoop.mapreduce.InputFormat. This method should be documented as *optional*. It will only be invoked if Java code is doing the submission to detect errors in the user's job configuration, such as missing or read-protected input directory, before the job is submitted to the cluster.


> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6039) Computing Input Splits on the MR Cluster

Posted by "Hemanth Yamijala (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719384#action_12719384 ] 

Hemanth Yamijala commented on HADOOP-6039:
------------------------------------------

Before we do this, I think we should resolve HADOOP-4421. Atleast to the extent of agreeing to a design. Adding one more task, while we are trying to fix problems with the existing ones might make things a tad more difficult to manage.

> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6039) Computing Input Splits on the MR Cluster

Posted by "Philip Zeyliger (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719350#action_12719350 ] 

Philip Zeyliger commented on HADOOP-6039:
-----------------------------------------

The motivation behind computing the input splits on the cluster is at least two-fold:
 * It would be great to be able to submit jobs to a cluster using a simple (REST?) API, from many languages.  (Similar to HADOOP-5633.)  The fact that job submission does a bunch of mapreduce-internal work makes such submission very tricky.  We're already seeing how workflow systems (here I'm thinking of Oozie and Pig) run MR jobs simply to launch more MR jobs, while inheriting the scheduling and isolation work that the JobTracker already does.
 * Sometimes computing the input splits is, in of itself, an operation that would do well to be run in parallel across several machines.  For example, splitting inputs may require going through many files on the DFS.  Moving input split calculations onto the cluster would pave the way for this to be possible.

Implementation-wise, we already have JOB_SETUP and JOB_CLEANUP tasks, so adding a JOB_SPLIT_CALCULATION, which could be colocated with JOB_SETUP makes some sense.

> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6039) Computing Input Splits on the MR Cluster

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719392#action_12719392 ] 

Amareshwari Sriramadasu commented on HADOOP-6039:
-------------------------------------------------

bq. Isn't it possible to do this as part of the JOB_SETUP task itself?
This can be done. We should move out the creation of setup/cleanup tasks from JobInProgress.initTasks(). 



> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6039) Computing Input Splits on the MR Cluster

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719393#action_12719393 ] 

Amareshwari Sriramadasu commented on HADOOP-6039:
-------------------------------------------------

bq. This can be done. We should move out the creation of setup/cleanup tasks from JobInProgress.initTasks().
Related jira HADOOP-4472.

> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.