You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2009/01/26 16:31:59 UTC

[jira] Created: (HADOOP-5123) Ant tasks for job submission

Ant tasks for job submission
----------------------------

                 Key: HADOOP-5123
                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
             Project: Hadoop Core
          Issue Type: New Feature
    Affects Versions: 0.21.0
         Environment: Both platforms, Linux and Windows
            Reporter: Steve Loughran
            Assignee: Steve Loughran
            Priority: Minor


Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 

<submit> : uploads JAR, submits job as user, with various settings

filesystem operations: mkdir, copyin, copyout, delete
 -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks

# security. Need to specify user; pick up user.name from JVM as default?
# cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
#job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
# testing. AntUnit to generate <junitreport> compatible XML files
# Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
# Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5123) Ant tasks for job submission

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667625#action_12667625 ] 

Steve Loughran commented on HADOOP-5123:
----------------------------------------

I'll take a look at the codebase in both of these. I'd initially expect to start with the minimal set of operations needed to get work into a cluster from a developer's desktop; let it evolve from there. While I know less about Hadoop than the other contributions, I do know more about Ant and how to test build files under JUnit, so what's really going to be new here are the regression tests. I have some job submit code of my own that I was going to start with, but HADOOP-2788 could be a good starting point.

What worries me is the whole configuration problem; I think the client settings are minimal enough now that the JT URL should be enough. 

The other problem is versioning; I will handle that by requiring tasks and cluster to be in sync, at least for now.

> Ant tasks for job submission
> ----------------------------
>
>                 Key: HADOOP-5123
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>         Environment: Both platforms, Linux and Windows
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
>  -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5123) Ant tasks for job submission

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran updated HADOOP-5123:
-----------------------------------

    Attachment: JobSubmitTask.java

This is a first draft of a JobSubmit client.

1. no declaration/setting up of the inputs and outputs
2. no setup, yet, of the configuration above the default values. 

For #2 there's a choice.
(a) refer to an ant resource (including, once its in there, a resource in an HDFS filesystem)
(b) let you declare the various properties in Ant itself. 

(b) is more Ant-like, but less compatible with the rest of the hadoop configuration design, and may still need to support reading in XML files just to get the base configuration together. But a mixed-configuration is hardest to get right.

Thoughts?





> Ant tasks for job submission
> ----------------------------
>
>                 Key: HADOOP-5123
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>         Environment: Both platforms, Linux and Windows
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: JobSubmitTask.java
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
>  -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5123) Ant tasks for job submission

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712029#action_12712029 ] 

Steve Loughran commented on HADOOP-5123:
----------------------------------------

looking at this and the configuration options, assuming everything is left to the XML files themselves.

# set it up on the classpath that you declare the task. Easiest to do, and what I would start with
# with a confdir attribute that points you at a configuration directory

the second option is more flexible.

If I were to do this (and one of my colleagues is pestering me for it), I'd do it as a contrib -it depends on both core and mapred, so once they get split up, it should be downstream of them. Nicely self contained, just need a cluster for testing. This could be done, incidentally, if the MiniMR cluster classes were moved from test/mapred to mapred, so I could add a <minimrcluster> task too. 

> Ant tasks for job submission
> ----------------------------
>
>                 Key: HADOOP-5123
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>         Environment: Both platforms, Linux and Windows
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: JobSubmitTask.java
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
>  -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5123) Ant tasks for job submission

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667406#action_12667406 ] 

Steve Loughran commented on HADOOP-5123:
----------------------------------------

The use case for the {{<submit>}} ant task is to submit a job as part of a build; print
out enough information for you to track it's progress. Upload the JAR file.
{code}
<hadoop:submit tracker="http://jobtracker:50030" 
    in="hdfs://host:port/tmp/in/something"
    out="hdfs://host:port/tmp/out/something"
    jobProperty="myJob"
    jar="dist/myapp.jar"
>
  <property name="dfs.replication.factor" value="4" />
  <mapper classname="org.example.identity" /> 
  <reducer classname="org.example.count" />
 </hadoop:submit>
{code}

# No attempt to do a block for the job submission. The task will print out
  the jobID.
# jobProperty names a property to set for the job ID
# list zero or more JAR files. No attempt to do sanity checks like loading classes -the far end can do that.
# No separate configuration files for the map/reduce/combine
# Maybe, a configuration file attribute {{conf}}; defines a conf file to use. If set, no other properties can be set (would force the ant task to parse the XML, edit it, save it etc.
# JAR file is optional, but if listed, it had better be there

Tests without cluster
* fail to submit if the JAR is missing
* fail to submit if there is no tracker
* error if the mapper or reducer is not defined

Tests with MiniMR up
* submit a job



> Ant tasks for job submission
> ----------------------------
>
>                 Key: HADOOP-5123
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>         Environment: Both platforms, Linux and Windows
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
>  -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5123) Ant tasks for job submission

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667673#action_12667673 ] 

Steve Loughran commented on HADOOP-5123:
----------------------------------------

Looking at HADOOP-2788, its actually more advanced than what I was thinking, as it 
# tries to block the Ant run until the job is finished; extracts counters afterwards
# does some classloader tricks to work out the JAR to include

I'm against the latter; more reliable to let the build file author point to the right place.

The blocking-the-job thing is also something I'm doubtful about, at least initially. Why? Because people will end up trying to use Ant as a long-lived workflow tool and it isn't optimised for that, either in availability or even memory management. People do try this - GridAnt is a case in point [http://www.globus.org/cog/projects/gridant/], but we don't encourage it. Better to move the workflow into the cluster and have some HA scheduler manage the sequence.



> Ant tasks for job submission
> ----------------------------
>
>                 Key: HADOOP-5123
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>         Environment: Both platforms, Linux and Windows
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
>  -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5123) Ant tasks for job submission

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667266#action_12667266 ] 

Steve Loughran commented on HADOOP-5123:
----------------------------------------

Antunit probably doesnt integrate well with tests that need to set  up a mini cluster for the test run; use the "legacy" junit test case integration JARs instead.

> Ant tasks for job submission
> ----------------------------
>
>                 Key: HADOOP-5123
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>         Environment: Both platforms, Linux and Windows
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
>  -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5123) Ant tasks for job submission

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667425#action_12667425 ] 

Steve Loughran commented on HADOOP-5123:
----------------------------------------

File operations

*Touch, copy in, copy out. Not using distcp, so for small data. 
* Rename, 
* add a condition for a file existing, maybe minimum size.
* DfsMkDir: create a directory

A first pass would use resources, [[http://ant.apache.org/manual/CoreTypes/resources.html#resource]], which can be used in existing Ant tasks; they extend the resource class
[[https://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/types/Resource.java?view=markup]]
and can be used in the existing, {{<copy>}}, {{<touch>}} tasks, and the like. 
The resource would need to implement the getOutputStream() and getInputStream() operations, also, ideally, {{Touchable}}, for the touch() operation.  

Tests without a cluster
* Some meaningful failure if the hdfs:// URLS don't work

Tests with a cluster
* Copy in, copy-out, copy inside
* touch
* delete
* test for a resource existing
* some of the resource selection operations

Tests against other file systems
* S3:// URLs? Test that it works, but then assume that it stays working.
* Test that s3 urls fail gracefully if the URL is missing/forbidden

> Ant tasks for job submission
> ----------------------------
>
>                 Key: HADOOP-5123
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5123
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.21.0
>         Environment: Both platforms, Linux and Windows
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs. 
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
>  -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.