You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2009/01/26 16:31:59 UTC
[jira] Created: (HADOOP-5123) Ant tasks for job submission
Ant tasks for job submission
----------------------------
Key: HADOOP-5123
URL: https://issues.apache.org/jira/browse/HADOOP-5123
Project: Hadoop Core
Issue Type: New Feature
Affects Versions: 0.21.0
Environment: Both platforms, Linux and Windows
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Minor
Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
<submit> : uploads JAR, submits job as user, with various settings
filesystem operations: mkdir, copyin, copyout, delete
-We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
# security. Need to specify user; pick up user.name from JVM as default?
# cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
#job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
# testing. AntUnit to generate <junitreport> compatible XML files
# Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
# Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5123) Ant tasks for job submission
Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667625#action_12667625 ]
Steve Loughran commented on HADOOP-5123:
----------------------------------------
I'll take a look at the codebase in both of these. I'd initially expect to start with the minimal set of operations needed to get work into a cluster from a developer's desktop; let it evolve from there. While I know less about Hadoop than the other contributions, I do know more about Ant and how to test build files under JUnit, so what's really going to be new here are the regression tests. I have some job submit code of my own that I was going to start with, but HADOOP-2788 could be a good starting point.
What worries me is the whole configuration problem; I think the client settings are minimal enough now that the JT URL should be enough.
The other problem is versioning; I will handle that by requiring tasks and cluster to be in sync, at least for now.
> Ant tasks for job submission
> ----------------------------
>
> Key: HADOOP-5123
> URL: https://issues.apache.org/jira/browse/HADOOP-5123
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.21.0
> Environment: Both platforms, Linux and Windows
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
> -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-5123) Ant tasks for job submission
Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran updated HADOOP-5123:
-----------------------------------
Attachment: JobSubmitTask.java
This is a first draft of a JobSubmit client.
1. no declaration/setting up of the inputs and outputs
2. no setup, yet, of the configuration above the default values.
For #2 there's a choice.
(a) refer to an ant resource (including, once its in there, a resource in an HDFS filesystem)
(b) let you declare the various properties in Ant itself.
(b) is more Ant-like, but less compatible with the rest of the hadoop configuration design, and may still need to support reading in XML files just to get the base configuration together. But a mixed-configuration is hardest to get right.
Thoughts?
> Ant tasks for job submission
> ----------------------------
>
> Key: HADOOP-5123
> URL: https://issues.apache.org/jira/browse/HADOOP-5123
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.21.0
> Environment: Both platforms, Linux and Windows
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Attachments: JobSubmitTask.java
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
> -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5123) Ant tasks for job submission
Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712029#action_12712029 ]
Steve Loughran commented on HADOOP-5123:
----------------------------------------
looking at this and the configuration options, assuming everything is left to the XML files themselves.
# set it up on the classpath that you declare the task. Easiest to do, and what I would start with
# with a confdir attribute that points you at a configuration directory
the second option is more flexible.
If I were to do this (and one of my colleagues is pestering me for it), I'd do it as a contrib -it depends on both core and mapred, so once they get split up, it should be downstream of them. Nicely self contained, just need a cluster for testing. This could be done, incidentally, if the MiniMR cluster classes were moved from test/mapred to mapred, so I could add a <minimrcluster> task too.
> Ant tasks for job submission
> ----------------------------
>
> Key: HADOOP-5123
> URL: https://issues.apache.org/jira/browse/HADOOP-5123
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.21.0
> Environment: Both platforms, Linux and Windows
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Attachments: JobSubmitTask.java
>
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
> -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5123) Ant tasks for job submission
Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667406#action_12667406 ]
Steve Loughran commented on HADOOP-5123:
----------------------------------------
The use case for the {{<submit>}} ant task is to submit a job as part of a build; print
out enough information for you to track it's progress. Upload the JAR file.
{code}
<hadoop:submit tracker="http://jobtracker:50030"
in="hdfs://host:port/tmp/in/something"
out="hdfs://host:port/tmp/out/something"
jobProperty="myJob"
jar="dist/myapp.jar"
>
<property name="dfs.replication.factor" value="4" />
<mapper classname="org.example.identity" />
<reducer classname="org.example.count" />
</hadoop:submit>
{code}
# No attempt to do a block for the job submission. The task will print out
the jobID.
# jobProperty names a property to set for the job ID
# list zero or more JAR files. No attempt to do sanity checks like loading classes -the far end can do that.
# No separate configuration files for the map/reduce/combine
# Maybe, a configuration file attribute {{conf}}; defines a conf file to use. If set, no other properties can be set (would force the ant task to parse the XML, edit it, save it etc.
# JAR file is optional, but if listed, it had better be there
Tests without cluster
* fail to submit if the JAR is missing
* fail to submit if there is no tracker
* error if the mapper or reducer is not defined
Tests with MiniMR up
* submit a job
> Ant tasks for job submission
> ----------------------------
>
> Key: HADOOP-5123
> URL: https://issues.apache.org/jira/browse/HADOOP-5123
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.21.0
> Environment: Both platforms, Linux and Windows
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
> -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5123) Ant tasks for job submission
Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667673#action_12667673 ]
Steve Loughran commented on HADOOP-5123:
----------------------------------------
Looking at HADOOP-2788, its actually more advanced than what I was thinking, as it
# tries to block the Ant run until the job is finished; extracts counters afterwards
# does some classloader tricks to work out the JAR to include
I'm against the latter; more reliable to let the build file author point to the right place.
The blocking-the-job thing is also something I'm doubtful about, at least initially. Why? Because people will end up trying to use Ant as a long-lived workflow tool and it isn't optimised for that, either in availability or even memory management. People do try this - GridAnt is a case in point [http://www.globus.org/cog/projects/gridant/], but we don't encourage it. Better to move the workflow into the cluster and have some HA scheduler manage the sequence.
> Ant tasks for job submission
> ----------------------------
>
> Key: HADOOP-5123
> URL: https://issues.apache.org/jira/browse/HADOOP-5123
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.21.0
> Environment: Both platforms, Linux and Windows
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
> -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5123) Ant tasks for job submission
Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667266#action_12667266 ]
Steve Loughran commented on HADOOP-5123:
----------------------------------------
Antunit probably doesnt integrate well with tests that need to set up a mini cluster for the test run; use the "legacy" junit test case integration JARs instead.
> Ant tasks for job submission
> ----------------------------
>
> Key: HADOOP-5123
> URL: https://issues.apache.org/jira/browse/HADOOP-5123
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.21.0
> Environment: Both platforms, Linux and Windows
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
> -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-5123) Ant tasks for job submission
Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667425#action_12667425 ]
Steve Loughran commented on HADOOP-5123:
----------------------------------------
File operations
*Touch, copy in, copy out. Not using distcp, so for small data.
* Rename,
* add a condition for a file existing, maybe minimum size.
* DfsMkDir: create a directory
A first pass would use resources, [[http://ant.apache.org/manual/CoreTypes/resources.html#resource]], which can be used in existing Ant tasks; they extend the resource class
[[https://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/types/Resource.java?view=markup]]
and can be used in the existing, {{<copy>}}, {{<touch>}} tasks, and the like.
The resource would need to implement the getOutputStream() and getInputStream() operations, also, ideally, {{Touchable}}, for the touch() operation.
Tests without a cluster
* Some meaningful failure if the hdfs:// URLS don't work
Tests with a cluster
* Copy in, copy-out, copy inside
* touch
* delete
* test for a resource existing
* some of the resource selection operations
Tests against other file systems
* S3:// URLs? Test that it works, but then assume that it stays working.
* Test that s3 urls fail gracefully if the URL is missing/forbidden
> Ant tasks for job submission
> ----------------------------
>
> Key: HADOOP-5123
> URL: https://issues.apache.org/jira/browse/HADOOP-5123
> Project: Hadoop Core
> Issue Type: New Feature
> Affects Versions: 0.21.0
> Environment: Both platforms, Linux and Windows
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Ant tasks to make it easy to work with hadoop filesystem and submit jobs.
> <submit> : uploads JAR, submits job as user, with various settings
> filesystem operations: mkdir, copyin, copyout, delete
> -We could maybe use Ant1.7 "resources" here, and so use hdfs as a source or dest in Ant's own tasks
> # security. Need to specify user; pick up user.name from JVM as default?
> # cluster binding: namenode/job tracker (hostname,port) or url are all that is needed?
> #job conf: how to configure the job that is submitted? support a list of <property name="name" value="something"> children
> # testing. AntUnit to generate <junitreport> compatible XML files
> # Documentation. With an example using Ivy to fetch the JARs for the tasks and hadoop client.
> # Polling: ant task to block for a job finished?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.