You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Alejandro Abdelnur (JIRA)" <ji...@apache.org> on 2009/03/02 07:07:15 UTC
[jira] Commented: (HADOOP-5303) Hadoop Workflow System (HWS)

    [ https://issues.apache.org/jira/browse/HADOOP-5303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12677898#action_12677898 ] 

Alejandro Abdelnur commented on HADOOP-5303:
--------------------------------------------

h4. Regarding the use of XSD

We want to use XSD as it allows us to do XML schema validation at deployment time, making much slimmer all the parsing code. And the only programmatic validation we have to do at deployment time it is that the DAG does not have loose ends and does not have cycles.

h4. Regarding the use of multiple XSDs

We can provide a single XSD but that will complicate how new action types can be validated at deployment time. As it would require creating a new XSD, a new XSD should have a different URI.

That is one of the reasons we went the approach of different XSDs for actions.

Another reason is that by using different XDSs eventually you could support a new hadoop action while still supporting the old one for all deployed applications.

*Option 1:* Current option, one XSD for control nodes and one XSD per action node type.

*Option 2:* Current option, one XSD for control nodes and one XSD for all the (out of the box) action node types.

*Option 3:* Integrate the control nodes and all the (out of the box) action nodes into a single XSD, leaving an extension point for custom action nodes.

Thoughts?

h4. Regarding input/output datasets for a high level workload scheduler

[IMO, this is a different topic from the XSD issue]

I understand the motivation of this, but I see this belonging to the workload scheduling level system.

IMO, the workflow nodes should stick to use a direct mapping of Hadoop/Pig configuration knobs (config props for Hadoop, params and config props for Pig). This makes the workflow model more intuitive to the Hadoop/Pig developers. 

It should be the higher level system (in your case the workload scheduler) should map the input/output datasets to the Hadoop/Pig configuration knobs.


> Hadoop Workflow System (HWS)
> ----------------------------
>
>                 Key: HADOOP-5303
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5303
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>         Attachments: hws-preso-v1_0_2009FEB22.pdf, hws-v1_0_2009FEB22.pdf
>
>
> This is a proposal for a system specialized in running Hadoop/Pig jobs in a control dependency DAG (Direct Acyclic Graph), a Hadoop workflow application.
> Attached there is a complete specification and a high level overview presentation.
> ----
> *Highlights* 
> A Workflow application is DAG that coordinates the following types of actions: Hadoop, Pig, Ssh, Http, Email and sub-workflows. 
> Flow control operations within the workflow applications can be done using decision, fork and join nodes. Cycles in workflows are not supported.
> Actions and decisions can be parameterized with job properties, actions output (i.e. Hadoop counters, Ssh key/value pairs output) and file information (file exists, file size, etc). Formal parameters are expressed in the workflow definition as {{${VAR}}} variables.
> A Workflow application is a ZIP file that contains the workflow definition (an XML file), all the necessary files to run all the actions: JAR files for Map/Reduce jobs, shells for streaming Map/Reduce jobs, native libraries, Pig scripts, and other resource files.
> Before running a workflow job, the corresponding workflow application must be deployed in HWS.
> Deploying workflow application and running workflow jobs can be done via command line tools, a WS API and a Java API.
> Monitoring the system and workflow jobs can be done via a web console, command line tools, a WS API and a Java API.
> When submitting a workflow job, a set of properties resolving all the formal parameters in the workflow definitions must be provided. This set of properties is a Hadoop configuration.
> Possible states for a workflow jobs are: {{CREATED}}, {{RUNNING}}, {{SUSPENDED}}, {{SUCCEEDED}}, {{KILLED}} and {{FAILED}}.
> In the case of a action failure in a workflow job, depending on the type of failure, HWS will attempt automatic retries, it will request a manual retry or it will fail the workflow job.
> HWS can make HTTP callback notifications on action start/end/failure events and workflow end/failure events.
> In the case of workflow job failure, the workflow job can be resubmitted skipping previously completed actions. Before doing a resubmission the workflow application could be updated with a patch to fix a problem in the workflow application code.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.