You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Purshotam Shah (JIRA)" <ji...@apache.org> on 2015/09/18 02:19:06 UTC

[jira] [Commented] (OOZIE-1976) Specifying coordinator input datasets in more logical ways

    [ https://issues.apache.org/jira/browse/OOZIE-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804733#comment-14804733 ] 

Purshotam Shah commented on OOZIE-1976:
---------------------------------------

Uploaded patch to RB. Some refactoring and naming changes are still pending. Patch has core logic.

There are three components in this patch

1. User interface
A new tag is added to coordinator.xml
ex.
<input-check>
    <or name="test">
                  <and>
                          <data-in dataset="A"/>"
                          <data-in dataset="B"/>
                   </and>
                   <and>
                          <data-in dataset="C"/>
                          <data-in dataset="D"/>
                   </and>
                   
         </or>;
<input-check>


input-check will have nested and/or/combine operation. It can have min and wait at operator or at date-in.
If input-check tag is missing then it consider to be old approach where all data dependency are needed.

2. Processing
input-check is converted into logical expression
	(a&&B)||(c&&d)
We use jexl to parse the logical expression.

There are three phase in parsing.
phase 1 : Only resolved dataset are parsed ( only current). 	
phase 2 : Once all current are resolved, then future/latest are parsed.
phase 3 : Doesn't do any filecheck, just return what is being parsed by phase1 and phase2. Is used for EL functions


3. Storage.
If inputcheck is enable, push_missing_dependencies and missing_dependencies are serialized and stored in DB.
If then not then it's old approach, where they are stored in plan text. This is backward compatible. 

> Specifying coordinator input datasets in more logical ways
> ----------------------------------------------------------
>
>                 Key: OOZIE-1976
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1976
>             Project: Oozie
>          Issue Type: New Feature
>          Components: coordinator
>    Affects Versions: trunk
>            Reporter: Mona Chitnis
>            Assignee: Purshotam Shah
>             Fix For: trunk
>
>         Attachments: Input-check.docx, OOZIE-1976-WIP.patch, OOZIE-1976-rough-design-2.pdf, OOZIE-1976-rough-design.pdf
>
>
> All dataset instances specified as input to coordinator, currently work on AND logic i.e. ALL of them should be available for workflow to start. We should enhance this to include more logical ways of specifying availability criteria e.g.
>  * OR between instances
>  * minimum N out of K instances
>  * delta datasets (process data incrementally)
> Use-cases for this:
>  * Different datasets are BCP, and workflow can run with either, whichever arrives earlier.
>  * Data is not guaranteed, and while $coord:latest allows skipping to available ones, workflow will never trigger unless mentioned number of instances are found.
>  * Workflow is like a ‘refining’ algorithm which should run after minimum required datasets are ready, and should only process the delta for efficiency.
> This JIRA is to discuss the design and then the review the implementation for some or all of the above features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)