You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2018/11/20 14:02:00 UTC

[jira] [Updated] (OOZIE-3387) Optimize coordinator data input dependency search

     [ https://issues.apache.org/jira/browse/OOZIE-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andras Piros updated OOZIE-3387:
--------------------------------
    Affects Version/s: 5.1.0

> Optimize coordinator data input dependency search
> -------------------------------------------------
>
>                 Key: OOZIE-3387
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3387
>             Project: Oozie
>          Issue Type: Improvement
>    Affects Versions: 5.1.0
>            Reporter: Andras Salamon
>            Priority: Major
>
> During data input dependency check Oozie evaluates EL functions likeĀ {{ coord:latest}} using a non-optimal way which may result more than necessary HDFS URI checks.
> 1. If the {{dataset}} frequency does not match the {{uri-template}} it checks the same HDFS URI multiple times. For instance in the following definition:
> {noformat}
> <dataset name="dataset1" frequency="${coord:minutes(1)}" initial-instance="2017-01-01T08:15Z" timezone="UTC">
>     <uri-template>${nameNode}/${rootDir}/${YEAR}-${MONTH}-${DAY}</uri-template>
>     <done-flag>_SUCCESS</done-flag>
> </dataset>
> ...
> <data-in name="coordInput" dataset="dataset1">
>     <instance>${coord:latest(0)}</instance>
> </data-in>
> {noformat}
> oozie check the same {{.../2018-11-20/_SUCCESS}} file 24*60=1440 times. It would be enough to check the file only once and skip the other 1439 tests.
> 2. If the frequency is 1 day and {{uri-template}} is definied in the following way:
> {noformat}
> <uri-template>${nameNode}/${rootDir}/${YEAR}/${MONTH}/${DAY}</uri-template>
> {noformat}
> oozie will check the following directories one by one even if the some of the parent directories are missing:
> {noformat}
> 2018/11/20
> 2018/11/19
> 2018/11/18
> ...
> {noformat}
> If there is no {{2018/11}} directory then it is not necessary to check all the {{2018/11/xx}} directories. It would be possible to reduce the number of HDFS URI checks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)