You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@oozie.apache.org by "Roman Shaposhnik (JIRA)" <ji...@apache.org> on 2011/09/10 05:27:09 UTC

[jira] [Created] (OOZIE-548) OOZIE-131: Support WF action level rery

OOZIE-131: Support WF action level rery
---------------------------------------

                 Key: OOZIE-548
                 URL: https://issues.apache.org/jira/browse/OOZIE-548
             Project: Oozie
          Issue Type: New Feature
            Reporter: Mohammad Kamrul Islam
            Assignee: Roman Shaposhnik


While there are hadoop task level retry and oozie level retry for any transient error, it is desirable to allow WF action level retry configured by user as well.

In this proposed task, the following sub-tasks needs to be considered:

1. Enable user to specify the retry count and retry interval (time between two successive tries).
2. Retry interval will be in minutes and the default value is 10 minutes. The default value should be system level configuration.
3. Default retry count is 0 (no-retry), to keep backward compatible. 
4. A new state called "RETRY" will be added in WF action. An action will be in RETRY state, if the job failed and needs to be retried.
5. Three fields needs to be added into WF action table. retry_count, max_retry, retry_interval.
6. Some services like Recovery service will periodically check for the following sql "select action_id from WF_ACTIONS where status = 'RETRY' and (last_modified_time + retry_interval ) < current_time and max_retry > retry_count)" and queue RETRY_COMMAND. The last filter of SQL might not be required.
5. RETRY_COMMAND will update the status from RETRY to PREP and push a ActionStartXCommand.

Open Question:
a) Who will remove the temporary directories/files (such as ACTION_DIR) created by Oozie? Is it part when the job moves to RETRY state? Or RETRY_COMMAND could do it?
b) Do we need to keep historical information such as why the previous retries failed? Historical information includes error code, error message etc.
c)anything else?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira