You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2017/06/27 13:12:00 UTC

[jira] [Updated] (OOZIE-2854) Oozie should handle transient database problems

     [ https://issues.apache.org/jira/browse/OOZIE-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andras Piros updated OOZIE-2854:
--------------------------------
    Summary: Oozie should handle transient database problems  (was: Oozie should handle transient DB problems)

> Oozie should handle transient database problems
> -----------------------------------------------
>
>                 Key: OOZIE-2854
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2854
>             Project: Oozie
>          Issue Type: Improvement
>          Components: core
>            Reporter: Peter Bacsko
>            Assignee: Andras Piros
>         Attachments: OOZIE-2854-001.patch, OOZIE-2854-002.patch, OOZIE-2854-003.patch, OOZIE-2854-004.patch, OOZIE-2854-005.patch, OOZIE-2854-POC-001.patch
>
>
> There can be problems when Oozie cannot update the database properly. Recently, we have experienced erratic behavior with two setups:
> * MySQL with the Galera cluster manager. Galera uses cluster-wide optimistic locking which might cause a transaction to rollback if there are two or more parallel transaction running and one of them cannot complete because of a conflict.
> * MySQL with Percona XtraDB Cluster. If one of the MySQL instances is killed, Oozie might get "Communications link failure" exception during the failover.
> The problem is that failed DB transactions later might cause a workflow (which are started/re-started by RecoveryService) to get stuck. It's not clear to us how this happens but it has to do with the fact that certain DB updates are not executed.
> The solution is to use some sort of retry logic with exponential backoff if the DB update fails. We could start with a 100ms wait time which is doubled at every retry. The operation can be considered a failure if it still fails after 10 attempts. These values could be configurable. We should discuss initial values in the scope of this JIRA.
> Note that this solution is to handle *transient* failures. If the DB is down for a longer period of time, we have to accept that the internal state of Oozie is corrupted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)