You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2018/05/31 10:20:00 UTC

[jira] [Updated] (OOZIE-3260) [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map

     [ https://issues.apache.org/jira/browse/OOZIE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andras Piros updated OOZIE-3260:
--------------------------------
    Summary: [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map  (was: [sla] Remove stale item on JPA related errors from in-memory SLA map)

> [sla] Remove stale item above max retries on JPA related errors from in-memory SLA map
> --------------------------------------------------------------------------------------
>
>                 Key: OOZIE-3260
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3260
>             Project: Oozie
>          Issue Type: Bug
>          Components: coordinator, core, workflow
>    Affects Versions: 5.0.0
>            Reporter: Andras Piros
>            Assignee: Andras Piros
>            Priority: Major
>
> Despite having implemented OOZIE-3134, there are still cases where {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. Some possibilities including but not limited to:
>  * database contents of {{SLA_SUMMARY}} table have been purged manually from DB
>  * no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB
>  * the {{WF_JOBS}} or {{COORD_JOBS}} instance that is being tracked by the {{SLACalcStatus}} instances inside {{SLACalculatorMemory#slaMap}} is not yet persisted to database when the SLA entry is already processed by {{SLACalculatorMemory.HistoryPurgeWorker}}. Depending on e.g. how many coordinator actions are being materialized, it can very well happen that {{SLACalcStatus}} entries inserted to the in-memory map will be processed before their corresponding {{CoordActionBean}} entries are yet to be persisted to database
> In those rare cases, we see {{JPAExecutorException}} instances like:
> {noformat}
> 2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] <t 1527981517, conn 1584126245> [0 ms] spent
> 2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [0000438-170916014916144-oozie-oozi-C@556]
> org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
>         at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
>         at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
>         at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
> {noformat}
> Solution here is to track the number of times the {{SLACalcStatus}} entry has not been processed successfully, and when a preconfigured {{oozie.sla.service.SLAService.maximum.retry.count}} is reached, remove any {{SLACalculatorMemory#slaMap}} entries that are causing those {{JPAExecutorException}} instances, to not cause huge logfiles. The items to be logged don't exist, anyways.
> It's still possible that multiple {{CoordActionBean}} instances being inserted won't have {{SLACalcStatus}} entries inside {{SLACalculatorMemory#slaMap}} by the time written to database, and thus, no SLA will be tracked. In those rare cases, preconfigured maximum retry count can be extended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)