You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2018/05/25 19:03:00 UTC

[jira] [Created] (OOZIE-3260) Remove item after first try from in-memory SLA map

Andras Piros created OOZIE-3260:
-----------------------------------

             Summary: Remove item after first try from in-memory SLA map
                 Key: OOZIE-3260
                 URL: https://issues.apache.org/jira/browse/OOZIE-3260
             Project: Oozie
          Issue Type: Bug
          Components: coordinator, core, workflow
    Affects Versions: 5.0.0
            Reporter: Andras Piros
            Assignee: Andras Piros


Despite having implemented OOZIE-3134, there are still cases where {{SLACalculatorMemory#slaMap}} and database contents still get out of sync. E.g. database contents of {{SLA_SUMMARY}} table have been purged manually from DB, or no corresponding {{WF_JOBS}} or {{COORD_JOBS}} entries exist anymore in DB.

In those rare cases, we see {{JPAExecutorException}} instances like:
{noformat}
2017-10-09 17:00:18,185 DEBUG openjpa.jdbc.SQL: SERVER[HOST] <t 1527981517, conn 1584126245> [0 ms] spent
2017-10-09 17:00:18,185 ERROR org.apache.oozie.sla.SLACalculatorMemory: SERVER[tplhc01c001.iuser.iroot.adidom.com] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000438-170916014916144-oozie-oozi-C@556] ACTION[-] Exception in SLA processing for job [0000438-170916014916144-oozie-oozi-C@556]
org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.eventProcessed from SLASummaryBean w where w.jobId = :id]
        at org.apache.oozie.executor.jpa.SLASummaryQueryExecutor.getSingleValue(SLASummaryQueryExecutor.java:161)
        at org.apache.oozie.sla.SLACalculatorMemory.updateJobSla(SLACalculatorMemory.java:480)
        at org.apache.oozie.sla.SLACalculatorMemory.updateAllSlaStatus(SLACalculatorMemory.java:601)
{noformat}
or
{noformat}
2017-10-09 17:00:53,085 WARN org.apache.oozie.service.CallableQueueService$CompositeCallable: SERVER[HOST] USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000011-170813033731256-oozie-oozi-W] ACTION[0000011-170813033731256-oozie-oozi-W@sqoop_full_tbl_unload] exception callable [action.check], E0604: Job does not exist [select w.statusStr from WorkflowJobBean w where w.id = :id]
org.apache.oozie.command.CommandException: E0604: Job does not exist [select w.statusStr from WorkflowJobBean w where w.id = :id]
        at org.apache.oozie.command.wf.ActionCheckXCommand.eagerLoadState(ActionCheckXCommand.java:97)
        at org.apache.oozie.command.XCommand.call(XCommand.java:256)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:332)
        at org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:261)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:179)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.oozie.executor.jpa.JPAExecutorException: E0604: Job does not exist [select w.statusStr from WorkflowJobBean w where w.id = :id]
        at org.apache.oozie.executor.jpa.WorkflowJobQueryExecutor.get(WorkflowJobQueryExecutor.java:345)
        at org.apache.oozie.executor.jpa.WorkflowJobQueryExecutor.get(WorkflowJobQueryExecutor.java:38)
        at org.apache.oozie.command.wf.ActionCheckXCommand.eagerLoadState(ActionCheckXCommand.java:90)
{noformat}

Solution here is to remove any {{SLACalculatorMemory#slaMap}} entries that are causing those {{JPAExecutorException}} instances after the first unsuccessful run, to not cause huge logfiles. The items to be logged don't exist anymore, anyways.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)