You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/05/16 16:53:13 UTC
[jira] [Commented] (OOZIE-2509) SLA job status can stuck in running
state
[ https://issues.apache.org/jira/browse/OOZIE-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284821#comment-15284821 ]
Rohini Palaniswamy commented on OOZIE-2509:
-------------------------------------------
+1
> SLA job status can stuck in running state
> -----------------------------------------
>
> Key: OOZIE-2509
> URL: https://issues.apache.org/jira/browse/OOZIE-2509
> Project: Oozie
> Issue Type: Bug
> Reporter: Purshotam Shah
> Assignee: Purshotam Shah
> Attachments: OOZIE-2509-V1.patch, OOZIE-2509-V2.patch, OOZIE-2509-V3.patch, OOZIE-2509-V4.patch, OOZIE-2509-V5.patch, OOZIE-2509-V6.patch, OOZIE-2509-V7.patch, OOZIE-2509-V8.patch
>
>
> There are few places where job status is not updated properly
> 1. Receiving event which is out of order.
> Ex "oozie.service.EventHandlerService.batch.size" is set to 50.
> oozie.service.EventHandlerService.worker.threads is set to 15. Which means that there will be 15 thread processing event in the batch of 50.
> It can happen that 51th event gets process before the 49th event.
> If 49th is job started event and 51th is job completed event, then the job status will get overridden to running
> 2.
> {code}
> case COORDINATOR_ACTION:
> CoordinatorActionBean ca = jpaService.execute(new CoordActionGetForSLAJPAExecutor(slaCalc.getId()));
> if (ca.isTerminalWithFailure()) {
> isEndMiss = ended = true;
> slaCalc.setActualEnd(ca.getLastModifiedTime());
> }
> if (ca.getExternalId() != null) {
> wf = jpaService.execute(new WorkflowJobGetForSLAJPAExecutor(ca.getExternalId()));
> if (wf.getEndTime() != null) {
> ended = true;
> if (wf.getEndTime().getTime() > slaCalc.getExpectedEnd().getTime()) {
> isEndMiss = true;
> }
> }
> slaCalc.setActualEnd(wf.getEndTime());
> slaCalc.setActualStart(wf.getStartTime());
> }
> {code}
> Oozie checks the wf status and update the sla status with coord job status.
> We might have a case where coord is still running,but wf has ended.
> 3. HistoryPurgeWorker updates endtime but doesn't update status.
> 4. There other few locking issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)