You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Purshotam Shah (JIRA)" <ji...@apache.org> on 2014/07/17 20:59:05 UTC
[jira] [Commented] (OOZIE-1940) StatusTransitService has race condition

    [ https://issues.apache.org/jira/browse/OOZIE-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065372#comment-14065372 ] 

Purshotam Shah commented on OOZIE-1940:
---------------------------------------

One solution could be to create  CoordStatusTransitXCommand and BundleStatusTransitXCommand. StatusTransitService will queue CoordStatusTransitXCommand and BundleStatusTransitXCommand.
Each StatusTransitXCommand command need to acquire job lock.

> StatusTransitService has race condition
> ---------------------------------------
>
>                 Key: OOZIE-1940
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1940
>             Project: Oozie
>          Issue Type: Bug
>            Reporter: Purshotam Shah
>
> StatusTransitService doesn't acquire lock while updating DB. 
> We noticed one such issue while doing HA testing, thanks to [~mchiang]
> We issue a change command to change pause time, which got executed on one server. While change command was running on one server, other server started executing StatusTransitService.
> Server 1 log
> {code}
> 2014-07-16 17:28:05,268  INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Acquired lock for [org.apache.oozie.service.StatusTransitService]
> 2014-07-16 17:28:09,694  INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Set coordinator job [0011385-140716042555-oozie-oozi-C] status to 'SUCCEEDED' from 'RUNNING' 
> 2014-07-16 17:28:15,416  INFO StatusTransitService$StatusTransitRunnable:539 [pool-1-thread-13] - USER[-] GROUP[-] Released lock for [org.apache.oozie.service.StatusTransitService]
> {code}
> Server 2 log
> {code}
> 2014-07-16 17:28:06,499 DEBUG CoordChangeXCommand:545 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] New pause/end date is : Wed Jul 16 17:30:00 UTC 2014 and last action number is : 3
> 2014-07-16 17:28:06,508  INFO CoordChangeXCommand:539 [http-0.0.0.0-4443-5] - USER[hadoopqa] GROUP[users] TOKEN[] APP[coordB180] JOB[0011385-140716042555-oozie-oozi-C] ACTION[-] ENDED CoordChangeXCommand for jobId=0011385-140716042555-oozie-oozi-C
> {code}
> CoordMaterializeTransitionXCommand has created all actions( few were in waiting and few were in running state) and set doneMaterialization to true.
> Change command deletes all waiting coords, except 3 running/SUCCEEDED action and reset doneMaterialization.
> StatusTransitService first loads a set of pending jobs and for each job it make DB calls to check coord action status. Coord jobs are loaded only once in beginning.
> This is what happened.
> 1.StatusTransitService loads the coord job which doneMaterialization is set to true at 17:28:05,268 (server 1)
> 2.Change command deletes waiting cation and reset  doneMaterialization at  17:28:06,508 (server 2)
> 3.StatusTransitService load actions for job, only 3 and in SUCCEEDED status. It never reload the doneMaterialization at 17:28:09,694 (server 1)
> StatusTransitService overrides set job status to SUCCEEDED, bcz it's doneMaterialization and all action are SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.2#6252)