You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Rick Kellogg (JIRA)" <ji...@apache.org> on 2015/10/05 03:04:27 UTC

[jira] [Updated] (STORM-682) Supervisor local worker state corrupted and failing to start.

     [ https://issues.apache.org/jira/browse/STORM-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rick Kellogg updated STORM-682:
-------------------------------
    Component/s: storm-core

> Supervisor local worker state corrupted and failing to start.
> -------------------------------------------------------------
>
>                 Key: STORM-682
>                 URL: https://issues.apache.org/jira/browse/STORM-682
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: Parth Brahmbhatt
>            Assignee: Parth Brahmbhatt
>             Fix For: 0.10.0, 0.9.4
>
>
> If supervisor's cleanup of a worker fails to delete some heartbeat files the local state of the supervisors get corrupted.The only way to recover the supervisor from this state is to delete the local state folder where supervisor stores all worker information.This fix can get very cumbersome if it happens on multiple worker nodes.
> The root cause of the issue is the order in which worker heartbeat versioned store files are created vs the deletion order of those files. LocalState.put first creates a data file X and then marks a success by creating a file X.version.  During get it first checks for all *.version files , tries to find the largest value of X and then issues a read against X. See the below pseudo code
> {code:java}
> start_supervisor() {
> workerIds = `ls local-state/workers`
> for each workerId in workerIds
>      versions =  `ls local-state/workers/workerId/heartbeats/*.version`
>      latest_version = max(versions)
>      read  local-state/workers/workerId/heartbeats/latest_version [Note there is no .version extension] 
> }
> {code}
> During cleanup it first tries to delete file X and then X.version. If X gets deleted  but X.version fails to delete the supervisor fails to start with FileNotFoundException in the code above. 
> We propose to change the deletion order so the .version files get deleted before the data file and catch any IOException when reading worker heartbeats to avoid supervisor failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)