You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2015/09/11 17:02:46 UTC

[jira] [Closed] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

     [ https://issues.apache.org/jira/browse/STORM-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans closed STORM-1043.
--------------------------------------
    Resolution: Invalid

> Concurrent access to state on local FS by multiple supervisors
> --------------------------------------------------------------
>
>                 Key: STORM-1043
>                 URL: https://issues.apache.org/jira/browse/STORM-1043
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.5
>            Reporter: Ernestas Vaiciukevičius
>              Labels: mesosphere
>
> Hi,
> we are running storm-mesos cluster and occassionaly workers die or are "lost" in mesos. When this happens it often coincides with errors in logs related to supervisors local state.
> By looking at the storm code it seems this might be caused by the way how multiple supervisor processes access the local state in the same directory via VersionedStore.
> For example: https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
> Here every supervisor does this concurrently:
> 1. reads latest state from FS
> 2. possibly updates the state
> 3. writes the new version of the state
> Some updates could be lost if there are 2+ supervisors and they execute above steps concurrently - then only the updates from last supervisor would remain on the last state version on the disk.
> We observed local state changes quite often (seconds), so the likelihood of this concurrency issue occurring is high.
> Some examples of exeptions:
> ------------------------------------------
> java.lang.RuntimeException: Version already exists or data already exists
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.persist(LocalState.java:101) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:82) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.put(LocalState.java:76) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> ---------------------------------------
> java.io.FileNotFoundException: File '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) ~[commons-io-2.4.jar:2.4]
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) ~[commons-io-2.4.jar:2.4]
> at backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.utils.LocalState.get(LocalState.java:72) ~[storm-core-0.9.5.jar:0.9.5]
> at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
> at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
> at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
> at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
> at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
> at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> -----------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)