You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Ernestas Vaiciukevičius (JIRA)" <ji...@apache.org> on 2015/09/11 16:53:45 UTC

[jira] [Created] (STORM-1043) Concurrent access to state on local FS by multiple supervisors

Ernestas Vaiciukevičius created STORM-1043:
----------------------------------------------

             Summary: Concurrent access to state on local FS by multiple supervisors
                 Key: STORM-1043
                 URL: https://issues.apache.org/jira/browse/STORM-1043
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 0.9.5
            Reporter: Ernestas Vaiciukevičius


Hi,

we are running storm-mesos cluster and occassionaly workers die or are "lost" in mesos. When this happens it often coincides with errors in logs related to supervisors local state.

By looking at the storm code it seems this might be caused by the way how multiple supervisor processes access the local state in the same directory via VersionedStore.

For example: https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434

Here every supervisor does this concurrently:
1. reads latest state from FS
2. possibly updates the state
3. writes the new version of the state

Some updates could be lost if there are 2+ supervisors and they execute above steps concurrently - then only the updates from last supervisor would remain on the last state version on the disk.

We observed local state changes quite often (seconds), so the likelihood of this concurrency issue occurring is high.

Some examples of exeptions:
------------------------------------------
java.lang.RuntimeException: Version already exists or data already exists
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.persist(LocalState.java:101) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:82) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:76) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

---------------------------------------
java.io.FileNotFoundException: File '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) ~[commons-io-2.4.jar:2.4]
at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) ~[commons-io-2.4.jar:2.4]
at backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.get(LocalState.java:72) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
-----------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)