You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Ernestas Vaiciukevičius (JIRA)" <ji...@apache.org> on 2015/09/11 16:53:45 UTC
[jira] [Created] (STORM-1043) Concurrent access to state on local
FS by multiple supervisors
Ernestas Vaiciukevičius created STORM-1043:
----------------------------------------------
Summary: Concurrent access to state on local FS by multiple supervisors
Key: STORM-1043
URL: https://issues.apache.org/jira/browse/STORM-1043
Project: Apache Storm
Issue Type: Bug
Affects Versions: 0.9.5
Reporter: Ernestas Vaiciukevičius
Hi,
we are running storm-mesos cluster and occassionaly workers die or are "lost" in mesos. When this happens it often coincides with errors in logs related to supervisors local state.
By looking at the storm code it seems this might be caused by the way how multiple supervisor processes access the local state in the same directory via VersionedStore.
For example: https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434
Here every supervisor does this concurrently:
1. reads latest state from FS
2. possibly updates the state
3. writes the new version of the state
Some updates could be lost if there are 2+ supervisors and they execute above steps concurrently - then only the updates from last supervisor would remain on the last state version on the disk.
We observed local state changes quite often (seconds), so the likelihood of this concurrency issue occurring is high.
Some examples of exeptions:
------------------------------------------
java.lang.RuntimeException: Version already exists or data already exists
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.persist(LocalState.java:101) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:82) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.put(LocalState.java:76) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
---------------------------------------
java.io.FileNotFoundException: File '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) ~[commons-io-2.4.jar:2.4]
at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) ~[commons-io-2.4.jar:2.4]
at backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.utils.LocalState.get(LocalState.java:72) ~[storm-core-0.9.5.jar:0.9.5]
at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
-----------------------------------------
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)