You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Rick Kellogg (JIRA)" <ji...@apache.org> on 2015/10/09 03:21:28 UTC
[jira] [Reopened] (STORM-134) FileNotFoundException during supervisor cleanup due to non-existant file

     [ https://issues.apache.org/jira/browse/STORM-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rick Kellogg reopened STORM-134:
--------------------------------

> FileNotFoundException during supervisor cleanup due to non-existant file
> ------------------------------------------------------------------------
>
>                 Key: STORM-134
>                 URL: https://issues.apache.org/jira/browse/STORM-134
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>            Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/356
> During development runs with local clusters I've sometimes managed to encounter FileNotFoundException thrown from FileUtils.forceDelete during cleanup.
> Without knowing the internals of Storm it feels a bit strange that clean up...
> a) ...fails when trying to delete a non existing file/folder
> b) ...tries to delete a non-existant file
> Would suppressing FileNotFoundException from FileUtils.forceDelete or switching to FileUtils.deleteQuietly be reasonable?
> Sample stack trace:
> ERROR [2012-10-09 22:08:51,255] org.apache.zookeeper.server.NIOServerCnxn: Thread Thread[main,5,main] died
> ! java.io.FileNotFoundException: File does not exist: /var/folders/30/4f0z00f56sqfh512yr1l61ym0000gn/T/5844d75c-3801-4757-b7ea-f20ca76e4e52/workers/08a06e4f-4cbc-4e10-b013-ff4fcb4cbf51/heartbeats/1349813330619.version
> ! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1386)
> ! at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1044)
> ! at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:977)
> ! at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:1381)
> ! at backtype.storm.util$rmr.invoke(util.clj:413)
> ! at backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
> ! at backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
> ----------
> MichaelBlume: Getting the same error while trying to test with complete-topology
> Note: I'm using storm 0.8.1. Have reproduced on my mac and my ubuntu netbook. Can try with development version if desired.
> Reproduced with development version. It's a really weird bug, because the code is actually checking for the existence of the path before attempting to delete it.
> OK, I think I partially understand. I put some logging into rmr to log out a listing of the tree it's attempting to remove. I get this:
> [clojure-test] 3501 [Thread-10] INFO backtype.storm.util - ["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats" "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844" "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990601844.version" "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801" "/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats/1349990602801.version"]
> [clojure-test] 3518 [main] INFO backtype.storm.util - ["/private/var/folders/qz/qs47vb4938bbk3djt74vrxwr0000gn/T/dadf34c4-ad02-4728-9d91-8b3300bcc926/workers/94008c8d-593e-463d-8665-d58fbb682f0e/heartbeats"]
> Looks like backtype.storm.util.rmr's being called on the same dir from two threads. The second call happens when the directory's just about empty, but not quite gone, but by the time rmr calls forceDelete, the directory's gone entirely and an exception gets raised. I've thrown a band-aid on this on my fork, but the proper solution would be to figure out why two threads are deleting the same directory.
> ----------
> nathanmarz: Please attach a reproducible test case that I can run so I can look into what's going on. I run local cluster's all the time (Storm's unit tests bring clusters up and down dozens of times) and have never seen this.
> ----------
> chids: I've not been able to create a reliable test case but the condition does occur "regularly". Right now I'm running the testBasicTopology code and got bitten by this a couple of times in a row. Then it stops happening.
> @MichaelBlume's have you tried to identify what Thread-10 is?
> ----------
> travis: yea, I'm seeing this regularly as well, not quite sure what the pattern is yet
> I got stack traces of the two threads trying to rmr:
> 11073 [main] ERROR backtype.storm.util  - Rmr path /var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
> java.lang.Exception
>     at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
>     at backtype.storm.util$rmr.invoke(util.clj:411)
>     at backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
>     at backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
>     at backtype.storm.daemon.supervisor$fn__4733$exec_fn__1207__auto__$reify__4742.shutdown_all_workers(supervisor.clj:368)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
>     at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:298)
>     at backtype.storm.testing$kill_local_storm_cluster.invoke(testing.clj:154)
> ... (application specific stacktrack - let me know if it would be useful)
> 11073 [Thread-19] ERROR backtype.storm.util  - Rmr path /var/folders/ll/0csh8smd025d9mgh2nf842400000gn/T//88915881-f92a-44be-93c1-434d9be6ac28/workers/bf2411b5-830a-4549-85e8-76637dcf352b/heartbeats
> java.lang.Exception
>     at backtype.storm.util$rmr$fn__4117.invoke(util.clj:411)
>     at backtype.storm.util$rmr.invoke(util.clj:411)
>     at backtype.storm.daemon.supervisor$try_cleanup_worker.invoke(supervisor.clj:146)
>     at backtype.storm.daemon.supervisor$shutdown_worker.invoke(supervisor.clj:165)
>     at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:223)
>     at clojure.lang.AFn.applyToHelper(AFn.java:161)
>     at clojure.lang.AFn.applyTo(AFn.java:151)
>     at clojure.core$apply.invoke(core.clj:603)
>     at clojure.core$partial$fn__4070.doInvoke(core.clj:2343)
>     at clojure.lang.RestFn.invoke(RestFn.java:397)
>     at backtype.storm.event$event_manager$fn__2484.invoke(event.clj:24)
>     at clojure.lang.AFn.run(AFn.java:24)
>         at java.lang.Thread.run(Thread.java:680)
> Still not quite sure what this all means, but getting closer. This happens pretty much every time I run tests on my mac and on our CI box.
> Still no test case, but I did find a workaround. I was running the topology like:
> (with-local-cluster [cluster]
>     (complete-topology cluster
>                        (my-topology drpc)
>                        :mock-sources {"actions" (map vector actions)
>                                       "drpc-requests" requests}))
> Sleeping for 5 seconds before shutting down the cluster seems to fix it (shorter periods just made the error sporadic rather than constant:
> (with-local-cluster [cluster]
>   (let [results
>         (complete-topology cluster
>                            (my-topology drpc)
>                            :mock-sources {"actions" (map vector actions)
>                                           "drpc-requests" requests})]
>     (Thread/sleep 5000)
>     results))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)