You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Sriharsha Chintalapani (JIRA)" <ji...@apache.org> on 2015/09/04 04:24:45 UTC

[jira] [Commented] (STORM-130) [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does not exist

    [ https://issues.apache.org/jira/browse/STORM-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730206#comment-14730206 ] 

Sriharsha Chintalapani commented on STORM-130:
----------------------------------------------

[~cywjackson] Thanks for the log  but this looks like some other issue not because of topology download issue though.

> [Storm 0.8.2]: java.io.FileNotFoundException: File '../stormconf.ser' does not exist
> ------------------------------------------------------------------------------------
>
>                 Key: STORM-130
>                 URL: https://issues.apache.org/jira/browse/STORM-130
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating, 0.9.3, 0.9.4
>            Reporter: James Xu
>            Assignee: Sriharsha Chintalapani
>            Priority: Minor
>             Fix For: 0.10.0, 0.9.5
>
>         Attachments: README.txt, nimbus.log.gz, supervisor.log.2015-08-26, supervisor_logs.tar.gz, worker_logs.tar.gz, worker_logs_of_kafka_traffic.tar.gz, worker_logs_of_zookeeper_traffic_2015-04-11.tar.gz, worker_logs_of_zookeeper_traffic_2015-04-12.tar.gz, worker_logs_of_zookeeper_traffic_2015-04-13.tar.gz, workers_with_stormconf.ser.gz
>
>
> https://github.com/nathanmarz/storm/issues/438
> Hi developers,
> We met critical issue with deploying storm topology to our prod cluster.
> After deploying topology we got trace on workers (Storm 0.8.2/zookeeper-3.3.6) :
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181 sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@254ba9a2
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on server zookeeper1.company.com/10.72.209.112:2181, sessionid = 0x13b3e4b5c780239, negotiated timeout = 20000
> 2013-01-14 10:57:39 zookeeper [INFO] Zookeeper state update: :connected:none
> 2013-01-14 10:57:39 ZooKeeper [INFO] Session: 0x13b3e4b5c780239 closed
> 2013-01-14 10:57:39 ClientCnxn [INFO] EventThread shut down
> 2013-01-14 10:57:39 CuratorFrameworkImpl [INFO] Starting
> 2013-01-14 10:57:39 ZooKeeper [INFO] Initiating client connection, connectString=zookeeper1.company.com:2181,zookeeper2.company.com:2181,zookeeper3.company.com:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@33a998c7
> 2013-01-14 10:57:39 ClientCnxn [INFO] Opening socket connection to server zookeeper1.company.com/10.72.209.112:2181
> 2013-01-14 10:57:39 ClientCnxn [INFO] Socket connection established to zookeeper1.company.com/10.72.209.112:2181, initiating session
> 2013-01-14 10:57:39 ClientCnxn [INFO] Session establishment complete on server zookeeper1.company.com/10.72.209.112:2181, sessionid = 0x13b3e4b5c78023a, negotiated timeout = 20000
> 2013-01-14 10:57:39 worker [ERROR] Error on initialization of server mk-worker
> java.io.FileNotFoundException: File '/tmp/storm/supervisor/stormdist/normalization-prod-1-1358161053/stormconf.ser' does not exist
> at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:137)
> at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1135)
> at backtype.storm.config$read_supervisor_storm_conf.invoke(config.clj:138)
> at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:146)
> at backtype.storm.daemon.worker$fn__4348$exec_fn__1228__auto____4349.invoke(worker.clj:332)
> at clojure.lang.AFn.applyToHelper(AFn.java:185)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at clojure.core$apply.invoke(core.clj:601)
> at backtype.storm.daemon.worker$fn__4348$mk_worker__4404.doInvoke(worker.clj:323)
> at clojure.lang.RestFn.invoke(RestFn.java:512)
> at backtype.storm.daemon.worker$_main.invoke(worker.clj:433)
> at clojure.lang.AFn.applyToHelper(AFn.java:172)
> at clojure.lang.AFn.applyTo(AFn.java:151)
> at backtype.storm.daemon.worker.main(Unknown Source)
> 2013-01-14 10:57:39 util [INFO] Halting process: ("Error on initialization")
> Supervisor trace:
> 2013-01-14 10:59:01 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 still hasn't started
> 2013-01-14 10:59:02 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 still hasn't starte
> .......
> 2013-01-14 10:59:34 supervisor [INFO] d6735377-f0d6-4247-9f35-c8620e2b0e26 still hasn't started
> 2013-01-14 10:59:35 supervisor [INFO] Worker d6735377-f0d6-4247-9f35-c8620e2b0e26 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Worker 234264c6-d9d6-4e8a-ab0a-8926bdd6b536 failed to start
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id 234264c6-d9d6-4e8a-ab0a-8926bdd6b536. Current supervisor time: 1358161175. State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4819. Process is probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down d5c3235f-5880-4be8-a759-5654b3df6a27:234264c6-d9d6-4e8a-ab0a-8926bdd6b536
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down and clearing state for id d6735377-f0d6-4247-9f35-c8620e2b0e26. Current supervisor time: 1358161175. State: :disallowed, Heartbeat: nil
> 2013-01-14 10:59:35 supervisor [INFO] Shutting down d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> 2013-01-14 10:59:35 util [INFO] Error when trying to kill 4809. Process is probably already dead.
> 2013-01-14 10:59:35 supervisor [INFO] Shut down d5c3235f-5880-4be8-a759-5654b3df6a27:d6735377-f0d6-4247-9f35-c8620e2b0e26
> Thanks!
> Oleg M.
> ----------
> xiaokang: We also encountered this problem and the DEBUG log showed that storm-code-map is null and new-assignment is not null. I may be that new-assignment is got from zk after storm-code-map. So we changed the storm-code-map after all-assignment and the problem disapeared.
> (defn mk-synchronize-supervisor [supervisor sync-processes event-manager processes-event-manager]
>   (fn this []
> (let [conf (:conf supervisor)
>       storm-cluster-state (:storm-cluster-state supervisor)
>       ^ISupervisor isupervisor (:isupervisor supervisor)
>       ^LocalState local-state (:local-state supervisor)
>       sync-callback (fn [& ignored] (.add event-manager this))
>       storm-code-map (read-storm-code-locations storm-cluster-state sync-callback)
>       downloaded-storm-ids (set (read-downloaded-storm-ids conf))
>       all-assignment (read-assignments
>                        storm-cluster-state
>                        (:supervisor-id supervisor)
>                        sync-callback)
>       new-assignment (->> all-assignment
>                           (filter-key #(.confirmAssigned isupervisor %)))
>       assigned-storm-ids (assigned-storm-ids-from-port-assignments new-assignment)
>       existing-assignment (.get local-state LS-LOCAL-ASSIGNMENTS)]
>   (log-debug "Synchronizing supervisor")
>   (log-debug "Storm code map: " storm-code-map)
>   (log-debug "Downloaded storm ids: " downloaded-storm-ids)
>   (log-debug "All assignment: " all-assignment)
>   (log-debug "New assignment: " new-assignment)
> ----------
> NJtwentyone: I ran into having a similar problem once. Haven't had it lately. I was going to investigate more and before I created a posted but... There are two schedule-recurring functions of intertest in the function (mk-supervisor).
> (schedule-recurring (:timer supervisor) 0 10 (fn ))
> (schedule-recurring (:timer supervisor)
> 0
> (conf SUPERVISOR-MONITOR-FREQUENCY-SECS)
> (fn )))
> So function-1 (synchronize-supervisor) will eventually remove that dir [../stormconf.ser]
> And function-2 (sync-processes) will eventually hang wanting to create workers saying repeatedly
> 2013-01-13 22:13:13 b.s.d.supervisor [INFO] 6f73facd-8722-4b83-959c-a7b396c61224 still hasn't started
> I'm using the distributed setup of (Storm 0.8.2/zookeeper-3.4.5) on my mac and a hp using storm.starter.WordCountTopology
> ----------
> miggi: Well this problem was resolved only after reinstalling new instance of ZK (3.3.6), we had other issues with deploy topology on 0.8.1 version, so it was decided use new instance ZK. 
> Btw: removing storm data folders didn't help
> ----------
> nathanmarz: For those of you hitting the problem, it would be really useful if you could turn on DEBUG logging on the supervisor and show those logs when the problem is hit.
> ----------
> miggi: After few months stable work - we ran into the same issue again.
> Logs ("topology.debug: true") 
> Supervisor logs: http://cl.ly/0p2W3N0A2V3S
> Worker logs: http://cl.ly/2w1j1r0a3X46
> Cleaning/restarting ZK/Nimbus/Worker components didn't help. 
> Now we down: haven't tried solution proposed by xiaokang but probably it would be our next steps.
> Any fresh ideas ?
> ----------
> devoncrouse: +1 - ran into this twice now. We've "reset" things by clearing out Storm stuff in Zk, wiping out Storm data directory, and restarting the topology, but it happens again in time.
> Storm 0.9.0-wip16
> Zk 3.3.6
> ----------
> d2r: +1 seen this happen many times. same work-around
> Most recently Storm 0.8.2-wip22
> I'll try to collect a look and debug logs from the supervisor and have a look.
> ----------
> miggi: Hey guys, 
> try restart your cluster and ensure that all previous workers (java processes) was killed after this operation. 
> Also we've updated cluster to storm 0.8.3-wip1 with possible fix.
> Hope it helps 
> ----––––
> Oleg M.
> ----------
> devoncrouse: Simple restart doesn't cut it; have to totally clear out all Storm state in Zk and data directory (cluster otherwise quickly becomes unbalanced with same errors in the logs) - and again, we're on 0.9.0-wip16.
> ----------
> xiaokang: We encountered the same problem in production cluster. The log showed that a worker died and supervisor and nimbus found that at the same time. Than nimbus assign the worker to other host and the supervisor's sync-supervisor delete topo's code while sync-process try to restart the died timeouted worker.
> Our solution is change supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30. It works.
> ----------
> d2r: We have tried the same work-around from @xiaokang, but we still see the issue.
> ----------
> nathanmarz: @d2r This is fixed in the 0.8.3 wip, please try that out and let me know if it fixes the issue for you.
> ----------
> d2r: @nathanmarz Yes, pulling in the fix to supervisor.clj from 0.8.3 did resolve the issue for us. Thanks!
> ----------
> devoncrouse: Is this fix going to be applied to the newer builds (e.g. 0.9.0) as well?
> d5207b5
> ---------
> viceyang: We encountered the same problem, our version is 0.8.3, i read supervisor code, it seems not a bug, the key reason is "synchronize-supervisor" which responsible for download file and remove file thread and "sync-processes" which responsible for start worker process thread is Asynchronous. 
> see this case: in synchronize-supervisor read assigment information from zk, supervisor download necessary file from nimbus and write local state . in aother thread sync-processes funciton read local state to launch workor process, when the worker process has not start ,synchronize-supervisor function is called again topology's assignment information has changed (cased by rebalance,or worker time out etc.) worker assignment to this supervisor has move to another supervisor, synchronize-supervisor remove the unnecessary file (jar file and ser file etc.) , after this, worker launched by " sync-processes" ,ser file was not exsit , this issue occur. 
> Though in this issue exception occured but logic sees right. this excetion can't compelety Eliminate unless change thread Asynchronous feather. 
> there is some way reduction the exception occur, @xiaokang metioned "change supervisor's woker timeout from 30 to 5 seconds and nimbus remain 30" this method redeuce nimbus reassign topology. another way change "synchronize-supervisor" thread loop time to a longger than 10(default time) sec, such as 30 sec。



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)