You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Abhishek Agarwal (JIRA)" <ji...@apache.org> on 2015/09/15 09:45:45 UTC
[jira] [Commented] (STORM-1041) Topology with kafka spout stops processing

    [ https://issues.apache.org/jira/browse/STORM-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745013#comment-14745013 ] 

Abhishek Agarwal commented on STORM-1041:
-----------------------------------------

Can you take the stack trace after the topology is stuck? Attach the stack trace here afterwards.  

> Topology with kafka spout stops processing
> ------------------------------------------
>
>                 Key: STORM-1041
>                 URL: https://issues.apache.org/jira/browse/STORM-1041
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.5
>            Reporter: Scott Bessler
>            Priority: Critical
>
> Topology:
>  KafkaSpout (1 task/executor) -> bolt that does grouping (1 task/executor) -> bolt that does processing (176 tasks/executors)
>  8 workers
>  Using Netty
> Sometimes when a worker dies (we've seen it happen due to an OOM or load from a co-located worker) it will try to restart on the same node, then 20s later shutdown and start on another node.
> {code}
> 2015-09-10 08:05:41,131 -0700 INFO        backtype.storm.daemon.supervisor:0 - Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] [225 225])} for this supervisor 8a845b9b-adaa-4943-b6a6-68fdadcc5146 on port 6701 with id 42a499b2-2c5c-43c2-be8a-a5b3f4f8a99e
> 2015-09-10 08:05:39,953 -0700 INFO        backtype.storm.daemon.supervisor:0 - Shutting down and clearing state for id 39c28ee2-abf9-4834-8b1f-0bd6933412e8. Current supervisor time: 1441897539. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897539, :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] [225 225]}, :port 6700}
> 2015-09-10 08:05:22,693 -0700 INFO        backtype.storm.daemon.supervisor:0 - Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "NoticeProcessorTopology-368-1441856754", :executors ([9 9] [41 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] [217 217] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] [225 225])} for this supervisor f26e1fae-03bd-4fa8-9868-6a54993f3c5d on port 6700 with id 39c28ee2-abf9-4834-8b1f-0bd6933412e8
> 2015-09-10 08:05:21,588 -0700 INFO        backtype.storm.daemon.supervisor:0 - Shutting down and clearing state for id 4f0e4c22-6ccc-4d78-a20f-88bffb8def1d. Current supervisor time: 1441897521. State: :timed-out, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1441897490, :storm-id "NoticeProcessorTopology-368-1441856754", :executors #{[9 9] [41 41] [73 73] [105 105] [137 137] [169 169] [201 201] [17 17] [49 49] [81 81] [113 113] [145 145] [177 177] [209 209] [25 25] [57 57] [89 89] [121 121] [153 153] [185 185] [217 217] [-1 -1] [1 1] [33 33] [65 65] [97 97] [129 129] [161 161] [193 193] [225 225]}, :port 6700}
> {code}
> While the worker was dead and then killed, other workers have had netty drop messages. In theory these messages should timeout and be replayed. Our message timeout is 30s. 
> {code}
> 2015-09-10 08:05:50,914 -0700 ERROR       b.storm.messaging.netty.Client:453 - dropping 1 message(s) destined for Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:44,904 -0700 ERROR       b.storm.messaging.netty.Client:453 - dropping 1 message(s) destined for Netty-Client-usw2b-grunt-drone33-prod.amz.relateiq.com/10.30.101.36:6701
> 2015-09-10 08:05:43,902 -0700 ERROR       b.storm.messaging.netty.Client:453 - dropping 1 message(s) destined for Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 - dropping 1 message(s) destined for Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> 2015-09-10 08:05:27,873 -0700 ERROR       b.storm.messaging.netty.Client:453 - dropping 1 message(s) destined for Netty-Client-usw2b-grunt-drone39-prod.amz.relateiq.com/10.30.101.5:6700
> {code}
> However these messages never timeout, and the MAX_SPOUT_PENDING has been reached, so no more tuples are emitted/processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)