You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Zhuo Liu (JIRA)" <ji...@apache.org> on 2016/08/17 05:15:21 UTC

[jira] [Comment Edited] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology

    [ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423899#comment-15423899 ] 

Zhuo Liu edited comment on STORM-1949 at 8/17/16 5:14 AM:
----------------------------------------------------------

Hi [~roshan_naik] [~revans2] [~knusbaum] [~kabhwan],

I tested with the word count example sent by Roshan, the results is: Backpressure works fine for 1.x-branch while in master (2.x), we have a NullPointerExecution in SpoutExecutor.

(1) in 1.x-branch, back pressure works well (see the attached screenshot). The back pressure flag gets set and unset intermittently depends on the load in disruptor queues. [~roshan_naik], I think your guess of reason (racing condition) is incorrect since we have cyclic check of flags every 100 ms, as mentioned by Bobby.


(2) in test for master, I see the topology gets stalled quickly, less than a minute. But the reason is not because of the bug of back pressure, it is due to the NullPointerException in Executor.
https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java#L205
[~Cody] Could you help with this NPE since you worked on Storm-1277? 
Thanks a lot!


 80 2016-08-16 21:19:15.102 o.a.s.m.EventLoggerBolt [INFO] EventLoggerBolt prepare called
 81 2016-08-16 21:19:15.107 o.a.s.m.FileBasedEventLogger [INFO] logFilePath /Users/zliu6/git/joe/storm/storm-dist/binary/target/storm/logs/workers-artifacts/FileReadWordCountTopo-1-1471407540/6700    /events.log
 82 2016-08-16 21:19:15.110 o.a.s.e.b.BoltExecutor [INFO] Prepared bolt __eventlogger:[2]
 83 2016-08-16 21:20:17.242 o.a.s.u.Utils [ERROR] Async loop died!
 84 java.lang.RuntimeException: java.lang.NullPointerException
 85     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:468) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 86     at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 87     at org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 88     at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 89     at org.apache.storm.utils.Utils$7.run(Utils.java:2227) [storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 90     at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
 91 Caused by: java.lang.NullPointerException
 92     at org.apache.storm.executor.spout.SpoutExecutor.tupleActionFn(SpoutExecutor.java:205) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 93     at org.apache.storm.executor.Executor.onEvent(Executor.java:257) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 94     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:455) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 95     ... 5 more
 96 2016-08-16 21:20:17.860 o.a.s.e.e.ReportError [ERROR] Error
 97 java.lang.RuntimeException: java.lang.RuntimeException: java.lang.NullPointerException
 98     at org.apache.storm.utils.Utils$7.run(Utils.java:2237) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 99     at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
100 Caused by: java.lang.RuntimeException: java.lang.NullPointerException
101     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:468) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
102     at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
103     at org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
104     at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
105     at org.apache.storm.utils.Utils$7.run(Utils.java:2227) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
106     ... 1 more
107 Caused by: java.lang.NullPointerException
108     at org.apache.storm.executor.spout.SpoutExecutor.tupleActionFn(SpoutExecutor.java:205) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
109     at org.apache.storm.executor.Executor.onEvent(Executor.java:257) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
110     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:455) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
111     at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
112     at org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
113     at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
114     at org.apache.storm.utils.Utils$7.run(Utils.java:2227) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
115     ... 1 more

 
 


was (Author: zhuoliu):
Hi [~roshan_naik] [~revans2] [~knusbaum] [~kabhwan],

I tested with the word count example sent by Roshan, the results is: Backpressure works fine for 1.x-branch while in master (2.x), we have a NullPointerExecution in SpoutExecutor.

(1) in 1.x-branch, back pressure works well (see the attached screenshot). The back pressure flag gets set and unset intermittently depends on the load in disruptor queues. [~roshan_naik], I think you guess of reason (racing condition) is incorrect since we have cyclic check of flags every 100 ms, as mentioned by Bobby.


(2) in test for master, I see the topology gets stalled quickly, less than a minute. But the reason is not because of the bug of back pressure, it is due to the NullPointerException in Executor.
https://github.com/apache/storm/blob/master/storm-core/src/jvm/org/apache/storm/executor/spout/SpoutExecutor.java#L205
[~Cody] Could you help with this NPE since you worked on Storm-1277? 
Thanks a lot!


 80 2016-08-16 21:19:15.102 o.a.s.m.EventLoggerBolt [INFO] EventLoggerBolt prepare called
 81 2016-08-16 21:19:15.107 o.a.s.m.FileBasedEventLogger [INFO] logFilePath /Users/zliu6/git/joe/storm/storm-dist/binary/target/storm/logs/workers-artifacts/FileReadWordCountTopo-1-1471407540/6700    /events.log
 82 2016-08-16 21:19:15.110 o.a.s.e.b.BoltExecutor [INFO] Prepared bolt __eventlogger:[2]
 83 2016-08-16 21:20:17.242 o.a.s.u.Utils [ERROR] Async loop died!
 84 java.lang.RuntimeException: java.lang.NullPointerException
 85     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:468) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 86     at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 87     at org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 88     at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 89     at org.apache.storm.utils.Utils$7.run(Utils.java:2227) [storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 90     at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
 91 Caused by: java.lang.NullPointerException
 92     at org.apache.storm.executor.spout.SpoutExecutor.tupleActionFn(SpoutExecutor.java:205) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 93     at org.apache.storm.executor.Executor.onEvent(Executor.java:257) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 94     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:455) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 95     ... 5 more
 96 2016-08-16 21:20:17.860 o.a.s.e.e.ReportError [ERROR] Error
 97 java.lang.RuntimeException: java.lang.RuntimeException: java.lang.NullPointerException
 98     at org.apache.storm.utils.Utils$7.run(Utils.java:2237) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
 99     at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
100 Caused by: java.lang.RuntimeException: java.lang.NullPointerException
101     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:468) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
102     at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
103     at org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
104     at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
105     at org.apache.storm.utils.Utils$7.run(Utils.java:2227) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
106     ... 1 more
107 Caused by: java.lang.NullPointerException
108     at org.apache.storm.executor.spout.SpoutExecutor.tupleActionFn(SpoutExecutor.java:205) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
109     at org.apache.storm.executor.Executor.onEvent(Executor.java:257) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
110     at org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:455) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
111     at org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:434) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
112     at org.apache.storm.utils.DisruptorQueue.consumeBatch(DisruptorQueue.java:424) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
113     at org.apache.storm.executor.spout.SpoutExecutor$2.call(SpoutExecutor.java:137) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
114     at org.apache.storm.utils.Utils$7.run(Utils.java:2227) ~[storm-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
115     ... 1 more

 
 

> Backpressure can cause spout to stop emitting and stall topology
> ----------------------------------------------------------------
>
>                 Key: STORM-1949
>                 URL: https://issues.apache.org/jira/browse/STORM-1949
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Roshan Naik
>         Attachments: 1.x-branch-works-perfect.png
>
>
> Problem can be reproduced by this [Word count topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java]
> within a IDE.
> I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt instances.
> The problem is more easily reproduced with WC topology as it causes an explosion of tuples due to splitting a sentence tuple into word tuples. As the bolts have to process more tuples than the  spout is producing, spout needs to operate slower.
> The amount of time it takes for the topology to stall can vary.. but typically under 10 mins. 
> *My theory:*  I suspect there is a race condition in the way ZK is being utilized to enable/disable back pressure. When congested (i.e pressure exceeds high water mark), the bolt's worker records this congested situation in ZK by creating a node. Once the congestion is reduced below the low water mark, it deletes this node. 
> The spout's worker has setup a watch on the parent node, expecting a callback whenever there is change in the child nodes. On receiving the callback the spout's worker lists the parent node to check if there are 0 or more child nodes.... it is essentially trying to figure out the nature of state change in ZK to determine whether to throttle or not. Subsequently  it setsup another watch in ZK to keep an eye on future changes.
> When there are multiple bolts, there can be rapid creation/deletion of these ZK nodes. Between the time the worker receives a callback and sets up the next watch.. many changes may have undergone in ZK which will go unnoticed by the spout. 
> The condition that the bolts are no longer congested may not get noticed as a result of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)