You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by arunmahadevan <gi...@git.apache.org> on 2018/09/12 22:39:19 UTC

[GitHub] storm pull request #2829: STORM-3222: Fix KafkaSpout internals to use Linked...

GitHub user arunmahadevan opened a pull request:

    https://github.com/apache/storm/pull/2829

    STORM-3222: Fix KafkaSpout internals to use LinkedList instead of ArrayList

    KafkaSpout internally maintains a waitingToEmit list per topic partition and keeps removing the first item to emit during each nextTuple. The implementation uses an ArrayList which results in un-necessary traversal and copy for each tuple.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arunmahadevan/storm STORM-3222

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2829.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2829
    
----
commit a5384aa845496a7e584bd947cace968a18b7ffdf
Author: Arun Mahadevan <ar...@...>
Date:   2018-09-12T22:36:24Z

    STORM-3222: Fix KafkaSpout internals to use LinkedList instead of ArrayList
    
    KafkaSpout internally maintains a waitingToEmit list per topic partition and keeps removing the first item to emit during each nextTuple. The implementation uses an ArrayList which results in un-necessary traversal and copy for each tuple.
    
    Also I am not sure why the nextTuple only emits a single tuple wheres ideally it should emit whatever it can emit in a single nextTuple call which is more efficient. However the logic appears too complicated to refactor.

----


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by hmcl <gi...@git.apache.org>.
Github user hmcl commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @revans2 if I am not mistaken I recall that when you reviewed the initial PR you suggested that a call to nextTuple should send only one tuple/record. Can you please refresh my mind on the reasons why? If I am confusing with a different patch, please disregard this message. Thanks.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @arunmahadevan I think the reason for only emitting one tuple at a time was to let Storm enforce the max.spout.pending limit. Maybe it would be better to accept some overflow in order to get the spout running faster, since people can still control how large the overflow can get by setting the consumer's max.poll.records (which I don't believe existed at the time the spout was originally written). I wanted to point to the PR where this was discussed, but I can't find it.
    
    If you want to try emitting as many tuples as possible, it should be possible by removing this break: https://github.com/apache/storm/pull/2829/files#diff-7d7cbc8f5444fa7ada7962033fc31c5eR425
    
    +1 for the current changes.


---

[GitHub] storm pull request #2829: STORM-3222: Fix KafkaSpout internals to use Linked...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/storm/pull/2829


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    Sorry my bad, there is an overflow.
    
    https://github.com/apache/storm/blob/4605ae0a34858309171e726e1924b9a37695c977/storm-client/src/jvm/org/apache/storm/executor/ExecutorTransfer.java#L94-L111
    
    The pause happens outside of the spouts in waiting for the queue to not be full any more just like before with the spouts.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by roshannaik <gi...@git.apache.org>.
Github user roshannaik commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @HeartSaVioR  
    
    1) pendingEmitsQ prevents nextTuple() from blocking when downstream queue is full.   It holds the overflow emits  (one or more) that occurred within a **single** nextTuple() invocation.  If Spout executor notices this Q is not empty, it will try to process any ACKs before entering wait strategy. Purpose of this Q is to prevent deadlock under BP. 
    
    2) **topology.max.spout.pending**  Looked like a candidate for elimination given that STORM-2306's BP. Although it is not absolutely necessary to use it in 2.0.. during benchmarking I noticed that it could sometimes have sizable impact on performance. Why ? it remains a mystery. It is now a perf related tunable (in ACK mode).. and not really a BP mechanism.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by arunmahadevan <gi...@git.apache.org>.
Github user arunmahadevan commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    >what's the motivation to change this to LinkedList?
    
    Its mentioned in the description. Heres the relevant code for some more details - https://github.com/apache/storm/blob/master/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java#L422
    
    >nextTuple emits only a single tuple because that's the contract of the method nextTuple, which must be honored. This was thoroughly discussed in the patch with the initial code implementation.
    
    It can emit one or more and ideally should emit whatever it has to emit at that point.
    https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/spout/ISpout.java#L72
    



---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @srdo is right. The reason is to enforce "max.spout.pending.limit", but the concept is brought when backpressure is not in place. IMHO, the concept is the thing we should drop whenever backpressure is stabilized.
    
    While I'm wondering backpressure would work correctly without any critical performance hit on 1.x (we disabled by default AFAIK), STORM-2306 renewed the mechanism of backpressure which we may be OK to rely on backpressure. If my memory is right, providing "max.spout.pending" was still the way to optimize even with STORM-2306, but even without this it worked correctly with sane performance.
    
    At least for Storm 2.0 we could try out relying only backpressure and drop the concept if we are happy with the result.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by hmcl <gi...@git.apache.org>.
Github user hmcl commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @arunmahadevan yes, it make sense to use a LinkedList here. Thanks for the explanation.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by HeartSaVioR <gi...@git.apache.org>.
Github user HeartSaVioR commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    If my understanding is right, there's pendingEmits (unbounded) which comes into play when it can't push tuple immediately, so emit should not block, nextTuple should not block as well. If emit can block in spout it could be pretty much a big risk (ack/fail is not handled as well and tuples will fail into timeout, and it may raise backpressure again).
    
    @roshannaik Could you confirm?


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by arunmahadevan <gi...@git.apache.org>.
Github user arunmahadevan commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    Also I am not sure why the nextTuple only emits a single tuple wheres ideally it should emit whatever it can emit in a single nextTuple call which is more efficient. However the logic appears too complicated to refactor.
    
    @srdo @HeartSaVioR 


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by hmcl <gi...@git.apache.org>.
Github user hmcl commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @arunmahadevan what's the motivation to change this to LinkedList?
    
    nextTuple emits only a single tuple because that's the contract of the method nextTuple, which must be honored. This was thoroughly discussed in the patch with the initial code implementation.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by roshannaik <gi...@git.apache.org>.
Github user roshannaik commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @arunmahadevan ..  need to be careful when that we are not doing too many emits in a single nextTuple()... it  can cause a OOM situation by flooding the pendingQ. 
    
    Worth checking if there is any measurable gain in combining many emits into  a single nextTuple() for KafkaSpout.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by srdo <gi...@git.apache.org>.
Github user srdo commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    Even if we change the spout to emit all the fetched records in one `nextTuple` call, the number and size of records returned in a fetch is limited by the `max.poll.records` KafkaConsumer setting (500 by default), as well as the `fetch.max.bytes` setting (50MB by default). 


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by revans2 <gi...@git.apache.org>.
Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    The reason we want to send a single tuple is because of how we do flow control in the spouts.  If you want to send more it can be more efficient, but you risk going over the max.spout.pending amount that is set by the user, and in 2.x where an emit can block, you risk having the spout block as well.


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by hmcl <gi...@git.apache.org>.
Github user hmcl commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    +1


---

[GitHub] storm issue #2829: STORM-3222: Fix KafkaSpout internals to use LinkedList in...

Posted by arunmahadevan <gi...@git.apache.org>.
Github user arunmahadevan commented on the issue:

    https://github.com/apache/storm/pull/2829
  
    @revans2 thanks for merging. Raised https://github.com/apache/storm/pull/2835 for 1.x-branch.
    
    >If you want to send more it can be more efficient, but you risk going over the max.spout.pending amount that is set by the user, and in 2.x where an emit can block, you risk having the spout block as well.
    
    If the emit can block in 2.x it might be an issue to address because its not in spout contract to necessarily emit only one record. I thought it could overflow and the max.spout.pending is not that relevant with the back pressure changes.


---