You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/09/19 13:37:00 UTC
[jira] [Updated] (STORM-3510) WorkerState.transferLocalBatch backpressure resend logic fix

     [ https://issues.apache.org/jira/browse/STORM-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated STORM-3510:
----------------------------------
    Labels: pull-request-available  (was: )

> WorkerState.transferLocalBatch backpressure resend logic fix
> ------------------------------------------------------------
>
>                 Key: STORM-3510
>                 URL: https://issues.apache.org/jira/browse/STORM-3510
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-client
>    Affects Versions: 2.2.0
>            Reporter: Christopher Johnson
>            Priority: Minor
>              Labels: pull-request-available
>
> WorkerState.transferLocalBatch uses an int lastOverflowCount to track the size of the overflow queue, and periodically resend the backpressure status to remote workers if the queue continues to grow.
>  
> The current implementation has two problems:
>  * The single variable tracks the receive queue of every executor in the worker, meaning it will be overwritten as tuples are sent to different executors.
>  * The variable is locally scoped, and so is not carried over between mini-batches.
>  
> This only comes in to effect when the overflow queue grows beyond 10000, which shouldn't happen unless a backpressure signal isn't received by an upstream worker, but if it does happen then a backpressure signal is going to be sent for every mini-batch processed.  I do not know if this is the intended behaviour, but the way the code is written seems to indicate that it isn't.
>  
> I have thought of two redesigns to fix these problems and make the behaviour align with how one would interpret the code:
>  
>  #  *Change the lastOverflowCount variable to a map of taskId to overflow count* - This will retain the behaviour of resending the backpressure update every mini-batch once over the threshold, if that behaviour is intended.  However, it will increase garbage by creating a new map every time WorkerState.transferLocalBatch is called by the NettyWorker thread.
>  # *Change the lastOverflowCount variable to a map of taskId to overflow count* *and move it to the BackPressureTracker class* - This will retain the counter between batches, and so only resend backpressure status every 10000 received tuples per task.
>  
> My preference is for the second option, as if the intended behaviour is to resend every mini batch it should be rewritten so the intent is explicit from the code.
>  
> It is also possible that doing it the second way could run in to concurrency issues i didn't think of, but as far as i can tell the topology.worker.receiver.thread.count config option isn't used at all?  If that's the case and there is only one NettyWorker thread per worker then it should be fine.
>  
> I have implemented both methods and attempted to benchmark them with [https://github.com/yahoo/storm-perf-test] but as i am running all workers on one machine i couldn't get it to the point that the relevant code was ever called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)