You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/08/14 12:35:00 UTC

[jira] [Commented] (FLINK-10141) Reduce lock contention introduced with 1.5

    [ https://issues.apache.org/jira/browse/FLINK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579739#comment-16579739 ] 

ASF GitHub Bot commented on FLINK-10141:
----------------------------------------

NicoK opened a new pull request #6553: [FLINK-10141][network] optimisations reducing lock contention
URL: https://github.com/apache/flink/pull/6553
 
 
   ## What is the purpose of the change
   
   With the changes around introducing credit-based flow control as well as the low latency changes, unfortunately, we also introduced some lock contention on `RemoteInputChannel#bufferQueue` and `RemoteInputChannel#receivedBuffers`. Additionally, we were asking for queue sizes when the only thing we need is whether it is empty or not.
   
   As a result, we saw high CPU load during "idle" stream processing jobs with no events in the stream but only watermarks (every 500ms) and many slots on a single machine.
   
   ## Brief change log
   
   - move `notifyCreditAvailable()` out of the lock around `RemoteInputChannel#bufferQueue`
   - move `notifyChannelNonEmpty()` out of the lock around `RemoteInputChannel#receivedBuffers`
   - replace `RemoteInputChannel#receivedBuffers.size()` with `receivedBuffers.isEmpty()` when this is the only thing needed
   - replace `SingleInputGate#inputChannelsWithData.size()` with `inputChannelsWithData.isEmpty()` when this is the only thing needed
   - minor code style improvement in `CreditBasedPartitionRequestClientHandler` to improve readability
   
   ## Verifying this change
   
   - This change is already covered by existing tests, such as `RemoteInputChannelTest`, `SingleInputGateTest`,..., and everything using the network stack.
   - Manually verified that less CPU is used with a `SocketWindowWordCount` with some added shuffles, no input elements, only watermarks every 500ms with 4 TMs, 10 slots each on a single (laptop) machine.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no** (per buffer)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **no**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? **JavaDocs**
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Reduce lock contention introduced with 1.5
> ------------------------------------------
>
>                 Key: FLINK-10141
>                 URL: https://issues.apache.org/jira/browse/FLINK-10141
>             Project: Flink
>          Issue Type: Bug
>          Components: Network
>    Affects Versions: 1.5.2, 1.6.0, 1.7.0
>            Reporter: Nico Kruber
>            Assignee: Nico Kruber
>            Priority: Major
>              Labels: pull-request-available
>
> With the changes around introducing credit-based flow control as well as the low latency changes, unfortunately, we also introduced some lock contention on {{RemoteInputChannel#bufferQueue}} and {{RemoteInputChannel#receivedBuffers}} as well as asking for queue sizes when the only thing we need is whether it is empty or not.
> This was observed as a high idle CPU load with no events in the stream but only watermarks (every 500ms) and many slots on a single machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)