You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladimir Ozerov (JIRA)" <ji...@apache.org> on 2017/07/10 08:00:03 UTC
[jira] [Updated] (IGNITE-5056) Implement communication backpressure control

     [ https://issues.apache.org/jira/browse/IGNITE-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Ozerov updated IGNITE-5056:
------------------------------------
    Fix Version/s:     (was: 2.1)
                   2.2

> Implement communication backpressure control
> --------------------------------------------
>
>                 Key: IGNITE-5056
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5056
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Yakov Zhdanov
>            Assignee: Yakov Zhdanov
>            Priority: Critical
>             Fix For: 2.2
>
>
> Problem
> Currently backpressure control relies on semaphore on sending side that ensures that sending queue cannot be  overflown and a special counter on receiving side that stops reading from the socket when unprocessed message count outgrows limit config parameter.
> In some scenarios it may lead to a distributed deadlock. E.g. we send many async jobs to remote nodes which in turn do sync cache operations. If task master node is a backup or primary for some cache updates and has already scheduled too many job requests for send it will not be able to respond to cache requests thus remote jobs would never complete.
> Solution
> Reading from socket should never stop
> Design notes
> * add IgniteConfiguration.maxAsyncRequests and propagate it via node attributes to all nodes of the cluster. All nodes may have different value (however this is unlikely).
> * add a flag to GridIoMessage.async. If flag is false then sender node assumed to synchronously wait for response and does not wait otherwise.
> * all sent async messages should be tracked on sender node on per-receiver basis.
> * all received async messages should be tracked on receiver nodes
> * nodes should add flag to communication acks on whether they can more async messages or not 
> * sender should never exceed IgniteConfiguration.maxAsyncRequests async requests per node
> * if IgniteConfiguration.maxAsyncRequests is exceeded or node sets flag in communication ack then all async messages become sync
> * above means:
> ** next compute job from the task is sent to the node only after response for some previous comes
> ** next dht update request (for primary sync or full async) is sent, but node doesn't send response to near node unless it has not received response for former operation from remote backup or for this operation
> ** next cache operation becomes sync - we force user code to wait on operation future.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)