You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2015/02/24 10:51:04 UTC
[jira] [Updated] (FLINK-1604) Livelock in PartitionRequestClientFactory

     [ https://issues.apache.org/jira/browse/FLINK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann updated FLINK-1604:
---------------------------------
    Description: 
In case of a job restart, we observed a livelock in {{PartitionRequestClientFactory.createPartitionRequestClient}}. We suspect that this might have the following reason:

In order to obtain a new {{PartitionRequestClient}} a new {{ConnectingChannel}} is created. This channel acts as a future for the client. The channel is inserted into a {{ConcurrentHashMap}} so that other {{Threads}} trying to create a client for the same address wait on the future. Once the client is obtained by the initially requesting {{Thread}}, it is inserted into the {{HashMap}} instead of the {{ConnectionChannel}}. When the client is disposed, then it will be removed from the {{HashMap}}, but only if the client is still stored in the map. 

And here is where things can go wrong. If the requesting thread is interrupted after it created the {{ConnectingChannel}} and inserted it into the {{ConcurrentHashMap}} but before inserting the {{PartitionRequestClient}} into the same map, then a the map entry for a given {{RemoteAddress}} is the {{ConnectingChannel}}. Assume now that another thread waited at this channel and eventually obtained the client from this future. In the wake of cancelling the job, the client would be disposed by the corresponding {{RemoteInputChannel}}. Once the job has been restarted, new threads want to connect to the {{RemoteAddress}} and they find the {{ConnectingChannel}} with the disposed {{PartitionRequestClient}} as future result in the hash map. They retrieve the channel and see that the client has already been disposed. Now they try to delete the client from the {{ConcurrentHashMap}} to make room for a new one. However, this deletion fails, because the map still contains the {{ConnectingChannel}}.

To make a long story short, we believe that the network state is not left in a valid state after cancelling a job.

That is currently our best theory for the livelock we observed on Travis.

  was:
In case of a job restart, we observed a livelock in {{PartitionRequestClientFactory.createPartitionRequestClient}}. We suspect that this might have the following reason:

In order to obtain a new {{PartitionRequestClient}} a new {{ConnectingChannel}} is created. This channel acts as a future for the client. The channel is inserted into a {{ConcurrentHashMap}} so that other {{Threads}} trying to create a client for the same address wait on the future. Once the client is obtained by the initially requesting {{Thread}}, it is inserted into the {{HashMap}} instead of the {{ConnectionChannel}}. When the client is disposed, then it will be removed from the {{HashMap}}, but only if the client is still stored in the map. 

And here is where things can go wrong. If the requesting thread is interrupted after it created the {{ConnectingChannel}} and inserted it into the {{ConcurrentHashMap}} but before inserting the {{PartitionRequestClient}} into the same map, then a the map entry for a given {{RemoteAddress}} is the {{ConnectingChannel}}. Assume now that another thread waited at this channel and eventually obtained the client from this future. In the wake of cancelling the job, the client would be disposed by the corresponding {{RemoteInputChannel}}. Once the job has been restarted, new threads want to connect to the {{RemoteAddress}} and they find the {{ConnectingChannel}} with the disposed {{PartitionRequestClient}} as future result in the hash map. They retrieve the channel and see that the client has already been disposed. Now they try to delete the client from the {{ConcurrentHashMap}} to make room for a new one. However, this deletion fails, because the map still contains the {{ConnectingChannel}}.

That is currently our best theory for the livelock we observed on Travis.


> Livelock in PartitionRequestClientFactory
> -----------------------------------------
>
>                 Key: FLINK-1604
>                 URL: https://issues.apache.org/jira/browse/FLINK-1604
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Till Rohrmann
>
> In case of a job restart, we observed a livelock in {{PartitionRequestClientFactory.createPartitionRequestClient}}. We suspect that this might have the following reason:
> In order to obtain a new {{PartitionRequestClient}} a new {{ConnectingChannel}} is created. This channel acts as a future for the client. The channel is inserted into a {{ConcurrentHashMap}} so that other {{Threads}} trying to create a client for the same address wait on the future. Once the client is obtained by the initially requesting {{Thread}}, it is inserted into the {{HashMap}} instead of the {{ConnectionChannel}}. When the client is disposed, then it will be removed from the {{HashMap}}, but only if the client is still stored in the map. 
> And here is where things can go wrong. If the requesting thread is interrupted after it created the {{ConnectingChannel}} and inserted it into the {{ConcurrentHashMap}} but before inserting the {{PartitionRequestClient}} into the same map, then a the map entry for a given {{RemoteAddress}} is the {{ConnectingChannel}}. Assume now that another thread waited at this channel and eventually obtained the client from this future. In the wake of cancelling the job, the client would be disposed by the corresponding {{RemoteInputChannel}}. Once the job has been restarted, new threads want to connect to the {{RemoteAddress}} and they find the {{ConnectingChannel}} with the disposed {{PartitionRequestClient}} as future result in the hash map. They retrieve the channel and see that the client has already been disposed. Now they try to delete the client from the {{ConcurrentHashMap}} to make room for a new one. However, this deletion fails, because the map still contains the {{ConnectingChannel}}.
> To make a long story short, we believe that the network state is not left in a valid state after cancelling a job.
> That is currently our best theory for the livelock we observed on Travis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)