You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/12/19 19:30:00 UTC

[jira] [Commented] (GEODE-4096) Race Condition between ConcurrentSerialGatewaySenderEventProcessor stopper thread and the _dispatchBatch method for the connection global variable.

    [ https://issues.apache.org/jira/browse/GEODE-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297294#comment-16297294 ] 

ASF GitHub Bot commented on GEODE-4096:
---------------------------------------

nabarunnag opened a new pull request #1186: GEODE-4096: Fixed race condition for connection global variable
URL: https://github.com/apache/geode/pull/1186
 
 
   	* Information on how the race condition occurs is provided in the GEODE-4096 ticket.
   	* getConnection before returning null and clearing out the global variable connection calls stop on the dispatcher.
   	* This makes sure that AckReaderThreads for the dispatcher is shutdown and prevents lingering threads holding the connection life cycle lock.
   
   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, you check travis-ci for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Race Condition between ConcurrentSerialGatewaySenderEventProcessor stopper thread and the _dispatchBatch method for the connection global variable.
> ---------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-4096
>                 URL: https://issues.apache.org/jira/browse/GEODE-4096
>             Project: Geode
>          Issue Type: Bug
>          Components: wan
>            Reporter: nabarun
>            Assignee: nabarun
>
> *+Order of execution for this race condition to occur+*.
> #  _dispatchBatch is trying to dispatch a batch of events but was somehow unsuccessful 
> # It silently decides that the remote server may not be ready so it wants to retry
> # Same time we decide to stop the SerialGatewaySenderEventProcessor hence we call the Stopper Thread.
> # Before the threads are started on all the senders / dispatchers it sets the isStopped flag for the SerialGatewaySenderEventProcessor to true.
> # Then the _dispatchBatch method which was in retry mode makes a getConnection call to get the connection. This method does a check on the SerialGatewaySenderEventProcessor's isStopped flag. It sees that the flag is set and this return null.
> # This null is stored in the global variable connection for the dispatcher.
> # Now that the _dispatchBatch method calls sees that the connection is null it should raise an exception and destroyConnection.
> # Meanwhile there was a AckThreadReader that was running and the stopper thread for the event processor wants to stop it, but since the connection global variable was set to null by the get connection method call by _disptachBatch.
> # Hence the shutDownAckReaderThreadConnection is executed on null and hence the AckReaderThread continues to keep running - being stuck on socketRead0.
> # But the problem is that the AckReaderThread acquire a connectionLifeCycle.readLock. to readAcknowledgement, but the destroyConnection calls from the stopper thread and _dispatchBatch's exception handling code needs a connectionLifeCycleLock.writeLock which they can't because readLock is held by the AckReaderThread, causing a deadlock



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)