You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Bruce Schuchardt <bs...@pivotal.io> on 2015/10/01 02:00:01 UTC

Review Request 38912: GEODE-77 fixes for network failure handling & miscellaneous unit test failures

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38912/
-----------------------------------------------------------

Review request for geode, anilkumar gingade, Jason Huynh, Jianxia Chen, and Lynn Gallinat.


Repository: geode


Description
-------

Network failure handling was not properly shutting down TCPConduit, leaving threads hanging trying to send messages.  The shutdown code was calling Services.emergencyClose too soon, and the recursion back into GMSMembershipManager shutdown code caused some problems, too.

GMSHealthMonitor was continually switching between two members to watch even though it had already sent suspect messages about them and had received no response.  I added a collection of IDs that are in this state and modified setNextNeighbor to avoid reusing them.

GMSHealthMonitor was sending removeMember messages to the locators and a random member, but for some reason this wasn't resolving a network partition fast enough.  I've disabled that behavior for now, sending the messages to all members.  This needs to be revisited because sending the message to all members is not scalable.

GMSHealthMonitor had some issues with initiating removals when it was in the process of shutting down.  I added some isStopping checks to fix this.

MembershipJUnitTest and StatRecorderJUnitTest were failing in gradle runs but not under Eclipse because my Eclipse launch configuration wasn't set to enable assertions.  After fixing that I found a number of problems with these tests and fixed them.

Multicast tests are now implemented in GMSMembershipManager and JGroupsMessenger.  This leverages the ping/pong messaging added for the quorum checker.

GMSJoinLeave was too slow in sending out new views when there were process failures.  I added code to inform the reply processor if there are queued leave/remove requests so it wouldn't wait for these, and also added similar checks in the removeHealthyMembers method (which performs checks on members using the HealthMonitor).

When there is a network partition GMSJoinLeave will now send a NetworkPartitionMessage to other members to prod them along in figuring out that they should shut down.

During a forced-disconnect there can be a lot of warning/fatail log messages.  If there are alert listeners in the system this can create a lot of network traffic and extra work figuring out whether the receiver is even there or not.  GMSMembershipManager now throws away outbound alerts when a forced-disconnect is in process.

Some of the forced-disconnect shutdown processing has been moved out of the membershp manager's DisconnectThread that was introduced with the quorum checker in order to set the shutdown cause, etc, as quickly as possible.

I noticed a lot of TXState log messages at debug level with a Throwable stack trace.  There was no comment saying why this was being done so I commented it out.

JGroups logging level is now set to FATAL by default.  The default log level was a problem during network partitions because each message send was causing a dire warning to be logged.

I observed a number of threads being left behind when a locator failed to start during auto-reconnect testing.  I added a unit test to LocatorJUnitTest for this and fixed the leaks.


Diffs
-----

  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java c3929c007ea69b15759b5b8480a32e3294cd6d73 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalLocator.java 6ea54e2a124410fedb8156a3757b79ea3de52174 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/NetView.java 65fe913b8200e18249334d1e55acf7a67455c247 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/Services.java acd2bedfa9583a37446712d08ef04671f291378a 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/fd/GMSHealthMonitor.java f12628aeaa9a5874da8a09db846b4dc653978f99 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/interfaces/Messenger.java b154403ce12ff87576c0f7ca01732b1377f9712b 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeave.java 7b6b97df54148985ed6154823eefcf7d3ca82c23 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/NetworkPartitionMessage.java PRE-CREATION 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/SuspectMembersMessage.java 117f440325ceab7131c4f5e153f32105a55b7b09 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/JGroupsMessenger.java c1acb87cc184447dbd1879d2c4a569c7a8093dda 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/StatRecorder.java 1fef0daec35ab999829f58fc44da03851a852b7f 
  gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/mgr/GMSMembershipManager.java 64dd1cd5de028b296f0fd6bf33e02ffbf672cf6e 
  gemfire-core/src/main/java/com/gemstone/gemfire/internal/DSFIDFactory.java a743c8a9f2d227143f04081b11c4a42d9dcb61c2 
  gemfire-core/src/main/java/com/gemstone/gemfire/internal/DataSerializableFixedID.java 39fdeef81856d5ff128ed6ea050d4afbc3a612f7 
  gemfire-core/src/main/java/com/gemstone/gemfire/internal/cache/TXState.java 2672323cc89c8266df943de4dc444984c66ca3af 
  gemfire-core/src/main/resources/com/gemstone/gemfire/internal/logging/log4j/log4j2-default.xml 8b1331ffda0ff7a3a1878ac491f9e394821f8ec1 
  gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorDUnitTest.java afb4687d8d75b6f36f2c6900352c4d51b13b28c0 
  gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorJUnitTest.java 5a09b5589c63a8ac9e9b4883925ef3627e2066a9 
  gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/MembershipJUnitTest.java f7683f9d0c4a1ca1bfd451fd9d0b7fcdc37c10ad 
  gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java 0af47a7904a85bd5c3efa98f1a398a43486d425f 
  gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/StatRecorderJUnitTest.java fb502908b7c1bc7a32dfb367d1cdad56997305bb 
  gemfire-core/src/test/java/com/gemstone/gemfire/internal/cache/partitioned/Bug43684DUnitTest.java 9722311b4a13f90c94dc63d9eef3091c77d81ad8 

Diff: https://reviews.apache.org/r/38912/diff/


Testing
-------

precheckin, 3-host network partition testing


Thanks,

Bruce Schuchardt


Re: Review Request 38912: GEODE-77 fixes for network failure handling & miscellaneous unit test failures

Posted by Bruce Schuchardt <bs...@pivotal.io>.

> On Oct. 1, 2015, 1:30 a.m., anilkumar gingade wrote:
> > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java, line 1327
> > <https://reviews.apache.org/r/38912/diff/1/?file=1088150#file1088150line1327>
> >
> >     we are passing the same value for both the arguments...Is this expected?

I will change this to (this.forcedDisconnect, preparingForReconnect, false)


- Bruce


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38912/#review101184
-----------------------------------------------------------


On Sept. 30, 2015, 11:59 p.m., Bruce Schuchardt wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38912/
> -----------------------------------------------------------
> 
> (Updated Sept. 30, 2015, 11:59 p.m.)
> 
> 
> Review request for geode, anilkumar gingade, Jason Huynh, Jianxia Chen, and Lynn Gallinat.
> 
> 
> Repository: geode
> 
> 
> Description
> -------
> 
> Network failure handling was not properly shutting down TCPConduit, leaving threads hanging trying to send messages.  The shutdown code was calling Services.emergencyClose too soon, and the recursion back into GMSMembershipManager shutdown code caused some problems, too.
> 
> GMSHealthMonitor was continually switching between two members to watch even though it had already sent suspect messages about them and had received no response.  I added a collection of IDs that are in this state and modified setNextNeighbor to avoid reusing them.
> 
> GMSHealthMonitor was sending removeMember messages to the locators and a random member, but for some reason this wasn't resolving a network partition fast enough.  I've disabled that behavior for now, sending the messages to all members.  This needs to be revisited because sending the message to all members is not scalable.
> 
> GMSHealthMonitor had some issues with initiating removals when it was in the process of shutting down.  I added some isStopping checks to fix this.
> 
> MembershipJUnitTest and StatRecorderJUnitTest were failing in gradle runs but not under Eclipse because my Eclipse launch configuration wasn't set to enable assertions.  After fixing that I found a number of problems with these tests and fixed them.
> 
> Multicast tests are now implemented in GMSMembershipManager and JGroupsMessenger.  This leverages the ping/pong messaging added for the quorum checker.
> 
> GMSJoinLeave was too slow in sending out new views when there were process failures.  I added code to inform the reply processor if there are queued leave/remove requests so it wouldn't wait for these, and also added similar checks in the removeHealthyMembers method (which performs checks on members using the HealthMonitor).
> 
> When there is a network partition GMSJoinLeave will now send a NetworkPartitionMessage to other members to prod them along in figuring out that they should shut down.
> 
> During a forced-disconnect there can be a lot of warning/fatail log messages.  If there are alert listeners in the system this can create a lot of network traffic and extra work figuring out whether the receiver is even there or not.  GMSMembershipManager now throws away outbound alerts when a forced-disconnect is in process.
> 
> Some of the forced-disconnect shutdown processing has been moved out of the membershp manager's DisconnectThread that was introduced with the quorum checker in order to set the shutdown cause, etc, as quickly as possible.
> 
> I noticed a lot of TXState log messages at debug level with a Throwable stack trace.  There was no comment saying why this was being done so I commented it out.
> 
> JGroups logging level is now set to FATAL by default.  The default log level was a problem during network partitions because each message send was causing a dire warning to be logged.
> 
> I observed a number of threads being left behind when a locator failed to start during auto-reconnect testing.  I added a unit test to LocatorJUnitTest for this and fixed the leaks.
> 
> 
> Diffs
> -----
> 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java c3929c007ea69b15759b5b8480a32e3294cd6d73 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalLocator.java 6ea54e2a124410fedb8156a3757b79ea3de52174 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/NetView.java 65fe913b8200e18249334d1e55acf7a67455c247 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/Services.java acd2bedfa9583a37446712d08ef04671f291378a 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/fd/GMSHealthMonitor.java f12628aeaa9a5874da8a09db846b4dc653978f99 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/interfaces/Messenger.java b154403ce12ff87576c0f7ca01732b1377f9712b 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeave.java 7b6b97df54148985ed6154823eefcf7d3ca82c23 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/NetworkPartitionMessage.java PRE-CREATION 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/SuspectMembersMessage.java 117f440325ceab7131c4f5e153f32105a55b7b09 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/JGroupsMessenger.java c1acb87cc184447dbd1879d2c4a569c7a8093dda 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/StatRecorder.java 1fef0daec35ab999829f58fc44da03851a852b7f 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/mgr/GMSMembershipManager.java 64dd1cd5de028b296f0fd6bf33e02ffbf672cf6e 
>   gemfire-core/src/main/java/com/gemstone/gemfire/internal/DSFIDFactory.java a743c8a9f2d227143f04081b11c4a42d9dcb61c2 
>   gemfire-core/src/main/java/com/gemstone/gemfire/internal/DataSerializableFixedID.java 39fdeef81856d5ff128ed6ea050d4afbc3a612f7 
>   gemfire-core/src/main/java/com/gemstone/gemfire/internal/cache/TXState.java 2672323cc89c8266df943de4dc444984c66ca3af 
>   gemfire-core/src/main/resources/com/gemstone/gemfire/internal/logging/log4j/log4j2-default.xml 8b1331ffda0ff7a3a1878ac491f9e394821f8ec1 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorDUnitTest.java afb4687d8d75b6f36f2c6900352c4d51b13b28c0 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorJUnitTest.java 5a09b5589c63a8ac9e9b4883925ef3627e2066a9 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/MembershipJUnitTest.java f7683f9d0c4a1ca1bfd451fd9d0b7fcdc37c10ad 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java 0af47a7904a85bd5c3efa98f1a398a43486d425f 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/StatRecorderJUnitTest.java fb502908b7c1bc7a32dfb367d1cdad56997305bb 
>   gemfire-core/src/test/java/com/gemstone/gemfire/internal/cache/partitioned/Bug43684DUnitTest.java 9722311b4a13f90c94dc63d9eef3091c77d81ad8 
> 
> Diff: https://reviews.apache.org/r/38912/diff/
> 
> 
> Testing
> -------
> 
> precheckin, 3-host network partition testing
> 
> 
> Thanks,
> 
> Bruce Schuchardt
> 
>


Re: Review Request 38912: GEODE-77 fixes for network failure handling & miscellaneous unit test failures

Posted by anilkumar gingade <ag...@pivotal.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38912/#review101184
-----------------------------------------------------------



gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java (line 1327)
<https://reviews.apache.org/r/38912/#comment158532>

    we are passing the same value for both the arguments...Is this expected?


- anilkumar gingade


On Sept. 30, 2015, 11:59 p.m., Bruce Schuchardt wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38912/
> -----------------------------------------------------------
> 
> (Updated Sept. 30, 2015, 11:59 p.m.)
> 
> 
> Review request for geode, anilkumar gingade, Jason Huynh, Jianxia Chen, and Lynn Gallinat.
> 
> 
> Repository: geode
> 
> 
> Description
> -------
> 
> Network failure handling was not properly shutting down TCPConduit, leaving threads hanging trying to send messages.  The shutdown code was calling Services.emergencyClose too soon, and the recursion back into GMSMembershipManager shutdown code caused some problems, too.
> 
> GMSHealthMonitor was continually switching between two members to watch even though it had already sent suspect messages about them and had received no response.  I added a collection of IDs that are in this state and modified setNextNeighbor to avoid reusing them.
> 
> GMSHealthMonitor was sending removeMember messages to the locators and a random member, but for some reason this wasn't resolving a network partition fast enough.  I've disabled that behavior for now, sending the messages to all members.  This needs to be revisited because sending the message to all members is not scalable.
> 
> GMSHealthMonitor had some issues with initiating removals when it was in the process of shutting down.  I added some isStopping checks to fix this.
> 
> MembershipJUnitTest and StatRecorderJUnitTest were failing in gradle runs but not under Eclipse because my Eclipse launch configuration wasn't set to enable assertions.  After fixing that I found a number of problems with these tests and fixed them.
> 
> Multicast tests are now implemented in GMSMembershipManager and JGroupsMessenger.  This leverages the ping/pong messaging added for the quorum checker.
> 
> GMSJoinLeave was too slow in sending out new views when there were process failures.  I added code to inform the reply processor if there are queued leave/remove requests so it wouldn't wait for these, and also added similar checks in the removeHealthyMembers method (which performs checks on members using the HealthMonitor).
> 
> When there is a network partition GMSJoinLeave will now send a NetworkPartitionMessage to other members to prod them along in figuring out that they should shut down.
> 
> During a forced-disconnect there can be a lot of warning/fatail log messages.  If there are alert listeners in the system this can create a lot of network traffic and extra work figuring out whether the receiver is even there or not.  GMSMembershipManager now throws away outbound alerts when a forced-disconnect is in process.
> 
> Some of the forced-disconnect shutdown processing has been moved out of the membershp manager's DisconnectThread that was introduced with the quorum checker in order to set the shutdown cause, etc, as quickly as possible.
> 
> I noticed a lot of TXState log messages at debug level with a Throwable stack trace.  There was no comment saying why this was being done so I commented it out.
> 
> JGroups logging level is now set to FATAL by default.  The default log level was a problem during network partitions because each message send was causing a dire warning to be logged.
> 
> I observed a number of threads being left behind when a locator failed to start during auto-reconnect testing.  I added a unit test to LocatorJUnitTest for this and fixed the leaks.
> 
> 
> Diffs
> -----
> 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java c3929c007ea69b15759b5b8480a32e3294cd6d73 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalLocator.java 6ea54e2a124410fedb8156a3757b79ea3de52174 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/NetView.java 65fe913b8200e18249334d1e55acf7a67455c247 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/Services.java acd2bedfa9583a37446712d08ef04671f291378a 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/fd/GMSHealthMonitor.java f12628aeaa9a5874da8a09db846b4dc653978f99 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/interfaces/Messenger.java b154403ce12ff87576c0f7ca01732b1377f9712b 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeave.java 7b6b97df54148985ed6154823eefcf7d3ca82c23 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/NetworkPartitionMessage.java PRE-CREATION 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/SuspectMembersMessage.java 117f440325ceab7131c4f5e153f32105a55b7b09 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/JGroupsMessenger.java c1acb87cc184447dbd1879d2c4a569c7a8093dda 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/StatRecorder.java 1fef0daec35ab999829f58fc44da03851a852b7f 
>   gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/mgr/GMSMembershipManager.java 64dd1cd5de028b296f0fd6bf33e02ffbf672cf6e 
>   gemfire-core/src/main/java/com/gemstone/gemfire/internal/DSFIDFactory.java a743c8a9f2d227143f04081b11c4a42d9dcb61c2 
>   gemfire-core/src/main/java/com/gemstone/gemfire/internal/DataSerializableFixedID.java 39fdeef81856d5ff128ed6ea050d4afbc3a612f7 
>   gemfire-core/src/main/java/com/gemstone/gemfire/internal/cache/TXState.java 2672323cc89c8266df943de4dc444984c66ca3af 
>   gemfire-core/src/main/resources/com/gemstone/gemfire/internal/logging/log4j/log4j2-default.xml 8b1331ffda0ff7a3a1878ac491f9e394821f8ec1 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorDUnitTest.java afb4687d8d75b6f36f2c6900352c4d51b13b28c0 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorJUnitTest.java 5a09b5589c63a8ac9e9b4883925ef3627e2066a9 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/MembershipJUnitTest.java f7683f9d0c4a1ca1bfd451fd9d0b7fcdc37c10ad 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java 0af47a7904a85bd5c3efa98f1a398a43486d425f 
>   gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/StatRecorderJUnitTest.java fb502908b7c1bc7a32dfb367d1cdad56997305bb 
>   gemfire-core/src/test/java/com/gemstone/gemfire/internal/cache/partitioned/Bug43684DUnitTest.java 9722311b4a13f90c94dc63d9eef3091c77d81ad8 
> 
> Diff: https://reviews.apache.org/r/38912/diff/
> 
> 
> Testing
> -------
> 
> precheckin, 3-host network partition testing
> 
> 
> Thanks,
> 
> Bruce Schuchardt
> 
>