You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by Kirk Lund <kl...@apache.org> on 2018/03/23 02:40:21 UTC

Membership log message questions

I'm looking through logs from a DUnit test that takes longer to run than I
think it should. It looks to me like membership is chewing up more time
than it should, and I have questions about some membership log messages.

Specifically, I'm worried about:
1) double-logging by vm1 when shutdown message is received from vm0

2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected

3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message
from vm0, b) processed a view with vm0 departed and c) even logged that vm0
gracefully left. (vm1 seems a bit schizophrenic about vm0)

Below are the log messages with my comments interspersed. Please help me
understand what's actually going on here and if we're seeing undesired
behavior (ie minor bugs)...

KIRK: vm0 is shutting down gracefully

[vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message
Processor 2> tid=87] Shutting down DistributionManager
192.168.1.18(34057)<ec><v1>:32771.

[vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message
Processor 2> tid=66] received leave request from
192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771

KIRK: vm1 states that vm0 gracefully left and logs it twice which seems
like a minor bug

[vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771 gracefully
left the distributed cache: shutdown message received

[vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771 gracefully
left the distributed cache: shutdown message received

[vm1] [info 2018/03/22 19:16:27.332 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] received new view:
View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members:
[192.168.1.18(34059:locator)<ec><v0>:32770,
192.168.1.18(34058)<ec><v2>:32772{lead}]  shutdown:
[192.168.1.18(34057)<ec><v1>:32771]
[vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2]
members: [192.168.1.18(34059:locator)<ec><v0>:32770,
192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>:32772]

KIRK: even tho vm1 just received a new view it initiates suspect processing
for vm0 which also seems like a minor bug

[vm1] [info 2018/03/22 19:16:28.186 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:28.683 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:29.180 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:29.681 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:30.183 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771

KIRK: vm0 surprisingly continues logging -- why is it suspecting other
members? why does it still have GMSHealthMonitor running? why does it log
the message twice? and again logging it twice looks like a minor bug to me
unless all of this double logging that I'm seeing means membership is doing
twice the work which would be horrible (please put my mind to rest about
this)

[vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection Scheduler>
tid=41] All other members are suspect at this point

[vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection Scheduler>
tid=41] All other members are suspect at this point

KIRK: after the above 2 messages vm1 continues logging about suspect (why?
vm0 gracefully left!)

[vm1] [info 2018/03/22 19:16:30.682 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:31.181 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:31.680 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771

Re: Membership log message questions

Posted by Kirk Lund <kl...@apache.org>.

Thanks Bruce! I'll change the test to use crashDistributedSystem.

On Mon, Mar 26, 2018 at 8:49 AM, Bruce Schuchardt <bs...@pivotal.io>
wrote:

> The problem is that the test is shutting down the cache in a high-priority
> thread pool.  When the ClusterDistributionManager tries to close its thread
> pools it waits a while for their threads to exit, so the thread ends up
> waiting for itself to exit & that's not going to happen.  During that time
> its membership services are still active.  I don't see the duplicate
> "gracefully left" messages you noted but I do see the other behavior.  This
> is normal and ensures that a member that gets into a strange state during
> shutdown like this is ejected from the distributed system.
>
> Changing the tests to use DistributedTestUtils.crashDistributedSystem(dm.getSystem())
> will clear up this problem and will more correctly match the intent of the
> tests.
>
> On 3/22/18 7:58 PM, Kirk Lund wrote:
>
>> If anyone really wants to look at and run the test directly, it's
>> *Bug41733DUnitTest* on develop.
>>
>> Alternatively, you could pull the branch GEODE-1279-rename-regressionTe
>> sts
>> from my fork (see *https://github.com/apache/geode/pull/1672
>> <https://github.com/apache/geode/pull/1672>*) where I've renamed the test
>> as *BucketCreationRequesterCrashHARegressionTest*, cleaned it up and
>>
>> temporarily added a locator to play with membership config (to see what
>> effect that has on the membership logging and execution time).
>>
>> On Thu, Mar 22, 2018 at 7:40 PM, Kirk Lund <kl...@apache.org> wrote:
>>
>> I'm looking through logs from a DUnit test that takes longer to run than I
>>> think it should. It looks to me like membership is chewing up more time
>>> than it should, and I have questions about some membership log messages.
>>>
>>> Specifically, I'm worried about:
>>> 1) double-logging by vm1 when shutdown message is received from vm0
>>>
>>> 2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected
>>>
>>> 3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message
>>> from vm0, b) processed a view with vm0 departed and c) even logged that
>>> vm0
>>> gracefully left. (vm1 seems a bit schizophrenic about vm0)
>>>
>>> Below are the log messages with my comments interspersed. Please help me
>>> understand what's actually going on here and if we're seeing undesired
>>> behavior (ie minor bugs)...
>>>
>>> KIRK: vm0 is shutting down gracefully
>>>
>>> [vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message
>>> Processor 2> tid=87] Shutting down DistributionManager
>>> 192.168.1.18(34057)<ec><v1>:32771.
>>>
>>> [vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message
>>> Processor 2> tid=66] received leave request from
>>> 192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771
>>>
>>> KIRK: vm1 states that vm0 gracefully left and logs it twice which seems
>>> like a minor bug
>>>
>>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>>> gracefully left the distributed cache: shutdown message received
>>>
>>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>>> gracefully left the distributed cache: shutdown message received
>>>
>>> [vm1] [info 2018/03/22 19:16:27.332 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] received new view:
>>> View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members:
>>> [192.168.1.18(34059:locator)<ec><v0>:32770,
>>> 192.168.1.18(34058)<ec><v2>:32772{lead}]
>>> shutdown: [192.168.1.18(34057)<ec><v1>:32771]
>>> [vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2]
>>> members: [192.168.1.18(34059:locator)<ec><v0>:32770,
>>> 192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>:
>>> 32772]
>>>
>>> KIRK: even tho vm1 just received a new view it initiates suspect
>>> processing for vm0 which also seems like a minor bug
>>>
>>> [vm1] [info 2018/03/22 19:16:28.186 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:28.683 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:29.180 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:29.681 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:30.183 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>> KIRK: vm0 surprisingly continues logging -- why is it suspecting other
>>> members? why does it still have GMSHealthMonitor running? why does it log
>>> the message twice? and again logging it twice looks like a minor bug to
>>> me
>>> unless all of this double logging that I'm seeing means membership is
>>> doing
>>> twice the work which would be horrible (please put my mind to rest about
>>> this)
>>>
>>> [vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection
>>> Scheduler> tid=41] All other members are suspect at this point
>>>
>>> [vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection
>>> Scheduler> tid=41] All other members are suspect at this point
>>>
>>> KIRK: after the above 2 messages vm1 continues logging about suspect
>>> (why?
>>> vm0 gracefully left!)
>>>
>>> [vm1] [info 2018/03/22 19:16:30.682 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:31.181 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:31.680 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>>
>>>
>

Re: Membership log message questions

Posted by Bruce Schuchardt <bs...@pivotal.io>.

The problem is that the test is shutting down the cache in a 
high-priority thread pool.  When the ClusterDistributionManager tries to 
close its thread pools it waits a while for their threads to exit, so 
the thread ends up waiting for itself to exit & that's not going to 
happen.  During that time its membership services are still active.  I 
don't see the duplicate "gracefully left" messages you noted but I do 
see the other behavior.  This is normal and ensures that a member that 
gets into a strange state during shutdown like this is ejected from the 
distributed system.

Changing the tests to use 
DistributedTestUtils.crashDistributedSystem(dm.getSystem()) will clear 
up this problem and will more correctly match the intent of the tests.

On 3/22/18 7:58 PM, Kirk Lund wrote:
> If anyone really wants to look at and run the test directly, it's
> *Bug41733DUnitTest* on develop.
>
> Alternatively, you could pull the branch GEODE-1279-rename-regressionTests
> from my fork (see *https://github.com/apache/geode/pull/1672
> <https://github.com/apache/geode/pull/1672>*) where I've renamed the test
> as *BucketCreationRequesterCrashHARegressionTest*, cleaned it up and
> temporarily added a locator to play with membership config (to see what
> effect that has on the membership logging and execution time).
>
> On Thu, Mar 22, 2018 at 7:40 PM, Kirk Lund <kl...@apache.org> wrote:
>
>> I'm looking through logs from a DUnit test that takes longer to run than I
>> think it should. It looks to me like membership is chewing up more time
>> than it should, and I have questions about some membership log messages.
>>
>> Specifically, I'm worried about:
>> 1) double-logging by vm1 when shutdown message is received from vm0
>>
>> 2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected
>>
>> 3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message
>> from vm0, b) processed a view with vm0 departed and c) even logged that vm0
>> gracefully left. (vm1 seems a bit schizophrenic about vm0)
>>
>> Below are the log messages with my comments interspersed. Please help me
>> understand what's actually going on here and if we're seeing undesired
>> behavior (ie minor bugs)...
>>
>> KIRK: vm0 is shutting down gracefully
>>
>> [vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message
>> Processor 2> tid=87] Shutting down DistributionManager
>> 192.168.1.18(34057)<ec><v1>:32771.
>>
>> [vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message
>> Processor 2> tid=66] received leave request from
>> 192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771
>>
>> KIRK: vm1 states that vm0 gracefully left and logs it twice which seems
>> like a minor bug
>>
>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>> gracefully left the distributed cache: shutdown message received
>>
>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>> gracefully left the distributed cache: shutdown message received
>>
>> [vm1] [info 2018/03/22 19:16:27.332 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] received new view:
>> View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members:
>> [192.168.1.18(34059:locator)<ec><v0>:32770, 192.168.1.18(34058)<ec><v2>:32772{lead}]
>> shutdown: [192.168.1.18(34057)<ec><v1>:32771]
>> [vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2]
>> members: [192.168.1.18(34059:locator)<ec><v0>:32770,
>> 192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>:
>> 32772]
>>
>> KIRK: even tho vm1 just received a new view it initiates suspect
>> processing for vm0 which also seems like a minor bug
>>
>> [vm1] [info 2018/03/22 19:16:28.186 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:28.683 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:29.180 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:29.681 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:30.183 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>> KIRK: vm0 surprisingly continues logging -- why is it suspecting other
>> members? why does it still have GMSHealthMonitor running? why does it log
>> the message twice? and again logging it twice looks like a minor bug to me
>> unless all of this double logging that I'm seeing means membership is doing
>> twice the work which would be horrible (please put my mind to rest about
>> this)
>>
>> [vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection
>> Scheduler> tid=41] All other members are suspect at this point
>>
>> [vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection
>> Scheduler> tid=41] All other members are suspect at this point
>>
>> KIRK: after the above 2 messages vm1 continues logging about suspect (why?
>> vm0 gracefully left!)
>>
>> [vm1] [info 2018/03/22 19:16:30.682 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:31.181 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:31.680 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>>

Re: Membership log message questions

Posted by Bruce Schuchardt <bs...@pivotal.io>.

This is weird behavior Kirk.  It looks like vm0 was behaving very badly 
- I'll check into it.


On 3/22/18 7:58 PM, Kirk Lund wrote:
> If anyone really wants to look at and run the test directly, it's
> *Bug41733DUnitTest* on develop.
>
> Alternatively, you could pull the branch GEODE-1279-rename-regressionTests
> from my fork (see *https://github.com/apache/geode/pull/1672
> <https://github.com/apache/geode/pull/1672>*) where I've renamed the test
> as *BucketCreationRequesterCrashHARegressionTest*, cleaned it up and
> temporarily added a locator to play with membership config (to see what
> effect that has on the membership logging and execution time).
>
> On Thu, Mar 22, 2018 at 7:40 PM, Kirk Lund <kl...@apache.org> wrote:
>
>> I'm looking through logs from a DUnit test that takes longer to run than I
>> think it should. It looks to me like membership is chewing up more time
>> than it should, and I have questions about some membership log messages.
>>
>> Specifically, I'm worried about:
>> 1) double-logging by vm1 when shutdown message is received from vm0
>>
>> 2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected
>>
>> 3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message
>> from vm0, b) processed a view with vm0 departed and c) even logged that vm0
>> gracefully left. (vm1 seems a bit schizophrenic about vm0)
>>
>> Below are the log messages with my comments interspersed. Please help me
>> understand what's actually going on here and if we're seeing undesired
>> behavior (ie minor bugs)...
>>
>> KIRK: vm0 is shutting down gracefully
>>
>> [vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message
>> Processor 2> tid=87] Shutting down DistributionManager
>> 192.168.1.18(34057)<ec><v1>:32771.
>>
>> [vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message
>> Processor 2> tid=66] received leave request from
>> 192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771
>>
>> KIRK: vm1 states that vm0 gracefully left and logs it twice which seems
>> like a minor bug
>>
>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>> gracefully left the distributed cache: shutdown message received
>>
>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>> gracefully left the distributed cache: shutdown message received
>>
>> [vm1] [info 2018/03/22 19:16:27.332 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] received new view:
>> View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members:
>> [192.168.1.18(34059:locator)<ec><v0>:32770, 192.168.1.18(34058)<ec><v2>:32772{lead}]
>> shutdown: [192.168.1.18(34057)<ec><v1>:32771]
>> [vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2]
>> members: [192.168.1.18(34059:locator)<ec><v0>:32770,
>> 192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>:
>> 32772]
>>
>> KIRK: even tho vm1 just received a new view it initiates suspect
>> processing for vm0 which also seems like a minor bug
>>
>> [vm1] [info 2018/03/22 19:16:28.186 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:28.683 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:29.180 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:29.681 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:30.183 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>> KIRK: vm0 surprisingly continues logging -- why is it suspecting other
>> members? why does it still have GMSHealthMonitor running? why does it log
>> the message twice? and again logging it twice looks like a minor bug to me
>> unless all of this double logging that I'm seeing means membership is doing
>> twice the work which would be horrible (please put my mind to rest about
>> this)
>>
>> [vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection
>> Scheduler> tid=41] All other members are suspect at this point
>>
>> [vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection
>> Scheduler> tid=41] All other members are suspect at this point
>>
>> KIRK: after the above 2 messages vm1 continues logging about suspect (why?
>> vm0 gracefully left!)
>>
>> [vm1] [info 2018/03/22 19:16:30.682 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:31.181 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>> [vm1] [info 2018/03/22 19:16:31.680 PDT <unicast
>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>> responding to heartbeat requests]]] from non-member
>> 192.168.1.18<ec><v1>:32771
>>
>>

Re: Membership log message questions

Posted by Kirk Lund <kl...@apache.org>.

If anyone really wants to look at and run the test directly, it's
*Bug41733DUnitTest* on develop.

Alternatively, you could pull the branch GEODE-1279-rename-regressionTests
from my fork (see *https://github.com/apache/geode/pull/1672
<https://github.com/apache/geode/pull/1672>*) where I've renamed the test
as *BucketCreationRequesterCrashHARegressionTest*, cleaned it up and
temporarily added a locator to play with membership config (to see what
effect that has on the membership logging and execution time).

On Thu, Mar 22, 2018 at 7:40 PM, Kirk Lund <kl...@apache.org> wrote:

> I'm looking through logs from a DUnit test that takes longer to run than I
> think it should. It looks to me like membership is chewing up more time
> than it should, and I have questions about some membership log messages.
>
> Specifically, I'm worried about:
> 1) double-logging by vm1 when shutdown message is received from vm0
>
> 2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected
>
> 3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message
> from vm0, b) processed a view with vm0 departed and c) even logged that vm0
> gracefully left. (vm1 seems a bit schizophrenic about vm0)
>
> Below are the log messages with my comments interspersed. Please help me
> understand what's actually going on here and if we're seeing undesired
> behavior (ie minor bugs)...
>
> KIRK: vm0 is shutting down gracefully
>
> [vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message
> Processor 2> tid=87] Shutting down DistributionManager
> 192.168.1.18(34057)<ec><v1>:32771.
>
> [vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message
> Processor 2> tid=66] received leave request from
> 192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771
>
> KIRK: vm1 states that vm0 gracefully left and logs it twice which seems
> like a minor bug
>
> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
> gracefully left the distributed cache: shutdown message received
>
> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
> gracefully left the distributed cache: shutdown message received
>
> [vm1] [info 2018/03/22 19:16:27.332 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] received new view:
> View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members:
> [192.168.1.18(34059:locator)<ec><v0>:32770, 192.168.1.18(34058)<ec><v2>:32772{lead}]
> shutdown: [192.168.1.18(34057)<ec><v1>:32771]
> [vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2]
> members: [192.168.1.18(34059:locator)<ec><v0>:32770,
> 192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>:
> 32772]
>
> KIRK: even tho vm1 just received a new view it initiates suspect
> processing for vm0 which also seems like a minor bug
>
> [vm1] [info 2018/03/22 19:16:28.186 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>
> [vm1] [info 2018/03/22 19:16:28.683 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>
> [vm1] [info 2018/03/22 19:16:29.180 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>
> [vm1] [info 2018/03/22 19:16:29.681 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>
> [vm1] [info 2018/03/22 19:16:30.183 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
> responding to heartbeat requests]]] from non-member
> 192.168.1.18<ec><v1>:32771
>
> KIRK: vm0 surprisingly continues logging -- why is it suspecting other
> members? why does it still have GMSHealthMonitor running? why does it log
> the message twice? and again logging it twice looks like a minor bug to me
> unless all of this double logging that I'm seeing means membership is doing
> twice the work which would be horrible (please put my mind to rest about
> this)
>
> [vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection
> Scheduler> tid=41] All other members are suspect at this point
>
> [vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection
> Scheduler> tid=41] All other members are suspect at this point
>
> KIRK: after the above 2 messages vm1 continues logging about suspect (why?
> vm0 gracefully left!)
>
> [vm1] [info 2018/03/22 19:16:30.682 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
> responding to heartbeat requests]]] from non-member
> 192.168.1.18<ec><v1>:32771
>
> [vm1] [info 2018/03/22 19:16:31.181 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
> responding to heartbeat requests]]] from non-member
> 192.168.1.18<ec><v1>:32771
>
> [vm1] [info 2018/03/22 19:16:31.680 PDT <unicast
> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
> responding to heartbeat requests]]] from non-member
> 192.168.1.18<ec><v1>:32771
>
>