You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by Aravind Musigumpula <Ar...@amdocs.com> on 2017/11/03 09:52:38 UTC

RE: Monitor the neighbour JVM using neihbour's member-timeout

Thanks Bruce for suggestions, I will change the new variables from InternalDistributedMember to NetView and do changes related to backward compatibility.

Now I know that there is another way that member can be removed from the view i.e if any member is sending a message and waits for ack-wait-threshold, if there is no response from the target the sender will do final check and remove it from the view if there is still no response. 
But I don't understand how deprecating the settings member-timeout, ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. The main problem is that we want a member to survive in the view for longer time than others.

If we deprecate the settings into one setting and pass the setting to monitoring member(say A), then it will use the target member(say B which we want to survive in view for longer time) timeout for health monitoring and ack-wait-threshold to wait for the response for any message before doing final check.
But what if some other member(say C) which is monitoring any other member(say D) have the member-timeout and ack-wait-threshold some smaller values. So if member C messages to B, C uses the smaller value of ack-wait-threshold(which is of member D) to get a response and does the final check again on basis of smaller member-timeout. So still member B can be kicked out of the view in small amount of time.

I think this can be solved simply if we use the member-timeout of suspected member in the final check where we establish TCP connection. We don't need to club those three settings as well. We can set the member-timeout of a particular member to a higher value and the member which monitors it uses its own member-timeout as it is now, but during the final check it uses the suspected member-timeout(which is a greater value). The final check is common place in both the no heartbeat scenario and no response for a message scenario.

Are there any concerns around this new proposal ?

Thanks,
Aravind Musigumpula 

-----Original Message-----
From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io]
Sent: Thursday, September 07, 2017 10:42 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

I think this might be an acceptable change though I doubt many people would find it useful.

It's already possible to set different member-timeouts on each node of the distributed system but the meaning of the setting is the inverse of what's proposed here, so having the current setting be different in each node is pretty useless.

I think the initiation of suspect processing ought to be addressed if we make this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't based on the member-timeout but ought to be.  This would make it possible to initiate suspect processing with different timing for different nodes.  It would still leave the question of slow backup operations hanging:  If you're waiting for one node that's blocked waiting for a response from another node (say a node holding a backup
bucket) you are going to initiate suspect processing on the node you're waiting on & not those other (backup) nodes.

Rolling upgrade will also be a problem since old members aren't going to cough up their member-timeout settings.  What should be used as a membership timeout for the old members during an upgrade?

If we proceed with this idea I'd prefer that we deprecate member-timeout, ack-wait-threshold and ack-severe-alert-threshold and have new settings with the "ack" settings being multiples of the new membership timeout setting.

Concerning the PR, it isn't acceptable in its current form. 
InternalDistributedMember identifiers are often transmitted in messages and increasing their size affects performance.  Any new member attributes need to be added to NetView instead of InternalDistributedMember.

On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
> Hi Team,
>
> We have a requirement to configure  different member timeout for different members as we need some members to survive in the view for longer time than the other the members before being kicked out of the view in case they aren't responding.
>
>
> 1.       Now with the current monitoring system it is not possible to determine when the member will be kicked out of the view if we configure different member-timeout's for some required members.
>
> 2.       Because if a member is not responding to any heartbeat requests, the member who is monitoring the non-responding member will initiate check member request.
>
> 3.       In this check member request monitoring member pings the non-responding member and waits for member-timeout of monitoring member for a response.
>
> 4.       If still there is no response, it will initiate a final suspect request to coordinator where the coordinator does the final check waiting for coordinators member-timeout.
>
> 5.       If coordinator did not get any response, it will remove the non-responding member from the view and publishes it.
>
> 6.       So, Here the time period for removing a member depends on its monitoring member's and coordinator's timeout. But the monitoring member depends on the view but it may change from time to time.
>
> So, now when a monitoring-member doing the check on a member, if we wait for the non-responding member's timeout instead of the monitoring member-timeout, then the time when the non-responding member will be removed from the view depends on its own member-timeout and the coordinators member-timeout.
> Hence we can configure different member-timeout for the required members.
>
> I created a pull request based on the above scenario: 
> https://github.com/apache/geode/pull/717
>
> Is the above approach correct? Do we have any concerns around this area?
> Please give your insights on this issue.
>
> Thanks,
> Aravind Musigumpula
>
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer
> <https://www.amdocs.com/about/email-disclaimer>
>

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

RE: Monitor the neighbour JVM using neihbour's member-timeout

Posted by Aravind Musigumpula <Ar...@amdocs.com>.

Hi Community,

Any Comments on the below one.


Thanks,
Aravind Musigumpula 


-----Original Message-----
From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io] 
Sent: Thursday, January 18, 2018 11:58 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

We don't use JGroups for membership anymore.  We rewrote all of it and now only use JGroups for UDP messaging.  We have complete control over the use of the member-timeout setting.

Aravind's idea is relevant to this group.

On 1/17/18 3:39 PM, Michael Stolz wrote:
> Pardon my ignorance, but is this something that should be brought up 
> on the JGroups community?
>
> --
> Mike Stolz
> Principal Engineer, GemFire Product Lead
> Mobile: +1-631-835-4771
> Download the new GemFire book here.
> <https://content.pivotal.io/ebooks/scaling-data-services-with-pivotal-
> gemfire>
>
> On Wed, Jan 17, 2018 at 2:37 AM, Aravind Musigumpula < 
> Aravind.Musigumpula@amdocs.com> wrote:
>
>> Hi Everyone,
>>
>> Consider a Geode cluster in which some nodes contain a particular 
>> type of data which is critical to the business and hosts a large amount of data.
>> Some nodes may host data which is not critical to the business and 
>> hosts less amount of data compared to the previous type of nodes.
>>
>> If both the type of nodes are going through some operation which is 
>> making them unresponsive, the former type of node may take a couple 
>> of seconds extra than the later to respond.
>>
>> In this scenario is it fair to give the same member-timeout to all 
>> the members?
>> What if we want to wait for a little longer time for such nodes.
>>
>> In the present configuration in geode, we cannot wait a little longer 
>> for some nodes when compared to do this although we can configure 
>> different member-timeout for all the nodes. But i think no one will 
>> ever configure different timeouts for each node because those 
>> member-timeouts will be used to monitor their neighbors.
>>
>> In this solution, we all do is wait for the suspected member-timeout 
>> instead of its own timeout during final check.
>> It has no backward implications also, if somebody wants to use the 
>> existing behavior they will continue to use the same member-timeouts 
>> for all the nodes. So the behavior of the system is preserved.
>>
>> If you have any concerns in this solution, please let me know.
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>>
>> -----Original Message-----
>> From: Aravind Musigumpula
>> Sent: Monday, December 18, 2017 6:55 PM
>> To: dev@geode.apache.org
>> Subject: RE: Monitor the neighbour JVM using neihbour's 
>> member-timeout
>>
>> Hi Community,
>>
>> Can you please give your suggestions on the below solution.
>>
>> I have raised a pull request for the same : 
>> https://github.com/apache/
>> geode/pull/1075 .
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>> -----Original Message-----
>> From: Aravind Musigumpula
>> Sent: Friday, November 03, 2017 3:23 PM
>> To: dev@geode.apache.org
>> Subject: RE: Monitor the neighbour JVM using neihbour's 
>> member-timeout
>>
>> Thanks Bruce for suggestions, I will change the new variables from 
>> InternalDistributedMember to NetView and do changes related to 
>> backward compatibility.
>>
>> Now I know that there is another way that member can be removed from 
>> the view i.e if any member is sending a message and waits for 
>> ack-wait-threshold, if there is no response from the target the 
>> sender will do final check and remove it from the view if there is still no response.
>> But I don't understand how deprecating the settings member-timeout, 
>> ack-wait-threshold, ack-severe-alert-threshold into one will solve 
>> the problem. The main problem is that we want a member to survive in 
>> the view for longer time than others.
>>
>> If we deprecate the settings into one setting and pass the setting to 
>> monitoring member(say A), then it will use the target member(say B 
>> which we want to survive in view for longer time) timeout for health 
>> monitoring and ack-wait-threshold to wait for the response for any 
>> message before doing final check.
>> But what if some other member(say C) which is monitoring any other 
>> member(say D) have the member-timeout and ack-wait-threshold some 
>> smaller values. So if member C messages to B, C uses the smaller 
>> value of ack-wait-threshold(which is of member D) to get a response 
>> and does the final check again on basis of smaller member-timeout. So 
>> still member B can be kicked out of the view in small amount of time.
>>
>> I think this can be solved simply if we use the member-timeout of 
>> suspected member in the final check where we establish TCP 
>> connection. We don't need to club those three settings as well. We 
>> can set the member-timeout of a particular member to a higher value 
>> and the member which monitors it uses its own member-timeout as it is 
>> now, but during the final check it uses the suspected member-timeout(which is a greater value).
>> The final check is common place in both the no heartbeat scenario and 
>> no response for a message scenario.
>>
>> Are there any concerns around this new proposal ?
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>> -----Original Message-----
>> From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io]
>> Sent: Thursday, September 07, 2017 10:42 PM
>> To: dev@geode.apache.org
>> Subject: Re: Monitor the neighbour JVM using neihbour's 
>> member-timeout
>>
>> I think this might be an acceptable change though I doubt many people 
>> would find it useful.
>>
>> It's already possible to set different member-timeouts on each node 
>> of the distributed system but the meaning of the setting is the 
>> inverse of what's proposed here, so having the current setting be 
>> different in each node is pretty useless.
>>
>> I think the initiation of suspect processing ought to be addressed if 
>> we make this change.  The ack-wait-threshold and 
>> ack-severe-alert-threshold aren't based on the member-timeout but 
>> ought to be.  This would make it possible to initiate suspect 
>> processing with different timing for different nodes.  It would still 
>> leave the question of slow backup operations
>> hanging:  If you're waiting for one node that's blocked waiting for a 
>> response from another node (say a node holding a backup
>> bucket) you are going to initiate suspect processing on the node 
>> you're waiting on & not those other (backup) nodes.
>>
>> Rolling upgrade will also be a problem since old members aren't going 
>> to cough up their member-timeout settings.  What should be used as a 
>> membership timeout for the old members during an upgrade?
>>
>> If we proceed with this idea I'd prefer that we deprecate 
>> member-timeout, ack-wait-threshold and ack-severe-alert-threshold and 
>> have new settings with the "ack" settings being multiples of the new 
>> membership timeout setting.
>>
>> Concerning the PR, it isn't acceptable in its current form.
>> InternalDistributedMember identifiers are often transmitted in 
>> messages and increasing their size affects performance.  Any new 
>> member attributes need to be added to NetView instead of InternalDistributedMember.
>>
>>
>> On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
>>> Hi Team,
>>>
>>> We have a requirement to configure  different member timeout for
>> different members as we need some members to survive in the view for 
>> longer time than the other the members before being kicked out of the 
>> view in case they aren't responding.
>>>
>>> 1.       Now with the current monitoring system it is not possible to
>> determine when the member will be kicked out of the view if we 
>> configure different member-timeout's for some required members.
>>> 2.       Because if a member is not responding to any heartbeat
>> requests, the member who is monitoring the non-responding member will 
>> initiate check member request.
>>> 3.       In this check member request monitoring member pings the
>> non-responding member and waits for member-timeout of monitoring 
>> member for a response.
>>> 4.       If still there is no response, it will initiate a final suspect
>> request to coordinator where the coordinator does the final check 
>> waiting for coordinators member-timeout.
>>> 5.       If coordinator did not get any response, it will remove the
>> non-responding member from the view and publishes it.
>>> 6.       So, Here the time period for removing a member depends on its
>> monitoring member's and coordinator's timeout. But the monitoring 
>> member depends on the view but it may change from time to time.
>>> So, now when a monitoring-member doing the check on a member, if we 
>>> wait
>> for the non-responding member's timeout instead of the monitoring 
>> member-timeout, then the time when the non-responding member will be 
>> removed from the view depends on its own member-timeout and the 
>> coordinators member-timeout.
>>> Hence we can configure different member-timeout for the required members.
>>>
>>> I created a pull request based on the above scenario:
>>> https://github.com/apache/geode/pull/717
>>>
>>> Is the above approach correct? Do we have any concerns around this area?
>>> Please give your insights on this issue.
>>>
>>> Thanks,
>>> Aravind Musigumpula
>>>
>>> This message and the information contained herein is proprietary and 
>>> confidential and subject to the Amdocs policy statement,
>>>
>>> you may review at https://www.amdocs.com/about/email-disclaimer
>>> <https://www.amdocs.com/about/email-disclaimer>
>>>
>> This message and the information contained herein is proprietary and 
>> confidential and subject to the Amdocs policy statement,
>>
>> you may review at https://www.amdocs.com/about/email-disclaimer < 
>> https://www.amdocs.com/about/email-disclaimer>
>>
>> This message and the information contained herein is proprietary and 
>> confidential and subject to the Amdocs policy statement,
>>
>> you may review at https://www.amdocs.com/about/email-disclaimer < 
>> https://www.amdocs.com/about/email-disclaimer>
>>
>>

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

Re: Monitor the neighbour JVM using neihbour's member-timeout

Posted by Bruce Schuchardt <bs...@pivotal.io>.

We don't use JGroups for membership anymore.  We rewrote all of it and 
now only use JGroups for UDP messaging.  We have complete control over 
the use of the member-timeout setting.

Aravind's idea is relevant to this group.

On 1/17/18 3:39 PM, Michael Stolz wrote:
> Pardon my ignorance, but is this something that should be brought up on the
> JGroups community?
>
> --
> Mike Stolz
> Principal Engineer, GemFire Product Lead
> Mobile: +1-631-835-4771
> Download the new GemFire book here.
> <https://content.pivotal.io/ebooks/scaling-data-services-with-pivotal-gemfire>
>
> On Wed, Jan 17, 2018 at 2:37 AM, Aravind Musigumpula <
> Aravind.Musigumpula@amdocs.com> wrote:
>
>> Hi Everyone,
>>
>> Consider a Geode cluster in which some nodes contain a particular type of
>> data which is critical to the business and hosts a large amount of data.
>> Some nodes may host data which is not critical to the business and hosts
>> less amount of data compared to the previous type of nodes.
>>
>> If both the type of nodes are going through some operation which is making
>> them unresponsive, the former type of node may take a couple of seconds
>> extra than the later to respond.
>>
>> In this scenario is it fair to give the same member-timeout to all the
>> members?
>> What if we want to wait for a little longer time for such nodes.
>>
>> In the present configuration in geode, we cannot wait a little longer for
>> some nodes when compared to do this although we can configure different
>> member-timeout for all the nodes. But i think no one will ever configure
>> different timeouts for each node because those member-timeouts will be used
>> to monitor their neighbors.
>>
>> In this solution, we all do is wait for the suspected member-timeout
>> instead of its own timeout during final check.
>> It has no backward implications also, if somebody wants to use the
>> existing behavior they will continue to use the same member-timeouts for
>> all the nodes. So the behavior of the system is preserved.
>>
>> If you have any concerns in this solution, please let me know.
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>>
>> -----Original Message-----
>> From: Aravind Musigumpula
>> Sent: Monday, December 18, 2017 6:55 PM
>> To: dev@geode.apache.org
>> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>>
>> Hi Community,
>>
>> Can you please give your suggestions on the below solution.
>>
>> I have raised a pull request for the same : https://github.com/apache/
>> geode/pull/1075 .
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>> -----Original Message-----
>> From: Aravind Musigumpula
>> Sent: Friday, November 03, 2017 3:23 PM
>> To: dev@geode.apache.org
>> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>>
>> Thanks Bruce for suggestions, I will change the new variables from
>> InternalDistributedMember to NetView and do changes related to backward
>> compatibility.
>>
>> Now I know that there is another way that member can be removed from the
>> view i.e if any member is sending a message and waits for
>> ack-wait-threshold, if there is no response from the target the sender will
>> do final check and remove it from the view if there is still no response.
>> But I don't understand how deprecating the settings member-timeout,
>> ack-wait-threshold, ack-severe-alert-threshold into one will solve the
>> problem. The main problem is that we want a member to survive in the view
>> for longer time than others.
>>
>> If we deprecate the settings into one setting and pass the setting to
>> monitoring member(say A), then it will use the target member(say B which we
>> want to survive in view for longer time) timeout for health monitoring and
>> ack-wait-threshold to wait for the response for any message before doing
>> final check.
>> But what if some other member(say C) which is monitoring any other
>> member(say D) have the member-timeout and ack-wait-threshold some smaller
>> values. So if member C messages to B, C uses the smaller value of
>> ack-wait-threshold(which is of member D) to get a response and does the
>> final check again on basis of smaller member-timeout. So still member B can
>> be kicked out of the view in small amount of time.
>>
>> I think this can be solved simply if we use the member-timeout of
>> suspected member in the final check where we establish TCP connection. We
>> don't need to club those three settings as well. We can set the
>> member-timeout of a particular member to a higher value and the member
>> which monitors it uses its own member-timeout as it is now, but during the
>> final check it uses the suspected member-timeout(which is a greater value).
>> The final check is common place in both the no heartbeat scenario and no
>> response for a message scenario.
>>
>> Are there any concerns around this new proposal ?
>>
>>
>> Thanks,
>> Aravind Musigumpula
>>
>> -----Original Message-----
>> From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io]
>> Sent: Thursday, September 07, 2017 10:42 PM
>> To: dev@geode.apache.org
>> Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout
>>
>> I think this might be an acceptable change though I doubt many people
>> would find it useful.
>>
>> It's already possible to set different member-timeouts on each node of the
>> distributed system but the meaning of the setting is the inverse of what's
>> proposed here, so having the current setting be different in each node is
>> pretty useless.
>>
>> I think the initiation of suspect processing ought to be addressed if we
>> make this change.  The ack-wait-threshold and ack-severe-alert-threshold
>> aren't based on the member-timeout but ought to be.  This would make it
>> possible to initiate suspect processing with different timing for different
>> nodes.  It would still leave the question of slow backup operations
>> hanging:  If you're waiting for one node that's blocked waiting for a
>> response from another node (say a node holding a backup
>> bucket) you are going to initiate suspect processing on the node you're
>> waiting on & not those other (backup) nodes.
>>
>> Rolling upgrade will also be a problem since old members aren't going to
>> cough up their member-timeout settings.  What should be used as a
>> membership timeout for the old members during an upgrade?
>>
>> If we proceed with this idea I'd prefer that we deprecate member-timeout,
>> ack-wait-threshold and ack-severe-alert-threshold and have new settings
>> with the "ack" settings being multiples of the new membership timeout
>> setting.
>>
>> Concerning the PR, it isn't acceptable in its current form.
>> InternalDistributedMember identifiers are often transmitted in messages
>> and increasing their size affects performance.  Any new member attributes
>> need to be added to NetView instead of InternalDistributedMember.
>>
>>
>> On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
>>> Hi Team,
>>>
>>> We have a requirement to configure  different member timeout for
>> different members as we need some members to survive in the view for longer
>> time than the other the members before being kicked out of the view in case
>> they aren't responding.
>>>
>>> 1.       Now with the current monitoring system it is not possible to
>> determine when the member will be kicked out of the view if we configure
>> different member-timeout's for some required members.
>>> 2.       Because if a member is not responding to any heartbeat
>> requests, the member who is monitoring the non-responding member will
>> initiate check member request.
>>> 3.       In this check member request monitoring member pings the
>> non-responding member and waits for member-timeout of monitoring member for
>> a response.
>>> 4.       If still there is no response, it will initiate a final suspect
>> request to coordinator where the coordinator does the final check waiting
>> for coordinators member-timeout.
>>> 5.       If coordinator did not get any response, it will remove the
>> non-responding member from the view and publishes it.
>>> 6.       So, Here the time period for removing a member depends on its
>> monitoring member's and coordinator's timeout. But the monitoring member
>> depends on the view but it may change from time to time.
>>> So, now when a monitoring-member doing the check on a member, if we wait
>> for the non-responding member's timeout instead of the monitoring
>> member-timeout, then the time when the non-responding member will be
>> removed from the view depends on its own member-timeout and the
>> coordinators member-timeout.
>>> Hence we can configure different member-timeout for the required members.
>>>
>>> I created a pull request based on the above scenario:
>>> https://github.com/apache/geode/pull/717
>>>
>>> Is the above approach correct? Do we have any concerns around this area?
>>> Please give your insights on this issue.
>>>
>>> Thanks,
>>> Aravind Musigumpula
>>>
>>> This message and the information contained herein is proprietary and
>>> confidential and subject to the Amdocs policy statement,
>>>
>>> you may review at https://www.amdocs.com/about/email-disclaimer
>>> <https://www.amdocs.com/about/email-disclaimer>
>>>
>> This message and the information contained herein is proprietary and
>> confidential and subject to the Amdocs policy statement,
>>
>> you may review at https://www.amdocs.com/about/email-disclaimer <
>> https://www.amdocs.com/about/email-disclaimer>
>>
>> This message and the information contained herein is proprietary and
>> confidential and subject to the Amdocs policy statement,
>>
>> you may review at https://www.amdocs.com/about/email-disclaimer <
>> https://www.amdocs.com/about/email-disclaimer>
>>
>>

Re: Monitor the neighbour JVM using neihbour's member-timeout

Posted by Michael Stolz <ms...@pivotal.io>.

Pardon my ignorance, but is this something that should be brought up on the
JGroups community?

--
Mike Stolz
Principal Engineer, GemFire Product Lead
Mobile: +1-631-835-4771
Download the new GemFire book here.
<https://content.pivotal.io/ebooks/scaling-data-services-with-pivotal-gemfire>

On Wed, Jan 17, 2018 at 2:37 AM, Aravind Musigumpula <
Aravind.Musigumpula@amdocs.com> wrote:

> Hi Everyone,
>
> Consider a Geode cluster in which some nodes contain a particular type of
> data which is critical to the business and hosts a large amount of data.
> Some nodes may host data which is not critical to the business and hosts
> less amount of data compared to the previous type of nodes.
>
> If both the type of nodes are going through some operation which is making
> them unresponsive, the former type of node may take a couple of seconds
> extra than the later to respond.
>
> In this scenario is it fair to give the same member-timeout to all the
> members?
> What if we want to wait for a little longer time for such nodes.
>
> In the present configuration in geode, we cannot wait a little longer for
> some nodes when compared to do this although we can configure different
> member-timeout for all the nodes. But i think no one will ever configure
> different timeouts for each node because those member-timeouts will be used
> to monitor their neighbors.
>
> In this solution, we all do is wait for the suspected member-timeout
> instead of its own timeout during final check.
> It has no backward implications also, if somebody wants to use the
> existing behavior they will continue to use the same member-timeouts for
> all the nodes. So the behavior of the system is preserved.
>
> If you have any concerns in this solution, please let me know.
>
>
> Thanks,
> Aravind Musigumpula
>
>
> -----Original Message-----
> From: Aravind Musigumpula
> Sent: Monday, December 18, 2017 6:55 PM
> To: dev@geode.apache.org
> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>
> Hi Community,
>
> Can you please give your suggestions on the below solution.
>
> I have raised a pull request for the same : https://github.com/apache/
> geode/pull/1075 .
>
>
> Thanks,
> Aravind Musigumpula
>
> -----Original Message-----
> From: Aravind Musigumpula
> Sent: Friday, November 03, 2017 3:23 PM
> To: dev@geode.apache.org
> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>
> Thanks Bruce for suggestions, I will change the new variables from
> InternalDistributedMember to NetView and do changes related to backward
> compatibility.
>
> Now I know that there is another way that member can be removed from the
> view i.e if any member is sending a message and waits for
> ack-wait-threshold, if there is no response from the target the sender will
> do final check and remove it from the view if there is still no response.
> But I don't understand how deprecating the settings member-timeout,
> ack-wait-threshold, ack-severe-alert-threshold into one will solve the
> problem. The main problem is that we want a member to survive in the view
> for longer time than others.
>
> If we deprecate the settings into one setting and pass the setting to
> monitoring member(say A), then it will use the target member(say B which we
> want to survive in view for longer time) timeout for health monitoring and
> ack-wait-threshold to wait for the response for any message before doing
> final check.
> But what if some other member(say C) which is monitoring any other
> member(say D) have the member-timeout and ack-wait-threshold some smaller
> values. So if member C messages to B, C uses the smaller value of
> ack-wait-threshold(which is of member D) to get a response and does the
> final check again on basis of smaller member-timeout. So still member B can
> be kicked out of the view in small amount of time.
>
> I think this can be solved simply if we use the member-timeout of
> suspected member in the final check where we establish TCP connection. We
> don't need to club those three settings as well. We can set the
> member-timeout of a particular member to a higher value and the member
> which monitors it uses its own member-timeout as it is now, but during the
> final check it uses the suspected member-timeout(which is a greater value).
> The final check is common place in both the no heartbeat scenario and no
> response for a message scenario.
>
> Are there any concerns around this new proposal ?
>
>
> Thanks,
> Aravind Musigumpula
>
> -----Original Message-----
> From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io]
> Sent: Thursday, September 07, 2017 10:42 PM
> To: dev@geode.apache.org
> Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout
>
> I think this might be an acceptable change though I doubt many people
> would find it useful.
>
> It's already possible to set different member-timeouts on each node of the
> distributed system but the meaning of the setting is the inverse of what's
> proposed here, so having the current setting be different in each node is
> pretty useless.
>
> I think the initiation of suspect processing ought to be addressed if we
> make this change.  The ack-wait-threshold and ack-severe-alert-threshold
> aren't based on the member-timeout but ought to be.  This would make it
> possible to initiate suspect processing with different timing for different
> nodes.  It would still leave the question of slow backup operations
> hanging:  If you're waiting for one node that's blocked waiting for a
> response from another node (say a node holding a backup
> bucket) you are going to initiate suspect processing on the node you're
> waiting on & not those other (backup) nodes.
>
> Rolling upgrade will also be a problem since old members aren't going to
> cough up their member-timeout settings.  What should be used as a
> membership timeout for the old members during an upgrade?
>
> If we proceed with this idea I'd prefer that we deprecate member-timeout,
> ack-wait-threshold and ack-severe-alert-threshold and have new settings
> with the "ack" settings being multiples of the new membership timeout
> setting.
>
> Concerning the PR, it isn't acceptable in its current form.
> InternalDistributedMember identifiers are often transmitted in messages
> and increasing their size affects performance.  Any new member attributes
> need to be added to NetView instead of InternalDistributedMember.
>
>
> On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
> > Hi Team,
> >
> > We have a requirement to configure  different member timeout for
> different members as we need some members to survive in the view for longer
> time than the other the members before being kicked out of the view in case
> they aren't responding.
> >
> >
> > 1.       Now with the current monitoring system it is not possible to
> determine when the member will be kicked out of the view if we configure
> different member-timeout's for some required members.
> >
> > 2.       Because if a member is not responding to any heartbeat
> requests, the member who is monitoring the non-responding member will
> initiate check member request.
> >
> > 3.       In this check member request monitoring member pings the
> non-responding member and waits for member-timeout of monitoring member for
> a response.
> >
> > 4.       If still there is no response, it will initiate a final suspect
> request to coordinator where the coordinator does the final check waiting
> for coordinators member-timeout.
> >
> > 5.       If coordinator did not get any response, it will remove the
> non-responding member from the view and publishes it.
> >
> > 6.       So, Here the time period for removing a member depends on its
> monitoring member's and coordinator's timeout. But the monitoring member
> depends on the view but it may change from time to time.
> >
> > So, now when a monitoring-member doing the check on a member, if we wait
> for the non-responding member's timeout instead of the monitoring
> member-timeout, then the time when the non-responding member will be
> removed from the view depends on its own member-timeout and the
> coordinators member-timeout.
> > Hence we can configure different member-timeout for the required members.
> >
> > I created a pull request based on the above scenario:
> > https://github.com/apache/geode/pull/717
> >
> > Is the above approach correct? Do we have any concerns around this area?
> > Please give your insights on this issue.
> >
> > Thanks,
> > Aravind Musigumpula
> >
> > This message and the information contained herein is proprietary and
> > confidential and subject to the Amdocs policy statement,
> >
> > you may review at https://www.amdocs.com/about/email-disclaimer
> > <https://www.amdocs.com/about/email-disclaimer>
> >
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>
>

RE: Monitor the neighbour JVM using neihbour's member-timeout

Posted by Aravind Musigumpula <Ar...@amdocs.com>.

Hi Everyone,

Consider a Geode cluster in which some nodes contain a particular type of data which is critical to the business and hosts a large amount of data. Some nodes may host data which is not critical to the business and hosts less amount of data compared to the previous type of nodes.

If both the type of nodes are going through some operation which is making them unresponsive, the former type of node may take a couple of seconds extra than the later to respond.

In this scenario is it fair to give the same member-timeout to all the members?
What if we want to wait for a little longer time for such nodes.

In the present configuration in geode, we cannot wait a little longer for some nodes when compared to do this although we can configure different member-timeout for all the nodes. But i think no one will ever configure different timeouts for each node because those member-timeouts will be used to monitor their neighbors.

In this solution, we all do is wait for the suspected member-timeout instead of its own timeout during final check.
It has no backward implications also, if somebody wants to use the existing behavior they will continue to use the same member-timeouts for all the nodes. So the behavior of the system is preserved.

If you have any concerns in this solution, please let me know.

Thanks,
Aravind Musigumpula 

-----Original Message-----
From: Aravind Musigumpula 
Sent: Monday, December 18, 2017 6:55 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Hi Community,

Can you please give your suggestions on the below solution.

I have raised a pull request for the same : https://github.com/apache/geode/pull/1075 .

Thanks,
Aravind Musigumpula 

-----Original Message-----
From: Aravind Musigumpula
Sent: Friday, November 03, 2017 3:23 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Thanks Bruce for suggestions, I will change the new variables from InternalDistributedMember to NetView and do changes related to backward compatibility.

Now I know that there is another way that member can be removed from the view i.e if any member is sending a message and waits for ack-wait-threshold, if there is no response from the target the sender will do final check and remove it from the view if there is still no response. 
But I don't understand how deprecating the settings member-timeout, ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. The main problem is that we want a member to survive in the view for longer time than others.

If we deprecate the settings into one setting and pass the setting to monitoring member(say A), then it will use the target member(say B which we want to survive in view for longer time) timeout for health monitoring and ack-wait-threshold to wait for the response for any message before doing final check.
But what if some other member(say C) which is monitoring any other member(say D) have the member-timeout and ack-wait-threshold some smaller values. So if member C messages to B, C uses the smaller value of ack-wait-threshold(which is of member D) to get a response and does the final check again on basis of smaller member-timeout. So still member B can be kicked out of the view in small amount of time.

I think this can be solved simply if we use the member-timeout of suspected member in the final check where we establish TCP connection. We don't need to club those three settings as well. We can set the member-timeout of a particular member to a higher value and the member which monitors it uses its own member-timeout as it is now, but during the final check it uses the suspected member-timeout(which is a greater value). The final check is common place in both the no heartbeat scenario and no response for a message scenario.

Are there any concerns around this new proposal ?

Thanks,
Aravind Musigumpula 

-----Original Message-----
From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io]
Sent: Thursday, September 07, 2017 10:42 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

I think this might be an acceptable change though I doubt many people would find it useful.

It's already possible to set different member-timeouts on each node of the distributed system but the meaning of the setting is the inverse of what's proposed here, so having the current setting be different in each node is pretty useless.

I think the initiation of suspect processing ought to be addressed if we make this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't based on the member-timeout but ought to be.  This would make it possible to initiate suspect processing with different timing for different nodes.  It would still leave the question of slow backup operations hanging:  If you're waiting for one node that's blocked waiting for a response from another node (say a node holding a backup
bucket) you are going to initiate suspect processing on the node you're waiting on & not those other (backup) nodes.

Rolling upgrade will also be a problem since old members aren't going to cough up their member-timeout settings.  What should be used as a membership timeout for the old members during an upgrade?

If we proceed with this idea I'd prefer that we deprecate member-timeout, ack-wait-threshold and ack-severe-alert-threshold and have new settings with the "ack" settings being multiples of the new membership timeout setting.

Concerning the PR, it isn't acceptable in its current form. 
InternalDistributedMember identifiers are often transmitted in messages and increasing their size affects performance.  Any new member attributes need to be added to NetView instead of InternalDistributedMember.

On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
> Hi Team,
>
> We have a requirement to configure  different member timeout for different members as we need some members to survive in the view for longer time than the other the members before being kicked out of the view in case they aren't responding.
>
>
> 1.       Now with the current monitoring system it is not possible to determine when the member will be kicked out of the view if we configure different member-timeout's for some required members.
>
> 2.       Because if a member is not responding to any heartbeat requests, the member who is monitoring the non-responding member will initiate check member request.
>
> 3.       In this check member request monitoring member pings the non-responding member and waits for member-timeout of monitoring member for a response.
>
> 4.       If still there is no response, it will initiate a final suspect request to coordinator where the coordinator does the final check waiting for coordinators member-timeout.
>
> 5.       If coordinator did not get any response, it will remove the non-responding member from the view and publishes it.
>
> 6.       So, Here the time period for removing a member depends on its monitoring member's and coordinator's timeout. But the monitoring member depends on the view but it may change from time to time.
>
> So, now when a monitoring-member doing the check on a member, if we wait for the non-responding member's timeout instead of the monitoring member-timeout, then the time when the non-responding member will be removed from the view depends on its own member-timeout and the coordinators member-timeout.
> Hence we can configure different member-timeout for the required members.
>
> I created a pull request based on the above scenario: 
> https://github.com/apache/geode/pull/717
>
> Is the above approach correct? Do we have any concerns around this area?
> Please give your insights on this issue.
>
> Thanks,
> Aravind Musigumpula
>
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer
> <https://www.amdocs.com/about/email-disclaimer>
>

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

Re: Monitor the neighbour JVM using neihbour's member-timeout

Posted by Bruce Schuchardt <bs...@pivotal.io>.

Hi Aravind,

Your arguments about the final-check seem sound to me.  I put some 
further comments on your PR.

Regards,

Bruce S.


On 12/18/17 5:25 AM, Aravind Musigumpula wrote:
> Hi Community,
>
> Can you please give your suggestions on the below solution.
>
> I have raised a pull request for the same : https://github.com/apache/geode/pull/1075 .
>
>
> Thanks,
> Aravind Musigumpula
>
> -----Original Message-----
> From: Aravind Musigumpula
> Sent: Friday, November 03, 2017 3:23 PM
> To: dev@geode.apache.org
> Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout
>
> Thanks Bruce for suggestions, I will change the new variables from InternalDistributedMember to NetView and do changes related to backward compatibility.
>
> Now I know that there is another way that member can be removed from the view i.e if any member is sending a message and waits for ack-wait-threshold, if there is no response from the target the sender will do final check and remove it from the view if there is still no response.
> But I don't understand how deprecating the settings member-timeout, ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. The main problem is that we want a member to survive in the view for longer time than others.
>
> If we deprecate the settings into one setting and pass the setting to monitoring member(say A), then it will use the target member(say B which we want to survive in view for longer time) timeout for health monitoring and ack-wait-threshold to wait for the response for any message before doing final check.
> But what if some other member(say C) which is monitoring any other member(say D) have the member-timeout and ack-wait-threshold some smaller values. So if member C messages to B, C uses the smaller value of ack-wait-threshold(which is of member D) to get a response and does the final check again on basis of smaller member-timeout. So still member B can be kicked out of the view in small amount of time.
>
> I think this can be solved simply if we use the member-timeout of suspected member in the final check where we establish TCP connection. We don't need to club those three settings as well. We can set the member-timeout of a particular member to a higher value and the member which monitors it uses its own member-timeout as it is now, but during the final check it uses the suspected member-timeout(which is a greater value). The final check is common place in both the no heartbeat scenario and no response for a message scenario.
>
> Are there any concerns around this new proposal ?
>
>
> Thanks,
> Aravind Musigumpula
>
> -----Original Message-----
> From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io]
> Sent: Thursday, September 07, 2017 10:42 PM
> To: dev@geode.apache.org
> Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout
>
> I think this might be an acceptable change though I doubt many people would find it useful.
>
> It's already possible to set different member-timeouts on each node of the distributed system but the meaning of the setting is the inverse of what's proposed here, so having the current setting be different in each node is pretty useless.
>
> I think the initiation of suspect processing ought to be addressed if we make this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't based on the member-timeout but ought to be.  This would make it possible to initiate suspect processing with different timing for different nodes.  It would still leave the question of slow backup operations hanging:  If you're waiting for one node that's blocked waiting for a response from another node (say a node holding a backup
> bucket) you are going to initiate suspect processing on the node you're waiting on & not those other (backup) nodes.
>
> Rolling upgrade will also be a problem since old members aren't going to cough up their member-timeout settings.  What should be used as a membership timeout for the old members during an upgrade?
>
> If we proceed with this idea I'd prefer that we deprecate member-timeout, ack-wait-threshold and ack-severe-alert-threshold and have new settings with the "ack" settings being multiples of the new membership timeout setting.
>
> Concerning the PR, it isn't acceptable in its current form.
> InternalDistributedMember identifiers are often transmitted in messages and increasing their size affects performance.  Any new member attributes need to be added to NetView instead of InternalDistributedMember.
>
>
> On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
>> Hi Team,
>>
>> We have a requirement to configure  different member timeout for different members as we need some members to survive in the view for longer time than the other the members before being kicked out of the view in case they aren't responding.
>>
>>
>> 1.       Now with the current monitoring system it is not possible to determine when the member will be kicked out of the view if we configure different member-timeout's for some required members.
>>
>> 2.       Because if a member is not responding to any heartbeat requests, the member who is monitoring the non-responding member will initiate check member request.
>>
>> 3.       In this check member request monitoring member pings the non-responding member and waits for member-timeout of monitoring member for a response.
>>
>> 4.       If still there is no response, it will initiate a final suspect request to coordinator where the coordinator does the final check waiting for coordinators member-timeout.
>>
>> 5.       If coordinator did not get any response, it will remove the non-responding member from the view and publishes it.
>>
>> 6.       So, Here the time period for removing a member depends on its monitoring member's and coordinator's timeout. But the monitoring member depends on the view but it may change from time to time.
>>
>> So, now when a monitoring-member doing the check on a member, if we wait for the non-responding member's timeout instead of the monitoring member-timeout, then the time when the non-responding member will be removed from the view depends on its own member-timeout and the coordinators member-timeout.
>> Hence we can configure different member-timeout for the required members.
>>
>> I created a pull request based on the above scenario:
>> https://github.com/apache/geode/pull/717
>>
>> Is the above approach correct? Do we have any concerns around this area?
>> Please give your insights on this issue.
>>
>> Thanks,
>> Aravind Musigumpula
>>
>> This message and the information contained herein is proprietary and
>> confidential and subject to the Amdocs policy statement,
>>
>> you may review at https://www.amdocs.com/about/email-disclaimer
>> <https://www.amdocs.com/about/email-disclaimer>
>>
> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>
>
> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>
>

RE: Monitor the neighbour JVM using neihbour's member-timeout

Posted by Aravind Musigumpula <Ar...@amdocs.com>.

Hi Community,

Can you please give your suggestions on the below solution.

I have raised a pull request for the same : https://github.com/apache/geode/pull/1075 .

Thanks,
Aravind Musigumpula 

-----Original Message-----
From: Aravind Musigumpula 
Sent: Friday, November 03, 2017 3:23 PM
To: dev@geode.apache.org
Subject: RE: Monitor the neighbour JVM using neihbour's member-timeout

Thanks Bruce for suggestions, I will change the new variables from InternalDistributedMember to NetView and do changes related to backward compatibility.

Now I know that there is another way that member can be removed from the view i.e if any member is sending a message and waits for ack-wait-threshold, if there is no response from the target the sender will do final check and remove it from the view if there is still no response. 
But I don't understand how deprecating the settings member-timeout, ack-wait-threshold, ack-severe-alert-threshold into one will solve the problem. The main problem is that we want a member to survive in the view for longer time than others.

If we deprecate the settings into one setting and pass the setting to monitoring member(say A), then it will use the target member(say B which we want to survive in view for longer time) timeout for health monitoring and ack-wait-threshold to wait for the response for any message before doing final check.
But what if some other member(say C) which is monitoring any other member(say D) have the member-timeout and ack-wait-threshold some smaller values. So if member C messages to B, C uses the smaller value of ack-wait-threshold(which is of member D) to get a response and does the final check again on basis of smaller member-timeout. So still member B can be kicked out of the view in small amount of time.

I think this can be solved simply if we use the member-timeout of suspected member in the final check where we establish TCP connection. We don't need to club those three settings as well. We can set the member-timeout of a particular member to a higher value and the member which monitors it uses its own member-timeout as it is now, but during the final check it uses the suspected member-timeout(which is a greater value). The final check is common place in both the no heartbeat scenario and no response for a message scenario.

Are there any concerns around this new proposal ?

Thanks,
Aravind Musigumpula 

-----Original Message-----
From: Bruce Schuchardt [mailto:bschuchardt@pivotal.io]
Sent: Thursday, September 07, 2017 10:42 PM
To: dev@geode.apache.org
Subject: Re: Monitor the neighbour JVM using neihbour's member-timeout

I think this might be an acceptable change though I doubt many people would find it useful.

It's already possible to set different member-timeouts on each node of the distributed system but the meaning of the setting is the inverse of what's proposed here, so having the current setting be different in each node is pretty useless.

I think the initiation of suspect processing ought to be addressed if we make this change.  The ack-wait-threshold and ack-severe-alert-threshold aren't based on the member-timeout but ought to be.  This would make it possible to initiate suspect processing with different timing for different nodes.  It would still leave the question of slow backup operations hanging:  If you're waiting for one node that's blocked waiting for a response from another node (say a node holding a backup
bucket) you are going to initiate suspect processing on the node you're waiting on & not those other (backup) nodes.

Rolling upgrade will also be a problem since old members aren't going to cough up their member-timeout settings.  What should be used as a membership timeout for the old members during an upgrade?

If we proceed with this idea I'd prefer that we deprecate member-timeout, ack-wait-threshold and ack-severe-alert-threshold and have new settings with the "ack" settings being multiples of the new membership timeout setting.

Concerning the PR, it isn't acceptable in its current form. 
InternalDistributedMember identifiers are often transmitted in messages and increasing their size affects performance.  Any new member attributes need to be added to NetView instead of InternalDistributedMember.

On 8/22/17 12:35 AM, Aravind Musigumpula wrote:
> Hi Team,
>
> We have a requirement to configure  different member timeout for different members as we need some members to survive in the view for longer time than the other the members before being kicked out of the view in case they aren't responding.
>
>
> 1.       Now with the current monitoring system it is not possible to determine when the member will be kicked out of the view if we configure different member-timeout's for some required members.
>
> 2.       Because if a member is not responding to any heartbeat requests, the member who is monitoring the non-responding member will initiate check member request.
>
> 3.       In this check member request monitoring member pings the non-responding member and waits for member-timeout of monitoring member for a response.
>
> 4.       If still there is no response, it will initiate a final suspect request to coordinator where the coordinator does the final check waiting for coordinators member-timeout.
>
> 5.       If coordinator did not get any response, it will remove the non-responding member from the view and publishes it.
>
> 6.       So, Here the time period for removing a member depends on its monitoring member's and coordinator's timeout. But the monitoring member depends on the view but it may change from time to time.
>
> So, now when a monitoring-member doing the check on a member, if we wait for the non-responding member's timeout instead of the monitoring member-timeout, then the time when the non-responding member will be removed from the view depends on its own member-timeout and the coordinators member-timeout.
> Hence we can configure different member-timeout for the required members.
>
> I created a pull request based on the above scenario: 
> https://github.com/apache/geode/pull/717
>
> Is the above approach correct? Do we have any concerns around this area?
> Please give your insights on this issue.
>
> Thanks,
> Aravind Musigumpula
>
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer
> <https://www.amdocs.com/about/email-disclaimer>
>

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>