You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Mario Salazar de Torres <ma...@est.tech> on 2020/11/21 10:16:47 UTC

Requests taking too long if one member of the cluster fails

Hi,

I've been looking into the following issue:
"Whenever performing a stress test on a Geode cluster and forcefully killing one of the members, all the threads in the application get stuck".

To give more context these are the conditions under the test is performed:

  *   A cluster is deployed with:
     *   2 locators.
     *   3 servers.
  *   2 partitioned regions are created and collocated with a third one (from now on called the "anchor").
     *   Also, regions have a single redundant copy configured.
     *   Whether or not to enable persistence on these regions do not affect to the test outcome.
     *   Note that we've configured a PartitionResolver for both of these regions.
  *   A geode-native test application is spin up with 20 threads sending a pack of 1 put request to each of the partitioned
regions regions (except for the "anchor"), all of that within a transaction. See example below to illustrate the kind of traffic sent:
void thread() {
  while(true) {
    common_prefix = to_string(time(nullptr));
    tx_manager->begin();
    for(region_name : {"region_a", "region_b"}) {
      key = "key-" + common_prefix + "|" + to_string(rand());
      value = to_string(rand());
      cache->getRegion(region_name)->put(key, value);
    }
    tx_manager->commit();
  }
}

The test consists of:

  *   Spinning up the cluster.
  *   Running the application.
  *   One of the servers (from now on called "server-0") is forcefully restarted by
using kill -KILL <PID> and after that starting it up again with gfsh.

The expectation of this test is that given that data has a redundant copy, and we have 2 servers up and running all the time, then writing data should be handled smoothly.
However, what actually happens is that all application threads end up being stuck.

So, in the process of troubleshooting, we noticed that there was several dead-locks in the geode-native client, which resulted in the following PRs:

  *   https://github.com/apache/geode-native/pull/660
  *   https://github.com/apache/geode-native/pull/676
  *   https://github.com/apache/geode-native/pull/699

After solving all dead-locks in the client-side, we were still noticing the same outcome in the test.
So, after more digging, there it is what we noticed:

  *   Once the server is killed, geode-native removes the server endpoint from the ClientMetadataService.
  *   But given that put requests can be only executed on the server holding the primary copy, these requests ended up being proxied towards the server that was just killed.
  *   As it takes some time for the cluster members to notice that other members are down, requests proxied trough "healthy" servers take longer than expected. Something between 5-30 seconds.
  *   So, in the end, all the threads are stuck for this interval of time because the server they are contacting, are contacting "server-0".

For the sake of clarity I've attached a diagram demonstrating the test scenario. Let me know any additional clarifications you might need to understand the test itself.

And now, my questions here are:

  *   Have you encountered this behavior before? And if so, how did you solved that?
  *   Is this expected behavior? And if so, what's the point of having a cluster of several members with partitioned redundant data?

Sorry for the long reading and thanks for any help you can throw in.

BR,
Mario.

Re: Requests taking too long if one member of the cluster fails

Posted by Jacob Barrett <ja...@vmware.com>.
On the native side of things I would suggest trying the same test with a Java client and compare. It is very possible the C++ client is lacking in its ability to respond to failures as timely as the more heavily used Java client.

-Jake

On Nov 23, 2020, at 3:54 AM, Mario Salazar de Torres <ma...@est.tech>> wrote:

Hi @John Blum<ma...@vmware.com>,

I am grateful for your explanation. Really, thanks! It has been an instructive read.
Finally understood why if you allow to write on all the replicas you'll end up risking the consistency.
Consequently, understood that, as in life itself, for distributed databases you can't have everything (C, A and P).

So yes, we'll have to tune the parametrization in our cluster setup so the time the requests are failing ( the ones falling into the buckets for which the sick server is primary owner ) is reduced.
And, we are tuning the client parameters, so requests that are going to fail, do it quickly, allowing the ones which are not supposed to fail, to entering the processing queue straight away.

I'll follow up on how it goes once all the test are executed πŸ™‚

BR,
Mario.
________________________________
From: John Blum <jb...@vmware.com>>
Sent: Monday, November 23, 2020 3:42 AM
To: dev@geode.apache.org<ma...@geode.apache.org> <de...@geode.apache.org>>
Cc: miguel.g.garcia@ericsson.com<ma...@ericsson.com> <mi...@ericsson.com>>
Subject: Re: Requests taking too long if one member of the cluster fails

Hi Mario-


1) Regarding why only write to the primary (bucket) of a PR (?)... again, it has to do with consistency.

Fundamentally, a distributed system is constrained by CAP.  The system can either be consistent or available in the face of network partitions.  You cannot have your cake and eat it too, πŸ˜‰.

By design, Apache Geode favors consistency over availability.  However, it doesn't mean Geode becomes unavailable when a node or nodes, or the network, fails. With the ability to configure redundant copies, it is more like "limited availability" when a member or portion of the cluster is severed from the rest of the system, until the member(s) or network recovers.

But, to guarantee consistency, a single node (i.e. the "primary") hosting the PR must be "the source of truth".  If writes are allowed to go to secondaries, then you need a sophisticated consensus algorithm (e.g. Paxos, Raft) to resolve conflicts when 2 or more writes in different copies change the same logical object but differ in value.  Therefore, writes go to the primary and are then distributed to the secondaries (which require an ACK) while holding a lock.

If you think about this in object-oriented terms, the safest object in a highly concurrent system is an immutable one.  However, if an object can be modified by multiple Threads, then it is safer if all Threads access the object though the same control plane to uphold the invariants of the object.

NOTE: For an object, serializing access through synchronization does increase contention.  However, keep in mind that a PR does not just have 1 primary.  Each bucket of the PR (defaults to 113; is tunable) has a primary thereby reducing contenting on writes.

Finally, Geode's consistency guarantees are much more sophisticated than what I described above. You can read more about Geode's consistency here<https://geode.apache.org/docs/guide/113/developing/distributed_regions/region_entry_versions.html> [1] (an entire chapter has been dedicated to this very important topic).



2) Regarding member-timeout...

Can this setting be too low?  Yes, of course; you must be careful.

Setting too low of a member-timeout could result in the system thrashing between the member being kicked out and the member rejoining the system.

This is costly because, after a member is kicked out, the system must "restore redundancy".  When the member rejoins, a "fence & merge" process occurs, then the system may need to "rebalance" the data.

Why would a node bounce between being a member, and part of the system, and getting kicked out?

Well, it depends on your infrastructure, for one.  If you have an unreliable network (more applicable in the cloud environments in certain cases), then minor but frequent network blips that severe 1 or more members could cause the member(s) to bounce between being kicked out and rejoining.  If enough members are severed from the system, then the system might need to decide on a quorum.

If a member is sick (e.g. running low on memory) thereby making the member seemingly unresponsive when in fact the member is just overloaded, this can cause issues.

There are many factors to consider when configuring Geode.  Don't simply set a property thinking it just solved my immediate problem when in fact it might have shifted the problem somewhere else.

The setting for member-timeout may very well be what you need, or you may need to consider other factors (e.g. the size of your system, both number of nodes as well as the size of the data, level of redundancy, you mention collocated data (this also is a factor), the environment, etc, etc).

This is the trickiest part of using any system like Geode.  You typically must tune it properly to your UC and requirements over several iterations to meet your SLAs.

This chapter<https://geode.apache.org/docs/guide/113/managing/monitor_tune/chapter_overview.html> [2] in the User Guide will be your friend.

I will let others chime in with their expertise/experience now.  Hopefully, this has given you some thoughts and things to consider.  Just remember, always test and measure, πŸ™‚

Cheers,
John


[1] https://geode.apache.org/docs/guide/113/developing/distributed_regions/region_entry_versions.html
[2] https://geode.apache.org/docs/guide/113/managing/monitor_tune/chapter_overview.html









________________________________
From: Mario Salazar de Torres <ma...@est.tech>>
Sent: Saturday, November 21, 2020 1:40 PM
To: dev@geode.apache.org<ma...@geode.apache.org> <de...@geode.apache.org>>
Cc: miguel.g.garcia@ericsson.com<ma...@ericsson.com> <mi...@ericsson.com>>
Subject: Re: Requests taking too long if one member of the cluster fails

Thanks @John Blum<ma...@vmware.com> for your detailed explanation! It helped me to better understand how redundancy works.

Thing is that all our use cases requires a really low response time when performing operations.
Under normal conditions a "put" takes a few milliseconds, but in the case of a cluster member going down, in the described scenario it might take up to 30 seconds, sometimes even more.
Things we've considered is to set a timeout in the client-side, but still, upon retrials it will face the same issue.

What I've noticed is that requests being proxied won't stop to be sent to the failed server until:

 1.  One locator (I'd say the coordinator, please correct my if I am wrong here) does a final health check towards the member and member-timeout elapses without a successful ping.
 2.  And one of the members holding a secondary copy of the buckets volunteers to become primary owner.
 3.  Server becomes the primary owner.

So, what I've tried here is to set a really low member-timeout, which results the server holding the secondary copy becoming the primary owner in around <600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry unforeseen consequences.

As for a long term solutions to this in order to remove/significantly reduce any impact upon a server failure I've been thinking of the following:

 1.  In the last ApacheConf someone asked why "put" is only done in servers holding a primary copy and why not remove this constraint? My question here, out of ignorance. Is this such a crazy idea?
I've seen redis has something called partial re-sync to solve split-brain scenarios that would be caused by writing into the secondary while the primary is down.
 2.  Another alternative I've been thinking is upon a connection failure (either a read ack io exception or connection refused) there could be an option for the
server owning a secondary copy that, if enabled, would volunteer to become primary owner straight away.

NOTE. I am more familiarized with the native client, so please feel free to correct me If I got something wrong, or if any of what I've written is a "bunch of gibberish" with no sense at all πŸ™‚
BTW. I am not sure that the test diagram was attached to the mail, so I've also uploaded it here: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fi.ibb.co%2FG7n6T0M%2FGeode-Server-Kill.jpg&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077920327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BoqDCMJY5EYohFLuuW47lql9n%2BlWdRk6z%2FiB828kHkc%3D&amp;reserved=0

Thanks again.
BR,
Mario
________________________________
From: John Blum <jb...@vmware.com>>
Sent: Saturday, November 21, 2020 9:41 PM
To: dev@geode.apache.org<ma...@geode.apache.org> <de...@geode.apache.org>>
Cc: miguel.g.garcia@ericsson.com<ma...@ericsson.com> <mi...@ericsson.com>>
Subject: Re: Requests taking too long if one member of the cluster fails

DISCLAIMER: I am not knowledgeable about the Native Client (implementation) nor am I commenting specifically on the perf you are seeing, which can have many factors. However, in general...

Given you are performing "put" operations on a PR, then for consistency reasons, Geode is always going to "write" to the primary, on which ever member in the cluster hosts the primary for that particular PR (bucket). So, if the member containing the primary for the PR goes down, then I would expect it to take more time than a normal "put" when no member goes down. Essentially, the cluster is going to shuffle things around and possible rebalance the cluster in order to restore redundancy. When rebalancing, having collocated Regions could even further impact timing.

When performing a "put" operation , having redundancy is not going to sustain or improve performance, if that was what you were expecting. In fact, it could even potentially negatively impact performance when a node goes down depending on the number of nodes and redundancy level.

Finally, if you were testing "gets" vs "puts", then I'd expect very little if any noticeable impact on performance, since you are using redundant copies, which should fail over in the case of a node failure.

Refer to the following sections in the User Guide for specfics:

1) Rebalancing PR Data: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Fdeveloping%2Fpartitioned_regions%2Frebalancing_pr_data.html&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HHfSpKLFGSkrmk3O63dgBrHEh10MbkjVjwzQ%2BwwgQcs%3D&amp;reserved=0 (specifically, look at the section on 'How PR Rebalancing Works', which also talks about collocation).

2) Restoring Redundancy in PRs: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Fdeveloping%2Fpartitioned_regions%2Frestoring_region_redundancy.html&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ajlnbzYi5arPKaHHycK5oc7SyjsTlsSqCo3gW0jOKjU%3D&amp;reserved=0

3) Review your settings for 'member-timeout'. Search for this Geode property here:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Freference%2Ftopics%2Fgemfire_properties.html&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Gv5hhTzS%2ByRRKRN%2FC7D9Pu8d%2Bvqdfe1nWPVhAioT7GU%3D&amp;reserved=0).


4) Also, be mindful of the PR's 'recovery delay':
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Freleases%2Flatest%2Fjavadoc%2Forg%2Fapache%2Fgeode%2Fcache%2FPartitionAttributes.html%23getRecoveryDelay--&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=zy9KK858tKmrDOhHl7gWwjEgF36qTKf8ZPdzLV7cNvA%3D&amp;reserved=0


There may be other server-side (cluster-wide) settings you can configure for node failures as well that I am not recalling off the top of my head.

Hope this helps,

-j

________________________________
From: Mario Salazar de Torres <ma...@est.tech>>
Sent: Saturday, November 21, 2020 2:16 AM
To: dev@geode.apache.org<ma...@geode.apache.org> <de...@geode.apache.org>>
Cc: miguel.g.garcia@ericsson.com<ma...@ericsson.com> <mi...@ericsson.com>>
Subject: Requests taking too long if one member of the cluster fails

Hi,

I've been looking into the following issue:
"Whenever performing a stress test on a Geode cluster and forcefully killing one of the members, all the threads in the application get stuck".

To give more context these are the conditions under the test is performed:

 *   A cluster is deployed with:
    *   2 locators.
    *   3 servers.
 *   2 partitioned regions are created and collocated with a third one (from now on called the "anchor").
    *   Also, regions have a single redundant copy configured.
    *   Whether or not to enable persistence on these regions do not affect to the test outcome.
    *   Note that we've configured a PartitionResolver for both of these regions.
 *   A geode-native test application is spin up with 20 threads sending a pack of 1 put request to each of the partitioned
regions regions (except for the "anchor"), all of that within a transaction. See example below to illustrate the kind of traffic sent:
void thread() {
 while(true) {
   common_prefix = to_string(time(nullptr));
   tx_manager->begin();
   for(region_name : {"region_a", "region_b"}) {
     key = "key-" + common_prefix + "|" + to_string(rand());
     value = to_string(rand());
     cache->getRegion(region_name)->put(key, value);
   }
   tx_manager->commit();
 }
}

The test consists of:

 *   Spinning up the cluster.
 *   Running the application.
 *   One of the servers (from now on called "server-0") is forcefully restarted by
using kill -KILL <PID> and after that starting it up again with gfsh.

The expectation of this test is that given that data has a redundant copy, and we have 2 servers up and running all the time, then writing data should be handled smoothly.
However, what actually happens is that all application threads end up being stuck.

So, in the process of troubleshooting, we noticed that there was several dead-locks in the geode-native client, which resulted in the following PRs:

 *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=CdZzEXsi5hUMsbMt%2Fv8c7B1S6N7arRcF2D32L8Lue6Y%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=CdZzEXsi5hUMsbMt%2Fv8c7B1S6N7arRcF2D32L8Lue6Y%3D&amp;reserved=0><https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=CdZzEXsi5hUMsbMt%2Fv8c7B1S6N7arRcF2D32L8Lue6Y%3D&amp;reserved=0%3Chttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077930325%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=CdZzEXsi5hUMsbMt%2Fv8c7B1S6N7arRcF2D32L8Lue6Y%3D&amp;reserved=0%3E>
 *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xge%2FYca7rLoTlBRJQyKUIn0Qr8%2F5oGn135ZFFpAB%2F6Y%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xge%2FYca7rLoTlBRJQyKUIn0Qr8%2F5oGn135ZFFpAB%2F6Y%3D&amp;reserved=0><https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xge%2FYca7rLoTlBRJQyKUIn0Qr8%2F5oGn135ZFFpAB%2F6Y%3D&amp;reserved=0%3Chttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xge%2FYca7rLoTlBRJQyKUIn0Qr8%2F5oGn135ZFFpAB%2F6Y%3D&amp;reserved=0%3E>
 *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JiVy%2Bci5fvuzqCcVw8erHj1OMNssoK%2Bz97F7B4J6IWs%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JiVy%2Bci5fvuzqCcVw8erHj1OMNssoK%2Bz97F7B4J6IWs%3D&amp;reserved=0><https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JiVy%2Bci5fvuzqCcVw8erHj1OMNssoK%2Bz97F7B4J6IWs%3D&amp;reserved=0%3Chttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C52eaa63514ef4f02e52f08d88fa69d1b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637417293077940321%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JiVy%2Bci5fvuzqCcVw8erHj1OMNssoK%2Bz97F7B4J6IWs%3D&amp;reserved=0%3E>

After solving all dead-locks in the client-side, we were still noticing the same outcome in the test.
So, after more digging, there it is what we noticed:

 *   Once the server is killed, geode-native removes the server endpoint from the ClientMetadataService.
 *   But given that put requests can be only executed on the server holding the primary copy, these requests ended up being proxied towards the server that was just killed.
 *   As it takes some time for the cluster members to notice that other members are down, requests proxied trough "healthy" servers take longer than expected. Something between 5-30 seconds.
 *   So, in the end, all the threads are stuck for this interval of time because the server they are contacting, are contacting "server-0".

For the sake of clarity I've attached a diagram demonstrating the test scenario. Let me know any additional clarifications you might need to understand the test itself.

And now, my questions here are:

 *   Have you encountered this behavior before? And if so, how did you solved that?
 *   Is this expected behavior? And if so, what's the point of having a cluster of several members with partitioned redundant data?

Sorry for the long reading and thanks for any help you can throw in.

BR,
Mario.


Re: Requests taking too long if one member of the cluster fails

Posted by Mario Salazar de Torres <ma...@est.tech>.
Hi @John Blum<ma...@vmware.com>,

I am grateful for your explanation. Really, thanks! It has been an instructive read.
Finally understood why if you allow to write on all the replicas you'll end up risking the consistency.
Consequently, understood that, as in life itself, for distributed databases you can't have everything (C, A and P).

So yes, we'll have to tune the parametrization in our cluster setup so the time the requests are failing ( the ones falling into the buckets for which the sick server is primary owner ) is reduced.
And, we are tuning the client parameters, so requests that are going to fail, do it quickly, allowing the ones which are not supposed to fail, to entering the processing queue straight away.

I'll follow up on how it goes once all the test are executed πŸ™‚

BR,
Mario.
________________________________
From: John Blum <jb...@vmware.com>
Sent: Monday, November 23, 2020 3:42 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

Hi Mario-


1) Regarding why only write to the primary (bucket) of a PR (?)... again, it has to do with consistency.

Fundamentally, a distributed system is constrained by CAP.  The system can either be consistent or available in the face of network partitions.  You cannot have your cake and eat it too, πŸ˜‰.

By design, Apache Geode favors consistency over availability.  However, it doesn't mean Geode becomes unavailable when a node or nodes, or the network, fails. With the ability to configure redundant copies, it is more like "limited availability" when a member or portion of the cluster is severed from the rest of the system, until the member(s) or network recovers.

But, to guarantee consistency, a single node (i.e. the "primary") hosting the PR must be "the source of truth".  If writes are allowed to go to secondaries, then you need a sophisticated consensus algorithm (e.g. Paxos, Raft) to resolve conflicts when 2 or more writes in different copies change the same logical object but differ in value.  Therefore, writes go to the primary and are then distributed to the secondaries (which require an ACK) while holding a lock.

If you think about this in object-oriented terms, the safest object in a highly concurrent system is an immutable one.  However, if an object can be modified by multiple Threads, then it is safer if all Threads access the object though the same control plane to uphold the invariants of the object.

NOTE: For an object, serializing access through synchronization does increase contention.  However, keep in mind that a PR does not just have 1 primary.  Each bucket of the PR (defaults to 113; is tunable) has a primary thereby reducing contenting on writes.

Finally, Geode's consistency guarantees are much more sophisticated than what I described above. You can read more about Geode's consistency here<https://geode.apache.org/docs/guide/113/developing/distributed_regions/region_entry_versions.html> [1] (an entire chapter has been dedicated to this very important topic).



2) Regarding member-timeout...

Can this setting be too low?  Yes, of course; you must be careful.

Setting too low of a member-timeout could result in the system thrashing between the member being kicked out and the member rejoining the system.

This is costly because, after a member is kicked out, the system must "restore redundancy".  When the member rejoins, a "fence & merge" process occurs, then the system may need to "rebalance" the data.

Why would a node bounce between being a member, and part of the system, and getting kicked out?

Well, it depends on your infrastructure, for one.  If you have an unreliable network (more applicable in the cloud environments in certain cases), then minor but frequent network blips that severe 1 or more members could cause the member(s) to bounce between being kicked out and rejoining.  If enough members are severed from the system, then the system might need to decide on a quorum.

If a member is sick (e.g. running low on memory) thereby making the member seemingly unresponsive when in fact the member is just overloaded, this can cause issues.

There are many factors to consider when configuring Geode.  Don't simply set a property thinking it just solved my immediate problem when in fact it might have shifted the problem somewhere else.

The setting for member-timeout may very well be what you need, or you may need to consider other factors (e.g. the size of your system, both number of nodes as well as the size of the data, level of redundancy, you mention collocated data (this also is a factor), the environment, etc, etc).

This is the trickiest part of using any system like Geode.  You typically must tune it properly to your UC and requirements over several iterations to meet your SLAs.

This chapter<https://geode.apache.org/docs/guide/113/managing/monitor_tune/chapter_overview.html> [2] in the User Guide will be your friend.

I will let others chime in with their expertise/experience now.  Hopefully, this has given you some thoughts and things to consider.  Just remember, always test and measure, πŸ™‚

Cheers,
John


[1] https://geode.apache.org/docs/guide/113/developing/distributed_regions/region_entry_versions.html
[2] https://geode.apache.org/docs/guide/113/managing/monitor_tune/chapter_overview.html









________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Saturday, November 21, 2020 1:40 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

Thanks @John Blum<ma...@vmware.com> for your detailed explanation! It helped me to better understand how redundancy works.

Thing is that all our use cases requires a really low response time when performing operations.
Under normal conditions a "put" takes a few milliseconds, but in the case of a cluster member going down, in the described scenario it might take up to 30 seconds, sometimes even more.
Things we've considered is to set a timeout in the client-side, but still, upon retrials it will face the same issue.

What I've noticed is that requests being proxied won't stop to be sent to the failed server until:

  1.  One locator (I'd say the coordinator, please correct my if I am wrong here) does a final health check towards the member and member-timeout elapses without a successful ping.
  2.  And one of the members holding a secondary copy of the buckets volunteers to become primary owner.
  3.  Server becomes the primary owner.

So, what I've tried here is to set a really low member-timeout, which results the server holding the secondary copy becoming the primary owner in around <600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry unforeseen consequences.

As for a long term solutions to this in order to remove/significantly reduce any impact upon a server failure I've been thinking of the following:

  1.  In the last ApacheConf someone asked why "put" is only done in servers holding a primary copy and why not remove this constraint? My question here, out of ignorance. Is this such a crazy idea?
I've seen redis has something called partial re-sync to solve split-brain scenarios that would be caused by writing into the secondary while the primary is down.
  2.  Another alternative I've been thinking is upon a connection failure (either a read ack io exception or connection refused) there could be an option for the
server owning a secondary copy that, if enabled, would volunteer to become primary owner straight away.

NOTE. I am more familiarized with the native client, so please feel free to correct me If I got something wrong, or if any of what I've written is a "bunch of gibberish" with no sense at all πŸ™‚
BTW. I am not sure that the test diagram was attached to the mail, so I've also uploaded it here: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fi.ibb.co%2FG7n6T0M%2FGeode-Server-Kill.jpg&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442478182%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RusegrEfDFWtDCFKaBnPWYWopewi5kknYR4mShdcSco%3D&amp;reserved=0

Thanks again.
BR,
Mario
________________________________
From: John Blum <jb...@vmware.com>
Sent: Saturday, November 21, 2020 9:41 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

DISCLAIMER: I am not knowledgeable about the Native Client (implementation) nor am I commenting specifically on the perf you are seeing, which can have many factors. However, in general...

Given you are performing "put" operations on a PR, then for consistency reasons, Geode is always going to "write" to the primary, on which ever member in the cluster hosts the primary for that particular PR (bucket).  So, if the member containing the primary for the PR goes down, then I would expect it to take more time than a normal "put" when no member goes down. Essentially, the cluster is going to shuffle things around and possible rebalance the cluster in order to restore redundancy. When rebalancing, having collocated Regions could even further impact timing.

When performing a "put" operation , having redundancy is not going to sustain or improve performance, if that was what you were expecting. In fact, it could even potentially negatively impact performance when a node goes down depending on the number of nodes and redundancy level.

Finally, if you were testing "gets" vs "puts", then I'd expect very little if any noticeable impact on performance, since you are using redundant copies, which should fail over in the case of a node failure.

Refer to the following sections in the User Guide for specfics:

1) Rebalancing PR Data: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Fdeveloping%2Fpartitioned_regions%2Frebalancing_pr_data.html&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442478182%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=akY6cl9Q9VnFnOBkBTLBmoVqkk4pwmrAKtC3f8DpadY%3D&amp;reserved=0 (specifically, look at the section on 'How PR Rebalancing Works', which also talks about collocation).

2) Restoring Redundancy in PRs: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Fdeveloping%2Fpartitioned_regions%2Frestoring_region_redundancy.html&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Uc%2B%2BJDu5Yf8cLJxuUY4oVgs18uRsweojTnJtk9%2BJ%2FMY%3D&amp;reserved=0

3) Review your settings for 'member-timeout'. Search for this Geode property here:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Freference%2Ftopics%2Fgemfire_properties.html&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HMvLpQR2zUxJXRGoK9Kf4KzNK4lbgXN9idZytRqIryA%3D&amp;reserved=0).


4) Also, be mindful of the PR's 'recovery delay':
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Freleases%2Flatest%2Fjavadoc%2Forg%2Fapache%2Fgeode%2Fcache%2FPartitionAttributes.html%23getRecoveryDelay--&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPY6mIO1bhE0%2FHxDh3XpOC%2B%2BfTmfntnoOPGwHCBXbOU%3D&amp;reserved=0


There may be other server-side (cluster-wide) settings you can configure for node failures as well that I am not recalling off the top of my head.

Hope this helps,

-j

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Saturday, November 21, 2020 2:16 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Requests taking too long if one member of the cluster fails

Hi,

I've been looking into the following issue:
"Whenever performing a stress test on a Geode cluster and forcefully killing one of the members, all the threads in the application get stuck".

To give more context these are the conditions under the test is performed:

  *   A cluster is deployed with:
     *   2 locators.
     *   3 servers.
  *   2 partitioned regions are created and collocated with a third one (from now on called the "anchor").
     *   Also, regions have a single redundant copy configured.
     *   Whether or not to enable persistence on these regions do not affect to the test outcome.
     *   Note that we've configured a PartitionResolver for both of these regions.
  *   A geode-native test application is spin up with 20 threads sending a pack of 1 put request to each of the partitioned
regions regions (except for the "anchor"), all of that within a transaction. See example below to illustrate the kind of traffic sent:
void thread() {
  while(true) {
    common_prefix = to_string(time(nullptr));
    tx_manager->begin();
    for(region_name : {"region_a", "region_b"}) {
      key = "key-" + common_prefix + "|" + to_string(rand());
      value = to_string(rand());
      cache->getRegion(region_name)->put(key, value);
    }
    tx_manager->commit();
  }
}

The test consists of:

  *   Spinning up the cluster.
  *   Running the application.
  *   One of the servers (from now on called "server-0") is forcefully restarted by
using kill -KILL <PID> and after that starting it up again with gfsh.

The expectation of this test is that given that data has a redundant copy, and we have 2 servers up and running all the time, then writing data should be handled smoothly.
However, what actually happens is that all application threads end up being stuck.

So, in the process of troubleshooting, we noticed that there was several dead-locks in the geode-native client, which resulted in the following PRs:

  *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JJpb%2Fg8pIozvjlS483T46EwnhsF0yRW2%2Bmr7XXJf704%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JJpb%2Fg8pIozvjlS483T46EwnhsF0yRW2%2Bmr7XXJf704%3D&amp;reserved=0>
  *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Bg8htZem3wf9cBEB0A97tCMguK9G391%2BeL7KS5uRQsU%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Bg8htZem3wf9cBEB0A97tCMguK9G391%2BeL7KS5uRQsU%3D&amp;reserved=0>
  *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Y2o2KPkdfjcKOrjWi7VJBN673qyH1QA8%2FGoT3nuGZGs%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442498173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=A248faKnTxhdP7fHd45otiI%2BDO2%2BJuk%2F%2FjIQqhbcDPw%3D&amp;reserved=0>

After solving all dead-locks in the client-side, we were still noticing the same outcome in the test.
So, after more digging, there it is what we noticed:

  *   Once the server is killed, geode-native removes the server endpoint from the ClientMetadataService.
  *   But given that put requests can be only executed on the server holding the primary copy, these requests ended up being proxied towards the server that was just killed.
  *   As it takes some time for the cluster members to notice that other members are down, requests proxied trough "healthy" servers take longer than expected. Something between 5-30 seconds.
  *   So, in the end, all the threads are stuck for this interval of time because the server they are contacting, are contacting "server-0".

For the sake of clarity I've attached a diagram demonstrating the test scenario. Let me know any additional clarifications you might need to understand the test itself.

And now, my questions here are:

  *   Have you encountered this behavior before? And if so, how did you solved that?
  *   Is this expected behavior? And if so, what's the point of having a cluster of several members with partitioned redundant data?

Sorry for the long reading and thanks for any help you can throw in.

BR,
Mario.

Re: Requests taking too long if one member of the cluster fails

Posted by John Blum <jb...@vmware.com>.
Hi Mario-


1) Regarding why only write to the primary (bucket) of a PR (?)... again, it has to do with consistency.

Fundamentally, a distributed system is constrained by CAP.  The system can either be consistent or available in the face of network partitions.  You cannot have your cake and eat it too, πŸ˜‰.

By design, Apache Geode favors consistency over availability.  However, it doesn't mean Geode becomes unavailable when a node or nodes, or the network, fails. With the ability to configure redundant copies, it is more like "limited availability" when a member or portion of the cluster is severed from the rest of the system, until the member(s) or network recovers.

But, to guarantee consistency, a single node (i.e. the "primary") hosting the PR must be "the source of truth".  If writes are allowed to go to secondaries, then you need a sophisticated consensus algorithm (e.g. Paxos, Raft) to resolve conflicts when 2 or more writes in different copies change the same logical object but differ in value.  Therefore, writes go to the primary and are then distributed to the secondaries (which require an ACK) while holding a lock.

If you think about this in object-oriented terms, the safest object in a highly concurrent system is an immutable one.  However, if an object can be modified by multiple Threads, then it is safer if all Threads access the object though the same control plane to uphold the invariants of the object.

NOTE: For an object, serializing access through synchronization does increase contention.  However, keep in mind that a PR does not just have 1 primary.  Each bucket of the PR (defaults to 113; is tunable) has a primary thereby reducing contenting on writes.

Finally, Geode's consistency guarantees are much more sophisticated than what I described above. You can read more about Geode's consistency here<https://geode.apache.org/docs/guide/113/developing/distributed_regions/region_entry_versions.html> [1] (an entire chapter has been dedicated to this very important topic).



2) Regarding member-timeout...

Can this setting be too low?  Yes, of course; you must be careful.

Setting too low of a member-timeout could result in the system thrashing between the member being kicked out and the member rejoining the system.

This is costly because, after a member is kicked out, the system must "restore redundancy".  When the member rejoins, a "fence & merge" process occurs, then the system may need to "rebalance" the data.

Why would a node bounce between being a member, and part of the system, and getting kicked out?

Well, it depends on your infrastructure, for one.  If you have an unreliable network (more applicable in the cloud environments in certain cases), then minor but frequent network blips that severe 1 or more members could cause the member(s) to bounce between being kicked out and rejoining.  If enough members are severed from the system, then the system might need to decide on a quorum.

If a member is sick (e.g. running low on memory) thereby making the member seemingly unresponsive when in fact the member is just overloaded, this can cause issues.

There are many factors to consider when configuring Geode.  Don't simply set a property thinking it just solved my immediate problem when in fact it might have shifted the problem somewhere else.

The setting for member-timeout may very well be what you need, or you may need to consider other factors (e.g. the size of your system, both number of nodes as well as the size of the data, level of redundancy, you mention collocated data (this also is a factor), the environment, etc, etc).

This is the trickiest part of using any system like Geode.  You typically must tune it properly to your UC and requirements over several iterations to meet your SLAs.

This chapter<https://geode.apache.org/docs/guide/113/managing/monitor_tune/chapter_overview.html> [2] in the User Guide will be your friend.

I will let others chime in with their expertise/experience now.  Hopefully, this has given you some thoughts and things to consider.  Just remember, always test and measure, πŸ™‚

Cheers,
John


[1] https://geode.apache.org/docs/guide/113/developing/distributed_regions/region_entry_versions.html
[2] https://geode.apache.org/docs/guide/113/managing/monitor_tune/chapter_overview.html









________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Saturday, November 21, 2020 1:40 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

Thanks @John Blum<ma...@vmware.com> for your detailed explanation! It helped me to better understand how redundancy works.

Thing is that all our use cases requires a really low response time when performing operations.
Under normal conditions a "put" takes a few milliseconds, but in the case of a cluster member going down, in the described scenario it might take up to 30 seconds, sometimes even more.
Things we've considered is to set a timeout in the client-side, but still, upon retrials it will face the same issue.

What I've noticed is that requests being proxied won't stop to be sent to the failed server until:

  1.  One locator (I'd say the coordinator, please correct my if I am wrong here) does a final health check towards the member and member-timeout elapses without a successful ping.
  2.  And one of the members holding a secondary copy of the buckets volunteers to become primary owner.
  3.  Server becomes the primary owner.

So, what I've tried here is to set a really low member-timeout, which results the server holding the secondary copy becoming the primary owner in around <600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry unforeseen consequences.

As for a long term solutions to this in order to remove/significantly reduce any impact upon a server failure I've been thinking of the following:

  1.  In the last ApacheConf someone asked why "put" is only done in servers holding a primary copy and why not remove this constraint? My question here, out of ignorance. Is this such a crazy idea?
I've seen redis has something called partial re-sync to solve split-brain scenarios that would be caused by writing into the secondary while the primary is down.
  2.  Another alternative I've been thinking is upon a connection failure (either a read ack io exception or connection refused) there could be an option for the
server owning a secondary copy that, if enabled, would volunteer to become primary owner straight away.

NOTE. I am more familiarized with the native client, so please feel free to correct me If I got something wrong, or if any of what I've written is a "bunch of gibberish" with no sense at all πŸ™‚
BTW. I am not sure that the test diagram was attached to the mail, so I've also uploaded it here: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fi.ibb.co%2FG7n6T0M%2FGeode-Server-Kill.jpg&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442478182%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RusegrEfDFWtDCFKaBnPWYWopewi5kknYR4mShdcSco%3D&amp;reserved=0

Thanks again.
BR,
Mario
________________________________
From: John Blum <jb...@vmware.com>
Sent: Saturday, November 21, 2020 9:41 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

DISCLAIMER: I am not knowledgeable about the Native Client (implementation) nor am I commenting specifically on the perf you are seeing, which can have many factors. However, in general...

Given you are performing "put" operations on a PR, then for consistency reasons, Geode is always going to "write" to the primary, on which ever member in the cluster hosts the primary for that particular PR (bucket).  So, if the member containing the primary for the PR goes down, then I would expect it to take more time than a normal "put" when no member goes down. Essentially, the cluster is going to shuffle things around and possible rebalance the cluster in order to restore redundancy. When rebalancing, having collocated Regions could even further impact timing.

When performing a "put" operation , having redundancy is not going to sustain or improve performance, if that was what you were expecting. In fact, it could even potentially negatively impact performance when a node goes down depending on the number of nodes and redundancy level.

Finally, if you were testing "gets" vs "puts", then I'd expect very little if any noticeable impact on performance, since you are using redundant copies, which should fail over in the case of a node failure.

Refer to the following sections in the User Guide for specfics:

1) Rebalancing PR Data: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Fdeveloping%2Fpartitioned_regions%2Frebalancing_pr_data.html&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442478182%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=akY6cl9Q9VnFnOBkBTLBmoVqkk4pwmrAKtC3f8DpadY%3D&amp;reserved=0 (specifically, look at the section on 'How PR Rebalancing Works', which also talks about collocation).

2) Restoring Redundancy in PRs: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Fdeveloping%2Fpartitioned_regions%2Frestoring_region_redundancy.html&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Uc%2B%2BJDu5Yf8cLJxuUY4oVgs18uRsweojTnJtk9%2BJ%2FMY%3D&amp;reserved=0

3) Review your settings for 'member-timeout'. Search for this Geode property here:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F113%2Freference%2Ftopics%2Fgemfire_properties.html&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HMvLpQR2zUxJXRGoK9Kf4KzNK4lbgXN9idZytRqIryA%3D&amp;reserved=0).


4) Also, be mindful of the PR's 'recovery delay':
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Freleases%2Flatest%2Fjavadoc%2Forg%2Fapache%2Fgeode%2Fcache%2FPartitionAttributes.html%23getRecoveryDelay--&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pPY6mIO1bhE0%2FHxDh3XpOC%2B%2BfTmfntnoOPGwHCBXbOU%3D&amp;reserved=0


There may be other server-side (cluster-wide) settings you can configure for node failures as well that I am not recalling off the top of my head.

Hope this helps,

-j

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Saturday, November 21, 2020 2:16 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Requests taking too long if one member of the cluster fails

Hi,

I've been looking into the following issue:
"Whenever performing a stress test on a Geode cluster and forcefully killing one of the members, all the threads in the application get stuck".

To give more context these are the conditions under the test is performed:

  *   A cluster is deployed with:
     *   2 locators.
     *   3 servers.
  *   2 partitioned regions are created and collocated with a third one (from now on called the "anchor").
     *   Also, regions have a single redundant copy configured.
     *   Whether or not to enable persistence on these regions do not affect to the test outcome.
     *   Note that we've configured a PartitionResolver for both of these regions.
  *   A geode-native test application is spin up with 20 threads sending a pack of 1 put request to each of the partitioned
regions regions (except for the "anchor"), all of that within a transaction. See example below to illustrate the kind of traffic sent:
void thread() {
  while(true) {
    common_prefix = to_string(time(nullptr));
    tx_manager->begin();
    for(region_name : {"region_a", "region_b"}) {
      key = "key-" + common_prefix + "|" + to_string(rand());
      value = to_string(rand());
      cache->getRegion(region_name)->put(key, value);
    }
    tx_manager->commit();
  }
}

The test consists of:

  *   Spinning up the cluster.
  *   Running the application.
  *   One of the servers (from now on called "server-0") is forcefully restarted by
using kill -KILL <PID> and after that starting it up again with gfsh.

The expectation of this test is that given that data has a redundant copy, and we have 2 servers up and running all the time, then writing data should be handled smoothly.
However, what actually happens is that all application threads end up being stuck.

So, in the process of troubleshooting, we noticed that there was several dead-locks in the geode-native client, which resulted in the following PRs:

  *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JJpb%2Fg8pIozvjlS483T46EwnhsF0yRW2%2Bmr7XXJf704%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JJpb%2Fg8pIozvjlS483T46EwnhsF0yRW2%2Bmr7XXJf704%3D&amp;reserved=0>
  *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Bg8htZem3wf9cBEB0A97tCMguK9G391%2BeL7KS5uRQsU%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Bg8htZem3wf9cBEB0A97tCMguK9G391%2BeL7KS5uRQsU%3D&amp;reserved=0>
  *   https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442488183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Y2o2KPkdfjcKOrjWi7VJBN673qyH1QA8%2FGoT3nuGZGs%3D&amp;reserved=0<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&amp;data=04%7C01%7Cjblum%40vmware.com%7C017b94a226604ef14bac08d88e6618c6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415916442498173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=A248faKnTxhdP7fHd45otiI%2BDO2%2BJuk%2F%2FjIQqhbcDPw%3D&amp;reserved=0>

After solving all dead-locks in the client-side, we were still noticing the same outcome in the test.
So, after more digging, there it is what we noticed:

  *   Once the server is killed, geode-native removes the server endpoint from the ClientMetadataService.
  *   But given that put requests can be only executed on the server holding the primary copy, these requests ended up being proxied towards the server that was just killed.
  *   As it takes some time for the cluster members to notice that other members are down, requests proxied trough "healthy" servers take longer than expected. Something between 5-30 seconds.
  *   So, in the end, all the threads are stuck for this interval of time because the server they are contacting, are contacting "server-0".

For the sake of clarity I've attached a diagram demonstrating the test scenario. Let me know any additional clarifications you might need to understand the test itself.

And now, my questions here are:

  *   Have you encountered this behavior before? And if so, how did you solved that?
  *   Is this expected behavior? And if so, what's the point of having a cluster of several members with partitioned redundant data?

Sorry for the long reading and thanks for any help you can throw in.

BR,
Mario.

Re: Requests taking too long if one member of the cluster fails

Posted by Mario Salazar de Torres <ma...@est.tech>.
Hi everyone,

Regarding what @Jacob Barrett<ma...@vmware.com> mentioned about the geode-native timeout handling, yes, I am aware of that, we are working on identifying any problems to open PRs.
But my feeling is that this PR https://github.com/apache/geode-native/pull/695 will dramatically improve things there πŸ™‚

Regarding parametrization, we've been testing several parametrizations and things look really promising, there are some minor things to tweak, but we barely notice the impact.
Regarding allowing a to configure Geode as a non primary/secondary (a.k.a multi-master) distributed system, thin is I've been reading, just out of curiosity, and it turns out, it is feasible. I.E: Google Spanner<https://static.googleusercontent.com/media/research.google.com/es//pubs/archive/45855.pdf>
Different thing is that it is something that can be easily implemented in Geode, or even maybe that's not something we've want. Still, I think that's a conversation for another forum.

So really thanks everyone that helped πŸ™‚
BR,
Mario.
________________________________
From: Anthony Baker <ba...@vmware.com>
Sent: Monday, November 23, 2020 6:25 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

Yes, lowering the member timeout is one approach I’ve seen taken for applications that demand ultra low latency.  These workloads need to provide not just low β€œaverage” or even p99 latency, but put a hard limit on the max value.

When you do this you need to ensure coherency across at all aspects of timeouts (eg client read timeouts and retries).  You need to ensure that GC pauses don’t cause instability in the cluster.  For example, if a GC pause is greater than the member timeout, you should go back and re-tune your heap settings to drive down GC.  If you are running in a container of VM you need to ensure sufficient resources so that the GemFIre process is never paused.

All this presupposes a stable and performant network infrastructure.

Anthony


On Nov 21, 2020, at 1:40 PM, Mario Salazar de Torres <ma...@est.tech>> wrote:

So, what I've tried here is to set a really low member-timeout, which results the server holding the secondary copy becoming the primary owner in around <600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry unforeseen consequences.


Re: Requests taking too long if one member of the cluster fails

Posted by Anthony Baker <ba...@vmware.com>.
Yes, lowering the member timeout is one approach I’ve seen taken for applications that demand ultra low latency.  These workloads need to provide not just low β€œaverage” or even p99 latency, but put a hard limit on the max value.

When you do this you need to ensure coherency across at all aspects of timeouts (eg client read timeouts and retries).  You need to ensure that GC pauses don’t cause instability in the cluster.  For example, if a GC pause is greater than the member timeout, you should go back and re-tune your heap settings to drive down GC.  If you are running in a container of VM you need to ensure sufficient resources so that the GemFIre process is never paused.

All this presupposes a stable and performant network infrastructure.

Anthony


On Nov 21, 2020, at 1:40 PM, Mario Salazar de Torres <ma...@est.tech>> wrote:

So, what I've tried here is to set a really low member-timeout, which results the server holding the secondary copy becoming the primary owner in around <600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry unforeseen consequences.


Re: Requests taking too long if one member of the cluster fails

Posted by Mario Salazar de Torres <ma...@est.tech>.
Thanks @John Blum<ma...@vmware.com> for your detailed explanation! It helped me to better understand how redundancy works.

Thing is that all our use cases requires a really low response time when performing operations.
Under normal conditions a "put" takes a few milliseconds, but in the case of a cluster member going down, in the described scenario it might take up to 30 seconds, sometimes even more.
Things we've considered is to set a timeout in the client-side, but still, upon retrials it will face the same issue.

What I've noticed is that requests being proxied won't stop to be sent to the failed server until:

  1.  One locator (I'd say the coordinator, please correct my if I am wrong here) does a final health check towards the member and member-timeout elapses without a successful ping.
  2.  And one of the members holding a secondary copy of the buckets volunteers to become primary owner.
  3.  Server becomes the primary owner.

So, what I've tried here is to set a really low member-timeout, which results the server holding the secondary copy becoming the primary owner in around <600ms. That's quite a huge improvement,
but I wanted to ask you if setting this member-timeout too low might carry unforeseen consequences.

As for a long term solutions to this in order to remove/significantly reduce any impact upon a server failure I've been thinking of the following:

  1.  In the last ApacheConf someone asked why "put" is only done in servers holding a primary copy and why not remove this constraint? My question here, out of ignorance. Is this such a crazy idea?
I've seen redis has something called partial re-sync to solve split-brain scenarios that would be caused by writing into the secondary while the primary is down.
  2.  Another alternative I've been thinking is upon a connection failure (either a read ack io exception or connection refused) there could be an option for the
server owning a secondary copy that, if enabled, would volunteer to become primary owner straight away.

NOTE. I am more familiarized with the native client, so please feel free to correct me If I got something wrong, or if any of what I've written is a "bunch of gibberish" with no sense at all πŸ™‚
BTW. I am not sure that the test diagram was attached to the mail, so I've also uploaded it here: https://i.ibb.co/G7n6T0M/Geode-Server-Kill.jpg

Thanks again.
BR,
Mario
________________________________
From: John Blum <jb...@vmware.com>
Sent: Saturday, November 21, 2020 9:41 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Re: Requests taking too long if one member of the cluster fails

DISCLAIMER: I am not knowledgeable about the Native Client (implementation) nor am I commenting specifically on the perf you are seeing, which can have many factors. However, in general...

Given you are performing "put" operations on a PR, then for consistency reasons, Geode is always going to "write" to the primary, on which ever member in the cluster hosts the primary for that particular PR (bucket).  So, if the member containing the primary for the PR goes down, then I would expect it to take more time than a normal "put" when no member goes down. Essentially, the cluster is going to shuffle things around and possible rebalance the cluster in order to restore redundancy. When rebalancing, having collocated Regions could even further impact timing.

When performing a "put" operation , having redundancy is not going to sustain or improve performance, if that was what you were expecting. In fact, it could even potentially negatively impact performance when a node goes down depending on the number of nodes and redundancy level.

Finally, if you were testing "gets" vs "puts", then I'd expect very little if any noticeable impact on performance, since you are using redundant copies, which should fail over in the case of a node failure.

Refer to the following sections in the User Guide for specfics:

1) Rebalancing PR Data: https://geode.apache.org/docs/guide/113/developing/partitioned_regions/rebalancing_pr_data.html (specifically, look at the section on 'How PR Rebalancing Works', which also talks about collocation).

2) Restoring Redundancy in PRs: https://geode.apache.org/docs/guide/113/developing/partitioned_regions/restoring_region_redundancy.html

3) Review your settings for 'member-timeout'. Search for this Geode property here:
https://geode.apache.org/docs/guide/113/reference/topics/gemfire_properties.html).


4) Also, be mindful of the PR's 'recovery delay':
https://geode.apache.org/releases/latest/javadoc/org/apache/geode/cache/PartitionAttributes.html#getRecoveryDelay--


There may be other server-side (cluster-wide) settings you can configure for node failures as well that I am not recalling off the top of my head.

Hope this helps,

-j

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Saturday, November 21, 2020 2:16 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Requests taking too long if one member of the cluster fails

Hi,

I've been looking into the following issue:
"Whenever performing a stress test on a Geode cluster and forcefully killing one of the members, all the threads in the application get stuck".

To give more context these are the conditions under the test is performed:

  *   A cluster is deployed with:
     *   2 locators.
     *   3 servers.
  *   2 partitioned regions are created and collocated with a third one (from now on called the "anchor").
     *   Also, regions have a single redundant copy configured.
     *   Whether or not to enable persistence on these regions do not affect to the test outcome.
     *   Note that we've configured a PartitionResolver for both of these regions.
  *   A geode-native test application is spin up with 20 threads sending a pack of 1 put request to each of the partitioned
regions regions (except for the "anchor"), all of that within a transaction. See example below to illustrate the kind of traffic sent:
void thread() {
  while(true) {
    common_prefix = to_string(time(nullptr));
    tx_manager->begin();
    for(region_name : {"region_a", "region_b"}) {
      key = "key-" + common_prefix + "|" + to_string(rand());
      value = to_string(rand());
      cache->getRegion(region_name)->put(key, value);
    }
    tx_manager->commit();
  }
}

The test consists of:

  *   Spinning up the cluster.
  *   Running the application.
  *   One of the servers (from now on called "server-0") is forcefully restarted by
using kill -KILL <PID> and after that starting it up again with gfsh.

The expectation of this test is that given that data has a redundant copy, and we have 2 servers up and running all the time, then writing data should be handled smoothly.
However, what actually happens is that all application threads end up being stuck.

So, in the process of troubleshooting, we noticed that there was several dead-locks in the geode-native client, which resulted in the following PRs:

  *   https://github.com/apache/geode-native/pull/660<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267358205%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SHG7IdzZIJHeZf4IEm4LJZZgCMFEzbDB1N0oULHwF4I%3D&reserved=0>
  *   https://github.com/apache/geode-native/pull/676<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qKk4AgGNnN%2FOKLigzXAU85ouk%2BQ7ZW2uM213AUTpYaA%3D&reserved=0>
  *   https://github.com/apache/geode-native/pull/699<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hd3U2zlgIYfH4hkTNj6uJwnnk8CwdutV%2Ful9JAxXlJo%3D&reserved=0>

After solving all dead-locks in the client-side, we were still noticing the same outcome in the test.
So, after more digging, there it is what we noticed:

  *   Once the server is killed, geode-native removes the server endpoint from the ClientMetadataService.
  *   But given that put requests can be only executed on the server holding the primary copy, these requests ended up being proxied towards the server that was just killed.
  *   As it takes some time for the cluster members to notice that other members are down, requests proxied trough "healthy" servers take longer than expected. Something between 5-30 seconds.
  *   So, in the end, all the threads are stuck for this interval of time because the server they are contacting, are contacting "server-0".

For the sake of clarity I've attached a diagram demonstrating the test scenario. Let me know any additional clarifications you might need to understand the test itself.

And now, my questions here are:

  *   Have you encountered this behavior before? And if so, how did you solved that?
  *   Is this expected behavior? And if so, what's the point of having a cluster of several members with partitioned redundant data?

Sorry for the long reading and thanks for any help you can throw in.

BR,
Mario.

Re: Requests taking too long if one member of the cluster fails

Posted by John Blum <jb...@vmware.com>.
DISCLAIMER: I am not knowledgeable about the Native Client (implementation) nor am I commenting specifically on the perf you are seeing, which can have many factors. However, in general...

Given you are performing "put" operations on a PR, then for consistency reasons, Geode is always going to "write" to the primary, on which ever member in the cluster hosts the primary for that particular PR (bucket).  So, if the member containing the primary for the PR goes down, then I would expect it to take more time than a normal "put" when no member goes down. Essentially, the cluster is going to shuffle things around and possible rebalance the cluster in order to restore redundancy. When rebalancing, having collocated Regions could even further impact timing.

When performing a "put" operation , having redundancy is not going to sustain or improve performance, if that was what you were expecting. In fact, it could even potentially negatively impact performance when a node goes down depending on the number of nodes and redundancy level.

Finally, if you were testing "gets" vs "puts", then I'd expect very little if any noticeable impact on performance, since you are using redundant copies, which should fail over in the case of a node failure.

Refer to the following sections in the User Guide for specfics:

1) Rebalancing PR Data: https://geode.apache.org/docs/guide/113/developing/partitioned_regions/rebalancing_pr_data.html (specifically, look at the section on 'How PR Rebalancing Works', which also talks about collocation).

2) Restoring Redundancy in PRs: https://geode.apache.org/docs/guide/113/developing/partitioned_regions/restoring_region_redundancy.html

3) Review your settings for 'member-timeout'. Search for this Geode property here:
https://geode.apache.org/docs/guide/113/reference/topics/gemfire_properties.html).


4) Also, be mindful of the PR's 'recovery delay':
https://geode.apache.org/releases/latest/javadoc/org/apache/geode/cache/PartitionAttributes.html#getRecoveryDelay--


There may be other server-side (cluster-wide) settings you can configure for node failures as well that I am not recalling off the top of my head.

Hope this helps,

-j

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Saturday, November 21, 2020 2:16 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Cc: miguel.g.garcia@ericsson.com <mi...@ericsson.com>
Subject: Requests taking too long if one member of the cluster fails

Hi,

I've been looking into the following issue:
"Whenever performing a stress test on a Geode cluster and forcefully killing one of the members, all the threads in the application get stuck".

To give more context these are the conditions under the test is performed:

  *   A cluster is deployed with:
     *   2 locators.
     *   3 servers.
  *   2 partitioned regions are created and collocated with a third one (from now on called the "anchor").
     *   Also, regions have a single redundant copy configured.
     *   Whether or not to enable persistence on these regions do not affect to the test outcome.
     *   Note that we've configured a PartitionResolver for both of these regions.
  *   A geode-native test application is spin up with 20 threads sending a pack of 1 put request to each of the partitioned
regions regions (except for the "anchor"), all of that within a transaction. See example below to illustrate the kind of traffic sent:
void thread() {
  while(true) {
    common_prefix = to_string(time(nullptr));
    tx_manager->begin();
    for(region_name : {"region_a", "region_b"}) {
      key = "key-" + common_prefix + "|" + to_string(rand());
      value = to_string(rand());
      cache->getRegion(region_name)->put(key, value);
    }
    tx_manager->commit();
  }
}

The test consists of:

  *   Spinning up the cluster.
  *   Running the application.
  *   One of the servers (from now on called "server-0") is forcefully restarted by
using kill -KILL <PID> and after that starting it up again with gfsh.

The expectation of this test is that given that data has a redundant copy, and we have 2 servers up and running all the time, then writing data should be handled smoothly.
However, what actually happens is that all application threads end up being stuck.

So, in the process of troubleshooting, we noticed that there was several dead-locks in the geode-native client, which resulted in the following PRs:

  *   https://github.com/apache/geode-native/pull/660<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F660&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267358205%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SHG7IdzZIJHeZf4IEm4LJZZgCMFEzbDB1N0oULHwF4I%3D&reserved=0>
  *   https://github.com/apache/geode-native/pull/676<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F676&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qKk4AgGNnN%2FOKLigzXAU85ouk%2BQ7ZW2uM213AUTpYaA%3D&reserved=0>
  *   https://github.com/apache/geode-native/pull/699<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode-native%2Fpull%2F699&data=04%7C01%7Cjblum%40vmware.com%7C18f6362abffe44fb21e008d88e069818%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637415506267368153%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hd3U2zlgIYfH4hkTNj6uJwnnk8CwdutV%2Ful9JAxXlJo%3D&reserved=0>

After solving all dead-locks in the client-side, we were still noticing the same outcome in the test.
So, after more digging, there it is what we noticed:

  *   Once the server is killed, geode-native removes the server endpoint from the ClientMetadataService.
  *   But given that put requests can be only executed on the server holding the primary copy, these requests ended up being proxied towards the server that was just killed.
  *   As it takes some time for the cluster members to notice that other members are down, requests proxied trough "healthy" servers take longer than expected. Something between 5-30 seconds.
  *   So, in the end, all the threads are stuck for this interval of time because the server they are contacting, are contacting "server-0".

For the sake of clarity I've attached a diagram demonstrating the test scenario. Let me know any additional clarifications you might need to understand the test itself.

And now, my questions here are:

  *   Have you encountered this behavior before? And if so, how did you solved that?
  *   Is this expected behavior? And if so, what's the point of having a cluster of several members with partitioned redundant data?

Sorry for the long reading and thanks for any help you can throw in.

BR,
Mario.