You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by Mario Salazar de Torres <ma...@est.tech> on 2021/02/26 12:48:30 UTC

Cached regions are not synchronized after restore

Hi,
These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
The later one has to do with subscription notification and cached regions. The scenario is as follows:

1. Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
2. Register interest for all the region entries.
3. Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
4. Take a backup of the disk-stores.
5. Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
6. Restore the previous backup.
7. Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.

The reason why notifications mentioned in step 7 are ignored is due to the following log:
"Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"

So, first of, I wanted to ask:

* Any of you have encountered this issue before? How did you tackle it?
* Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
* Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
* Maybe any other solution having to do with cluster versioning?

BR,
Mario.

Re: Cached regions are not synchronized after restore

Posted by Mario Salazar de Torres <ma...@est.tech>.

Hi again,

While looking into how Java client ensures cache consistency after a cluster restart, I've seen that this is achieved by recovering previously registered interest. However, I noticed the following scenario which outcome does not seem correct to me:

  1.  Start a cluster with a REPLICATED region named "Region", no entries. Note this region does not have the persistence enabled.
  2.  Start the Java client A and create the region with CACHING_PROXY RegionShortcut. Note this region is associated to a pool with subscription enabled.
  3.  Register interest for KEY_VALUES for the following regex: "entry\-.*"
  4.  Use the client A to put an entry which key is "another-one"
  5.  Start the Java client B and create the region with PROXY RegionShortcut.
  6.  Write with client B entries which keys are entry-1 .. entry-100.
  7.  Restart the cluster.
  8.  After the cluster is restarted the region cache from client A still contains the entry which key is "another-one", but it's not in the server, hence causing an inconsistency in the cache.

So, what do you think? Why is this mechanism implemented this way in the Java client and is not instead like this?:

  1.  If connection is lost to the cluster, "localClear" is called.
  2.  Whenever subscription redundancy is recovered, key for which there is an interest are recovered.

Any clarification is welcomed 🙂

Thanks,
Mario
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 10:34 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi,

Well, thing is Mike you raised an interesting topic here, because I wasn't seeing the behavior of GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION as incorrect, rather the fact that entries where not cleaned up after subscription redundancy is lost.
But interestingly, I've been asked that why after restarting the cluster, some entries events are allowed, and some others are not. I thought it was worth looking into it, and now that you mentioned about GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION, I had the following idea:

To ease debugging of this issue I've decoded and serialized the notification messages into a JSON-like style.

Event for key "key" in region "/Region" before the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444651497,
    "member_id": "745dac94-74cf-4d18-9f7b-aab3877f99b9",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_bgddbbgfab1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 0,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

Event for the same key after the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "BITS_RECORDED | BITS_IS_REMOTE_TAG",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444953534,
    "member_id": "558a65d2-3299-4404-946b-a081e7591d4e",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_dgfccgfbad1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 3,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION is supposed to be raised if 2 events for the same key are received at the same time. But in this case, they are received at two distant points in time (before the restart and after the restart).
Why is this happening? As you can see both versions have the same "entry_version" (1), so in this case for both Java and native clients what determines whether the exception is thrown is the result of the memberIDs comparison.

So, this explains why randomly some entries are allowed and some are not under this scenario. Because depending on which ID have both members involved with both events the comparison yields a different result. Now here are my questions:

  *   Shouldn't be better in this case to use the timestamp to determine which one is older/newer whenever both have the same entry_version?
  *   Wouldn't be in this scenario inaccurate to throw ConcurrentCacheModificationException if the check fails? I mean, given the fact that there is no concurrent modification.

Thanks for all,
Mario.

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Wednesday, March 3, 2021 8:10 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7Cf889a418de304df8b09708d8dd245484%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637502495409048002%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GIbtpxg%2F4iaUC0dqu9JwUtUu9TaDK%2B4YjRAKiSgekGI%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mario Salazar de Torres <ma...@est.tech>.

Hi,

Thanks for pointing that part of the documentation.
Ok I understand now that if cache consistency for certain keys is a concern, the user should register interest for such keys.
And with that premise my scenario does not happens, so I guess I'll go for the Java client implementation in geode-native

Thanks,
Mario.
________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Thursday, March 4, 2021 10:00 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Here's the doc I was referring (the one Dan mentioned) to: https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3

Specifically, see the section under Server Failover: "...For notify by subscription, it clears and reloads only the entries in the region interest lists."

The confusion for  me is whether this also applies to a restore operation. So I'm testing that.

Mike

[https://geode.apache.org/docs/guide/12/images_svg/client_server_event_dist.svg]<https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3>
Client-to-Server Event Distribution | Geode Docs<https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3>
Note: Invalidations on the client side are not forwarded to the server. Server-to-Client Event Distribution. The server automatically sends entry modification events only for keys in which the client has registered interest.
geode.apache.org

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Thursday, March 4, 2021 9:31 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Actually, I believe that is correct. I.e., clients should only drop stuff they've registered interest in. I'll see if I can find the docs for this.

Consider using a local cache to store read-only data --- maybe a product catalog, that only gets updated on the first of the month. There would be no need to invalidate that data even if the cluster had to be restored.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 9:19 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mike,

Thanks for looking into it. However, thing is, not all entries all cleared in the case of the Java client, just the keys which interest it’s registered. I think that might not be correct :S

BR,
Mario

Obtener Outlook para iOS<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2Fo0ukef&amp;data=04%7C01%7Cmartellm%40vmware.com%7C72d28fb314ca4e573d9208d8df336082%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504759066707165%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=MFeYcq19R4fFTC2TX6BqQhpzAZMAeR85kqliI4eVEbk%3D&amp;reserved=0>
________________________________
De: Mike Martell <ma...@vmware.com>
Enviado: Thursday, March 4, 2021 6:13:19 PM
Para: dev@geode.apache.org <de...@geode.apache.org>
Asunto: Re: Cached regions are not synchronized after restore

It seems like this is a NC bug. I'll write a test for your scenario, but it seems like the NC is not dropping its local cache (i.e., clearing all local entries) on a server restore. As Dan has indicated, the Java client will clear a local cache when it loses connectivity to all servers, which is implied by a restore operation. I'll report back on my findings.

Mike

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 1:34 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi,

Well, thing is Mike you raised an interesting topic here, because I wasn't seeing the behavior of GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION as incorrect, rather the fact that entries where not cleaned up after subscription redundancy is lost.
But interestingly, I've been asked that why after restarting the cluster, some entries events are allowed, and some others are not. I thought it was worth looking into it, and now that you mentioned about GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION, I had the following idea:

To ease debugging of this issue I've decoded and serialized the notification messages into a JSON-like style.

Event for key "key" in region "/Region" before the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444651497,
    "member_id": "745dac94-74cf-4d18-9f7b-aab3877f99b9",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_bgddbbgfab1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 0,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

Event for the same key after the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "BITS_RECORDED | BITS_IS_REMOTE_TAG",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444953534,
    "member_id": "558a65d2-3299-4404-946b-a081e7591d4e",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_dgfccgfbad1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 3,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION is supposed to be raised if 2 events for the same key are received at the same time. But in this case, they are received at two distant points in time (before the restart and after the restart).
Why is this happening? As you can see both versions have the same "entry_version" (1), so in this case for both Java and native clients what determines whether the exception is thrown is the result of the memberIDs comparison.

So, this explains why randomly some entries are allowed and some are not under this scenario. Because depending on which ID have both members involved with both events the comparison yields a different result. Now here are my questions:

  *   Shouldn't be better in this case to use the timestamp to determine which one is older/newer whenever both have the same entry_version?
  *   Wouldn't be in this scenario inaccurate to throw ConcurrentCacheModificationException if the check fails? I mean, given the fact that there is no concurrent modification.

Thanks for all,
Mario.

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Wednesday, March 3, 2021 8:10 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7C72d28fb314ca4e573d9208d8df336082%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504759066717151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=shwYkIrvckj%2BaXj03IihIz40gjJ%2BwvlI9JcLAf08BX8%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mike Martell <ma...@vmware.com>.

Here's the doc I was referring (the one Dan mentioned) to: https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3

Specifically, see the section under Server Failover: "...For notify by subscription, it clears and reloads only the entries in the region interest lists."

The confusion for  me is whether this also applies to a restore operation. So I'm testing that.

Mike

[https://geode.apache.org/docs/guide/12/images_svg/client_server_event_dist.svg]<https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3>
Client-to-Server Event Distribution | Geode Docs<https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3>
Note: Invalidations on the client side are not forwarded to the server. Server-to-Client Event Distribution. The server automatically sends entry modification events only for keys in which the client has registered interest.
geode.apache.org

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Thursday, March 4, 2021 9:31 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Actually, I believe that is correct. I.e., clients should only drop stuff they've registered interest in. I'll see if I can find the docs for this.

Consider using a local cache to store read-only data --- maybe a product catalog, that only gets updated on the first of the month. There would be no need to invalidate that data even if the cluster had to be restored.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 9:19 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mike,

Thanks for looking into it. However, thing is, not all entries all cleared in the case of the Java client, just the keys which interest it’s registered. I think that might not be correct :S

BR,
Mario

Obtener Outlook para iOS<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2Fo0ukef&amp;data=04%7C01%7Cmartellm%40vmware.com%7C72d28fb314ca4e573d9208d8df336082%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504759066707165%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=MFeYcq19R4fFTC2TX6BqQhpzAZMAeR85kqliI4eVEbk%3D&amp;reserved=0>
________________________________
De: Mike Martell <ma...@vmware.com>
Enviado: Thursday, March 4, 2021 6:13:19 PM
Para: dev@geode.apache.org <de...@geode.apache.org>
Asunto: Re: Cached regions are not synchronized after restore

It seems like this is a NC bug. I'll write a test for your scenario, but it seems like the NC is not dropping its local cache (i.e., clearing all local entries) on a server restore. As Dan has indicated, the Java client will clear a local cache when it loses connectivity to all servers, which is implied by a restore operation. I'll report back on my findings.

Mike

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 1:34 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi,

Well, thing is Mike you raised an interesting topic here, because I wasn't seeing the behavior of GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION as incorrect, rather the fact that entries where not cleaned up after subscription redundancy is lost.
But interestingly, I've been asked that why after restarting the cluster, some entries events are allowed, and some others are not. I thought it was worth looking into it, and now that you mentioned about GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION, I had the following idea:

To ease debugging of this issue I've decoded and serialized the notification messages into a JSON-like style.

Event for key "key" in region "/Region" before the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444651497,
    "member_id": "745dac94-74cf-4d18-9f7b-aab3877f99b9",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_bgddbbgfab1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 0,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

Event for the same key after the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "BITS_RECORDED | BITS_IS_REMOTE_TAG",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444953534,
    "member_id": "558a65d2-3299-4404-946b-a081e7591d4e",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_dgfccgfbad1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 3,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION is supposed to be raised if 2 events for the same key are received at the same time. But in this case, they are received at two distant points in time (before the restart and after the restart).
Why is this happening? As you can see both versions have the same "entry_version" (1), so in this case for both Java and native clients what determines whether the exception is thrown is the result of the memberIDs comparison.

So, this explains why randomly some entries are allowed and some are not under this scenario. Because depending on which ID have both members involved with both events the comparison yields a different result. Now here are my questions:

  *   Shouldn't be better in this case to use the timestamp to determine which one is older/newer whenever both have the same entry_version?
  *   Wouldn't be in this scenario inaccurate to throw ConcurrentCacheModificationException if the check fails? I mean, given the fact that there is no concurrent modification.

Thanks for all,
Mario.

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Wednesday, March 3, 2021 8:10 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7C72d28fb314ca4e573d9208d8df336082%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504759066717151%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=shwYkIrvckj%2BaXj03IihIz40gjJ%2BwvlI9JcLAf08BX8%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mike Martell <ma...@vmware.com>.

Actually, I believe that is correct. I.e., clients should only drop stuff they've registered interest in. I'll see if I can find the docs for this.

Consider using a local cache to store read-only data --- maybe a product catalog, that only gets updated on the first of the month. There would be no need to invalidate that data even if the cluster had to be restored.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 9:19 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mike,

Thanks for looking into it. However, thing is, not all entries all cleared in the case of the Java client, just the keys which interest it’s registered. I think that might not be correct :S

BR,
Mario

Obtener Outlook para iOS<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2Fo0ukef&amp;data=04%7C01%7Cmartellm%40vmware.com%7C045e124e353443466a8d08d8df31b1db%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504751818612549%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pXo%2Fn%2FCkFKjFY3Pd2dPNCMAMfnjwtuMIqkmjs%2BPw2XA%3D&amp;reserved=0>
________________________________
De: Mike Martell <ma...@vmware.com>
Enviado: Thursday, March 4, 2021 6:13:19 PM
Para: dev@geode.apache.org <de...@geode.apache.org>
Asunto: Re: Cached regions are not synchronized after restore

It seems like this is a NC bug. I'll write a test for your scenario, but it seems like the NC is not dropping its local cache (i.e., clearing all local entries) on a server restore. As Dan has indicated, the Java client will clear a local cache when it loses connectivity to all servers, which is implied by a restore operation. I'll report back on my findings.

Mike

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 1:34 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi,

Well, thing is Mike you raised an interesting topic here, because I wasn't seeing the behavior of GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION as incorrect, rather the fact that entries where not cleaned up after subscription redundancy is lost.
But interestingly, I've been asked that why after restarting the cluster, some entries events are allowed, and some others are not. I thought it was worth looking into it, and now that you mentioned about GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION, I had the following idea:

To ease debugging of this issue I've decoded and serialized the notification messages into a JSON-like style.

Event for key "key" in region "/Region" before the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444651497,
    "member_id": "745dac94-74cf-4d18-9f7b-aab3877f99b9",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_bgddbbgfab1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 0,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

Event for the same key after the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "BITS_RECORDED | BITS_IS_REMOTE_TAG",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444953534,
    "member_id": "558a65d2-3299-4404-946b-a081e7591d4e",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_dgfccgfbad1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 3,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION is supposed to be raised if 2 events for the same key are received at the same time. But in this case, they are received at two distant points in time (before the restart and after the restart).
Why is this happening? As you can see both versions have the same "entry_version" (1), so in this case for both Java and native clients what determines whether the exception is thrown is the result of the memberIDs comparison.

So, this explains why randomly some entries are allowed and some are not under this scenario. Because depending on which ID have both members involved with both events the comparison yields a different result. Now here are my questions:

  *   Shouldn't be better in this case to use the timestamp to determine which one is older/newer whenever both have the same entry_version?
  *   Wouldn't be in this scenario inaccurate to throw ConcurrentCacheModificationException if the check fails? I mean, given the fact that there is no concurrent modification.

Thanks for all,
Mario.

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Wednesday, March 3, 2021 8:10 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7C045e124e353443466a8d08d8df31b1db%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504751818612549%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z0R7v5mRtjTWx9t60lppZzDGTUXik8p1326rHNFBHUY%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mario Salazar de Torres <ma...@est.tech>.

Hi Mike,

Thanks for looking into it. However, thing is, not all entries all cleared in the case of the Java client, just the keys which interest it’s registered. I think that might not be correct :S

BR,
Mario

Obtener Outlook para iOS<https://aka.ms/o0ukef>
________________________________
De: Mike Martell <ma...@vmware.com>
Enviado: Thursday, March 4, 2021 6:13:19 PM
Para: dev@geode.apache.org <de...@geode.apache.org>
Asunto: Re: Cached regions are not synchronized after restore

It seems like this is a NC bug. I'll write a test for your scenario, but it seems like the NC is not dropping its local cache (i.e., clearing all local entries) on a server restore. As Dan has indicated, the Java client will clear a local cache when it loses connectivity to all servers, which is implied by a restore operation. I'll report back on my findings.

Mike

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 1:34 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi,

Well, thing is Mike you raised an interesting topic here, because I wasn't seeing the behavior of GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION as incorrect, rather the fact that entries where not cleaned up after subscription redundancy is lost.
But interestingly, I've been asked that why after restarting the cluster, some entries events are allowed, and some others are not. I thought it was worth looking into it, and now that you mentioned about GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION, I had the following idea:

To ease debugging of this issue I've decoded and serialized the notification messages into a JSON-like style.

Event for key "key" in region "/Region" before the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444651497,
    "member_id": "745dac94-74cf-4d18-9f7b-aab3877f99b9",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_bgddbbgfab1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 0,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

Event for the same key after the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "BITS_RECORDED | BITS_IS_REMOTE_TAG",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444953534,
    "member_id": "558a65d2-3299-4404-946b-a081e7591d4e",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_dgfccgfbad1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 3,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION is supposed to be raised if 2 events for the same key are received at the same time. But in this case, they are received at two distant points in time (before the restart and after the restart).
Why is this happening? As you can see both versions have the same "entry_version" (1), so in this case for both Java and native clients what determines whether the exception is thrown is the result of the memberIDs comparison.

So, this explains why randomly some entries are allowed and some are not under this scenario. Because depending on which ID have both members involved with both events the comparison yields a different result. Now here are my questions:

  *   Shouldn't be better in this case to use the timestamp to determine which one is older/newer whenever both have the same entry_version?
  *   Wouldn't be in this scenario inaccurate to throw ConcurrentCacheModificationException if the check fails? I mean, given the fact that there is no concurrent modification.

Thanks for all,
Mario.

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Wednesday, March 3, 2021 8:10 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7C1c816655778943f0046d08d8def0ce9b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504473135136189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RC6pDjAJXaSvuIm8euBXe64claJWgJ7hAXCKTx4YhNw%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mike Martell <ma...@vmware.com>.

It seems like this is a NC bug. I'll write a test for your scenario, but it seems like the NC is not dropping its local cache (i.e., clearing all local entries) on a server restore. As Dan has indicated, the Java client will clear a local cache when it loses connectivity to all servers, which is implied by a restore operation. I'll report back on my findings.

Mike

________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Thursday, March 4, 2021 1:34 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi,

Well, thing is Mike you raised an interesting topic here, because I wasn't seeing the behavior of GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION as incorrect, rather the fact that entries where not cleaned up after subscription redundancy is lost.
But interestingly, I've been asked that why after restarting the cluster, some entries events are allowed, and some others are not. I thought it was worth looking into it, and now that you mentioned about GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION, I had the following idea:

To ease debugging of this issue I've decoded and serialized the notification messages into a JSON-like style.

Event for key "key" in region "/Region" before the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444651497,
    "member_id": "745dac94-74cf-4d18-9f7b-aab3877f99b9",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_bgddbbgfab1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 0,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

Event for the same key after the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "BITS_RECORDED | BITS_IS_REMOTE_TAG",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444953534,
    "member_id": "558a65d2-3299-4404-946b-a081e7591d4e",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_dgfccgfbad1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 3,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION is supposed to be raised if 2 events for the same key are received at the same time. But in this case, they are received at two distant points in time (before the restart and after the restart).
Why is this happening? As you can see both versions have the same "entry_version" (1), so in this case for both Java and native clients what determines whether the exception is thrown is the result of the memberIDs comparison.

So, this explains why randomly some entries are allowed and some are not under this scenario. Because depending on which ID have both members involved with both events the comparison yields a different result. Now here are my questions:

  *   Shouldn't be better in this case to use the timestamp to determine which one is older/newer whenever both have the same entry_version?
  *   Wouldn't be in this scenario inaccurate to throw ConcurrentCacheModificationException if the check fails? I mean, given the fact that there is no concurrent modification.

Thanks for all,
Mario.

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Wednesday, March 3, 2021 8:10 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7C1c816655778943f0046d08d8def0ce9b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637504473135136189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=RC6pDjAJXaSvuIm8euBXe64claJWgJ7hAXCKTx4YhNw%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mario Salazar de Torres <ma...@est.tech>.

Hi,

Well, thing is Mike you raised an interesting topic here, because I wasn't seeing the behavior of GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION as incorrect, rather the fact that entries where not cleaned up after subscription redundancy is lost.
But interestingly, I've been asked that why after restarting the cluster, some entries events are allowed, and some others are not. I thought it was worth looking into it, and now that you mentioned about GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION, I had the following idea:

To ease debugging of this issue I've decoded and serialized the notification messages into a JSON-like style.

Event for key "key" in region "/Region" before the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444651497,
    "member_id": "745dac94-74cf-4d18-9f7b-aab3877f99b9",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_bgddbbgfab1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 0,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

Event for the same key after the cluster restart:
{
  "region": "/Region",
  "key": "key",
  "delta": false,
  "value": "value",
  "version": {
    "flags": "HAS_MEMBER_ID | VERSION_TWO_BYTES",
    "bits": "BITS_RECORDED | BITS_IS_REMOTE_TAG",
    "dsid": -1,
    "entry_version": 1,
    "region_version": 1,
    "timestamp": 1614444953534,
    "member_id": "558a65d2-3299-4404-946b-a081e7591d4e",
    "prev_member_id": null
  },
  "interest_list": true,
  "cqs": false,
  "event": {
    "member": {
      "id": "Native_dgfccgfbad1",
      "name": "default_GeodeDS",
      "type": "MemberType.LONER_DM_TYPE",
      "flags": "PARTIAL_ID_BIT",
      "address": "::",
      "port": 2,
      "uuid": "00000000-0000-0000-0000-000000000000",
      "weight": 0
    },
    "thread_id": 1,
    "sequence_id": 3,
    "bucket_id": -1,
    "breadcrumb": 0
  }
}

GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION is supposed to be raised if 2 events for the same key are received at the same time. But in this case, they are received at two distant points in time (before the restart and after the restart).
Why is this happening? As you can see both versions have the same "entry_version" (1), so in this case for both Java and native clients what determines whether the exception is thrown is the result of the memberIDs comparison.

So, this explains why randomly some entries are allowed and some are not under this scenario. Because depending on which ID have both members involved with both events the comparison yields a different result. Now here are my questions:

  *   Shouldn't be better in this case to use the timestamp to determine which one is older/newer whenever both have the same entry_version?
  *   Wouldn't be in this scenario inaccurate to throw ConcurrentCacheModificationException if the check fails? I mean, given the fact that there is no concurrent modification.

Thanks for all,
Mario.

________________________________
From: Mike Martell <ma...@vmware.com>
Sent: Wednesday, March 3, 2021 8:10 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7Cf889a418de304df8b09708d8dd245484%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637502495409048002%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GIbtpxg%2F4iaUC0dqu9JwUtUu9TaDK%2B4YjRAKiSgekGI%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mike Martell <ma...@vmware.com>.

Hi Mario,

Thanks for discovering this difference in behavior between the geode-native clients (C++ and .NET) and geode Java client regarding the sync'ing of local caches after a restore operation.

It looks like geode-native has no tests for the exception type you're seeing: GF_CACHE_CONCURRENT_MODIFICATION_EXCEPTION

We definitely should write a test around this functionality using our new testing framework and then fix the new failing test.

If you're not familiar with our new test framework, I'd be happy to walk you through it. It's much nicer!

Thanks,
Mike.
________________________________
From: Mario Salazar de Torres <ma...@est.tech>
Sent: Monday, March 1, 2021 6:38 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgeode.apache.org%2Fdocs%2Fguide%2F12%2Fdeveloping%2Fevents%2Fhow_client_server_distribution_works.html%23how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3&amp;data=04%7C01%7Cmartellm%40vmware.com%7Cf889a418de304df8b09708d8dd245484%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C637502495409048002%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=GIbtpxg%2F4iaUC0dqu9JwUtUu9TaDK%2B4YjRAKiSgekGI%3D&amp;reserved=0

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Mario Salazar de Torres <ma...@est.tech>.

Hi everyone,

After replicating the test with both clients being the Java implementation, I could verify that what @Dan Smith<ma...@vmware.com> pointed out about the documentation is happening. Every time subscription redundancy is lost, cached entries are erased. This is clearly not happening in geode-native.
So, I will further investigate to bring this functionality into the native client.

Thanks for the help 🙂
BR/
Mario
________________________________
From: Dan Smith <da...@vmware.com>
Sent: Monday, March 1, 2021 5:27 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Dan Smith <da...@vmware.com>.

The java client at least should automatically drop its cache when it loses and restores connectivity to all the servers. See https://geode.apache.org/docs/guide/12/developing/events/how_client_server_distribution_works.html#how_client_server_distribution_works__section_928BB60066414BEB9FAA7FB3120334A3

-Dan

________________________________
From: Jacob Barrett <ja...@vmware.com>
Sent: Friday, February 26, 2021 9:29 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Cached regions are not synchronized after restore

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
>
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
>
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
>
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
>
> So, first of, I wanted to ask:
>
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
>
> BR,
> Mario.

Re: Cached regions are not synchronized after restore

Posted by Jacob Barrett <ja...@vmware.com>.

For clarification, does the a Java client show the same behavior in missing the events after restore or is this something you are saying is unique to the native clients?

> On Feb 26, 2021, at 4:48 AM, Mario Salazar de Torres <ma...@est.tech> wrote:
> 
> Hi,
> These months I have been tackling a series of issues having to do with several inconsistencies in the geode-native client after a cluster restore.
> The later one has to do with subscription notification and cached regions. The scenario is as follows:
> 
>  1.  Start the cluster and the clients. For clarification purposes let's say we have a NC and a Java client. Being the NC the local client, and Java client the external one.
> Also note that NC client has subscription notification enabled for its pool and the region has caching enabled.
>  2.  Register interest for all the region entries.
>  3.  Write some entries into the region from the Java client. Notifications are received in the NC client and entries cached.
>  4.  Take a backup of the disk-stores.
>  5.  Write/modify some entries with the Java client. Notifications are received in the NC client and entries cached.
>  6.  Restore the previous backup.
>  7.  Write/modify some entries with the Java client. Some of the notifications are discarded and some others not.
> Note that all the entries which notifications were ignored did not exist in the step 4 of the scenario.
> 
> The reason why notifications mentioned in step 7 are ignored is due to the following log:
> "Region::localUpdate: updateNoThrow<Region::put> for key [<redacted>] failed because the cache already contains an entry with higher version. The cache listener will not be invoked"
> 
> So, first of, I wanted to ask:
> 
>  *   Any of you have encountered this issue before? How did you tackle it?
>  *   Is there any mechanism in the Java client to avoid this kind of issues with caching de-sync? Note that I did not found any
>  *   Maybe we should add an option to clear local cached regions after connection is lost towards the cluster in the same way is done with PdxTypeRegistry?
>  *   Maybe any other solution having to do with cluster versioning?
> 
> BR,
> Mario.