You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Mario Kevo <ma...@est.tech> on 2020/09/11 16:46:18 UTC

Colocated regions missing some buckets after restart

Hi geode-dev,

We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
After servers restart on some region we can see that they don't have any bucket.
gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
Metrics for region:/region1 On Member server-1


Category  |            Metric            | Value
--------- | ---------------------------- | -----
partition | putLocalRate                 | 0.0
          | putRemoteRate                | 0.0
          | putRemoteLatency             | 0
          | putRemoteAvgLatency          | 0
          | bucketCount                  | 0
          | primaryBucketCount           | 0
          | configuredRedundancy         | 1
          | actualRedundancy             | 0
          | numBucketsWithoutRedundancy  | 113
          | totalBucketSize              | 0

gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
Metrics for region:/region1 On Member server-0

Category  |            Metric            | Value
--------- | ---------------------------- | -----
partition | putLocalRate                 | 0.0
          | putRemoteRate                | 0.0
          | putRemoteLatency             | 0
          | putRemoteAvgLatency          | 0
          | bucketCount                  | 113
          | primaryBucketCount           | 56
          | configuredRedundancy         | 1
          | actualRedundancy             | 0
          | numBucketsWithoutRedundancy  | 113
          | totalBucketSize              | 0


The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

Does anyone have some idea about this problem, what causing the issue?
The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
After restart both servers and try to do show metrics command you will got this issue for some regions.

BR,
Mario


Re: Colocated regions missing some buckets after restart

Posted by Donal Evans <do...@vmware.com>.
Hi Mario,

I've tried using 12 colocated regions, starting the servers within 0.2 seconds of each other (according to the locator logs) and ensuring that the order they're started in is the same as the order they were shut down in, but I'm still unable to reproduce this issue. Is there anything else that I might be missing or doing differently from you?

Donal
________________________________
From: Mario Kevo <ma...@est.tech>
Sent: Monday, September 28, 2020 10:49 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Odg: Colocated regions missing some buckets after restart

Hi Donal,

Sometimes you need to do restart two or three times, but mostly it is reproduced by first restart.
start locator --name=locator1 --port=10334
start locator --name=locator2 --port=10335 --locators=localhost[10334]
start server --name=server1 --locators=127.0.0.1[10334],127.0.0.1[10335] --server-port=40404
start server --name=server2 --locators=127.0.0.1[10334],127.0.0.1[10335] --server-port=40405
I'm putting 10000 entries, but you can use a lower value.

You need to be really quick with commands. There is an example from my locator log.
[info 2020/09/29 07:41:52.060 CEST <unicast receiver,mkevo-XPS-15-9570-16115> tid=0x1d] Received a join request from 192.168.0.145(server4:22852):41002
[info 2020/09/29 07:41:52.406 CEST <unicast receiver,mkevo-XPS-15-9570-16115> tid=0x1d] Received a join request from 192.168.0.145(server3:22879):41003

I prepare commands to start server in two terminals, so I can start them almost in the same time.
Sorry, I forgot to mention that you need to see which server is stopped first and starts him first (The issue was first reproduced on kubernetes, and that is how pods restarts servers).
Also if you are not able to reproduce the issue, try to set 10 or more colocated regions.

BR,
Mario

________________________________
Šalje: Donal Evans <do...@vmware.com>
Poslano: 28. rujna 2020. 23:48
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Hi Mario,

I tried to reproduce the issue using the steps you describe, but I wasn't able to. After restarting the servers, all regions have the expected 113 buckets, and the server startup process is not noticeably slower. I have a few questions that might help understand why I'm unable to reproduce this:

  *   Do you see this behaviour 100% of the time with these steps, or is still only on some restarts that it shows up?
  *   Could you describe in more detail how exactly you're starting the locators/servers? I'm just using the gfsh "start locator" and "start server" commands, only specifying ports, with no other settings, so if you're doing anything different that may be a factor.
  *   How many entries are you putting into the region, and does the issue still reproduce if you use fewer entries? I'm using 10000 entries as described in your earlier email.
  *   How quick do you have to be when restarting the servers in the two terminals at the same time? I'm currently just manually clicking between them and executing the two start server commands within a second of each other, but if that's not fast enough then I should probably be using a script or something.

Hopefully if we can understand what's different between what I'm doing and what you're doing then it will help us understand exactly what's going wrong.

- Donal
________________________________
From: Mario Kevo <ma...@est.tech>
Sent: Monday, September 28, 2020 6:23 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Odg: Colocated regions missing some buckets after restart

Hi all,

After more investigation I found that for some buckets is problem to define which server is primary.
While doing getPrimary if existing primary is null it waits for a new primary and after some time return null for it.

From what I found is while doing setHosting( grabBucket[PartitionedRegionDataStore.java]->grabFreeBucket[PartitionedRegionDataStore.java]->setHosting[ProxyBucketRegion.java]->setHosting[BucketAdvisor.java]) it volunteer for primary and sendProfileUpdate to all other servers.
There it calls BucketProfileUpdateMessage.send and there is stucked as it cannot get response from the other members.

Ticket is opened on GEODE: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8546&amp;data=02%7C01%7Cdoevans%40vmware.com%7Cdd061beb634d4fd7805708d8643b6b70%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637369553660830374&amp;sdata=J79mRS8BYs2oTHGgz%2BqgmDIXO1zICK%2FIXSxKj%2FvWXF8%3D&amp;reserved=0
How to reproduce the issue:

  1.   Start two locators and two servers
  2.   Create PARTITION_REDUNDANT_PERSISTENT region with redundant-copies=1
  3.   Create few PARTITION_REDUNDANT regions(I used six regions) colocated with persistent region and redundant-copies=1
  4.   Put some entries.
  5.   Restart servers(you can simply run "kill -15 <server_pids>" and then from two terminals start both of them at the same time)
  6.   It will take a time to get server startup finished and for the latest region bucketCount will be zero on one member

If someone with more experience with bucket initialization have a time to help me with this I will appriciate it.
For any more info, please contact me.

BR,
Mario


________________________________
Šalje: Mario Kevo <ma...@est.tech>
Poslano: 17. rujna 2020. 15:00
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Odg: Colocated regions missing some buckets after restart

Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. This part of code can be found on this link<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fblob%2Ff2ccbc8ae860fc018baba7cc8de7b5e01a22c606%2Fgeode-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fgeode%2Finternal%2Fcache%2FPartitionedRegionDataStore.java%23L660&amp;data=02%7C01%7Cdoevans%40vmware.com%7Cdd061beb634d4fd7805708d8643b6b70%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637369553660830374&amp;sdata=Ekuv20WGkPjUdEcTUdJQu8lPh5RYoCCDKZOFlvngXko%3D&amp;reserved=0>.
We will continue investigation on this and try to find what cause the issue.

BR,
Mario

________________________________
Šalje: Anilkumar Gingade <ag...@vmware.com>
Poslano: 16. rujna 2020. 16:55
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can find threads stuck in region creation...This will show if there are any lock contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi Anil,

    From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region):
    [warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners []


    And on the other server1:
    [warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 15 seconds have elapsed while waiting for replies: <FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 replies from [192.168.0.145(server2:28054)<v6>:41003]> on 192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

    [warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) is stuck

    [warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1>
    Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
    ...
    It seems that this is not problem with stats.
    We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more.

    BR,
    Mario



    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 15. rujna 2020. 16:36
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Mario,

    I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

    The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

    There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
    You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

    -Anil

    On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi,


        This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



        So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



        The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


        First server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 113

        | primaryBucketCount           | 57



        Second server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 111

        | primaryBucketCount           | 55


        So we are missing a primary bucket without being aware of the issue.

        BR,
        Mario

        ________________________________
        Šalje: Anilkumar Gingade <ag...@vmware.com>
        Poslano: 11. rujna 2020. 20:34
        Prima: dev@geode.apache.org <de...@geode.apache.org>
        Predmet: Re: Colocated regions missing some buckets after restart

        Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
        When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
        Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

        -Anil.


        On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

            Hi geode-dev,

            We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
            After servers restart on some region we can see that they don't have any bucket.
            gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-1


            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 0
                      | primaryBucketCount           | 0
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0

            gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-0

            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 113
                      | primaryBucketCount           | 56
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0


            The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

            Does anyone have some idea about this problem, what causing the issue?
            The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
            After restart both servers and try to do show metrics command you will got this issue for some regions.

            BR,
            Mario





Odg: Colocated regions missing some buckets after restart

Posted by Mario Kevo <ma...@est.tech>.
Hi Donal,

Sometimes you need to do restart two or three times, but mostly it is reproduced by first restart.
start locator --name=locator1 --port=10334
start locator --name=locator2 --port=10335 --locators=localhost[10334]
start server --name=server1 --locators=127.0.0.1[10334],127.0.0.1[10335] --server-port=40404
start server --name=server2 --locators=127.0.0.1[10334],127.0.0.1[10335] --server-port=40405
I'm putting 10000 entries, but you can use a lower value.

You need to be really quick with commands. There is an example from my locator log.
[info 2020/09/29 07:41:52.060 CEST <unicast receiver,mkevo-XPS-15-9570-16115> tid=0x1d] Received a join request from 192.168.0.145(server4:22852):41002
[info 2020/09/29 07:41:52.406 CEST <unicast receiver,mkevo-XPS-15-9570-16115> tid=0x1d] Received a join request from 192.168.0.145(server3:22879):41003

I prepare commands to start server in two terminals, so I can start them almost in the same time.
Sorry, I forgot to mention that you need to see which server is stopped first and starts him first (The issue was first reproduced on kubernetes, and that is how pods restarts servers).
Also if you are not able to reproduce the issue, try to set 10 or more colocated regions.

BR,
Mario

________________________________
Šalje: Donal Evans <do...@vmware.com>
Poslano: 28. rujna 2020. 23:48
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Hi Mario,

I tried to reproduce the issue using the steps you describe, but I wasn't able to. After restarting the servers, all regions have the expected 113 buckets, and the server startup process is not noticeably slower. I have a few questions that might help understand why I'm unable to reproduce this:

  *   Do you see this behaviour 100% of the time with these steps, or is still only on some restarts that it shows up?
  *   Could you describe in more detail how exactly you're starting the locators/servers? I'm just using the gfsh "start locator" and "start server" commands, only specifying ports, with no other settings, so if you're doing anything different that may be a factor.
  *   How many entries are you putting into the region, and does the issue still reproduce if you use fewer entries? I'm using 10000 entries as described in your earlier email.
  *   How quick do you have to be when restarting the servers in the two terminals at the same time? I'm currently just manually clicking between them and executing the two start server commands within a second of each other, but if that's not fast enough then I should probably be using a script or something.

Hopefully if we can understand what's different between what I'm doing and what you're doing then it will help us understand exactly what's going wrong.

- Donal
________________________________
From: Mario Kevo <ma...@est.tech>
Sent: Monday, September 28, 2020 6:23 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Odg: Colocated regions missing some buckets after restart

Hi all,

After more investigation I found that for some buckets is problem to define which server is primary.
While doing getPrimary if existing primary is null it waits for a new primary and after some time return null for it.

From what I found is while doing setHosting( grabBucket[PartitionedRegionDataStore.java]->grabFreeBucket[PartitionedRegionDataStore.java]->setHosting[ProxyBucketRegion.java]->setHosting[BucketAdvisor.java]) it volunteer for primary and sendProfileUpdate to all other servers.
There it calls BucketProfileUpdateMessage.send and there is stucked as it cannot get response from the other members.

Ticket is opened on GEODE: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8546&amp;data=02%7C01%7Cdoevans%40vmware.com%7C4a51a06464f34b8cf6ed08d863b1c66f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637368962530124201&amp;sdata=cfW3BI0K906FutWL9QQBDlDharQdK08%2FRY1iUgyImWk%3D&amp;reserved=0
How to reproduce the issue:

  1.   Start two locators and two servers
  2.   Create PARTITION_REDUNDANT_PERSISTENT region with redundant-copies=1
  3.   Create few PARTITION_REDUNDANT regions(I used six regions) colocated with persistent region and redundant-copies=1
  4.   Put some entries.
  5.   Restart servers(you can simply run "kill -15 <server_pids>" and then from two terminals start both of them at the same time)
  6.   It will take a time to get server startup finished and for the latest region bucketCount will be zero on one member

If someone with more experience with bucket initialization have a time to help me with this I will appriciate it.
For any more info, please contact me.

BR,
Mario


________________________________
Šalje: Mario Kevo <ma...@est.tech>
Poslano: 17. rujna 2020. 15:00
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Odg: Colocated regions missing some buckets after restart

Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. This part of code can be found on this link<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fblob%2Ff2ccbc8ae860fc018baba7cc8de7b5e01a22c606%2Fgeode-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fgeode%2Finternal%2Fcache%2FPartitionedRegionDataStore.java%23L660&amp;data=02%7C01%7Cdoevans%40vmware.com%7C4a51a06464f34b8cf6ed08d863b1c66f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637368962530124201&amp;sdata=kkO23vl%2FLd%2FXn8ilKbdhAm%2FnXSToShZkU1EYNv6DUj0%3D&amp;reserved=0>.
We will continue investigation on this and try to find what cause the issue.

BR,
Mario

________________________________
Šalje: Anilkumar Gingade <ag...@vmware.com>
Poslano: 16. rujna 2020. 16:55
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can find threads stuck in region creation...This will show if there are any lock contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi Anil,

    From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region):
    [warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners []


    And on the other server1:
    [warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 15 seconds have elapsed while waiting for replies: <FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 replies from [192.168.0.145(server2:28054)<v6>:41003]> on 192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

    [warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) is stuck

    [warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1>
    Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
    ...
    It seems that this is not problem with stats.
    We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more.

    BR,
    Mario



    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 15. rujna 2020. 16:36
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Mario,

    I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

    The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

    There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
    You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

    -Anil

    On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi,


        This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



        So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



        The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


        First server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 113

        | primaryBucketCount           | 57



        Second server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 111

        | primaryBucketCount           | 55


        So we are missing a primary bucket without being aware of the issue.

        BR,
        Mario

        ________________________________
        Šalje: Anilkumar Gingade <ag...@vmware.com>
        Poslano: 11. rujna 2020. 20:34
        Prima: dev@geode.apache.org <de...@geode.apache.org>
        Predmet: Re: Colocated regions missing some buckets after restart

        Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
        When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
        Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

        -Anil.


        On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

            Hi geode-dev,

            We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
            After servers restart on some region we can see that they don't have any bucket.
            gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-1


            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 0
                      | primaryBucketCount           | 0
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0

            gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-0

            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 113
                      | primaryBucketCount           | 56
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0


            The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

            Does anyone have some idea about this problem, what causing the issue?
            The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
            After restart both servers and try to do show metrics command you will got this issue for some regions.

            BR,
            Mario





Re: Colocated regions missing some buckets after restart

Posted by Donal Evans <do...@vmware.com>.
Hi Mario,

I tried to reproduce the issue using the steps you describe, but I wasn't able to. After restarting the servers, all regions have the expected 113 buckets, and the server startup process is not noticeably slower. I have a few questions that might help understand why I'm unable to reproduce this:

  *   Do you see this behaviour 100% of the time with these steps, or is still only on some restarts that it shows up?
  *   Could you describe in more detail how exactly you're starting the locators/servers? I'm just using the gfsh "start locator" and "start server" commands, only specifying ports, with no other settings, so if you're doing anything different that may be a factor.
  *   How many entries are you putting into the region, and does the issue still reproduce if you use fewer entries? I'm using 10000 entries as described in your earlier email.
  *   How quick do you have to be when restarting the servers in the two terminals at the same time? I'm currently just manually clicking between them and executing the two start server commands within a second of each other, but if that's not fast enough then I should probably be using a script or something.

Hopefully if we can understand what's different between what I'm doing and what you're doing then it will help us understand exactly what's going wrong.

- Donal
________________________________
From: Mario Kevo <ma...@est.tech>
Sent: Monday, September 28, 2020 6:23 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Odg: Colocated regions missing some buckets after restart

Hi all,

After more investigation I found that for some buckets is problem to define which server is primary.
While doing getPrimary if existing primary is null it waits for a new primary and after some time return null for it.

From what I found is while doing setHosting( grabBucket[PartitionedRegionDataStore.java]->grabFreeBucket[PartitionedRegionDataStore.java]->setHosting[ProxyBucketRegion.java]->setHosting[BucketAdvisor.java]) it volunteer for primary and sendProfileUpdate to all other servers.
There it calls BucketProfileUpdateMessage.send and there is stucked as it cannot get response from the other members.

Ticket is opened on GEODE: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8546&amp;data=02%7C01%7Cdoevans%40vmware.com%7C4a51a06464f34b8cf6ed08d863b1c66f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637368962530124201&amp;sdata=cfW3BI0K906FutWL9QQBDlDharQdK08%2FRY1iUgyImWk%3D&amp;reserved=0
How to reproduce the issue:

  1.   Start two locators and two servers
  2.   Create PARTITION_REDUNDANT_PERSISTENT region with redundant-copies=1
  3.   Create few PARTITION_REDUNDANT regions(I used six regions) colocated with persistent region and redundant-copies=1
  4.   Put some entries.
  5.   Restart servers(you can simply run "kill -15 <server_pids>" and then from two terminals start both of them at the same time)
  6.   It will take a time to get server startup finished and for the latest region bucketCount will be zero on one member

If someone with more experience with bucket initialization have a time to help me with this I will appriciate it.
For any more info, please contact me.

BR,
Mario


________________________________
Šalje: Mario Kevo <ma...@est.tech>
Poslano: 17. rujna 2020. 15:00
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Odg: Colocated regions missing some buckets after restart

Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. This part of code can be found on this link<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fblob%2Ff2ccbc8ae860fc018baba7cc8de7b5e01a22c606%2Fgeode-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fgeode%2Finternal%2Fcache%2FPartitionedRegionDataStore.java%23L660&amp;data=02%7C01%7Cdoevans%40vmware.com%7C4a51a06464f34b8cf6ed08d863b1c66f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637368962530124201&amp;sdata=kkO23vl%2FLd%2FXn8ilKbdhAm%2FnXSToShZkU1EYNv6DUj0%3D&amp;reserved=0>.
We will continue investigation on this and try to find what cause the issue.

BR,
Mario

________________________________
Šalje: Anilkumar Gingade <ag...@vmware.com>
Poslano: 16. rujna 2020. 16:55
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can find threads stuck in region creation...This will show if there are any lock contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi Anil,

    From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region):
    [warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners []


    And on the other server1:
    [warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 15 seconds have elapsed while waiting for replies: <FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 replies from [192.168.0.145(server2:28054)<v6>:41003]> on 192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

    [warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) is stuck

    [warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1>
    Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
    ...
    It seems that this is not problem with stats.
    We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more.

    BR,
    Mario



    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 15. rujna 2020. 16:36
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Mario,

    I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

    The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

    There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
    You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

    -Anil

    On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi,


        This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



        So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



        The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


        First server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 113

        | primaryBucketCount           | 57



        Second server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 111

        | primaryBucketCount           | 55


        So we are missing a primary bucket without being aware of the issue.

        BR,
        Mario

        ________________________________
        Šalje: Anilkumar Gingade <ag...@vmware.com>
        Poslano: 11. rujna 2020. 20:34
        Prima: dev@geode.apache.org <de...@geode.apache.org>
        Predmet: Re: Colocated regions missing some buckets after restart

        Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
        When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
        Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

        -Anil.


        On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

            Hi geode-dev,

            We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
            After servers restart on some region we can see that they don't have any bucket.
            gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-1


            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 0
                      | primaryBucketCount           | 0
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0

            gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-0

            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 113
                      | primaryBucketCount           | 56
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0


            The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

            Does anyone have some idea about this problem, what causing the issue?
            The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
            After restart both servers and try to do show metrics command you will got this issue for some regions.

            BR,
            Mario





Odg: Colocated regions missing some buckets after restart

Posted by Mario Kevo <ma...@est.tech>.
Hi all,

After more investigation I found that for some buckets is problem to define which server is primary.
While doing getPrimary if existing primary is null it waits for a new primary and after some time return null for it.

From what I found is while doing setHosting( grabBucket[PartitionedRegionDataStore.java]->grabFreeBucket[PartitionedRegionDataStore.java]->setHosting[ProxyBucketRegion.java]->setHosting[BucketAdvisor.java]) it volunteer for primary and sendProfileUpdate to all other servers.
There it calls BucketProfileUpdateMessage.send and there is stucked as it cannot get response from the other members.

Ticket is opened on GEODE: https://issues.apache.org/jira/browse/GEODE-8546
How to reproduce the issue:

  1.   Start two locators and two servers
  2.   Create PARTITION_REDUNDANT_PERSISTENT region with redundant-copies=1
  3.   Create few PARTITION_REDUNDANT regions(I used six regions) colocated with persistent region and redundant-copies=1
  4.   Put some entries.
  5.   Restart servers(you can simply run "kill -15 <server_pids>" and then from two terminals start both of them at the same time)
  6.   It will take a time to get server startup finished and for the latest region bucketCount will be zero on one member

If someone with more experience with bucket initialization have a time to help me with this I will appriciate it.
For any more info, please contact me.

BR,
Mario


________________________________
Šalje: Mario Kevo <ma...@est.tech>
Poslano: 17. rujna 2020. 15:00
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Odg: Colocated regions missing some buckets after restart

Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. This part of code can be found on this link<https://github.com/apache/geode/blob/f2ccbc8ae860fc018baba7cc8de7b5e01a22c606/geode-core/src/main/java/org/apache/geode/internal/cache/PartitionedRegionDataStore.java#L660>.
We will continue investigation on this and try to find what cause the issue.

BR,
Mario

________________________________
Šalje: Anilkumar Gingade <ag...@vmware.com>
Poslano: 16. rujna 2020. 16:55
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can find threads stuck in region creation...This will show if there are any lock contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi Anil,

    From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region):
    [warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners []


    And on the other server1:
    [warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 15 seconds have elapsed while waiting for replies: <FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 replies from [192.168.0.145(server2:28054)<v6>:41003]> on 192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

    [warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) is stuck

    [warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1>
    Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
    ...
    It seems that this is not problem with stats.
    We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more.

    BR,
    Mario



    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 15. rujna 2020. 16:36
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Mario,

    I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

    The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

    There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
    You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

    -Anil

    On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi,


        This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



        So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



        The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


        First server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 113

        | primaryBucketCount           | 57



        Second server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 111

        | primaryBucketCount           | 55


        So we are missing a primary bucket without being aware of the issue.

        BR,
        Mario

        ________________________________
        Šalje: Anilkumar Gingade <ag...@vmware.com>
        Poslano: 11. rujna 2020. 20:34
        Prima: dev@geode.apache.org <de...@geode.apache.org>
        Predmet: Re: Colocated regions missing some buckets after restart

        Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
        When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
        Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

        -Anil.


        On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

            Hi geode-dev,

            We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
            After servers restart on some region we can see that they don't have any bucket.
            gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-1


            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 0
                      | primaryBucketCount           | 0
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0

            gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-0

            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 113
                      | primaryBucketCount           | 56
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0


            The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

            Does anyone have some idea about this problem, what causing the issue?
            The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
            After restart both servers and try to do show metrics command you will got this issue for some regions.

            BR,
            Mario





Odg: Colocated regions missing some buckets after restart

Posted by Mario Kevo <ma...@est.tech>.
Hi Anil,

Thread dump is in an attachment.
For now we found difference between server logs, on the one which have this problem has this log "Colocation is incomplete".
So it seems that colocation is not finished for this region on this member. This part of code can be found on this link<https://github.com/apache/geode/blob/f2ccbc8ae860fc018baba7cc8de7b5e01a22c606/geode-core/src/main/java/org/apache/geode/internal/cache/PartitionedRegionDataStore.java#L660>.
We will continue investigation on this and try to find what cause the issue.

BR,
Mario

________________________________
Šalje: Anilkumar Gingade <ag...@vmware.com>
Poslano: 16. rujna 2020. 16:55
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can find threads stuck in region creation...This will show if there are any lock contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi Anil,

    From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region):
    [warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners []


    And on the other server1:
    [warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 15 seconds have elapsed while waiting for replies: <FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 replies from [192.168.0.145(server2:28054)<v6>:41003]> on 192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

    [warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) is stuck

    [warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1>
    Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
    ...
    It seems that this is not problem with stats.
    We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more.

    BR,
    Mario



    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 15. rujna 2020. 16:36
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Mario,

    I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

    The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

    There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
    You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

    -Anil

    On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi,


        This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



        So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



        The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


        First server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 113

        | primaryBucketCount           | 57



        Second server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 111

        | primaryBucketCount           | 55


        So we are missing a primary bucket without being aware of the issue.

        BR,
        Mario

        ________________________________
        Šalje: Anilkumar Gingade <ag...@vmware.com>
        Poslano: 11. rujna 2020. 20:34
        Prima: dev@geode.apache.org <de...@geode.apache.org>
        Predmet: Re: Colocated regions missing some buckets after restart

        Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
        When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
        Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

        -Anil.


        On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

            Hi geode-dev,

            We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
            After servers restart on some region we can see that they don't have any bucket.
            gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-1


            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 0
                      | primaryBucketCount           | 0
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0

            gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-0

            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 113
                      | primaryBucketCount           | 56
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0


            The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

            Does anyone have some idea about this problem, what causing the issue?
            The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
            After restart both servers and try to do show metrics command you will got this issue for some regions.

            BR,
            Mario





Re: Colocated regions missing some buckets after restart

Posted by Anilkumar Gingade <ag...@vmware.com>.
Mario,

Take a thread dump; couple of times at an interval of a minute...See if you can find threads stuck in region creation...This will show if there are any lock contention.

-Anil.


On 9/16/20, 6:29 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi Anil,

    From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region):
    [warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners []


    And on the other server1:
    [warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 15 seconds have elapsed while waiting for replies: <FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 replies from [192.168.0.145(server2:28054)<v6>:41003]> on 192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

    [warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) is stuck

    [warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1>
    Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
    ...
    It seems that this is not problem with stats.
    We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more.

    BR,
    Mario



    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 15. rujna 2020. 16:36
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Mario,

    I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

    The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

    There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
    You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

    -Anil

    On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi,


        This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



        So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



        The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


        First server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 113

        | primaryBucketCount           | 57



        Second server:

        Partition               | putLocalRate                 | 0.0

        | putRemoteRate                | 0.0

        | putRemoteLatency             | 0

        | putRemoteAvgLatency          | 0

        | bucketCount                  | 111

        | primaryBucketCount           | 55


        So we are missing a primary bucket without being aware of the issue.

        BR,
        Mario

        ________________________________
        Šalje: Anilkumar Gingade <ag...@vmware.com>
        Poslano: 11. rujna 2020. 20:34
        Prima: dev@geode.apache.org <de...@geode.apache.org>
        Predmet: Re: Colocated regions missing some buckets after restart

        Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
        When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
        Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

        -Anil.


        On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

            Hi geode-dev,

            We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
            After servers restart on some region we can see that they don't have any bucket.
            gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-1


            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 0
                      | primaryBucketCount           | 0
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0

            gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
            Metrics for region:/region1 On Member server-0

            Category  |            Metric            | Value
            --------- | ---------------------------- | -----
            partition | putLocalRate                 | 0.0
                      | putRemoteRate                | 0.0
                      | putRemoteLatency             | 0
                      | putRemoteAvgLatency          | 0
                      | bucketCount                  | 113
                      | primaryBucketCount           | 56
                      | configuredRedundancy         | 1
                      | actualRedundancy             | 0
                      | numBucketsWithoutRedundancy  | 113
                      | totalBucketSize              | 0


            The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

            Does anyone have some idea about this problem, what causing the issue?
            The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
            After restart both servers and try to do show metrics command you will got this issue for some regions.

            BR,
            Mario





Odg: Colocated regions missing some buckets after restart

Posted by Mario Kevo <ma...@est.tech>.
Hi Anil,

From server logs we see that have some threads stucked and continuosly get on server2 the following message(bucket missing on server2 for DfSessions region):
[warn 2020/09/15 14:25:39.852 CEST <PartitionedRegion Message Processor2> tid=0x251] 15 secs have elapsed waiting for a primary for bucket [BucketAdvisor /__PR/_B__DfSessions_18:935: state=VOLUNTEERING_HOSTING]. Current bucket owners []


And on the other server1:
[warn 2020/09/15 14:25:40.852 CEST <ResourceManagerRecoveryThread 1> tid=0xdf] 15 seconds have elapsed while waiting for replies: <FetchPartitionDetailsMessage$FetchPartitionDetailsResponse 3808 waiting for 1 replies from [192.168.0.145(server2:28054)<v6>:41003]> on 192.168.0.145(server1:28031)<v5>:41002 whose current membership list is: [[192.168.0.145(locator1:27244:locator)<ec><v0>:41000, 192.168.0.145(locator2:27343:locator)<ec><v1>:41001, 192.168.0.145(server1:28031)<v5>:41002, 192.168.0.145(server2:28054)<v6>:41003]]

[warn 2020/09/15 14:27:20.200 CEST <ThreadsMonitor> tid=0x11] Thread 223 (0xdf) is stuck

[warn 2020/09/15 14:27:20.202 CEST <ThreadsMonitor> tid=0x11] Thread <223> (0xdf) that was executed at <15 Sep 2020 14:25:24 CEST> has been stuck for <115.361 seconds> and number of thread monitor iteration <1>
Thread Name <ResourceManagerRecoveryThread 1> state <TIMED_WAITING>
...
It seems that this is not problem with stats.
We have a some suspicion that the problem is with some lock, but we need to investigate it a bit more.

BR,
Mario



________________________________
Šalje: Anilkumar Gingade <ag...@vmware.com>
Poslano: 15. rujna 2020. 16:36
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Mario,

I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

-Anil

On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi,


    This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



    So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



    The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


    First server:

    Partition               | putLocalRate                 | 0.0

    | putRemoteRate                | 0.0

    | putRemoteLatency             | 0

    | putRemoteAvgLatency          | 0

    | bucketCount                  | 113

    | primaryBucketCount           | 57



    Second server:

    Partition               | putLocalRate                 | 0.0

    | putRemoteRate                | 0.0

    | putRemoteLatency             | 0

    | putRemoteAvgLatency          | 0

    | bucketCount                  | 111

    | primaryBucketCount           | 55


    So we are missing a primary bucket without being aware of the issue.

    BR,
    Mario

    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 11. rujna 2020. 20:34
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
    When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
    Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

    -Anil.


    On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi geode-dev,

        We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
        After servers restart on some region we can see that they don't have any bucket.
        gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
        Metrics for region:/region1 On Member server-1


        Category  |            Metric            | Value
        --------- | ---------------------------- | -----
        partition | putLocalRate                 | 0.0
                  | putRemoteRate                | 0.0
                  | putRemoteLatency             | 0
                  | putRemoteAvgLatency          | 0
                  | bucketCount                  | 0
                  | primaryBucketCount           | 0
                  | configuredRedundancy         | 1
                  | actualRedundancy             | 0
                  | numBucketsWithoutRedundancy  | 113
                  | totalBucketSize              | 0

        gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
        Metrics for region:/region1 On Member server-0

        Category  |            Metric            | Value
        --------- | ---------------------------- | -----
        partition | putLocalRate                 | 0.0
                  | putRemoteRate                | 0.0
                  | putRemoteLatency             | 0
                  | putRemoteAvgLatency          | 0
                  | bucketCount                  | 113
                  | primaryBucketCount           | 56
                  | configuredRedundancy         | 1
                  | actualRedundancy             | 0
                  | numBucketsWithoutRedundancy  | 113
                  | totalBucketSize              | 0


        The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

        Does anyone have some idea about this problem, what causing the issue?
        The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
        After restart both servers and try to do show metrics command you will got this issue for some regions.

        BR,
        Mario




Re: Colocated regions missing some buckets after restart

Posted by Anilkumar Gingade <ag...@vmware.com>.
Mario,

I doubt this has anything to do with the client connections. If it is it should be between server/member to server/member connection; in that case the unresponsive member is kicked out from the cluster.

The recommended configuration is to have persistence regions for both parent and co-located regions (and replicated regions)...

There could be issues in the stats too...Can you try executing a test/validation code on server side to dump/list primary and secondary buckets.
You can do that using helper methods: pr.getDataStore().getAllLocalPrimaryBucketIds();

-Anil

On 9/14/20, 12:25 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi,


    This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



    So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



    The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


    First server:

    Partition               | putLocalRate                 | 0.0

    | putRemoteRate                | 0.0

    | putRemoteLatency             | 0

    | putRemoteAvgLatency          | 0

    | bucketCount                  | 113

    | primaryBucketCount           | 57



    Second server:

    Partition               | putLocalRate                 | 0.0

    | putRemoteRate                | 0.0

    | putRemoteLatency             | 0

    | putRemoteAvgLatency          | 0

    | bucketCount                  | 111

    | primaryBucketCount           | 55


    So we are missing a primary bucket without being aware of the issue.

    BR,
    Mario

    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 11. rujna 2020. 20:34
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Colocated regions missing some buckets after restart

    Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
    When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
    Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

    -Anil.


    On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi geode-dev,

        We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
        After servers restart on some region we can see that they don't have any bucket.
        gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
        Metrics for region:/region1 On Member server-1


        Category  |            Metric            | Value
        --------- | ---------------------------- | -----
        partition | putLocalRate                 | 0.0
                  | putRemoteRate                | 0.0
                  | putRemoteLatency             | 0
                  | putRemoteAvgLatency          | 0
                  | bucketCount                  | 0
                  | primaryBucketCount           | 0
                  | configuredRedundancy         | 1
                  | actualRedundancy             | 0
                  | numBucketsWithoutRedundancy  | 113
                  | totalBucketSize              | 0

        gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
        Metrics for region:/region1 On Member server-0

        Category  |            Metric            | Value
        --------- | ---------------------------- | -----
        partition | putLocalRate                 | 0.0
                  | putRemoteRate                | 0.0
                  | putRemoteLatency             | 0
                  | putRemoteAvgLatency          | 0
                  | bucketCount                  | 113
                  | primaryBucketCount           | 56
                  | configuredRedundancy         | 1
                  | actualRedundancy             | 0
                  | numBucketsWithoutRedundancy  | 113
                  | totalBucketSize              | 0


        The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

        Does anyone have some idea about this problem, what causing the issue?
        The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
        After restart both servers and try to do show metrics command you will got this issue for some regions.

        BR,
        Mario




Odg: Colocated regions missing some buckets after restart

Posted by Mario Kevo <ma...@est.tech>.
Hi,


This problem is usually seen only on 1 server. The other servers metrics and bucket count looks fine. Another symptom of this issue is that the max-connections limit is reached on the problematic server if we have a client that tries to reconnect after the server restart. Clients simply get no response from the server so they try to close the connection, but the connection close is not acknowledged by the server. On server side we see that the connections are in CLOSE-WAIT state with packets in the socket receiver queue. It’s as if the servers just stopped processing packets on the sockets while waiting for a member with the primary bucket.



So in short, each new client connection is “unresponsive”. The client tries to close it a open a new one, but the socket doesn’t get closed on server side and the connection is left “hanging” on the server. Clients will try to do this until max-connections is reached on the servers. This is why we would be unable to add any data to the regions. But IMHO it’s really not dependent on adding data, since this issue happens occasionally (1 out of ~4 restarts) and only on one server.



The initial problem was observed with a persistent region A (with 10000 key-value pairs inserted) and a non-persistent region B collocated with region A. We did some tests with both regions being persistent. We haven’t observed the same issue yet (although we did only a few restarts), but we observed something that also looks quite worrying. Both servers start up without reporting issues in the logs. But, looking at the server metrics, one server has wrong information about “bucketCount” and is missing primary buckets. E.g:


First server:

Partition               | putLocalRate                 | 0.0

| putRemoteRate                | 0.0

| putRemoteLatency             | 0

| putRemoteAvgLatency          | 0

| bucketCount                  | 113

| primaryBucketCount           | 57



Second server:

Partition               | putLocalRate                 | 0.0

| putRemoteRate                | 0.0

| putRemoteLatency             | 0

| putRemoteAvgLatency          | 0

| bucketCount                  | 111

| primaryBucketCount           | 55


So we are missing a primary bucket without being aware of the issue.

BR,
Mario

________________________________
Šalje: Anilkumar Gingade <ag...@vmware.com>
Poslano: 11. rujna 2020. 20:34
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Colocated regions missing some buckets after restart

Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion).
Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

-Anil.


On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi geode-dev,

    We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
    After servers restart on some region we can see that they don't have any bucket.
    gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
    Metrics for region:/region1 On Member server-1


    Category  |            Metric            | Value
    --------- | ---------------------------- | -----
    partition | putLocalRate                 | 0.0
              | putRemoteRate                | 0.0
              | putRemoteLatency             | 0
              | putRemoteAvgLatency          | 0
              | bucketCount                  | 0
              | primaryBucketCount           | 0
              | configuredRedundancy         | 1
              | actualRedundancy             | 0
              | numBucketsWithoutRedundancy  | 113
              | totalBucketSize              | 0

    gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
    Metrics for region:/region1 On Member server-0

    Category  |            Metric            | Value
    --------- | ---------------------------- | -----
    partition | putLocalRate                 | 0.0
              | putRemoteRate                | 0.0
              | putRemoteLatency             | 0
              | putRemoteAvgLatency          | 0
              | bucketCount                  | 113
              | primaryBucketCount           | 56
              | configuredRedundancy         | 1
              | actualRedundancy             | 0
              | numBucketsWithoutRedundancy  | 113
              | totalBucketSize              | 0


    The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

    Does anyone have some idea about this problem, what causing the issue?
    The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
    After restart both servers and try to do show metrics command you will got this issue for some regions.

    BR,
    Mario



Re: Colocated regions missing some buckets after restart

Posted by Anilkumar Gingade <ag...@vmware.com>.
Are you seeing no-buckets for persistent regions or non-persistent. The buckets are created dynamically; when data is added to corresponding buckets...
When server is restarted, in case of in-memory regions as the data is not there, the bucket region may not have been created (my suspicion). 
Can you try adding data and see if the co-located bucket region gets created in respective nodes/server.

-Anil.


On 9/11/20, 9:46 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi geode-dev,

    We have a system with two servers and a few regions. One region is persistent and other are not but they are colocated with this persistent region.
    After servers restart on some region we can see that they don't have any bucket.
    gfsh>show metrics --member=server-1 --region=/region1 --categories=partition
    Metrics for region:/region1 On Member server-1


    Category  |            Metric            | Value
    --------- | ---------------------------- | -----
    partition | putLocalRate                 | 0.0
              | putRemoteRate                | 0.0
              | putRemoteLatency             | 0
              | putRemoteAvgLatency          | 0
              | bucketCount                  | 0
              | primaryBucketCount           | 0
              | configuredRedundancy         | 1
              | actualRedundancy             | 0
              | numBucketsWithoutRedundancy  | 113
              | totalBucketSize              | 0

    gfsh>show metrics --member=server-0 --region=/region1 --categories=partition
    Metrics for region:/region1 On Member server-0

    Category  |            Metric            | Value
    --------- | ---------------------------- | -----
    partition | putLocalRate                 | 0.0
              | putRemoteRate                | 0.0
              | putRemoteLatency             | 0
              | putRemoteAvgLatency          | 0
              | bucketCount                  | 113
              | primaryBucketCount           | 56
              | configuredRedundancy         | 1
              | actualRedundancy             | 0
              | numBucketsWithoutRedundancy  | 113
              | totalBucketSize              | 0


    The persistent region is ok, but some of these colocated regions has this issue. We also wait some time, but it doesn't change.

    Does anyone have some idea about this problem, what causing the issue?
    The issue can be easily reproduced with two locators, two servers, one persistent region and few non-persistent regions colocated with persistent one.
    After restart both servers and try to do show metrics command you will got this issue for some regions.

    BR,
    Mario