You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geode.apache.org by Philippe CEROU <ph...@gfi.fr> on 2019/01/30 08:19:41 UTC

Is there a way to boost all cluster startup when persistence is ON ?

Hi,

New test, new problem 😊

I have a 30 nodes Apache Geode cluster (30 x 4 CPU / 16Gb RAM) with 200 000 000 partitionend & replicated (C=2) «PDX-ized » rows.

When I stop all my cluster (Servers then Locators) and I do a full restart (Locators then Servers) I have in fact two possible problems :

1. The warm-up time is very long (About 15 minutes), each disk store is standing another node one and the time everything finalize rhe service is down (exactly the cluster is there but the region is not there).

Is there a way to make the region available and to directly start accepting at least INSERTS ?

The best would be to directly serve the region for INSERTs and fallback potentially READS to disk 😊

2. Sometimes when I startup my 30 servers some of them crash with this line after near 1 to 2 minutes :

[info 2019/01/30 08:03:02.603 UTC nx130b-srv <main> tid=0x1] Initialization of region PdxTypes completed

[info 2019/01/30 08:04:17.606 UTC nx130b-srv <Geode Failure Detection Scheduler1> tid=0x1c] Failure detection is now watching 10.200.6.107(nx107c-srv:19000)<v5>:41000

[info 2019/01/30 08:04:17.613 UTC nx130b-srv <Geode Failure Detection thread 3> tid=0xe2] Failure detection is now watching 10.200.4.112(nx112b-srv:18872)<v5>:41000

[error 2019/01/30 08:05:02.942 UTC nx130b-srv <main> tid=0x1] Cache initialization for GemFireCache[id = 2125470482; isClosing = false; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=.

[info 2019/01/30 08:05:02.961 UTC nx130b-srv <main> tid=0x1] GemFireCache[id = 2125470482; isClosing = true; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2019/01/30 08:05:03.325 UTC nx130b-srv <main> tid=0x1] Shutting down DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000.

[info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Now closing distribution for 10.200.4.130(nx130b-srv:3065)<v5>:41000

[info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Stopping membership services

[info 2019/01/30 08:05:03.435 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor server socket is closed in stopServices().

[info 2019/01/30 08:05:03.435 UTC nx130b-srv <Geode Failure Detection Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting

[info 2019/01/30 08:05:03.436 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor serverSocketExecutor is terminated

[info 2019/01/30 08:05:03.475 UTC nx130b-srv <main> tid=0x1] DistributionManager stopped in 150ms.

[info 2019/01/30 08:05:03.476 UTC nx130b-srv <main> tid=0x1] Marking DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000 as closed.

Any idea ?

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : jeudi 24 janvier 2019 13:23
À : user@geode.apache.org
Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

Own-response,

Finally, after shutting down and restarting the cluster, this one was « just » reimporting data from disks and it took 1h34m to make region back (for persisted 200 000 000 rows among 10 nodes).

I still have some questions 😊

My cluster reported memory usage is really different than the unique region it serve, see screenshots.

Cluster memory usage (205Gb):

[cid:image008.jpg@01D4B87C.EB8FC3B0]

But region consume 311Gb :

[cid:image009.jpg@01D4B87C.EB8FC3B0]

What I do not understand :

  *   If I consume 205Gb with 35Gb for « region disk caching » what are the 205Gb - 35Gb = 170Gb used for ?
  *   If I really have 205Gb used by 305Gb, I understand my region is about 311Gb, why so little memory is used to cache region ?

If I launch a simple « query  --query='select count(*) from /ksdata-benchmark where C001="AD6909"' » from GFSH I still have no response after 10 minutes while the requested column is indexd !

gfsh>list indexes
Member Name |                Member ID                 |    Region Path    |         Name          | Type  | Indexed Expression |    From Clause    | Valid Index
----------- | ---------------------------------------- | ----------------- | --------------------- | ----- | ------------------ | ----------------- | -----------
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : jeudi 24 janvier 2019 11:21
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

I understand, here it is a question of heap memory consumption, but I think we do something wrong because :


  1.  I have retried with a 10 AWS C5.4XLARGE nodes (8 CPU + 64 Gb RAM + 50Gb SSD disks), this nodes have 4 x more memory than previous test which give us a GEODE server with 305Gb MEMORY (Pulse/Total heap).



  1.  My INPUT data is a CSV with 200 000 000 rows about 150 bytes each divided in 19 LONG & STRING columns, this give me (With region « redundant-copies=2 ») about 3 x 200 000 000 x 150 = 83.81Gb. I have 4 indexes too on 4 LONG columns, my operational product data cost ratio (With 80% of HEAP threshold « eviction-heap-percentage=80 ») is so about (305Gb x 0.8) / 83.81Gb = 2.91, not very good ☹

So, I think I have a problem with OVERFLOW settings even if I did the same as documentation show.

Locators launch command :

gfsh -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='${LLOCATORS}'

PDX configuration :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "configure pdx --disk-store=DEFAULT --read-serialized=true"

Servers launch command:

WMEM=28000
gfsh \
-e "start server --name=${WSNAME} --initial-heap=${WMEM}M --max-heap=${WMEM}M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"

Region declaration :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
-e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"

Other « new » problem, if I shutdown the cluster (everything) then I restart it (locators then servers) I do not see my region anymore, but HEAP memory is consumed (I attached a PULSE screenshot).

[cid:image010.jpg@01D4B87C.EB8FC3B0]

If I look at servers logs I see that, after one hour (multiple times on multiple servers with different details):

10.200.2.108: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [19, 67, 101, 103, 110, 112] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.2.108:   DiskStore ID: 0c16c9b9-eff3-4fe1-84b1-f2ad0b7d19de
10.200.2.108:   Name: nx108a-srv
10.200.2.108:   Location: /10.200.2.108:/opt/app/geode/node/nx108a-srv/ksdata
10.200.2.108: Offline members with potentially new data:[
10.200.2.108:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.2.108:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.2.108:   Buckets: [19, 110]
10.200.2.108: ,
10.200.2.108:   DiskStore ID: 338fd13a-e564-444b-bfd9-377bec060897
10.200.2.108:   Location: /10.200.6.107:/opt/app/geode/node/nx107c-srv/ksdata
10.200.2.108:   Buckets: [19, 67, 101, 112]
10.200.2.108: ,
10.200.2.108:   DiskStore ID: e086e935-62df-4401-99dd-16475adf1f01
10.200.2.108:   Location: /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata
10.200.2.108:   Buckets: [103]
10.200.2.108: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
10.200.6.104: ...........................................................................................................................................
10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 31, 44, 65, 99] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
10.200.6.104:   Name: nx104c-srv
10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
10.200.6.104: Offline members with potentially new data:[
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
10.200.6.104: ,
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [7]
10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
10.200.6.104: ..............................
10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 8, 31, 44, 52, 65, 95, 99] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
10.200.6.104:   Name: nx104c-srv
10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
10.200.6.104: Offline members with potentially new data:[
10.200.6.104:   DiskStore ID: 9bd60b3e-f640-4d95-b9a5-be14fccb5f91
10.200.6.104:   Location: /10.200.2.102:/opt/app/geode/node/nx102a-srv/ksdata
10.200.6.104:   Buckets: [8, 52, 95]
10.200.6.104: ,
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
10.200.6.104: ,
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [7]
10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.

Eveyone seems waiting something from everyone, even if all my 10 nodes are UP.

If I use the « gfsh show missing-disk-stores » command there are from 5 to 20 missing ones.

gfsh>show missing-disk-stores
Missing Disk Stores


           Disk Store ID             |     Host      | Directory
------------------------------------ | ------------- | -------------------------------------
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 | /opt/app/geode/node/nx109b-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
e086e935-62df-4401-99dd-16475adf1f01 | /10.200.2.105 | /opt/app/geode/node/nx105a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
338fd13a-e564-444b-bfd9-377bec060897 | /10.200.6.107 | /opt/app/geode/node/nx107c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
184f8d33-a041-4635-915a-9e64cf9c007c | /10.200.6.101 | /opt/app/geode/node/nx101c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 | /opt/app/geode/node/nx104c-srv/ksdata
68c79259-ffcb-4b3b-a6a4-ef8bff6190be | /10.200.6.110 | /opt/app/geode/node/nx110c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 | /opt/app/geode/node/nx109b-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 | /opt/app/geode/node/nx104c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata

Very strange...

Does-it meens that some DATA is lost even with a redundancy of #3 ?

I saw in documentation that « missing disk stores » can be revoked but it is not clear on the fact that data is finally lost or not ☹

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Anthony Baker [mailto:abaker@pivotal.io]
Envoyé : mercredi 23 janvier 2019 18:28
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Re: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

When a cluster member member becomes unresponsive, Geode may fence off the member in order to preserve consistency and availability.  The question to investigate is *why* the member got into this state.

Questions to investigate:

- How much heap memory is your data consuming?
- How much data is overflowed disk vs in heap memory?
- How much data is being read from disk vs memory?
- Is GC activity consuming significant cpu resources?
- Are there other processes running on the system causing swapping behavior?

Anthony


On Jan 23, 2019, at 8:45 AM, Philippe CEROU <ph...@gfi.fr>> wrote:

Hi,

I think I have found the problem,

When we use REGION with OVERFLOW to disk once the percentage of memory configure dis reached the geode server become very very slow ☹

At the end even if we have a lot of nodes the overall cluster do not reach to write/acquire as far as client send data and everything fall, we past from 140 000 rows per second (memory) to less than 10 000 rows per second (once OVERFLOW is started)...

Is the product able to overflow on disk without a so high bandwidth reduce ?

For information, here are my launch commands (9 nodes) :

3 x :

gfsh \
-e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='nx101c[1234],nc102a[1234],nx103b[1234]'"

For PDX :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "configure pdx --disk-store=DEFAULT --read-serialized=true"

9 x :

gfsh \
-e "start server --name=${WSNAME} --initial-heap=6000M --max-heap=6000M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"

For disks & regions :

gfsh \
                -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
                -e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"

gfsh \
                -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
                -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
                -e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : mercredi 23 janvier 2019 08:40
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

Following my GEODE study i now have this situation.

I have 6 nodes, 3 have a locator and 6 have a server.

When I try to do massive insertion (200 million PDX-ized rows) I have after some hours this error on client side for all threads :

...
Exception in thread "Thread-20" org.apache.geode.cache.client.ServerOperationException: remote server on nxmaster(28523:loner):35042:efcecc76: Region /ksdata-benchmark putAll at server applied partial keys due to exception.
        at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9542)
        at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9446)
        at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9458)
        at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.TableInsert(CxDrvGEODE.java:144)
        at com.gfi.rt.lib.database.connectors.CxTable.insert(CxTable.java:175)
        at com.gfi.rt.bin.database.dbbench.BenchmarkObject.run(BenchmarkObject.java:279)
        at com.gfi.rt.bin.database.dbbench.BenchmarkThread.DoIt(BenchmarkThread.java:84)
        at com.gfi.rt.bin.database.dbbench.BenchmarkThread.run(BenchmarkThread.java:67)
Caused by: org.apache.geode.cache.persistence.PartitionOfflineException: Region /ksdata-benchmark bucket 48 has persistent data that is no longer online stored at these locations: [/10.200.6.101:/opt/app/geode/node/nx101c-srv/ksdata created at timestamp 1548181352906 version 0 diskStoreId 3650a4bc61f447a3-bc9cba70cf59e514 name null, /10.200.4.106:/opt/app/geode/node/nx106b-srv/ksdata created at timestamp 1548181352865 version 0 diskStoreId a7de7988708b44d2-b4b09d91c1f536b6 name null, /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata created at timestamp 1548181353100 version 0 diskStoreId 72dcab08ae12413c-b099fbb7f9ab740b name null]
        at org.apache.geode.internal.cache.ProxyBucketRegion.checkBucketRedundancyBeforeGrab(ProxyBucketRegion.java:590)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.lockRedundancyLock(PartitionedRegionDataStore.java:595)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:440)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2858)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.handleManageBucketRequest(PartitionedRegionDataStore.java:1014)
        at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketOnMember(PRHARedundancyProvider.java:1233)
        at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketInstance(PRHARedundancyProvider.java:416)
        at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketAtomically(PRHARedundancyProvider.java:604)
        at org.apache.geode.internal.cache.PartitionedRegion.createBucket(PartitionedRegion.java:3310)
        at org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2055)
        at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
        at org.apache.geode.internal.cache.PartitionedRegion.performPutAllEntry(PartitionedRegion.java:2124)
        at org.apache.geode.internal.cache.LocalRegion.basicEntryPutAll(LocalRegion.java:10060)
        at org.apache.geode.internal.cache.LocalRegion.access$100(LocalRegion.java:231)
        at org.apache.geode.internal.cache.LocalRegion$2.run(LocalRegion.java:9639)
        at org.apache.geode.internal.cache.event.NonDistributedEventTracker.syncBulkOp(NonDistributedEventTracker.java:107)
        at org.apache.geode.internal.cache.LocalRegion.syncBulkOp(LocalRegion.java:6085)
        at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9657)
        at org.apache.geode.internal.cache.LocalRegion.basicBridgePutAll(LocalRegion.java:9367)
        at org.apache.geode.internal.cache.tier.sockets.command.PutAll80.cmdExecute(PutAll80.java:270)
        at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:178)
        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:844)
        at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:74)
        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:594)
        at org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121)
        at java.lang.Thread.run(Thread.java:748)
...

When I check the cluster 4 of the 6 servers has disappear, when I check their LOG I have this :

...
[info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure Detection Scheduler1> tid=0x1b] Failure detection is now watching 10.200.4.103(nx103b-srv:21892)<v4>:41001

[info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure Detection Scheduler1> tid=0x1b] Failure detection is now watching 10.200.2.102(nx102a-srv:23781)<v4>:41001

[info 2019/01/23 01:28:35.326 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] Membership received a request to remove 10.200.2.102(nx102a-srv:23781)<v4>:41001 from 10.200.2.102(nx102a:23617:locator)<ec><v0>:41000 reason=Member isn't responding to heartbeat requests

[severe 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] Membership service failure: Member isn't responding to heartbeat requests
org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
        at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
        at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
        at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
        at org.jgroups.JChannel.up(JChannel.java:741)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
        at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
        at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
        at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
        at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
        at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
        at org.jgroups.protocols.TP.receive(TP.java:1714)
        at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
        at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
        at java.lang.Thread.run(Thread.java:748)

[info 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] CacheServer configuration saved

[info 2019/01/23 01:28:35.341 UTC nx102a-srv <DisconnectThread> tid=0x1cd] Stopping membership services

[info 2019/01/23 01:28:35.341 UTC nx102a-srv <DisconnectThread> tid=0x1cd] GMSHealthMonitor server socket is closed in stopServices().

[info 2019/01/23 01:28:35.342 UTC nx102a-srv <Geode Failure Detection thread 126> tid=0x1ca] Failure detection is now watching 10.200.2.102(nx102a:23617:locator)<ec><v0>:41000

[info 2019/01/23 01:28:35.343 UTC nx102a-srv <Geode Failure Detection Server thread 1> tid=0x1e] GMSHealthMonitor server thread exiting

[info 2019/01/23 01:28:35.344 UTC nx102a-srv <DisconnectThread> tid=0x1cd] GMSHealthMonitor serverSocketExecutor is terminated

[info 2019/01/23 01:28:35.347 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Disconnecting old DistributedSystem to prepare for a reconnect attempt

[info 2019/01/23 01:28:35.351 UTC nx102a-srv <ReconnectThread> tid=0x1cd] GemFireCache[id = 82825098; isClosing = true; isShutDownAll = false; created = Tue Jan 22 18:22:30 UTC 2019; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2019/01/23 01:28:35.352 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Cache server on port 40404 is shutting down.

[severe 2019/01/23 01:28:45.005 UTC nx102a-srv <EvictorThread8> tid=0x92] Uncaught exception in thread Thread[EvictorThread8,10,main]
org.apache.geode.distributed.DistributedSystemDisconnectedException: Distribution manager on 10.200.2.102(nx102a-srv:23781)<v4>:41001 started at Tue Jan 22 18:22:30 UTC 2019: Member isn't responding to heartbeat requests, caused by org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
        at org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:3926)
        at org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:966)
        at org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1547)
        at org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
        at org.apache.geode.internal.cache.GemFireCacheImpl.getInternalResourceManager(GemFireCacheImpl.java:4330)
        at org.apache.geode.internal.cache.GemFireCacheImpl.getResourceManager(GemFireCacheImpl.java:4319)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.getAllRegionList(HeapEvictor.java:138)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.getAllSortedRegionList(HeapEvictor.java:171)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.createAndSubmitWeightedRegionEvictionTasks(HeapEvictor.java:215)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.access$200(HeapEvictor.java:53)
        at org.apache.geode.internal.cache.eviction.HeapEvictor$1.run(HeapEvictor.java:357)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
        at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
        at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
        at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
        at org.jgroups.JChannel.up(JChannel.java:741)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
        at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
        at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
        at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
        at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
        at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
        at org.jgroups.protocols.TP.receive(TP.java:1714)
        at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
        at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
        ... 1 more

[info 2019/01/23 01:30:05.874 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Shutting down DistributionManager 10.200.2.102(nx102a-srv:23781)<v4>:41001. At least one Exception occurred.
...

Any idea ?

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>


De : Philippe CEROU
Envoyé : mercredi 23 janvier 2019 08:19
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : RE: Multi-threaded Java client exception.

Hi,

Thanks to Anthony for its help, I so modified my code to have only one cache & region which are thread-shared.

To have something running well I had to use synchronized blocks to handle cache connection and close, and use a dedicated region creation/sharing function has follow (Hope it can helps someone) :

...
public class CxDrvGEODE extends CxObjNOSQL {
       static ClientCache oCache = null;
       static final String oSync = "x";
       static boolean IsConnected = false;
       static int nbThreads = 0;
       static final HashMap<String, Region<Long, CxDrvGEODERow>> hmRegions = newHashMap<String, Region<Long, CxDrvGEODERow>>();

...

       public void DoConnect() {
             synchronized (oSync) {
                    if (!IsConnected) {
                           ReflectionBasedAutoSerializer oRBAS = newReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
                           oCache = newClientCacheFactory().addPoolLocator(this.GetNode(), Integer.valueOf(this.Port)).set("log-level", "WARN").setPdxSerializer(oRBAS).create();
                           IsConnected = true;
                    }
                    nbThreads++;
             }
       }

       public void DoClose() {
             synchronized (oSync) {
                    if (nbThreads > 0) {
                           if (nbThreads == 1 && oCache != null) {
                                  for (String sKey : hmRegions.keySet()) {
                                        hmRegions.get(sKey).close();
                                        hmRegions.remove(sKey);
                                  }
                                  oCache.close();
                                  oCache = null;
                                  IsConnected = false;
                           }
                           nbThreads--;
                    }
             }
       }

...

       private Region<Long, CxDrvGEODERow> getCache(String CTable) {
             synchronized (oSync) {
                    Region<Long, CxDrvGEODERow> oRegion = null;
                    if (hmRegions.containsKey(CTable)) {
                           oRegion = hmRegions.get(CTable);
                    } else {
                           oRegion = oCache.<Long, CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.PROXY).create(this.Base + '-'+ CTable);
                           hmRegions.put(CTable, oRegion);
                    }
                    return oRegion;
             }
       }

...

       public boolean TableInsert(String CTable, String[][] TColumns, Object[][] TOValues,boolean BCommit, boolean BForceBlocMode) {
...
             Region<Long, CxDrvGEODERow> oRegion = getCache(CTable);
...
       }

...

}



Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>


De : Anthony Baker [mailto:abaker@pivotal.io]
Envoyé : mardi 22 janvier 2019 16:59
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Re: Multi-threaded Java client exception.

You only need one ClientCache in each JVM.  You can create the cache and region once and then pass it to each worker thread.

Anthony

On Jan 22, 2019, at 1:25 AM, Philippe CEROU <ph...@gfi.fr>> wrote:

Hi,

We try to tune a single program using multi-threading for storage interface.

The problem we have is if we launch this code with THREADS=1 then all run well, starting with THREADS=2 then we always have this exception when we create connection (On yellow row).

Exception in thread "main" java.lang.IllegalStateException: Existing cache's default pool was not compatible
        at org.apache.geode.internal.cache.GemFireCacheImpl.validatePoolFactory(GemFireCacheImpl.java:2933)
        at org.apache.geode.cache.client.ClientCacheFactory.basicCreate(ClientCacheFactory.java:252)
        at org.apache.geode.cache.client.ClientCacheFactory.create(ClientCacheFactory.java:213)
        at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.DoConnect(CxDrvGEODE.java:35)
        at com.gfi.rt.lib.database.connectors.CxObj.DoConnect(CxObj.java:121)
        at com.gfi.rt.lib.database.connectors.CxInterface.Connect(CxInterface.java:91)
        at com.gfi.rt.lib.database.connectors.CxInterface.Connect(CxInterface.java:149)
       ...

Here is the data interface code.

package com.gfi.rt.lib.database.connectors.nosql;

import java.util.HashMap;
import org.apache.geode.cache.Region;
import org.apache.geode.cache.client.ClientCache;
import org.apache.geode.cache.client.ClientCacheFactory;
import org.apache.geode.cache.client.ClientRegionShortcut;
import org.apache.geode.pdx.ReflectionBasedAutoSerializer;

public class CxDrvGEODEThread extends CxObjNOSQL {
       ClientCache oCache = null;
       boolean IsConnected = false;

       // Connexion

       public CxDrvGEODEThread() {
             Connector = "geode-native";
       }

       public void DoConnect() {
             if (!IsConnected) {
                    ReflectionBasedAutoSerializer rbas = newReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
                    oCache = new ClientCacheFactory().addPoolLocator(this.GetNode(), Integer.valueOf(this.Port)).set("log-level", "WARN").setPdxSerializer(rbas).create();
                    IsConnected = true;
             }
       }

       // Déconnexion de la connexion courante à la base de données

       public void DoClose() {
             if (oCache != null) {
                    oCache.close();
                    oCache = null;
                    IsConnected = false;
             }
       }

       // Insertion de données

       public boolean TableInsert(String CTable, String[][] TColumns, Object[][] TOValues,boolean BCommit, boolean BForceBlocMode) {
             boolean BResult = false;
             final HashMap<Long, CxDrvGEODERow> mrows = new HashMap<Long, CxDrvGEODERow>();

             ...

             if (!mrows.isEmpty()) {
                           Region<Long, CxDrvGEODERow> oRegion = oCache.<Long, CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.PROXY).create(this.Base + '-' +CTable);
                    if (oRegion != null) {
                           oRegion.putAll(mrows);
                           oRegion.close();
                    }
             }
             mrows.clear();
             BResult = true;
             return BResult;
       }
}

Note that cluster, disk stores, regions and indexes are pre-created on GEODE.

Every thread is isolated and create its own CxDrvGEODEThread class instance and do « DoConnect » --> N x « TableInsert » --> « DoClose ».

Here is a master thread class call example:

        private DataThread launchOneDataThread(long LNbProcess, long LNbLines, int LBatchSize, long LProcessID, String BenchId) {
                final CxObj POCX = CXI.Connect(CXO.Connector, CXO.Server, CXO.Port, CXO.Base, CXO.User, CXO.Password, BTrace);
                final DataThread BT = new DataThread(LNbProcess, LNbLines, LBatchSize, LProcessID, new DataObject(POCX, BenchId,CBenchParams, CTable));
                new Thread(new Runnable() {
                        @Override
                        public void run() {
                                BT.start();
                        }
                }).start();
                return BT;
        }

I’m sure we are doing something really bad, any idea ?

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=Y-m_AcPqNYz6oz5orp1rMZS_KdYRmbcUsFvj6AaIprI&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=j3b0MNZZlpcvogNg2pC9MD6B9rhKXj5Et-YiHqfZogE&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=3j79LCFW1Ey72TkJcMHwsFK0o2vO38fQDFbO0VH6pEY&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=KbOBhON1qk5FcAwwRtRODcOxwfHcNCIPYk_ql-LwSdA&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=0T27NgqJxc1Lb1q71yM7mDATuZJEXuClx1sqm3E3RRI&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=>

RE: Is there a way to boost all cluster startup when persistence is ON ?

Posted by Philippe CEROU <ph...@gfi.fr>.

Hi,

For the test we use only one GP2 SSD per server with no IOPS provisioned at all.

We do not have any EBS metric in CLOUD WATCH, we will try to configure it tomorrow in this case to see more things.

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Michael Stolz [mailto:mstolz@pivotal.io]
Envoyé : jeudi 31 janvier 2019 18:43
À : user@geode.apache.org
Objet : Re: Is there a way to boost all cluster startup when persistence is ON ?


Quick question...are you starting the servers in parallel or one at a time?

If you did increase the number of servers each would have its own portion of the diskstore, so you should get some scaling effect, but because of contention for the disk itself, maybe not linear.

--
Mike Stolz
Principal Engineer - Gemfire Product Manager
Mobile: 631-835-4771

On Jan 30, 2019 3:20 AM, "Philippe CEROU" <ph...@gfi.fr>> wrote:
Hi,

New test, new problem 😊

I have a 30 nodes Apache Geode cluster (30 x 4 CPU / 16Gb RAM) with 200 000 000 partitionend & replicated (C=2) «PDX-ized » rows.

When I stop all my cluster (Servers then Locators) and I do a full restart (Locators then Servers) I have in fact two possible problems :

1. The warm-up time is very long (About 15 minutes), each disk store is standing another node one and the time everything finalize rhe service is down (exactly the cluster is there but the region is not there).

Is there a way to make the region available and to directly start accepting at least INSERTS ?

The best would be to directly serve the region for INSERTs and fallback potentially READS to disk 😊

2. Sometimes when I startup my 30 servers some of them crash with this line after near 1 to 2 minutes :

[info 2019/01/30 08:03:02.603 UTC nx130b-srv <main> tid=0x1] Initialization of region PdxTypes completed

[info 2019/01/30 08:04:17.606 UTC nx130b-srv <Geode Failure Detection Scheduler1> tid=0x1c] Failure detection is now watching 10.200.6.107(nx107c-srv:19000)<v5>:41000

[info 2019/01/30 08:04:17.613 UTC nx130b-srv <Geode Failure Detection thread 3> tid=0xe2] Failure detection is now watching 10.200.4.112(nx112b-srv:18872)<v5>:41000

[error 2019/01/30 08:05:02.942 UTC nx130b-srv <main> tid=0x1] Cache initialization for GemFireCache[id = 2125470482; isClosing = false; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=.

[info 2019/01/30 08:05:02.961 UTC nx130b-srv <main> tid=0x1] GemFireCache[id = 2125470482; isClosing = true; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2019/01/30 08:05:03.325 UTC nx130b-srv <main> tid=0x1] Shutting down DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000.

[info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Now closing distribution for 10.200.4.130(nx130b-srv:3065)<v5>:41000

[info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Stopping membership services

[info 2019/01/30 08:05:03.435 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor server socket is closed in stopServices().

[info 2019/01/30 08:05:03.435 UTC nx130b-srv <Geode Failure Detection Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting

[info 2019/01/30 08:05:03.436 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor serverSocketExecutor is terminated

[info 2019/01/30 08:05:03.475 UTC nx130b-srv <main> tid=0x1] DistributionManager stopped in 150ms.

[info 2019/01/30 08:05:03.476 UTC nx130b-srv <main> tid=0x1] Marking DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000 as closed.

Any idea ?

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr<http://gfi.fr>
—
1 Rond-point du Général Eisenhower, 31400 Toulouse<https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr<ma...@gfi.fr>]
Envoyé : jeudi 24 janvier 2019 13:23
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

Own-response,

Finally, after shutting down and restarting the cluster, this one was « just » reimporting data from disks and it took 1h34m to make region back (for persisted 200 000 000 rows among 10 nodes).

I still have some questions 😊

My cluster reported memory usage is really different than the unique region it serve, see screenshots.

Cluster memory usage (205Gb):

[cid:image007.jpg@01D4B997.9432E270]

But region consume 311Gb :

[cid:image008.jpg@01D4B997.9432E270]

What I do not understand :

  *   If I consume 205Gb with 35Gb for « region disk caching » what are the 205Gb - 35Gb = 170Gb used for ?
  *   If I really have 205Gb used by 305Gb, I understand my region is about 311Gb, why so little memory is used to cache region ?

If I launch a simple « query  --query='select count(*) from /ksdata-benchmark where C001="AD6909"' » from GFSH I still have no response after 10 minutes while the requested column is indexd !

gfsh>list indexes
Member Name |                Member ID                 |    Region Path    |         Name          | Type  | Indexed Expression |    From Clause    | Valid Index
----------- | ---------------------------------------- | ----------------- | --------------------- | ----- | ------------------ | ----------------- | -----------
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr<http://gfi.fr>
—
1 Rond-point du Général Eisenhower, 31400 Toulouse<https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : jeudi 24 janvier 2019 11:21
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

I understand, here it is a question of heap memory consumption, but I think we do something wrong because :


  1.  I have retried with a 10 AWS C5.4XLARGE nodes (8 CPU + 64 Gb RAM + 50Gb SSD disks), this nodes have 4 x more memory than previous test which give us a GEODE server with 305Gb MEMORY (Pulse/Total heap).



  1.  My INPUT data is a CSV with 200 000 000 rows about 150 bytes each divided in 19 LONG & STRING columns, this give me (With region « redundant-copies=2 ») about 3 x 200 000 000 x 150 = 83.81Gb. I have 4 indexes too on 4 LONG columns, my operational product data cost ratio (With 80% of HEAP threshold « eviction-heap-percentage=80 ») is so about (305Gb x 0.8) / 83.81Gb = 2.91, not very good ☹

So, I think I have a problem with OVERFLOW settings even if I did the same as documentation show.

Locators launch command :

gfsh -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='${LLOCATORS}'

PDX configuration :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "configure pdx --disk-store=DEFAULT --read-serialized=true"

Servers launch command:

WMEM=28000
gfsh \
-e "start server --name=${WSNAME} --initial-heap=${WMEM}M --max-heap=${WMEM}M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"

Region declaration :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
-e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"

Other « new » problem, if I shutdown the cluster (everything) then I restart it (locators then servers) I do not see my region anymore, but HEAP memory is consumed (I attached a PULSE screenshot).

[cid:image009.jpg@01D4B997.9432E270]

If I look at servers logs I see that, after one hour (multiple times on multiple servers with different details):

10.200.2.108<http://10.200.2.108>: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [19, 67, 101, 103, 110, 112] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.2.108<http://10.200.2.108>:   DiskStore ID: 0c16c9b9-eff3-4fe1-84b1-f2ad0b7d19de
10.200.2.108<http://10.200.2.108>:   Name: nx108a-srv
10.200.2.108<http://10.200.2.108>:   Location: /10.200.2.108:/opt/app/geode/node/nx108a-srv/ksdata
10.200.2.108<http://10.200.2.108>: Offline members with potentially new data:[
10.200.2.108<http://10.200.2.108>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.2.108<http://10.200.2.108>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.2.108<http://10.200.2.108>:   Buckets: [19, 110]
10.200.2.108<http://10.200.2.108>: ,
10.200.2.108<http://10.200.2.108>:   DiskStore ID: 338fd13a-e564-444b-bfd9-377bec060897
10.200.2.108<http://10.200.2.108>:   Location: /10.200.6.107:/opt/app/geode/node/nx107c-srv/ksdata
10.200.2.108<http://10.200.2.108>:   Buckets: [19, 67, 101, 112]
10.200.2.108<http://10.200.2.108>: ,
10.200.2.108<http://10.200.2.108>:   DiskStore ID: e086e935-62df-4401-99dd-16475adf1f01
10.200.2.108<http://10.200.2.108>:   Location: /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata
10.200.2.108<http://10.200.2.108>:   Buckets: [103]
10.200.2.108<http://10.200.2.108>: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
10.200.6.104<http://10.200.6.104>: ...........................................................................................................................................
10.200.6.104<http://10.200.6.104>: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 31, 44, 65, 99] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.6.104<http://10.200.6.104>:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
10.200.6.104<http://10.200.6.104>:   Name: nx104c-srv
10.200.6.104<http://10.200.6.104>:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
10.200.6.104<http://10.200.6.104>: Offline members with potentially new data:[
10.200.6.104<http://10.200.6.104>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104<http://10.200.6.104>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104<http://10.200.6.104>:   Buckets: [3, 31, 44, 65, 99]
10.200.6.104<http://10.200.6.104>: ,
10.200.6.104<http://10.200.6.104>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104<http://10.200.6.104>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104<http://10.200.6.104>:   Buckets: [7]
10.200.6.104<http://10.200.6.104>: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
10.200.6.104<http://10.200.6.104>: ..............................
10.200.6.104<http://10.200.6.104>: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 8, 31, 44, 52, 65, 95, 99] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.6.104<http://10.200.6.104>:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
10.200.6.104<http://10.200.6.104>:   Name: nx104c-srv
10.200.6.104<http://10.200.6.104>:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
10.200.6.104<http://10.200.6.104>: Offline members with potentially new data:[
10.200.6.104<http://10.200.6.104>:   DiskStore ID: 9bd60b3e-f640-4d95-b9a5-be14fccb5f91
10.200.6.104<http://10.200.6.104>:   Location: /10.200.2.102:/opt/app/geode/node/nx102a-srv/ksdata
10.200.6.104<http://10.200.6.104>:   Buckets: [8, 52, 95]
10.200.6.104<http://10.200.6.104>: ,
10.200.6.104<http://10.200.6.104>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104<http://10.200.6.104>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104<http://10.200.6.104>:   Buckets: [3, 31, 44, 65, 99]
10.200.6.104<http://10.200.6.104>: ,
10.200.6.104<http://10.200.6.104>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104<http://10.200.6.104>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104<http://10.200.6.104>:   Buckets: [7]
10.200.6.104<http://10.200.6.104>: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.

Eveyone seems waiting something from everyone, even if all my 10 nodes are UP.

If I use the « gfsh show missing-disk-stores » command there are from 5 to 20 missing ones.

gfsh>show missing-disk-stores
Missing Disk Stores


           Disk Store ID             |     Host      | Directory
------------------------------------ | ------------- | -------------------------------------
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102<http://10.200.2.102> | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103<http://10.200.4.103> | /opt/app/geode/node/nx103b-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102<http://10.200.2.102> | /opt/app/geode/node/nx102a-srv/ksdata
a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109<http://10.200.4.109> | /opt/app/geode/node/nx109b-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103<http://10.200.4.103> | /opt/app/geode/node/nx103b-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102<http://10.200.2.102> | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103<http://10.200.4.103> | /opt/app/geode/node/nx103b-srv/ksdata
e086e935-62df-4401-99dd-16475adf1f01 | /10.200.2.105<http://10.200.2.105> | /opt/app/geode/node/nx105a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103<http://10.200.4.103> | /opt/app/geode/node/nx103b-srv/ksdata
338fd13a-e564-444b-bfd9-377bec060897 | /10.200.6.107<http://10.200.6.107> | /opt/app/geode/node/nx107c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102<http://10.200.2.102> | /opt/app/geode/node/nx102a-srv/ksdata
184f8d33-a041-4635-915a-9e64cf9c007c | /10.200.6.101<http://10.200.6.101> | /opt/app/geode/node/nx101c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102<http://10.200.2.102> | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103<http://10.200.4.103> | /opt/app/geode/node/nx103b-srv/ksdata
0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104<http://10.200.6.104> | /opt/app/geode/node/nx104c-srv/ksdata
68c79259-ffcb-4b3b-a6a4-ef8bff6190be | /10.200.6.110<http://10.200.6.110> | /opt/app/geode/node/nx110c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102<http://10.200.2.102> | /opt/app/geode/node/nx102a-srv/ksdata
a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109<http://10.200.4.109> | /opt/app/geode/node/nx109b-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103<http://10.200.4.103> | /opt/app/geode/node/nx103b-srv/ksdata
0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104<http://10.200.6.104> | /opt/app/geode/node/nx104c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102<http://10.200.2.102> | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103<http://10.200.4.103> | /opt/app/geode/node/nx103b-srv/ksdata

Very strange...

Does-it meens that some DATA is lost even with a redundancy of #3 ?

I saw in documentation that « missing disk stores » can be revoked but it is not clear on the fact that data is finally lost or not ☹

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr<http://gfi.fr>
—
1 Rond-point du Général Eisenhower, 31400 Toulouse<https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Anthony Baker [mailto:abaker@pivotal.io]
Envoyé : mercredi 23 janvier 2019 18:28
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Re: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

When a cluster member member becomes unresponsive, Geode may fence off the member in order to preserve consistency and availability.  The question to investigate is *why* the member got into this state.

Questions to investigate:

- How much heap memory is your data consuming?
- How much data is overflowed disk vs in heap memory?
- How much data is being read from disk vs memory?
- Is GC activity consuming significant cpu resources?
- Are there other processes running on the system causing swapping behavior?

Anthony


On Jan 23, 2019, at 8:45 AM, Philippe CEROU <ph...@gfi.fr>> wrote:

Hi,

I think I have found the problem,

When we use REGION with OVERFLOW to disk once the percentage of memory configure dis reached the geode server become very very slow ☹

At the end even if we have a lot of nodes the overall cluster do not reach to write/acquire as far as client send data and everything fall, we past from 140 000 rows per second (memory) to less than 10 000 rows per second (once OVERFLOW is started)...

Is the product able to overflow on disk without a so high bandwidth reduce ?

For information, here are my launch commands (9 nodes) :

3 x :

gfsh \
-e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='nx101c[1234],nc102a[1234],nx103b[1234]'"

For PDX :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "configure pdx --disk-store=DEFAULT --read-serialized=true"

9 x :

gfsh \
-e "start server --name=${WSNAME} --initial-heap=6000M --max-heap=6000M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"

For disks & regions :

gfsh \
                -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
                -e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"

gfsh \
                -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
                -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
                -e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr<http://gfi.fr>
—
1 Rond-point du Général Eisenhower, 31400 Toulouse<https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : mercredi 23 janvier 2019 08:40
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

Following my GEODE study i now have this situation.

I have 6 nodes, 3 have a locator and 6 have a server.

When I try to do massive insertion (200 million PDX-ized rows) I have after some hours this error on client side for all threads :

...
Exception in thread "Thread-20" org.apache.geode.cache.client.ServerOperationException: remote server on nxmaster(28523:loner):35042:efcecc76: Region /ksdata-benchmark putAll at server applied partial keys due to exception.
        at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9542)
        at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9446)
        at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9458)
        at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.TableInsert(CxDrvGEODE.java:144)
        at com.gfi.rt.lib.database.connectors.CxTable.insert(CxTable.java:175)
        at com.gfi.rt.bin.database.dbbench.BenchmarkObject.run(BenchmarkObject.java:279)
        at com.gfi.rt.bin.database.dbbench.BenchmarkThread.DoIt(Ben
...

Re: Is there a way to boost all cluster startup when persistence is ON ?

Posted by Anthony Baker <ab...@pivotal.io>.

What kind of AWS volumes are you using and what throughput && IOPS are you seeing from cloud watch?

Anthony


> On Jan 31, 2019, at 9:43 AM, Michael Stolz <ms...@pivotal.io> wrote:
> 
> Quick question...are you starting the servers in parallel or one at a time?
> 
> If you did increase the number of servers each would have its own portion of the diskstore, so you should get some scaling effect, but because of contention for the disk itself, maybe not linear.
> 
> --
> Mike Stolz
> Principal Engineer - Gemfire Product Manager
> Mobile: 631-835-4771
> 
> 
> On Jan 30, 2019 3:20 AM, "Philippe CEROU" <philippe.cerou@gfi.fr <ma...@gfi.fr>> wrote:
> Hi,
> 
>  
> 
> New test, new problem 😊
> 
>  
> 
> I have a 30 nodes Apache Geode cluster (30 x 4 CPU / 16Gb RAM) with 200 000 000 partitionend & replicated (C=2) «PDX-ized » rows.
> 
>  
> 
> When I stop all my cluster (Servers then Locators) and I do a full restart (Locators then Servers) I have in fact two possible problems :
> 
>  
> 
> 1. The warm-up time is very long (About 15 minutes), each disk store is standing another node one and the time everything finalize rhe service is down (exactly the cluster is there but the region is not there).
> 
>  
> 
> Is there a way to make the region available and to directly start accepting at least INSERTS ?
> 
>  
> 
> The best would be to directly serve the region for INSERTs and fallback potentially READS to disk 😊
> 
>  
> 
> 2. Sometimes when I startup my 30 servers some of them crash with this line after near 1 to 2 minutes :
> 
>  
> 
> [info 2019/01/30 08:03:02.603 UTC nx130b-srv <main> tid=0x1] Initialization of region PdxTypes completed
> 
>  
> 
> [info 2019/01/30 08:04:17.606 UTC nx130b-srv <Geode Failure Detection Scheduler1> tid=0x1c] Failure detection is now watching 10.200.6.107(nx107c-srv:19000)<v5>:41000
> 
>  
> 
> [info 2019/01/30 08:04:17.613 UTC nx130b-srv <Geode Failure Detection thread 3> tid=0xe2] Failure detection is now watching 10.200.4.112(nx112b-srv:18872)<v5>:41000
> 
>  
> 
> [error 2019/01/30 08:05:02.942 UTC nx130b-srv <main> tid=0x1] Cache initialization for GemFireCache[id = 2125470482; isClosing = false; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=.
> 
>  
> 
> [info 2019/01/30 08:05:02.961 UTC nx130b-srv <main> tid=0x1] GemFireCache[id = 2125470482; isClosing = true; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
> 
>  
> 
> [info 2019/01/30 08:05:03.325 UTC nx130b-srv <main> tid=0x1] Shutting down DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000.
> 
>  
> 
> [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Now closing distribution for 10.200.4.130(nx130b-srv:3065)<v5>:41000
> 
>  
> 
> [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Stopping membership services
> 
>  
> 
> [info 2019/01/30 08:05:03.435 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor server socket is closed in stopServices().
> 
>  
> 
> [info 2019/01/30 08:05:03.435 UTC nx130b-srv <Geode Failure Detection Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting
> 
>  
> 
> [info 2019/01/30 08:05:03.436 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor serverSocketExecutor is terminated
> 
>  
> 
> [info 2019/01/30 08:05:03.475 UTC nx130b-srv <main> tid=0x1] DistributionManager stopped in 150ms.
> 
>  
> 
> [info 2019/01/30 08:05:03.476 UTC nx130b-srv <main> tid=0x1] Marking DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000 as closed.
> 
>  
> 
> Any idea ?
> 
>  
> 
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr <http://gfi.fr/>
> —
> 
> 1 Rond-point du Général Eisenhower, 31400 Toulouse <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
> Tél. : +33 (0)5.62.85.11.55
> Mob. : +33 (0)6.03.56.48.62
> www.gfi.world <http://www.gfi.world/> 
> 
> — 
> 
> <image001.png> <https://www.facebook.com/gfiinformatique> <image002.png> <https://twitter.com/gfiinformatique> <image003.png> <https://www.instagram.com/gfiinformatique/> <image004.png> <https://www.linkedin.com/company/gfi-informatique> <image005.png> <https://www.youtube.com/user/GFIinformatique>
> —
> <image006.jpg> <http://www.gfi.world/>
>  
> 
>  
> 
> De : Philippe CEROU [mailto:philippe.cerou@gfi.fr <ma...@gfi.fr>] 
> Envoyé : jeudi 24 janvier 2019 13:23
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
> 
>  
> 
> Hi,
> 
>  
> 
> Own-response,
> 
>  
> 
> Finally, after shutting down and restarting the cluster, this one was « just » reimporting data from disks and it took 1h34m to make region back (for persisted 200 000 000 rows among 10 nodes).
> 
>  
> 
> I still have some questions 😊
> 
>  
> 
> My cluster reported memory usage is really different than the unique region it serve, see screenshots.
> 
>  
> 
> Cluster memory usage (205Gb):
> 
>  
> 
> <image008.jpg>
> 
>  
> 
> But region consume 311Gb :
> 
>  
> 
> <image009.jpg>
> 
>  
> 
> What I do not understand :
> 
> If I consume 205Gb with 35Gb for « region disk caching » what are the 205Gb - 35Gb = 170Gb used for ?
> If I really have 205Gb used by 305Gb, I understand my region is about 311Gb, why so little memory is used to cache region ?
>  
> 
> If I launch a simple « query  --query='select count(*) from /ksdata-benchmark where C001="AD6909"' » from GFSH I still have no response after 10 minutes while the requested column is indexd !
> 
>  
> 
> gfsh>list indexes
> 
> Member Name |                Member ID                 |    Region Path    |         Name          | Type  | Indexed Expression |    From Clause    | Valid Index
> 
> ----------- | ---------------------------------------- | ----------------- | --------------------- | ----- | ------------------ | ----------------- | -----------
> 
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> 
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> 
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> 
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> 
>  
> 
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr <http://gfi.fr/>
> —
> 
> 1 Rond-point du Général Eisenhower, 31400 Toulouse <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
> Tél. : +33 (0)5.62.85.11.55
> Mob. : +33 (0)6.03.56.48.62
> www.gfi.world <http://www.gfi.world/> 
> 
> — 
> 
> <image001.png> <https://www.facebook.com/gfiinformatique> <image002.png> <https://twitter.com/gfiinformatique> <image003.png> <https://www.instagram.com/gfiinformatique/> <image004.png> <https://www.linkedin.com/company/gfi-informatique> <image005.png> <https://www.youtube.com/user/GFIinformatique>
> —
> <image006.jpg> <http://www.gfi.world/>
>  
> 
>  
> 
> De : Philippe CEROU [mailto:philippe.cerou@gfi.fr <ma...@gfi.fr>] 
> Envoyé : jeudi 24 janvier 2019 11:21
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
> 
>  
> 
> Hi,
> 
>  
> 
> I understand, here it is a question of heap memory consumption, but I think we do something wrong because :
> 
>  
> 
> I have retried with a 10 AWS C5.4XLARGE nodes (8 CPU + 64 Gb RAM + 50Gb SSD disks), this nodes have 4 x more memory than previous test which give us a GEODE server with 305Gb MEMORY (Pulse/Total heap).
>  
> 
> My INPUT data is a CSV with 200 000 000 rows about 150 bytes each divided in 19 LONG & STRING columns, this give me (With region « redundant-copies=2 ») about 3 x 200 000 000 x 150 = 83.81Gb. I have 4 indexes too on 4 LONG columns, my operational product data cost ratio (With 80% of HEAP threshold « eviction-heap-percentage=80 ») is so about (305Gb x 0.8) / 83.81Gb = 2.91, not very good ☹
>  
> 
> So, I think I have a problem with OVERFLOW settings even if I did the same as documentation show.
> 
>  
> 
> Locators launch command :
> 
>  
> 
> gfsh -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='${LLOCATORS}'
> 
>  
> 
> PDX configuration :
> 
>  
> 
> gfsh \
> 
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> 
> -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
> 
>  
> 
> Servers launch command:
> 
>  
> 
> WMEM=28000
> 
> gfsh \
> 
> -e "start server --name=${WSNAME} --initial-heap=${WMEM}M --max-heap=${WMEM}M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"
> 
>  
> 
> Region declaration :
> 
>  
> 
> gfsh \
> 
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> 
> -e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"
> 
>  
> 
> gfsh \
> 
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> 
> -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
> 
> -e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
> 
> -e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
> 
> -e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
> 
> -e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"
> 
>  
> 
> Other « new » problem, if I shutdown the cluster (everything) then I restart it (locators then servers) I do not see my region anymore, but HEAP memory is consumed (I attached a PULSE screenshot).
> 
>  
> 
> <image010.jpg>
> 
>  
> 
> If I look at servers logs I see that, after one hour (multiple times on multiple servers with different details):
> 
>  
> 
> 10.200.2.108 <http://10.200.2.108/>: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [19, 67, 101, 103, 110, 112] are waiting for another offline member to recover the latest data.My persistent id is:
> 
> 10.200.2.108 <http://10.200.2.108/>:   DiskStore ID: 0c16c9b9-eff3-4fe1-84b1-f2ad0b7d19de
> 
> 10.200.2.108 <http://10.200.2.108/>:   Name: nx108a-srv
> 
> 10.200.2.108 <http://10.200.2.108/>:   Location: /10.200.2.108:/opt/app/geode/node/nx108a-srv/ksdata
> 
> 10.200.2.108 <http://10.200.2.108/>: Offline members with potentially new data:[
> 
> 10.200.2.108 <http://10.200.2.108/>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 
> 10.200.2.108 <http://10.200.2.108/>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 
> 10.200.2.108 <http://10.200.2.108/>:   Buckets: [19, 110]
> 
> 10.200.2.108 <http://10.200.2.108/>: ,
> 
> 10.200.2.108 <http://10.200.2.108/>:   DiskStore ID: 338fd13a-e564-444b-bfd9-377bec060897
> 
> 10.200.2.108 <http://10.200.2.108/>:   Location: /10.200.6.107:/opt/app/geode/node/nx107c-srv/ksdata
> 
> 10.200.2.108 <http://10.200.2.108/>:   Buckets: [19, 67, 101, 112]
> 
> 10.200.2.108 <http://10.200.2.108/>: ,
> 
> 10.200.2.108 <http://10.200.2.108/>:   DiskStore ID: e086e935-62df-4401-99dd-16475adf1f01
> 
> 10.200.2.108 <http://10.200.2.108/>:   Location: /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata
> 
> 10.200.2.108 <http://10.200.2.108/>:   Buckets: [103]
> 
> 10.200.2.108 <http://10.200.2.108/>: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
> 
> 10.200.6.104 <http://10.200.6.104/>: ...........................................................................................................................................
> 
> 10.200.6.104 <http://10.200.6.104/>: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 31, 44, 65, 99] are waiting for another offline member to recover the latest data.My persistent id is:
> 
> 10.200.6.104 <http://10.200.6.104/>:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
> 
> 10.200.6.104 <http://10.200.6.104/>:   Name: nx104c-srv
> 
> 10.200.6.104 <http://10.200.6.104/>:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
> 
> 10.200.6.104 <http://10.200.6.104/>: Offline members with potentially new data:[
> 
> 10.200.6.104 <http://10.200.6.104/>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 
> 10.200.6.104 <http://10.200.6.104/>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 
> 10.200.6.104 <http://10.200.6.104/>:   Buckets: [3, 31, 44, 65, 99]
> 
> 10.200.6.104 <http://10.200.6.104/>: ,
> 
> 10.200.6.104 <http://10.200.6.104/>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 
> 10.200.6.104 <http://10.200.6.104/>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 
> 10.200.6.104 <http://10.200.6.104/>:   Buckets: [7]
> 
> 10.200.6.104 <http://10.200.6.104/>: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
> 
> 10.200.6.104 <http://10.200.6.104/>: ..............................
> 
> 10.200.6.104 <http://10.200.6.104/>: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 8, 31, 44, 52, 65, 95, 99] are waiting for another offline member to recover the latest data.My persistent id is:
> 
> 10.200.6.104 <http://10.200.6.104/>:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
> 
> 10.200.6.104 <http://10.200.6.104/>:   Name: nx104c-srv
> 
> 10.200.6.104 <http://10.200.6.104/>:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
> 
> 10.200.6.104 <http://10.200.6.104/>: Offline members with potentially new data:[
> 
> 10.200.6.104 <http://10.200.6.104/>:   DiskStore ID: 9bd60b3e-f640-4d95-b9a5-be14fccb5f91
> 
> 10.200.6.104 <http://10.200.6.104/>:   Location: /10.200.2.102:/opt/app/geode/node/nx102a-srv/ksdata
> 
> 10.200.6.104 <http://10.200.6.104/>:   Buckets: [8, 52, 95]
> 
> 10.200.6.104 <http://10.200.6.104/>: ,
> 
> 10.200.6.104 <http://10.200.6.104/>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 
> 10.200.6.104 <http://10.200.6.104/>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 
> 10.200.6.104 <http://10.200.6.104/>:   Buckets: [3, 31, 44, 65, 99]
> 
> 10.200.6.104 <http://10.200.6.104/>: ,
> 
> 10.200.6.104 <http://10.200.6.104/>:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 
> 10.200.6.104 <http://10.200.6.104/>:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 
> 10.200.6.104 <http://10.200.6.104/>:   Buckets: [7]
> 
> 10.200.6.104 <http://10.200.6.104/>: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
> 
>  
> 
> Eveyone seems waiting something from everyone, even if all my 10 nodes are UP.
> 
>  
> 
> If I use the « gfsh show missing-disk-stores » command there are from 5 to 20 missing ones.
> 
>  
> 
> gfsh>show missing-disk-stores
> 
> Missing Disk Stores
> 
>  
> 
>  
> 
>            Disk Store ID             |     Host      | Directory
> 
> ------------------------------------ | ------------- | -------------------------------------
> 
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 <http://10.200.2.102/> | /opt/app/geode/node/nx102a-srv/ksdata
> 
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 <http://10.200.4.103/> | /opt/app/geode/node/nx103b-srv/ksdata
> 
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 <http://10.200.2.102/> | /opt/app/geode/node/nx102a-srv/ksdata
> 
> a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 <http://10.200.4.109/> | /opt/app/geode/node/nx109b-srv/ksdata
> 
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 <http://10.200.4.103/> | /opt/app/geode/node/nx103b-srv/ksdata
> 
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 <http://10.200.2.102/> | /opt/app/geode/node/nx102a-srv/ksdata
> 
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 <http://10.200.4.103/> | /opt/app/geode/node/nx103b-srv/ksdata
> 
> e086e935-62df-4401-99dd-16475adf1f01 | /10.200.2.105 <http://10.200.2.105/> | /opt/app/geode/node/nx105a-srv/ksdata
> 
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 <http://10.200.4.103/> | /opt/app/geode/node/nx103b-srv/ksdata
> 
> 338fd13a-e564-444b-bfd9-377bec060897 | /10.200.6.107 <http://10.200.6.107/> | /opt/app/geode/node/nx107c-srv/ksdata
> 
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 <http://10.200.2.102/> | /opt/app/geode/node/nx102a-srv/ksdata
> 
> 184f8d33-a041-4635-915a-9e64cf9c007c | /10.200.6.101 <http://10.200.6.101/> | /opt/app/geode/node/nx101c-srv/ksdata
> 
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 <http://10.200.2.102/> | /opt/app/geode/node/nx102a-srv/ksdata
> 
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 <http://10.200.4.103/> | /opt/app/geode/node/nx103b-srv/ksdata
> 
> 0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 <http://10.200.6.104/> | /opt/app/geode/node/nx104c-srv/ksdata
> 
> 68c79259-ffcb-4b3b-a6a4-ef8bff6190be | /10.200.6.110 <http://10.200.6.110/> | /opt/app/geode/node/nx110c-srv/ksdata
> 
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 <http://10.200.2.102/> | /opt/app/geode/node/nx102a-srv/ksdata
> 
> a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 <http://10.200.4.109/> | /opt/app/geode/node/nx109b-srv/ksdata
> 
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 <http://10.200.4.103/> | /opt/app/geode/node/nx103b-srv/ksdata
> 
> 0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 <http://10.200.6.104/> | /opt/app/geode/node/nx104c-srv/ksdata
> 
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 <http://10.200.2.102/> | /opt/app/geode/node/nx102a-srv/ksdata
> 
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 <http://10.200.4.103/> | /opt/app/geode/node/nx103b-srv/ksdata
> 
>  
> 
> Very strange...
> 
>  
> 
> Does-it meens that some DATA is lost even with a redundancy of #3 ?
> 
>  
> 
> I saw in documentation that « missing disk stores » can be revoked but it is not clear on the fact that data is finally lost or not ☹
> 
>  
> 
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr <http://gfi.fr/>
> —
> 
> 1 Rond-point du Général Eisenhower, 31400 Toulouse <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
> Tél. : +33 (0)5.62.85.11.55
> Mob. : +33 (0)6.03.56.48.62
> www.gfi.world <http://www.gfi.world/> 
> 
> — 
> 
> <image001.png> <https://www.facebook.com/gfiinformatique> <image002.png> <https://twitter.com/gfiinformatique> <image003.png> <https://www.instagram.com/gfiinformatique/> <image004.png> <https://www.linkedin.com/company/gfi-informatique> <image005.png> <https://www.youtube.com/user/GFIinformatique>
> —
> <image006.jpg> <http://www.gfi.world/>
>  
> 
>  
> 
> De : Anthony Baker [mailto:abaker@pivotal.io <ma...@pivotal.io>] 
> Envoyé : mercredi 23 janvier 2019 18:28
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : Re: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
> 
>  
> 
> When a cluster member member becomes unresponsive, Geode may fence off the member in order to preserve consistency and availability.  The question to investigate is *why* the member got into this state.
> 
>  
> 
> Questions to investigate:
> 
>  
> 
> - How much heap memory is your data consuming?
> 
> - How much data is overflowed disk vs in heap memory?
> 
> - How much data is being read from disk vs memory?
> 
> - Is GC activity consuming significant cpu resources?
> 
> - Are there other processes running on the system causing swapping behavior?
> 
>  
> 
> Anthony
> 
>  
> 
>  
> 
> On Jan 23, 2019, at 8:45 AM, Philippe CEROU <philippe.cerou@gfi.fr <ma...@gfi.fr>> wrote:
> 
>  
> 
> Hi,
> 
>  
> 
> I think I have found the problem,
> 
>  
> 
> When we use REGION with OVERFLOW to disk once the percentage of memory configure dis reached the geode server become very very slow ☹
> 
>  
> 
> At the end even if we have a lot of nodes the overall cluster do not reach to write/acquire as far as client send data and everything fall, we past from 140 000 rows per second (memory) to less than 10 000 rows per second (once OVERFLOW is started)...
> 
>  
> 
> Is the product able to overflow on disk without a so high bandwidth reduce ?
> 
>  
> 
> For information, here are my launch commands (9 nodes) :
> 
>  
> 
> 3 x :
> 
>  
> 
> gfsh \
> 
> -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='nx101c[1234],nc102a[1234],nx103b[1234]'"
> 
>  
> 
> For PDX :
> 
>  
> 
> gfsh \
> 
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> 
> -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
> 
>  
> 
> 9 x :
> 
>  
> 
> gfsh \
> 
> -e "start server --name=${WSNAME} --initial-heap=6000M --max-heap=6000M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"
> 
>  
> 
> For disks & regions :
> 
>  
> 
> gfsh \
> 
>                 -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> 
>                 -e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"
> 
>  
> 
> gfsh \
> 
>                 -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> 
>                 -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
> 
>                 -e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
> 
>                 -e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
> 
>                 -e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
> 
>                 -e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"
> 
>  
> 
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr <http://gfi.fr/>
> —
> 
> 1 Rond-point du Général Eisenhower, 31400 Toulouse <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=> 
> 
> — 
> 
> <image001.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
> —
> <image006.jpg> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>  
> 
>  
> 
> De : Philippe CEROU [mailto:philippe.cerou@gfi.fr <ma...@gfi.fr>] 
> Envoyé : mercredi 23 janvier 2019 08:40
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
> 
>  
> 
> Hi,
> 
>  
> 
> Following my GEODE study i now have this situation.
> 
>  
> 
> I have 6 nodes, 3 have a locator and 6 have a server.
> 
>  
> 
> When I try to do massive insertion (200 million PDX-ized rows) I have after some hours this error on client side for all threads :
> 
>  
> 
> ...
> 
> Exception in thread "Thread-20" org.apache.geode.cache.client.ServerOperationException: remote server on nxmaster(28523:loner):35042:efcecc76: Region /ksdata-benchmark putAll at server applied partial keys due to exception.
> 
>         at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9542)
> 
>         at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9446)
> 
>         at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9458)
> 
>         at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.TableInsert(CxDrvGEODE.java:144)
> 
>         at com.gfi.rt.lib.database.connectors.CxTable.insert(CxTable.java:175)
> 
>         at com.gfi.rt.bin.database.dbbench.BenchmarkObject.run(BenchmarkObject.java:279)
> 
>         at com.gfi.rt.bin.database.dbbench.BenchmarkThread.DoIt(Ben
> 
> ...

Re: Is there a way to boost all cluster startup when persistence is ON ?

Posted by Michael Stolz <ms...@pivotal.io>.

Quick question...are you starting the servers in parallel or one at a time?

If you did increase the number of servers each would have its own portion
of the diskstore, so you should get some scaling effect, but because of
contention for the disk itself, maybe not linear.

--
Mike Stolz
Principal Engineer - Gemfire Product Manager
Mobile: 631-835-4771

On Jan 30, 2019 3:20 AM, "Philippe CEROU" <ph...@gfi.fr> wrote:

> Hi,
>
>
>
> New test, new problem 😊
>
>
>
> I have a 30 nodes Apache Geode cluster (30 x 4 CPU / 16Gb RAM) with
> 200 000 000 partitionend & replicated (C=2) «PDX-ized » rows.
>
>
>
> When I stop all my cluster (Servers then Locators) and I do a full restart
> (Locators then Servers) I have in fact two possible problems :
>
>
>
> 1. The warm-up time is very long (About 15 minutes), each disk store is
> standing another node one and the time everything finalize rhe service is
> down (exactly the cluster is there but the region is not there).
>
>
>
> Is there a way to make the region available and to directly start
> accepting at least INSERTS ?
>
>
>
> The best would be to directly serve the region for INSERTs and fallback
> potentially READS to disk 😊
>
>
>
> 2. Sometimes when I startup my 30 servers some of them crash with this
> line after near 1 to 2 minutes :
>
>
>
> [info 2019/01/30 08:03:02.603 UTC nx130b-srv <main> tid=0x1]
> Initialization of region PdxTypes completed
>
>
>
> [info 2019/01/30 08:04:17.606 UTC nx130b-srv <Geode Failure Detection
> Scheduler1> tid=0x1c] Failure detection is now watching
> 10.200.6.107(nx107c-srv:19000)<v5>:41000
>
>
>
> [info 2019/01/30 08:04:17.613 UTC nx130b-srv <Geode Failure Detection
> thread 3> tid=0xe2] Failure detection is now watching
> 10.200.4.112(nx112b-srv:18872)<v5>:41000
>
>
>
> [error 2019/01/30 08:05:02.942 UTC nx130b-srv <main> tid=0x1] Cache
> initialization for GemFireCache[id = 2125470482; isClosing = false;
> isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server =
> false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed
> because: org.apache.geode.GemFireIOException: While starting cache server
> CacheServer on port=40404 client subscription config policy=none client
> subscription config capacity=1 client subscription config overflow
> directory=.
>
>
>
> [info 2019/01/30 08:05:02.961 UTC nx130b-srv <main> tid=0x1]
> GemFireCache[id = 2125470482; isClosing = true; isShutDownAll = false;
> created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false;
> lockLease = 120; lockTimeout = 60]: Now closing.
>
>
>
> [info 2019/01/30 08:05:03.325 UTC nx130b-srv <main> tid=0x1] Shutting down
> DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000.
>
>
>
> [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Now closing
> distribution for 10.200.4.130(nx130b-srv:3065)<v5>:41000
>
>
>
> [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Stopping
> membership services
>
>
>
> [info 2019/01/30 08:05:03.435 UTC nx130b-srv <main> tid=0x1]
> GMSHealthMonitor server socket is closed in stopServices().
>
>
>
> [info 2019/01/30 08:05:03.435 UTC nx130b-srv <Geode Failure Detection
> Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting
>
>
>
> [info 2019/01/30 08:05:03.436 UTC nx130b-srv <main> tid=0x1]
> GMSHealthMonitor serverSocketExecutor is terminated
>
>
>
> [info 2019/01/30 08:05:03.475 UTC nx130b-srv <main> tid=0x1]
> DistributionManager stopped in 150ms.
>
>
>
> [info 2019/01/30 08:05:03.476 UTC nx130b-srv <main> tid=0x1] Marking
> DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000 as closed.
>
>
>
> Any idea ?
>
>
>
> Cordialement,
>
>
> *— NOTE : n/a*
> —
> Gfi Informatique
> *Philippe Cerou*
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
>
> —
>
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
>
> Tél. : +33 (0)5.62.85.11.55
> Mob. : +33 (0)6.03.56.48.62
> *www.gfi.world* <http://www.gfi.world/>
>
> —
>
> [image: Facebook] <https://www.facebook.com/gfiinformatique> [image:
> Twitter] <https://twitter.com/gfiinformatique> [image: Instagram]
> <https://www.instagram.com/gfiinformatique/> [image: LinkedIn]
> <https://www.linkedin.com/company/gfi-informatique> [image: YouTube]
> <https://www.youtube.com/user/GFIinformatique>
> —
> [image: cid:image006.jpg@01D2F97F.AA6ABB50] <http://www.gfi.world/>
>
>
>
>
>
> *De :* Philippe CEROU [mailto:philippe.cerou@gfi.fr]
> *Envoyé :* jeudi 24 janvier 2019 13:23
> *À :* user@geode.apache.org
> *Objet :* RE: Over-activity cause nodes to crash/disconnect with error :
> org.apache.geode.ForcedDisconnectException: Member isn't responding to
> heartbeat requests
>
>
>
> Hi,
>
>
>
> Own-response,
>
>
>
> Finally, after shutting down and restarting the cluster, this one was
> « just » reimporting data from disks and it took 1h34m to make region back
> (for persisted 200 000 000 rows among 10 nodes).
>
>
>
> I still have some questions 😊
>
>
>
> My cluster reported memory usage is really different than the unique
> region it serve, see screenshots.
>
>
>
> Cluster memory usage (205Gb):
>
>
>
>
>
> But region consume 311Gb :
>
>
>
>
>
> What I do not understand :
>
>    - If I consume 205Gb with 35Gb for « region disk caching » what are
>    the 205Gb - 35Gb = 170Gb used for ?
>    - If I really have 205Gb used by 305Gb, I understand my region is
>    about 311Gb, why so little memory is used to cache region ?
>
>
>
> If I launch a simple « query  --query='select count(*) from
> /ksdata-benchmark where C001="AD6909"' » from GFSH I still have no
> response after 10 minutes while the requested column is indexd !
>
>
>
> gfsh>list indexes
>
> Member Name |                Member ID                 |    Region Path
> |         Name          | Type  | Indexed Expression |    From Clause    |
> Valid Index
>
> ----------- | ---------------------------------------- |
> ----------------- | --------------------- | ----- | ------------------ |
> ----------------- | -----------
>
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               |
> /ksdata-benchmark | true
>
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               |
> /ksdata-benchmark | true
>
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               |
> /ksdata-benchmark | true
>
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
> /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               |
> /ksdata-benchmark | true
>
>
>
> Cordialement,
>
>
> *— NOTE : n/a*
> —
> Gfi Informatique
> *Philippe Cerou*
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
>
> —
>
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
>
> Tél. : +33 (0)5.62.85.11.55
> Mob. : +33 (0)6.03.56.48.62
> *www.gfi.world* <http://www.gfi.world/>
>
> —
>
> [image: Facebook] <https://www.facebook.com/gfiinformatique> [image:
> Twitter] <https://twitter.com/gfiinformatique> [image: Instagram]
> <https://www.instagram.com/gfiinformatique/> [image: LinkedIn]
> <https://www.linkedin.com/company/gfi-informatique> [image: YouTube]
> <https://www.youtube.com/user/GFIinformatique>
> —
> [image: cid:image006.jpg@01D2F97F.AA6ABB50] <http://www.gfi.world/>
>
>
>
>
>
> *De :* Philippe CEROU [mailto:philippe.cerou@gfi.fr
> <ph...@gfi.fr>]
> *Envoyé :* jeudi 24 janvier 2019 11:21
> *À :* user@geode.apache.org
> *Objet :* RE: Over-activity cause nodes to crash/disconnect with error :
> org.apache.geode.ForcedDisconnectException: Member isn't responding to
> heartbeat requests
>
>
>
> Hi,
>
>
>
> I understand, here it is a question of heap memory consumption, but I
> think we do something wrong because :
>
>
>
>    1. I have retried with a 10 AWS C5.4XLARGE nodes (8 CPU + 64 Gb RAM +
>    50Gb SSD disks), this nodes have 4 x more memory than previous test which
>    give us a GEODE server with 305Gb MEMORY (Pulse/Total heap).
>
>
>
>    1. My INPUT data is a CSV with 200 000 000 rows about 150 bytes each
>    divided in 19 LONG & STRING columns, this give me (With region
>    « redundant-copies=2 ») about 3 x 200 000 000 x 150 = 83.81Gb. I have 4
>    indexes too on 4 LONG columns, my operational product data cost ratio (With
>    80% of HEAP threshold « eviction-heap-percentage=80 ») is so about
>    (305Gb x 0.8) / 83.81Gb = 2.91, not very good ☹
>
>
>
> So, I think I have a problem with OVERFLOW settings even if I did the same
> as documentation show.
>
>
>
> Locators launch command :
>
>
>
> gfsh -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true
> --port=1234 --mcast-port=0 --locators='${LLOCATORS}'
>
>
>
> PDX configuration :
>
>
>
> gfsh \
>
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
> -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
>
>
>
> Servers launch command:
>
>
>
> WMEM=28000
>
> gfsh \
>
> -e "start server --name=${WSNAME} --initial-heap=${WMEM}M
> --max-heap=${WMEM}M --eviction-heap-percentage=80 --group=ksgroup
> --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],
> nx102a[1234],nx103b[1234]'"
>
>
>
> Region declaration :
>
>
>
> gfsh \
>
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
> -e "create disk-store --name ksdata --allow-force-compaction=true
> --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"
>
>
>
> gfsh \
>
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
> -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup
> --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata
> --redundant-copies=2 --eviction-action=overflow-to-disk" \
>
> -e "create index --name=ksdata-benchmark-C001 --expression=C001
> --region=ksdata-benchmark --group=ksgroup" \
>
> -e "create index --name=ksdata-benchmark-C002 --expression=C002
> --region=ksdata-benchmark --group=ksgroup" \
>
> -e "create index --name=ksdata-benchmark-C003 --expression=C003
> --region=ksdata-benchmark --group=ksgroup" \
>
> -e "create index --name=ksdata-benchmark-C011 --expression=C011
> --region=ksdata-benchmark --group=ksgroup"
>
>
>
> Other « new » problem, if I shutdown the cluster (everything) then I
> restart it (locators then servers) I do not see my region anymore, but HEAP
> memory is consumed (I attached a PULSE screenshot).
>
>
>
>
>
> If I look at servers logs I see that, after one hour (multiple times on
> multiple servers with different details):
>
>
>
> 10.200.2.108: Region /ksdata-benchmark (and any colocated sub-regions)
> has potentially stale data.  Buckets [19, 67, 101, 103, 110, 112] are
> waiting for another offline member to recover the latest data.My persistent
> id is:
>
> 10.200.2.108:   DiskStore ID: 0c16c9b9-eff3-4fe1-84b1-f2ad0b7d19de
>
> 10.200.2.108:   Name: nx108a-srv
>
> 10.200.2.108:   Location: /10.200.2.108:/opt/app/geode/
> node/nx108a-srv/ksdata
>
> 10.200.2.108: Offline members with potentially new data:[
>
> 10.200.2.108:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
> 10.200.2.108:   Location: /10.200.4.103:/opt/app/geode/
> node/nx103b-srv/ksdata
>
> 10.200.2.108:   Buckets: [19, 110]
>
> 10.200.2.108: ,
>
> 10.200.2.108:   DiskStore ID: 338fd13a-e564-444b-bfd9-377bec060897
>
> 10.200.2.108:   Location: /10.200.6.107:/opt/app/geode/
> node/nx107c-srv/ksdata
>
> 10.200.2.108:   Buckets: [19, 67, 101, 112]
>
> 10.200.2.108: ,
>
> 10.200.2.108:   DiskStore ID: e086e935-62df-4401-99dd-16475adf1f01
>
> 10.200.2.108:   Location: /10.200.2.105:/opt/app/geode/
> node/nx105a-srv/ksdata
>
> 10.200.2.108:   Buckets: [103]
>
> 10.200.2.108: ]Use the gfsh show missing-disk-stores command to see all
> disk stores that are being waited on by other members.
>
> 10.200.6.104: ............................................................
> ............................................................
> ...................
>
> 10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions)
> has potentially stale data.  Buckets [3, 7, 31, 44, 65, 99] are waiting for
> another offline member to recover the latest data.My persistent id is:
>
> 10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
>
> 10.200.6.104:   Name: nx104c-srv
>
> 10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/
> node/nx104c-srv/ksdata
>
> 10.200.6.104: Offline members with potentially new data:[
>
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/
> node/nx103b-srv/ksdata
>
> 10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
>
> 10.200.6.104: ,
>
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/
> node/nx103b-srv/ksdata
>
> 10.200.6.104:   Buckets: [7]
>
> 10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all
> disk stores that are being waited on by other members.
>
> 10.200.6.104: ..............................
>
> 10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions)
> has potentially stale data.  Buckets [3, 7, 8, 31, 44, 52, 65, 95, 99] are
> waiting for another offline member to recover the latest data.My persistent
> id is:
>
> 10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
>
> 10.200.6.104:   Name: nx104c-srv
>
> 10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/
> node/nx104c-srv/ksdata
>
> 10.200.6.104: Offline members with potentially new data:[
>
> 10.200.6.104:   DiskStore ID: 9bd60b3e-f640-4d95-b9a5-be14fccb5f91
>
> 10.200.6.104:   Location: /10.200.2.102:/opt/app/geode/
> node/nx102a-srv/ksdata
>
> 10.200.6.104:   Buckets: [8, 52, 95]
>
> 10.200.6.104: ,
>
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/
> node/nx103b-srv/ksdata
>
> 10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
>
> 10.200.6.104: ,
>
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/
> node/nx103b-srv/ksdata
>
> 10.200.6.104:   Buckets: [7]
>
> 10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all
> disk stores that are being waited on by other members.
>
>
>
> Eveyone seems waiting something from everyone, even if all my 10 nodes are
> UP.
>
>
>
> If I use the « gfsh show missing-disk-stores » command there are from 5
> to 20 missing ones.
>
>
>
> gfsh>show missing-disk-stores
>
> Missing Disk Stores
>
>
>
>
>
>            Disk Store ID             |     Host      | Directory
>
> ------------------------------------ | ------------- |
> -------------------------------------
>
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
> /opt/app/geode/node/nx102a-srv/ksdata
>
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
> /opt/app/geode/node/nx103b-srv/ksdata
>
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
> /opt/app/geode/node/nx102a-srv/ksdata
>
> a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 |
> /opt/app/geode/node/nx109b-srv/ksdata
>
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
> /opt/app/geode/node/nx103b-srv/ksdata
>
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
> /opt/app/geode/node/nx102a-srv/ksdata
>
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
> /opt/app/geode/node/nx103b-srv/ksdata
>
> e086e935-62df-4401-99dd-16475adf1f01 | /10.200.2.105 |
> /opt/app/geode/node/nx105a-srv/ksdata
>
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
> /opt/app/geode/node/nx103b-srv/ksdata
>
> 338fd13a-e564-444b-bfd9-377bec060897 | /10.200.6.107 |
> /opt/app/geode/node/nx107c-srv/ksdata
>
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
> /opt/app/geode/node/nx102a-srv/ksdata
>
> 184f8d33-a041-4635-915a-9e64cf9c007c | /10.200.6.101 |
> /opt/app/geode/node/nx101c-srv/ksdata
>
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
> /opt/app/geode/node/nx102a-srv/ksdata
>
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
> /opt/app/geode/node/nx103b-srv/ksdata
>
> 0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 |
> /opt/app/geode/node/nx104c-srv/ksdata
>
> 68c79259-ffcb-4b3b-a6a4-ef8bff6190be | /10.200.6.110 |
> /opt/app/geode/node/nx110c-srv/ksdata
>
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
> /opt/app/geode/node/nx102a-srv/ksdata
>
> a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 |
> /opt/app/geode/node/nx109b-srv/ksdata
>
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
> /opt/app/geode/node/nx103b-srv/ksdata
>
> 0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 |
> /opt/app/geode/node/nx104c-srv/ksdata
>
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
> /opt/app/geode/node/nx102a-srv/ksdata
>
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
> /opt/app/geode/node/nx103b-srv/ksdata
>
>
>
> Very strange...
>
>
>
> Does-it meens that some DATA is lost even with a redundancy of #3 ?
>
>
>
> I saw in documentation that « missing disk stores » can be revoked but it
> is not clear on the fact that data is finally lost or not ☹
>
>
>
> Cordialement,
>
>
> *— NOTE : n/a*
> —
> Gfi Informatique
> *Philippe Cerou*
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
>
> —
>
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
>
> Tél. : +33 (0)5.62.85.11.55
> Mob. : +33 (0)6.03.56.48.62
> *www.gfi.world* <http://www.gfi.world/>
>
> —
>
> [image: Facebook] <https://www.facebook.com/gfiinformatique> [image:
> Twitter] <https://twitter.com/gfiinformatique> [image: Instagram]
> <https://www.instagram.com/gfiinformatique/> [image: LinkedIn]
> <https://www.linkedin.com/company/gfi-informatique> [image: YouTube]
> <https://www.youtube.com/user/GFIinformatique>
> —
> [image: cid:image006.jpg@01D2F97F.AA6ABB50] <http://www.gfi.world/>
>
>
>
>
>
> *De :* Anthony Baker [mailto:abaker@pivotal.io <ab...@pivotal.io>]
> *Envoyé :* mercredi 23 janvier 2019 18:28
> *À :* user@geode.apache.org
> *Objet :* Re: Over-activity cause nodes to crash/disconnect with error :
> org.apache.geode.ForcedDisconnectException: Member isn't responding to
> heartbeat requests
>
>
>
> When a cluster member member becomes unresponsive, Geode may fence off the
> member in order to preserve consistency and availability.  The question to
> investigate is *why* the member got into this state.
>
>
>
> Questions to investigate:
>
>
>
> - How much heap memory is your data consuming?
>
> - How much data is overflowed disk vs in heap memory?
>
> - How much data is being read from disk vs memory?
>
> - Is GC activity consuming significant cpu resources?
>
> - Are there other processes running on the system causing swapping
> behavior?
>
>
>
> Anthony
>
>
>
>
>
> On Jan 23, 2019, at 8:45 AM, Philippe CEROU <ph...@gfi.fr> wrote:
>
>
>
> Hi,
>
>
>
> I think I have found the problem,
>
>
>
> When we use REGION with OVERFLOW to disk once the percentage of memory
> configure dis reached the geode server become very very slow ☹
>
>
>
> At the end even if we have a lot of nodes the overall cluster do not reach
> to write/acquire as far as client send data and everything fall, we past
> from 140 000 rows per second (memory) to less than 10 000 rows per second
> (once OVERFLOW is started)...
>
>
>
> Is the product able to overflow on disk without a so high bandwidth
> reduce ?
>
>
>
> For information, here are my launch commands (9 nodes) :
>
>
>
> 3 x :
>
>
>
> gfsh \
>
> -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true
> --port=1234 --mcast-port=0 --locators='nx101c[1234],
> nc102a[1234],nx103b[1234]'"
>
>
>
> For PDX :
>
>
>
> gfsh \
>
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
> -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
>
>
>
> 9 x :
>
>
>
> gfsh \
>
> -e "start server --name=${WSNAME} --initial-heap=6000M --max-heap=6000M
> --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true
> --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"
>
>
>
> For disks & regions :
>
>
>
> gfsh \
>
>                 -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'"
> \
>
>                 -e "create disk-store --name ksdata
> --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata
> --group=ksgroup"
>
>
>
> gfsh \
>
>                 -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'"
> \
>
>                 -e "create region --if-not-exists --name=ksdata-benchmark
> --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW
> --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk"
> \
>
>                 -e "create index --name=ksdata-benchmark-C001
> --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
>
>                 -e "create index --name=ksdata-benchmark-C002
> --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
>
>                 -e "create index --name=ksdata-benchmark-C003
> --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
>
>                 -e "create index --name=ksdata-benchmark-C011
> --expression=C011 --region=ksdata-benchmark --group=ksgroup"
>
>
>
> Cordialement,
>
>
> *— NOTE : n/a*
> —
> Gfi Informatique
> *Philippe Cerou*
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
>
> —
>
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> <https://maps.google.com/?q=1+Rond-point+du+G%C3%A9n%C3%A9ral+Eisenhower,+31400+Toulouse&entry=gmail&source=g>
>
> Tél. : +33 (0)5.62.85.11.55 <+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <+33%206.03.56.48.62>
> *www.gfi.world*
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>
> —
>
> <image001.png>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=>
>  <image002.png>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=>
>  <image003.png>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=>
>  <image004.png>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=>
>  <image005.png>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
> —
> <image006.jpg>
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>
>
>
>
> *De :* Philippe CEROU [mailto:philippe.cerou@gfi.fr
> <ph...@gfi.fr>]
> *Envoyé :* mercredi 23 janvier 2019 08:40
> *À :* user@geode.apache.org
> *Objet :* Over-activity cause nodes to crash/disconnect with error :
> org.apache.geode.ForcedDisconnectException: Member isn't responding to
> heartbeat requests
>
>
>
> Hi,
>
>
>
> Following my GEODE study i now have this situation.
>
>
>
> I have 6 nodes, 3 have a locator and 6 have a server.
>
>
>
> When I try to do massive insertion (200 million PDX-ized rows) I have
> after some hours this error on client side for all threads :
>
>
>
> ...
>
> Exception in thread "Thread-20" org.apache.geode.cache.client.ServerOperationException:
> remote server on nxmaster(28523:loner):35042:efcecc76: Region
> /ksdata-benchmark putAll at server applied partial keys due to exception.
>
>         at org.apache.geode.internal.cache.LocalRegion.basicPutAll(
> LocalRegion.java:9542)
>
>         at org.apache.geode.internal.cache.LocalRegion.putAll(
> LocalRegion.java:9446)
>
>         at org.apache.geode.internal.cache.LocalRegion.putAll(
> LocalRegion.java:9458)
>
>         at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.
> TableInsert(CxDrvGEODE.java:144)
>
>         at com.gfi.rt.lib.database.connectors.CxTable.insert(
> CxTable.java:175)
>
>         at com.gfi.rt.bin.database.dbbench.BenchmarkObject.run(
> BenchmarkObject.java:279)
>
>         at com.gfi.rt.bin.database.dbbench.BenchmarkThread.DoIt(Ben
>
> ...

Re: Is there a way to boost all cluster startup when persistence is ON ?

Posted by Udo Kohlmeyer <ud...@apache.org>.

Hi there Phillippe,

If you can confirm, you have 3(AZ's) x 30 servers or have 3(AZ's) x 10 
servers?

Also, if you servers if different AZ's I think it would be safe to 
consider them being "split by WAN" and use WAN 
GatewaySenders/GatewayReceivers 
<https://geode.apache.org/docs/guide/18/topologies_and_comm/multi_site_configuration/chapter_overview.html> 
to keep them in sync.

As for starting a cluster with "insert-only" until the cluster has 
correctly started up, is not a feature that is currently supported. If 
there is a real need for this, I believe that the Geode community would 
be willing to look into this (we also accept PR's ;) )

As for start up performance, how large are your keys and values 
(bytes/kbytes)?

Also, as per your previous deployment description, are your servers 
split over AZ's? THAT would cause a real issue on start up. If you would 
like, you could share your cluster config (deployment model) and geode 
config (regions, indexes, diskstores, etc..)

This way we could possible provide better recommendations, more targeted 
to your usecase, rather than try and provide you with answers that may 
not be applicable in your case.

--Udo

On 1/31/19 08:55, Philippe CEROU wrote:
>
> Hi,
>
> Thanks for the link, but I tried to tune disk store as follow but 
> startup time is always near 18 minutes to have region UP & RUNNING 
> (Startup : 15:22 -> Region « Region /ksdata-benchmark has successfully 
> completed waiting for other members to recover » : 15 :40).
>
> Here is what I configured for the disk store:
>
> create disk-store --name ksdata --dir=/opt/app/geode/data/ksdata 
> --group=ksgroup --allow-force-compaction=true --auto-compact=true 
> --compaction-threshold=30 --max-oplog-size=64 --queue-size=1000 
> --time-interval=5000 --write-buffer-size=1048576
>
> The test was here to secure 200 000 000 PDX-ized rows with 3 
> examplaries on 3 AWS AZs (Datacenters) on 30 « m4.xlarge » servers (3 
> x 10 x [4 CPU, 16Gb RAM]) and the result is too far away from our 
> requirements (Less than 5 minutes for all the solution is UP [cache + 
> products + databases]).
>
> It is why I asked if there was a possibility to start serving the 
> region before the index recovering was finalized :
>
>   * INSERTS : You said yourself that all was WAL so adding new rows
>     should not be a problem while recovering, new ROWS being treated
>     at the end.
>   * READ : No matter if there are no indexes for us we can accept a
>     quick disk full scan and have not efficient read responses.
>   * DELETE : Not in our cases, it is a cache with TTL so if olr rows
>     are at the end DELETED 2 or 3 hours after their peremption
>     timestamp it is not a problem.
>
> If it is not possible, can you just inform me if the platform 
> efficiency, and the recovery, could be increased installing multiple 
> servers on same node, for example migrating from 30 to 150 GEODE 
> servers with iso-infrastructure:
>
>   * Before : 30 x [1 x machine -> 1 x 15Gb geode node]
>   * After : 30 x [1 x machine -> 5 x 3Gb geode node]
>
> Cordialement,
>
> */—
> NOTE : n/a/*
> —
> Gfi**Informatique
> *Philippe Cerou*
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
>
> —
>
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
>
> Tél. : +33 (0)5.62.85.11.55
> Mob. : +33 (0)6.03.56.48.62
> *www.gfi.world* <http://www.gfi.world/>
>
> —
>
> Facebook <https://www.facebook.com/gfiinformatique>Twitter 
> <https://twitter.com/gfiinformatique>Instagram 
> <https://www.instagram.com/gfiinformatique/>LinkedIn 
> <https://www.linkedin.com/company/gfi-informatique>YouTube 
> <https://www.youtube.com/user/GFIinformatique>
> —
> cid:image006.jpg@01D2F97F.AA6ABB50 <http://www.gfi.world/>
>
> *De :* Anthony Baker [mailto:abaker@pivotal.io]
> *Envoyé :* jeudi 31 janvier 2019 16:51
> *À :* user@geode.apache.org
> *Objet :* Re: Is there a way to boost all cluster startup when 
> persistence is ON ?
>
> 1) Regarding recovery and startup time:
>
> When using persistent regions, Geode records updates using append-only 
> logs.  This optimizes write performance.  Unlike a relational 
> database, we don’t need to acquire multiple page/buffer locks in order 
> to update an index structure (such as B* tree).  The tradeoff is that 
> we need to scan all the data at startup time to ensure we know where 
> the most recent copy of the data is on disk.
>
> We do a number of optimizations to speed recovery, including lazily 
> faulting in values—we only have to recover the keys in order for a 
> region to be “online”.  However, if the region defines indexes, we do 
> have to recover all the values in order to rebuild the index in memory.
>
> Here are some details on compaction of the log files:
>
> https://geode.apache.org/docs/guide/11/managing/disk_storage/compacting_disk_stores.html
>
> HTH,
>
> Anthony
>
>
>
>     On Jan 30, 2019, at 12:19 AM, Philippe CEROU
>     <philippe.cerou@gfi.fr <ma...@gfi.fr>> wrote:
>
>     Hi,
>
>     New test, new problem 😊
>
>     I have a 30 nodes Apache Geode cluster (30 x 4 CPU / 16Gb RAM)
>     with200 000 000 <tel:200%C2%A0000%C2%A0000>partitionend &
>     replicated (C=2) «PDX-ized » rows.
>
>     When I stop all my cluster (Servers then Locators) and I do a full
>     restart (Locators then Servers) I have in fact two possible problems :
>
>     1. The warm-up time is very long (About 15 minutes), each disk
>     store is standing another node one and the time everything
>     finalize rhe service is down (exactly the cluster is there but the
>     region is not there).
>
>     Is there a way to make the region available and to directly start
>     accepting at least INSERTS ?
>
>     The best would be to directly serve the region for INSERTs and
>     fallback potentially READS to disk 😊
>
>     2. Sometimes when I startup my 30 servers some of them crash with
>     this line after near 1 to 2 minutes :
>
>     [info 2019/01/30 08:03:02.603 UTC nx130b-srv <main> tid=0x1]
>     Initialization of region PdxTypes completed
>
>     [info 2019/01/30 08:04:17.606 UTC nx130b-srv <Geode Failure
>     Detection Scheduler1> tid=0x1c] Failure detection is now watching
>     10.200.6.107(nx107c-srv:19000)<v5>:41000
>
>     [info 2019/01/30 08:04:17.613 UTC nx130b-srv <Geode Failure
>     Detection thread 3> tid=0xe2] Failure detection is now watching
>     10.200.4.112(nx112b-srv:18872)<v5>:41000
>
>     [error 2019/01/30 08:05:02.942 UTC nx130b-srv <main> tid=0x1]
>     Cache initialization for GemFireCache[id =2125470482
>     <tel:2125470482>; isClosing = false; isShutDownAll = false;
>     created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead
>     = false; lockLease = 120; lockTimeout = 60] failed because:
>     org.apache.geode.GemFireIOException: While starting cache server
>     CacheServer on port=40404 client subscription config policy=none
>     client subscription config capacity=1 client subscription config
>     overflow directory=.
>
>     [info 2019/01/30 08:05:02.961 UTC nx130b-srv <main> tid=0x1]
>     GemFireCache[id =2125470482 <tel:2125470482>; isClosing = true;
>     isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019;
>     server = false; copyOnRead = false; lockLease = 120; lockTimeout =
>     60]: Now closing.
>
>     [info 2019/01/30 08:05:03.325 UTC nx130b-srv <main> tid=0x1]
>     Shutting down DistributionManager
>     10.200.4.130(nx130b-srv:3065)<v5>:41000.
>
>     [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Now
>     closing distribution for 10.200.4.130(nx130b-srv:3065)<v5>:41000
>
>     [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1]
>     Stopping membership services
>
>     [info 2019/01/30 08:05:03.435 UTC nx130b-srv <main> tid=0x1]
>     GMSHealthMonitor server socket is closed in stopServices().
>
>     [info 2019/01/30 08:05:03.435 UTC nx130b-srv <Geode Failure
>     Detection Server thread 1> tid=0x1f] GMSHealthMonitor server
>     thread exiting
>
>     [info 2019/01/30 08:05:03.436 UTC nx130b-srv <main> tid=0x1]
>     GMSHealthMonitor serverSocketExecutor is terminated
>
>     [info 2019/01/30 08:05:03.475 UTC nx130b-srv <main> tid=0x1]
>     DistributionManager stopped in 150ms.
>
>     [info 2019/01/30 08:05:03.476 UTC nx130b-srv <main> tid=0x1]
>     Marking DistributionManager
>     10.200.4.130(nx130b-srv:3065)<v5>:41000 as closed.
>
>     Any idea ?
>
>     Cordialement,
>
>     */—
>     NOTE : n/a/*
>     —
>     Gfi**Informatique
>     *Philippe Cerou*
>     Architecte & Expert Système
>     GFI Production / Toulouse
>     philippe.cerou @gfi.fr
>
>     —
>
>     1 Rond-point du Général Eisenhower, 31400 Toulouse
>
>     Tél. :+33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
>     Mob. :+33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
>     *www.gfi.world* <http://www.gfi.world/>
>
>     —
>
>     <image001.png>
>     <https://www.facebook.com/gfiinformatique><image002.png>
>     <https://twitter.com/gfiinformatique><image003.png>
>     <https://www.instagram.com/gfiinformatique/><image004.png>
>     <https://www.linkedin.com/company/gfi-informatique><image005.png>
>     <https://www.youtube.com/user/GFIinformatique>
>     —
>     <image006.jpg> <http://www.gfi.world/>
>
>     *De :*Philippe CEROU [mailto:philippe.cerou@gfi.fr]
>     *Envoyé :*jeudi 24 janvier 2019 13:23
>     *À :*user@geode.apache.org <ma...@geode.apache.org>
>     *Objet :*RE: Over-activity cause nodes to crash/disconnect with
>     error : org.apache.geode.ForcedDisconnectException: Member isn't
>     responding to heartbeat requests
>
>     Hi,
>
>     Own-response,
>
>     Finally, after shutting down and restarting the cluster, this one
>     was « just » reimporting data from disks and it took 1h34m to make
>     region back (for persisted200 000 000
>     <tel:200%C2%A0000%C2%A0000>rows among 10 nodes).
>
>     I still have some questions 😊
>
>     My cluster reported memory usage is really different than the
>     unique region it serve, see screenshots.
>
>     Cluster memory usage (205Gb):
>
>     <image008.jpg>
>
>     But region consume 311Gb :
>
>     <image009.jpg>
>
>     What I do not understand :
>
>       * If I consume 205Gb with 35Gb for « region disk caching » what
>         are the 205Gb - 35Gb = 170Gb used for ?
>       * If I really have 205Gb used by 305Gb, I understand my region
>         is about 311Gb, why so little memory is used to cache region ?
>
>     If I launch a simple « query  --query='select count(*) from
>     /ksdata-benchmark where C001="AD6909"' » from GFSH I still have no
>     response after 10 minutes while the requested column is indexd !
>
>     gfsh>list indexes
>
>     Member Name |                Member ID                 |    Region
>     Path    | Name          | Type  | Indexed Expression | From
>     Clause    | Valid Index
>
>     ----------- | ---------------------------------------- |
>     ----------------- | --------------------- | ----- |
>     ------------------ | ----------------- | -----------
>
>     nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001      
>             | /ksdata-benchmark | true
>
>     nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C001 | RANGE |
>     C001               | /ksdata-benchmark | true
>
>     nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C002 | RANGE |
>     C002               | /ksdata-benchmark | true
>
>     nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C003 | RANGE |
>     C003               | /ksdata-benchmark | true
>
>     nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  |
>     /ksdata-benchmark | ksdata-benchmark-C011 | RANGE |
>     C011               | /ksdata-benchmark | true
>
>     Cordialement,
>
>     */—
>     NOTE : n/a/*
>     —
>     Gfi**Informatique
>     *Philippe Cerou*
>     Architecte & Expert Système
>     GFI Production / Toulouse
>     philippe.cerou @gfi.fr
>
>     —
>
>     1 Rond-point du Général Eisenhower, 31400 Toulouse
>
>     Tél. :+33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
>     Mob. :+33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
>     *www.gfi.world* <http://www.gfi.world/>
>
>     —
>
>     <image001.png>
>     <https://www.facebook.com/gfiinformatique><image002.png>
>     <https://twitter.com/gfiinformatique><image003.png>
>     <https://www.instagram.com/gfiinformatique/><image004.png>
>     <https://www.linkedin.com/company/gfi-informatique><image005.png>
>     <https://www.youtube.com/user/GFIinformatique>
>     —
>     <image006.jpg> <http://www.gfi.world/>
>
>     *De :*Philippe CEROU [mailto:philippe.cerou@gfi.fr]
>     *Envoyé :*jeudi 24 janvier 2019 11:21
>     *À :*user@geode.apache.org <ma...@geode.apache.org>
>     *Objet :*RE: Over-activity cause nodes to crash/disconnect with
>     error : org.apache.geode.ForcedDisconnectException: Member isn't
>     responding to heartbeat requests
>
>     Hi,
>
>     I understand, here it is a question of heap memory consumption,
>     but I think we do something wrong because :
>
>     1.I have retried with a 10 AWS C5.4XLARGE nodes (8 CPU + 64 Gb RAM
>     + 50Gb SSD disks), this nodes have 4 x more memory than previous
>     test which give us a GEODE server with 305Gb MEMORY (Pulse/Total
>     heap).
>
>     2.My INPUT data is a CSV with200 000 000
>     <tel:200%C2%A0000%C2%A0000>rows about 150 bytes each divided in 19
>     LONG & STRING columns, this give me (With region
>     « redundant-copies=2 ») about 3 x200 000 000 x 150
>     <tel:200%C2%A0000%C2%A0000;150>= 83.81Gb. I have 4 indexes too on
>     4 LONG columns, my operational product data cost ratio (With 80%
>     of HEAP threshold « eviction-heap-percentage=80 ») is so about
>     (305Gb x 0.8) / 83.81Gb = 2.91, not very good ☹
>
>     So, I think I have a problem with OVERFLOW settings even if I did
>     the same as documentation show.
>
>     Locators launch command :
>
>     gfsh -e "start locator --name=${WSNAME} --group=ksgroup
>     --enable-cluster-configuration=true --port=1234 --mcast-port=0
>     --locators='${LLOCATORS}'
>
>     PDX configuration :
>
>     gfsh \
>
>     -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
>     -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
>
>     Servers launch command:
>
>     WMEM=28000
>
>     gfsh \
>
>     -e "start server --name=${WSNAME} --initial-heap=${WMEM}M
>     --max-heap=${WMEM}M --eviction-heap-percentage=80 --group=ksgroup
>     --use-cluster-configuration=true --mcast-port=0
>     --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"
>
>     Region declaration :
>
>     gfsh \
>
>     -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
>     -e "create disk-store --name ksdata --allow-force-compaction=true
>     --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"
>
>     gfsh \
>
>     -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
>     -e "create region --if-not-exists --name=ksdata-benchmark
>     --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW
>     --disk-store=ksdata --redundant-copies=2
>     --eviction-action=overflow-to-disk" \
>
>     -e "create index --name=ksdata-benchmark-C001 --expression=C001
>     --region=ksdata-benchmark --group=ksgroup" \
>
>     -e "create index --name=ksdata-benchmark-C002 --expression=C002
>     --region=ksdata-benchmark --group=ksgroup" \
>
>     -e "create index --name=ksdata-benchmark-C003 --expression=C003
>     --region=ksdata-benchmark --group=ksgroup" \
>
>     -e "create index --name=ksdata-benchmark-C011 --expression=C011
>     --region=ksdata-benchmark --group=ksgroup"
>
>     Other « new » problem, if I shutdown the cluster (everything) then
>     I restart it (locators then servers) I do not see my region
>     anymore, but HEAP memory is consumed (I attached a PULSE screenshot).
>
>     <image010.jpg>
>
>     If I look at servers logs I see that, after one hour (multiple
>     times on multiple servers with different details):
>
>     10.200.2.108: Region /ksdata-benchmark (and any colocated
>     sub-regions) has potentially stale data.  Buckets [19, 67, 101,
>     103, 110, 112] are waiting for another offline member to recover
>     the latest data.My persistent id is:
>
>     10.200.2.108:   DiskStore ID: 0c16c9b9-eff3-4fe1-84b1-f2ad0b7d19de
>
>     10.200.2.108:   Name: nx108a-srv
>
>     10.200.2.108:   Location:
>     /10.200.2.108:/opt/app/geode/node/nx108a-srv/ksdata
>
>     10.200.2.108: Offline members with potentially new data:[
>
>     10.200.2.108:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
>     10.200.2.108:   Location:
>     /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
>
>     10.200.2.108:   Buckets: [19, 110]
>
>     10.200.2.108: ,
>
>     10.200.2.108:   DiskStore ID: 338fd13a-e564-444b-bfd9-377bec060897
>
>     10.200.2.108:   Location:
>     /10.200.6.107:/opt/app/geode/node/nx107c-srv/ksdata
>
>     10.200.2.108:   Buckets: [19, 67, 101, 112]
>
>     10.200.2.108: ,
>
>     10.200.2.108:   DiskStore ID: e086e935-62df-4401-99dd-16475adf1f01
>
>     10.200.2.108:   Location:
>     /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata
>
>     10.200.2.108:   Buckets: [103]
>
>     10.200.2.108: ]Use the gfsh show missing-disk-stores command to
>     see all disk stores that are being waited on by other members.
>
>     10.200.6.104:
>     ...........................................................................................................................................
>
>     10.200.6.104: Region /ksdata-benchmark (and any colocated
>     sub-regions) has potentially stale data.  Buckets [3, 7, 31, 44,
>     65, 99] are waiting for another offline member to recover the
>     latest data.My persistent id is:
>
>     10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
>
>     10.200.6.104:   Name: nx104c-srv
>
>     10.200.6.104:   Location:
>     /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
>
>     10.200.6.104: Offline members with potentially new data:[
>
>     10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
>     10.200.6.104:   Location:
>     /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
>
>     10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
>
>     10.200.6.104: ,
>
>     10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
>     10.200.6.104:   Location:
>     /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
>
>     10.200.6.104:   Buckets: [7]
>
>     10.200.6.104: ]Use the gfsh show missing-disk-stores command to
>     see all disk stores that are being waited on by other members.
>
>     10.200.6.104: ..............................
>
>     10.200.6.104: Region /ksdata-benchmark (and any colocated
>     sub-regions) has potentially stale data.  Buckets [3, 7, 8, 31,
>     44, 52, 65, 95, 99] are waiting for another offline member to
>     recover the latest data.My persistent id is:
>
>     10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
>
>     10.200.6.104:   Name: nx104c-srv
>
>     10.200.6.104:   Location:
>     /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
>
>     10.200.6.104: Offline members with potentially new data:[
>
>     10.200.6.104:   DiskStore ID: 9bd60b3e-f640-4d95-b9a5-be14fccb5f91
>
>     10.200.6.104:   Location:
>     /10.200.2.102:/opt/app/geode/node/nx102a-srv/ksdata
>
>     10.200.6.104:   Buckets: [8, 52, 95]
>
>     10.200.6.104: ,
>
>     10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
>     10.200.6.104:   Location:
>     /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
>
>     10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
>
>     10.200.6.104: ,
>
>     10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
>
>     10.200.6.104:   Location:
>     /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
>
>     10.200.6.104:   Buckets: [7]
>
>     10.200.6.104: ]Use the gfsh show missing-disk-stores command to
>     see all disk stores that are being waited on by other members.
>
>     Eveyone seems waiting something from everyone, even if all my 10
>     nodes are UP.
>
>     If I use the « gfsh show missing-disk-stores » command there are
>     from 5 to 20 missing ones.
>
>     gfsh>show missing-disk-stores
>
>     Missing Disk Stores
>
>                Disk Store ID |     Host      | Directory
>
>     ------------------------------------ | ------------- |
>     -------------------------------------
>
>     9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
>     /opt/app/geode/node/nx102a-srv/ksdata
>
>     26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
>     /opt/app/geode/node/nx103b-srv/ksdata
>
>     9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
>     /opt/app/geode/node/nx102a-srv/ksdata
>
>     a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 |
>     /opt/app/geode/node/nx109b-srv/ksdata
>
>     26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
>     /opt/app/geode/node/nx103b-srv/ksdata
>
>     9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
>     /opt/app/geode/node/nx102a-srv/ksdata
>
>     26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
>     /opt/app/geode/node/nx103b-srv/ksdata
>
>     e086e935-62df-4401-99dd-16475adf1f01 | /10.200.2.105 |
>     /opt/app/geode/node/nx105a-srv/ksdata
>
>     26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
>     /opt/app/geode/node/nx103b-srv/ksdata
>
>     338fd13a-e564-444b-bfd9-377bec060897 | /10.200.6.107 |
>     /opt/app/geode/node/nx107c-srv/ksdata
>
>     9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
>     /opt/app/geode/node/nx102a-srv/ksdata
>
>     184f8d33-a041-4635-915a-9e64cf9c007c | /10.200.6.101 |
>     /opt/app/geode/node/nx101c-srv/ksdata
>
>     9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
>     /opt/app/geode/node/nx102a-srv/ksdata
>
>     26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
>     /opt/app/geode/node/nx103b-srv/ksdata
>
>     0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 |
>     /opt/app/geode/node/nx104c-srv/ksdata
>
>     68c79259-ffcb-4b3b-a6a4-ef8bff6190be | /10.200.6.110 |
>     /opt/app/geode/node/nx110c-srv/ksdata
>
>     9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
>     /opt/app/geode/node/nx102a-srv/ksdata
>
>     a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 |
>     /opt/app/geode/node/nx109b-srv/ksdata
>
>     26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
>     /opt/app/geode/node/nx103b-srv/ksdata
>
>     0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 |
>     /opt/app/geode/node/nx104c-srv/ksdata
>
>     9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 |
>     /opt/app/geode/node/nx102a-srv/ksdata
>
>     26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 |
>     /opt/app/geode/node/nx103b-srv/ksdata
>
>     Very strange...
>
>     Does-it meens that some DATA is lost even with a redundancy of #3 ?
>
>     I saw in documentation that « missing disk stores » can be revoked
>     but it is not clear on the fact that data is finally lost or not ☹
>
>     Cordialement,
>
>     */—
>     NOTE : n/a/*
>     —
>     Gfi**Informatique
>     *Philippe Cerou*
>     Architecte & Expert Système
>     GFI Production / Toulouse
>     philippe.cerou @gfi.fr
>
>     —
>
>     1 Rond-point du Général Eisenhower, 31400 Toulouse
>
>     Tél. :+33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
>     Mob. :+33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
>     *www.gfi.world* <http://www.gfi.world/>
>
>     —
>
>     <image001.png>
>     <https://www.facebook.com/gfiinformatique><image002.png>
>     <https://twitter.com/gfiinformatique><image003.png>
>     <https://www.instagram.com/gfiinformatique/><image004.png>
>     <https://www.linkedin.com/company/gfi-informatique><image005.png>
>     <https://www.youtube.com/user/GFIinformatique>
>     —
>     <image006.jpg> <http://www.gfi.world/>
>
>     *De :*Anthony Baker [mailto:abaker@pivotal.io]
>     *Envoyé :*mercredi 23 janvier 2019 18:28
>     *À :*user@geode.apache.org <ma...@geode.apache.org>
>     *Objet :*Re: Over-activity cause nodes to crash/disconnect with
>     error : org.apache.geode.ForcedDisconnectException: Member isn't
>     responding to heartbeat requests
>
>     When a cluster member member becomes unresponsive, Geode may fence
>     off the member in order to preserve consistency and availability.
>      The question to investigate is *why* the member got into this state.
>
>     Questions to investigate:
>
>     - How much heap memory is your data consuming?
>
>     - How much data is overflowed disk vs in heap memory?
>
>     - How much data is being read from disk vs memory?
>
>     - Is GC activity consuming significant cpu resources?
>
>     - Are there other processes running on the system causing swapping
>     behavior?
>
>     Anthony
>
>         On Jan 23, 2019, at 8:45 AM, Philippe CEROU
>         <philippe.cerou@gfi.fr <ma...@gfi.fr>> wrote:
>
>         Hi,
>
>         I think I have found the problem,
>
>         When we use REGION with OVERFLOW to disk once the percentage
>         of memory configure dis reached the geode server become very
>         very slow ☹
>
>         At the end even if we have a lot of nodes the overall cluster
>         do not reach to write/acquire as far as client send data and
>         everything fall, we past from 140 000 rows per second (memory)
>         to less than 10 000 rows per second (once OVERFLOW is started)...
>
>         Is the product able to overflow on disk without a so high
>         bandwidth reduce ?
>
>         For information, here are my launch commands (9 nodes) :
>
>         3 x :
>
>         gfsh \
>
>         -e "start locator --name=${WSNAME} --group=ksgroup
>         --enable-cluster-configuration=true --port=1234 --mcast-port=0
>         --locators='nx101c[1234],nc102a[1234],nx103b[1234]'"
>
>         For PDX :
>
>         gfsh \
>
>         -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
>         -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
>
>         9 x :
>
>         gfsh \
>
>         -e "start server --name=${WSNAME} --initial-heap=6000M
>         --max-heap=6000M --eviction-heap-percentage=80 --group=ksgroup
>         --use-cluster-configuration=true --mcast-port=0
>         --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"
>
>         For disks & regions :
>
>         gfsh \
>
>                         -e "connect
>         --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
>                         -e "create disk-store --name ksdata
>         --allow-force-compaction=true --auto-compact=true
>         --dir=/opt/app/geode/data/ksdata --group=ksgroup"
>
>         gfsh \
>
>                         -e "connect
>         --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>
>                         -e "create region --if-not-exists
>         --name=ksdata-benchmark --group=ksgroup
>         --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW
>         --disk-store=ksdata --redundant-copies=2
>         --eviction-action=overflow-to-disk" \
>
>                         -e "create index --name=ksdata-benchmark-C001
>         --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
>
>                         -e "create index --name=ksdata-benchmark-C002
>         --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
>
>                         -e "create index --name=ksdata-benchmark-C003
>         --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
>
>                         -e "create index --name=ksdata-benchmark-C011
>         --expression=C011 --region=ksdata-benchmark --group=ksgroup"
>
>         Cordialement,
>
>         */—
>         NOTE : n/a/*
>         —
>         Gfi**Informatique
>         *Philippe Cerou*
>         Architecte & Expert Système
>         GFI Production / Toulouse
>         philippe.cerou @gfi.fr
>
>         —
>
>         1 Rond-point du Général Eisenhower, 31400 Toulouse
>
>         Tél. :+33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
>         Mob. :+33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
>         *www.gfi.world*
>         <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>         —
>
>         <image001.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=><image002.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=><image003.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=><image004.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=><image005.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
>         —
>         <image006.jpg>
>         <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>         *De :*Philippe CEROU [mailto:philippe.cerou@gfi.fr]
>         *Envoyé :*mercredi 23 janvier 2019 08:40
>         *À :*user@geode.apache.org <ma...@geode.apache.org>
>         *Objet :*Over-activity cause nodes to crash/disconnect with
>         error : org.apache.geode.ForcedDisconnectException: Member
>         isn't responding to heartbeat requests
>
>         Hi,
>
>         Following my GEODE study i now have this situation.
>
>         I have 6 nodes, 3 have a locator and 6 have a server.
>
>         When I try to do massive insertion (200 million PDX-ized rows)
>         I have after some hours this error on client side for all
>         threads :
>
>         ...
>
>         Exception in thread "Thread-20"
>         org.apache.geode.cache.client.ServerOperationException: remote
>         server on nxmaster(28523:loner):35042:efcecc76: Region
>         /ksdata-benchmark putAll at server applied partial keys due to
>         exception.
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9542)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9446)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9458)
>
>                 at
>         com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.TableInsert(CxDrvGEODE.java:144)
>
>                 at
>         com.gfi.rt.lib.database.connectors.CxTable.insert(CxTable.java:175)
>
>                 at
>         com.gfi.rt.bin.database.dbbench.BenchmarkObject.run(BenchmarkObject.java:279)
>
>                 at
>         com.gfi.rt.bin.database.dbbench.BenchmarkThread.DoIt(BenchmarkThread.java:84)
>
>                 at
>         com.gfi.rt.bin.database.dbbench.BenchmarkThread.run(BenchmarkThread.java:67)
>
>         Caused by:
>         org.apache.geode.cache.persistence.PartitionOfflineException:
>         Region /ksdata-benchmark bucket 48 has persistent data that is
>         no longer online stored at these locations:
>         [/10.200.6.101:/opt/app/geode/node/nx101c-srv/ksdata created
>         at timestamp1548181352906 <tel:1548181352906>version 0
>         diskStoreId 3650a4bc61f447a3-bc9cba70cf59e514 name null,
>         /10.200.4.106:/opt/app/geode/node/nx106b-srv/ksdata created at
>         timestamp1548181352865 <tel:1548181352865>version 0
>         diskStoreId a7de7988708b44d2-b4b09d91c1f536b6 name null,
>         /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata created at
>         timestamp1548181353100 <tel:1548181353100>version 0
>         diskStoreId 72dcab08ae12413c-b099fbb7f9ab740b name null]
>
>                 at
>         org.apache.geode.internal.cache.ProxyBucketRegion.checkBucketRedundancyBeforeGrab(ProxyBucketRegion.java:590)
>
>                 at
>         org.apache.geode.internal.cache.PartitionedRegionDataStore.lockRedundancyLock(PartitionedRegionDataStore.java:595)
>
>                 at
>         org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:440)
>
>                 at
>         org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2858)
>
>                 at
>         org.apache.geode.internal.cache.PartitionedRegionDataStore.handleManageBucketRequest(PartitionedRegionDataStore.java:1014)
>
>                 at
>         org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketOnMember(PRHARedundancyProvider.java:1233)
>
>                 at
>         org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketInstance(PRHARedundancyProvider.java:416)
>
>                 at
>         org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketAtomically(PRHARedundancyProvider.java:604)
>
>                 at
>         org.apache.geode.internal.cache.PartitionedRegion.createBucket(PartitionedRegion.java:3310)
>
>                 at
>         org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2055)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
>
>                 at
>         org.apache.geode.internal.cache.PartitionedRegion.performPutAllEntry(PartitionedRegion.java:2124)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.basicEntryPutAll(LocalRegion.java:10060)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.access$100(LocalRegion.java:231)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion$2.run(LocalRegion.java:9639)
>
>                 at
>         org.apache.geode.internal.cache.event.NonDistributedEventTracker.syncBulkOp(NonDistributedEventTracker.java:107)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.syncBulkOp(LocalRegion.java:6085)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9657)
>
>                 at
>         org.apache.geode.internal.cache.LocalRegion.basicBridgePutAll(LocalRegion.java:9367)
>
>                 at
>         org.apache.geode.internal.cache.tier.sockets.command.PutAll80.cmdExecute(PutAll80.java:270)
>
>                 at
>         org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:178)
>
>                 at
>         org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:844)
>
>                 at
>         org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:74)
>
>                 at
>         org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1214)
>
>                 at
>         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>                 at
>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>                 at
>         org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:594)
>
>                 at
>         org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121)
>
>                 at java.lang.Thread.run(Thread.java:748)
>
>         ...
>
>         When I check the cluster 4 of the 6 servers has disappear,
>         when I check their LOG I have this :
>
>         ...
>
>         [info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure
>         Detection Scheduler1> tid=0x1b] Failure detection is now
>         watching 10.200.4.103(nx103b-srv:21892)<v4>:41001
>
>         [info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure
>         Detection Scheduler1> tid=0x1b] Failure detection is now
>         watching 10.200.2.102(nx102a-srv:23781)<v4>:41001
>
>         [info 2019/01/23 01:28:35.326 UTC nx102a-srv <unicast
>         receiver,nx102a-34886> tid=0x1a] Membership received a request
>         to remove 10.200.2.102(nx102a-srv:23781)<v4>:41001 from
>         10.200.2.102(nx102a:23617:locator)<ec><v0>:41000 reason=Member
>         isn't responding to heartbeat requests
>
>         [severe 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast
>         receiver,nx102a-34886> tid=0x1a] Membership service failure:
>         Member isn't responding to heartbeat requests
>
>         org.apache.geode.ForcedDisconnectException: Member isn't
>         responding to heartbeat requests
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
>
>                 at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
>
>                 at org.jgroups.JChannel.up(JChannel.java:741)
>
>                 at
>         org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
>
>                 at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
>
>                 at
>         org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
>
>                 at
>         org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
>
>                 at
>         org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
>
>                 at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
>
>                 at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
>
>                 at
>         org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
>
>                 at
>         org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
>
>                 at
>         org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
>
>                 at org.jgroups.protocols.TP.receive(TP.java:1714)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
>
>                 at
>         org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
>
>                 at java.lang.Thread.run(Thread.java:748)
>
>         [info 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast
>         receiver,nx102a-34886> tid=0x1a] CacheServer configuration saved
>
>         [info 2019/01/23 01:28:35.341 UTC nx102a-srv
>         <DisconnectThread> tid=0x1cd] Stopping membership services
>
>         [info 2019/01/23 01:28:35.341 UTC nx102a-srv
>         <DisconnectThread> tid=0x1cd] GMSHealthMonitor server socket
>         is closed in stopServices().
>
>         [info 2019/01/23 01:28:35.342 UTC nx102a-srv <Geode Failure
>         Detection thread 126> tid=0x1ca] Failure detection is now
>         watching 10.200.2.102(nx102a:23617:locator)<ec><v0>:41000
>
>         [info 2019/01/23 01:28:35.343 UTC nx102a-srv <Geode Failure
>         Detection Server thread 1> tid=0x1e] GMSHealthMonitor server
>         thread exiting
>
>         [info 2019/01/23 01:28:35.344 UTC nx102a-srv
>         <DisconnectThread> tid=0x1cd] GMSHealthMonitor
>         serverSocketExecutor is terminated
>
>         [info 2019/01/23 01:28:35.347 UTC nx102a-srv <ReconnectThread>
>         tid=0x1cd] Disconnecting old DistributedSystem to prepare for
>         a reconnect attempt
>
>         [info 2019/01/23 01:28:35.351 UTC nx102a-srv <ReconnectThread>
>         tid=0x1cd] GemFireCache[id = 82825098; isClosing = true;
>         isShutDownAll = false; created = Tue Jan 22 18:22:30 UTC 2019;
>         server = true; copyOnRead = false; lockLease = 120;
>         lockTimeout = 60]: Now closing.
>
>         [info 2019/01/23 01:28:35.352 UTC nx102a-srv <ReconnectThread>
>         tid=0x1cd] Cache server on port40404 <tel:40404>is shutting down.
>
>         [severe 2019/01/23 01:28:45.005 UTC nx102a-srv
>         <EvictorThread8> tid=0x92] Uncaught exception in thread
>         Thread[EvictorThread8,10,main]
>
>         org.apache.geode.distributed.DistributedSystemDisconnectedException:
>         Distribution manager on
>         10.200.2.102(nx102a-srv:23781)<v4>:41001 started at Tue Jan 22
>         18:22:30 UTC 2019: Member isn't responding to heartbeat
>         requests, caused by
>         org.apache.geode.ForcedDisconnectException: Member isn't
>         responding to heartbeat requests
>
>                 at
>         org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:3926)
>
>                 at
>         org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:966)
>
>                 at
>         org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1547)
>
>                 at
>         org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
>
>                 at
>         org.apache.geode.internal.cache.GemFireCacheImpl.getInternalResourceManager(GemFireCacheImpl.java:4330)
>
>                 at
>         org.apache.geode.internal.cache.GemFireCacheImpl.getResourceManager(GemFireCacheImpl.java:4319)
>
>                 at
>         org.apache.geode.internal.cache.eviction.HeapEvictor.getAllRegionList(HeapEvictor.java:138)
>
>                 at
>         org.apache.geode.internal.cache.eviction.HeapEvictor.getAllSortedRegionList(HeapEvictor.java:171)
>
>                 at
>         org.apache.geode.internal.cache.eviction.HeapEvictor.createAndSubmitWeightedRegionEvictionTasks(HeapEvictor.java:215)
>
>                 at
>         org.apache.geode.internal.cache.eviction.HeapEvictor.access$200(HeapEvictor.java:53)
>
>                 at
>         org.apache.geode.internal.cache.eviction.HeapEvictor$1.run(HeapEvictor.java:357)
>
>                 at
>         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>                 at
>         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>                 at java.lang.Thread.run(Thread.java:748)
>
>         Caused by: org.apache.geode.ForcedDisconnectException: Member
>         isn't responding to heartbeat requests
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
>
>                 at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
>
>                 at org.jgroups.JChannel.up(JChannel.java:741)
>
>                 at
>         org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
>
>                 at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
>
>                 at
>         org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
>
>                 at
>         org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
>
>                 at
>         org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
>
>                 at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
>
>                 at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
>
>                 at
>         org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
>
>                 at
>         org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
>
>                 at
>         org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
>
>                 at org.jgroups.protocols.TP.receive(TP.java:1714)
>
>                 at
>         org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
>
>                 at
>         org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
>
>                 ... 1 more
>
>         [info 2019/01/23 01:30:05.874 UTC nx102a-srv <ReconnectThread>
>         tid=0x1cd] Shutting down DistributionManager
>         10.200.2.102(nx102a-srv:23781)<v4>:41001. At least one
>         Exception occurred.
>
>         ...
>
>         Any idea ?
>
>         Cordialement,
>
>         */—
>         NOTE : n/a/*
>         —
>         Gfi**Informatique
>         *Philippe Cerou*
>         Architecte & Expert Système
>         GFI Production / Toulouse
>         philippe.cerou @gfi.fr
>
>         —
>
>         1 Rond-point du Général Eisenhower, 31400 Toulouse
>
>         Tél. :+33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
>         Mob. :+33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
>         *www.gfi.world*
>         <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>         —
>
>         <image001.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=><image002.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=><image003.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=><image004.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=><image005.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
>         —
>         <image006.jpg>
>         <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>         *De :*Philippe CEROU
>         *Envoyé :*mercredi 23 janvier 2019 08:19
>         *À :*user@geode.apache.org <ma...@geode.apache.org>
>         *Objet :*RE: Multi-threaded Java client exception.
>
>         Hi,
>
>         Thanks to Anthony for its help, I so modified my code to have
>         only one cache & region which are thread-shared.
>
>         To have something running well I had to use synchronized
>         blocks to handle cache connection and close, and use a
>         dedicated region creation/sharing function has follow (Hope it
>         can helps someone) :
>
>         ...
>
>         *public**class*CxDrvGEODE*extends*CxObjNOSQL {
>
>         *static*ClientCache/oCache/=*null*;
>
>         *static**final*String*/oSync/*="x";
>
>         *static**boolean*/IsConnected/=*false*;
>
>         *static**int*/nbThreads/= 0;
>
>         *static**final*HashMap<String, Region<Long,
>         CxDrvGEODERow>>*/hmRegions/*=*new*HashMap<String, Region<Long,
>         CxDrvGEODERow>>();
>
>         ...
>
>         *public**void*DoConnect() {
>
>         *synchronized*(*/oSync/*) {
>
>         *if*(!/IsConnected/) {
>
>         ReflectionBasedAutoSerializeroRBAS=*new*ReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
>
>         /oCache/=*new*ClientCacheFactory().addPoolLocator(*this*.GetNode(),
>         Integer./valueOf/(*this*.Port)).set("log-level","WARN").setPdxSerializer(oRBAS).create();
>
>         /IsConnected/=*true*;
>
>                             }
>
>         /nbThreads/++;
>
>                      }
>
>                }
>
>         *public**void*DoClose() {
>
>         *synchronized*(*/oSync/*) {
>
>         *if*(/nbThreads/> 0) {
>
>         *if*(/nbThreads/== 1 &&/oCache/!=*null*) {
>
>         *for*(StringsKey:*/hmRegions/*.keySet()) {
>
>         */hmRegions/*.get(sKey).close();
>
>         */hmRegions/*.remove(sKey);
>
>         }
>
>         /oCache/.close();
>
>         /oCache/=*null*;
>
>         /IsConnected/=*false*;
>
>                                    }
>
>         /nbThreads/--;
>
>                             }
>
>                      }
>
>                }
>
>         ...
>
>         *private*Region<Long, CxDrvGEODERow> getCache(StringCTable) {
>
>         *synchronized*(*/oSync/*) {
>
>         Region<Long, CxDrvGEODERow>oRegion=*null*;
>
>         *if*(*/hmRegions/*.containsKey(CTable)) {
>
>         oRegion=*/hmRegions/*.get(CTable);
>
>                             }*else*{
>
>         oRegion=/oCache/.<Long,
>         CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.*/PROXY/*).create(*this*.Base+'-'+CTable);
>
>         */hmRegions/*.put(CTable,oRegion);
>
>                             }
>
>         *return*oRegion;
>
>                      }
>
>                }
>
>         ...
>
>         *public**boolean*TableInsert(StringCTable, String[][]TColumns,
>         Object[][]TOValues,*boolean*BCommit,*boolean*BForceBlocMode) {
>
>         ...
>
>         Region<Long, CxDrvGEODERow> oRegion = getCache(CTable);
>
>         ...
>
>                }
>
>         ...
>
>         }
>
>         Cordialement,
>
>         */—
>         NOTE : n/a/*
>         —
>         Gfi**Informatique
>         *Philippe Cerou*
>         Architecte & Expert Système
>         GFI Production / Toulouse
>         philippe.cerou @gfi.fr
>
>         —
>
>         1 Rond-point du Général Eisenhower, 31400 Toulouse
>
>         Tél. :+33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
>         Mob. :+33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
>         *www.gfi.world*
>         <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>         —
>
>         <image001.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=><image002.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=><image003.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=><image004.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=><image005.png>
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
>         —
>         <image006.jpg>
>         <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>
>         *De :*Anthony Baker [mailto:abaker@pivotal.io]
>         *Envoyé :*mardi 22 janvier 2019 16:59
>         *À :*user@geode.apache.org <ma...@geode.apache.org>
>         *Objet :*Re: Multi-threaded Java client exception.
>
>         You only need one ClientCache in each JVM.  You can create the
>         cache and region once and then pass it to each worker thread.
>
>         Anthony
>
>             On Jan 22, 2019, at 1:25 AM, Philippe CEROU
>             <philippe.cerou@gfi.fr <ma...@gfi.fr>> wrote:
>
>             Hi,
>
>             We try to tune a single program using multi-threading for
>             storage interface.
>
>             The problem we have is if we launch this code with
>             THREADS=1 then all run well, starting with THREADS=2 then
>             we always have this exception when we create connection
>             (On yellow row).
>
>             Exception in thread
>             "main"_java.lang.IllegalStateException_: Existing cache's
>             default pool was not compatible
>
>                     at
>             org.apache.geode.internal.cache.GemFireCacheImpl.validatePoolFactory(_GemFireCacheImpl.java:2933_)
>
>                     at
>             org.apache.geode.cache.client.ClientCacheFactory.basicCreate(_ClientCacheFactory.java:252_)
>
>                     at
>             org.apache.geode.cache.client.ClientCacheFactory.create(_ClientCacheFactory.java:213_)
>
>             at
>             com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.DoConnect(_CxDrvGEODE.java:35_)
>
>                     at
>             com.gfi.rt.lib.database.connectors.CxObj.DoConnect(_CxObj.java:121_)
>
>                     at
>             com.gfi.rt.lib.database.connectors.CxInterface.Connect(_CxInterface.java:91_)
>
>                     at
>             com.gfi.rt.lib.database.connectors.CxInterface.Connect(_CxInterface.java:149_)
>
>                    ...
>
>             Here is the data interface code.
>
>             *package*com.gfi.rt.lib.database.connectors.nosql;
>
>             *import*java.util.HashMap;
>
>             *import*org.apache.geode.cache.Region;
>
>             *import*org.apache.geode.cache.client.ClientCache;
>
>             *import*org.apache.geode.cache.client.ClientCacheFactory;
>
>             *import*org.apache.geode.cache.client.ClientRegionShortcut;
>
>             *import*org.apache.geode.pdx.ReflectionBasedAutoSerializer;
>
>             *public**class*CxDrvGEODEThread*extends*CxObjNOSQL {
>
>             ClientCacheoCache=*null*;
>
>             *boolean*IsConnected=*false*;
>
>             //_Connexion_
>
>             *public*CxDrvGEODEThread() {
>
>             Connector="geode-native";
>
>                    }
>
>             *public**void*DoConnect() {
>
>             *if*(!IsConnected) {
>
>             ReflectionBasedAutoSerializerrbas=*new*ReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
>
>             oCache=*new*ClientCacheFactory().addPoolLocator(*this*.GetNode(),
>             Integer./valueOf/(*this*.Port)).set("log-level","WARN").setPdxSerializer(rbas).create();
>
>             IsConnected=*true*;
>
>             }
>
>                    }
>
>             //_Déconnexion__de__la__connexion__courante_à_la_base_de__données_
>
>             *public**void*DoClose() {
>
>             *if*(oCache!=*null*) {
>
>             oCache.close();
>
>             oCache=*null*;
>
>             IsConnected=*false*;
>
>             }
>
>                    }
>
>             // Insertion_de__données_
>
>             *public**boolean*TableInsert(StringCTable,
>             String[][]TColumns,
>             Object[][]TOValues,*boolean*BCommit,*boolean*BForceBlocMode) {
>
>             *boolean*BResult=*false*;
>
>             *final*HashMap<Long,
>             CxDrvGEODERow>mrows=*new*HashMap<Long, CxDrvGEODERow>();
>
>             _..._
>
>             *if*(!mrows.isEmpty()) {
>
>             Region<Long, CxDrvGEODERow>oRegion=oCache.<Long,
>             CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.*/PROXY/*).create(*this*.Base+'-'+CTable);
>
>             *if*(oRegion!=*null*) {
>
>             oRegion.putAll(mrows);
>
>             oRegion.close();
>
>             }
>
>             }
>
>             mrows.clear();
>
>             BResult=*true*;
>
>             *return*BResult;
>
>                    }
>
>             }
>
>             Note that cluster, disk stores, regions and indexes are
>             pre-created on GEODE.
>
>             Every thread is isolated and create its
>             ownCxDrvGEODEThreadclass instance and do « DoConnect »àN x
>             « TableInsert »à« DoClose ».
>
>             Here is a master thread class call example:
>
>             *private*DataThreadlaunchOneDataThread(*long*LNbProcess,*long*LNbLines,*int*LBatchSize,*long*LProcessID,
>             StringBenchId) {
>
>             *final*CxObjPOCX=CXI.Connect(CXO.Connector,CXO.Server,CXO.Port,CXO.Base,CXO.User,CXO.Password,BTrace);
>
>             *final*DataThreadBT=*new*DataThread(LNbProcess,LNbLines,LBatchSize,LProcessID,*new*DataObject(POCX,BenchId,CBenchParams,CTable));
>
>             *new*Thread(*new*Runnable() {
>
>             @Override
>
>             *public**void*run() {
>
>             BT.start();
>
>             }
>
>             }).start();
>
>             *return*BT;
>
>                     }
>
>             I’m sure we are doing something really bad, any idea ?
>
>             Cordialement,
>
>             */—
>             NOTE : n/a/*
>             —
>             Gfi**Informatique
>             *Philippe Cerou*
>             Architecte & Expert Système
>             GFI Production / Toulouse
>             philippe.cerou @gfi.fr
>
>             —
>
>             1 Rond-point du Général Eisenhower, 31400 Toulouse
>
>             Tél. :+33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
>             Mob. :+33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
>             *www.gfi.world*
>             <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=>
>
>             —
>
>             <image001.png>
>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=Y-m_AcPqNYz6oz5orp1rMZS_KdYRmbcUsFvj6AaIprI&e=><image002.png>
>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=j3b0MNZZlpcvogNg2pC9MD6B9rhKXj5Et-YiHqfZogE&e=><image003.png>
>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=3j79LCFW1Ey72TkJcMHwsFK0o2vO38fQDFbO0VH6pEY&e=><image004.png>
>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=KbOBhON1qk5FcAwwRtRODcOxwfHcNCIPYk_ql-LwSdA&e=><image005.png>
>             <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=0T27NgqJxc1Lb1q71yM7mDATuZJEXuClx1sqm3E3RRI&e=>
>             —
>             <image006.jpg>
>             <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=>
>

RE: Is there a way to boost all cluster startup when persistence is ON ?

Posted by Philippe CEROU <ph...@gfi.fr>.

Hi,

Thanks for the link, but I tried to tune disk store as follow but startup time is always near 18 minutes to have region UP & RUNNING (Startup : 15:22 -> Region « Region /ksdata-benchmark has successfully completed waiting for other members to recover » : 15 :40).

Here is what I configured for the disk store:

create disk-store --name ksdata --dir=/opt/app/geode/data/ksdata --group=ksgroup --allow-force-compaction=true --auto-compact=true --compaction-threshold=30 --max-oplog-size=64 --queue-size=1000 --time-interval=5000 --write-buffer-size=1048576

The test was here to secure 200 000 000 PDX-ized rows with 3 examplaries on 3 AWS AZs (Datacenters) on 30 « m4.xlarge » servers (3 x 10 x [4 CPU, 16Gb RAM]) and the result is too far away from our requirements (Less than 5 minutes for all the solution is UP [cache + products + databases]).

It is why I asked if there was a possibility to start serving the region before the index recovering was finalized :

  *   INSERTS : You said yourself that all was WAL so adding new rows should not be a problem while recovering, new ROWS being treated at the end.
  *   READ : No matter if there are no indexes for us we can accept a quick disk full scan and have not efficient read responses.
  *   DELETE : Not in our cases, it is a cache with TTL so if olr rows are at the end DELETED 2 or 3 hours after their peremption timestamp it is not a problem.

If it is not possible, can you just inform me if the platform efficiency, and the recovery, could be increased installing multiple servers on same node, for example migrating from 30 to 150 GEODE servers with iso-infrastructure:

  *   Before : 30 x [1 x machine -> 1 x 15Gb geode node]
  *   After : 30 x [1 x machine -> 5 x 3Gb geode node]

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55
Mob. : +33 (0)6.03.56.48.62
www.gfi.world<http://www.gfi.world/>
—
[Facebook]<https://www.facebook.com/gfiinformatique> [Twitter] <https://twitter.com/gfiinformatique>  [Instagram] <https://www.instagram.com/gfiinformatique/>  [LinkedIn] <https://www.linkedin.com/company/gfi-informatique>  [YouTube] <https://www.youtube.com/user/GFIinformatique>
—
[cid:image006.jpg@01D2F97F.AA6ABB50]<http://www.gfi.world/>


De : Anthony Baker [mailto:abaker@pivotal.io]
Envoyé : jeudi 31 janvier 2019 16:51
À : user@geode.apache.org
Objet : Re: Is there a way to boost all cluster startup when persistence is ON ?

1) Regarding recovery and startup time:

When using persistent regions, Geode records updates using append-only logs.  This optimizes write performance.  Unlike a relational database, we don’t need to acquire multiple page/buffer locks in order to update an index structure (such as B* tree).  The tradeoff is that we need to scan all the data at startup time to ensure we know where the most recent copy of the data is on disk.

We do a number of optimizations to speed recovery, including lazily faulting in values—we only have to recover the keys in order for a region to be “online”.  However, if the region defines indexes, we do have to recover all the values in order to rebuild the index in memory.

Here are some details on compaction of the log files:
https://geode.apache.org/docs/guide/11/managing/disk_storage/compacting_disk_stores.html


HTH,
Anthony



On Jan 30, 2019, at 12:19 AM, Philippe CEROU <ph...@gfi.fr>> wrote:

Hi,

New test, new problem 😊

I have a 30 nodes Apache Geode cluster (30 x 4 CPU / 16Gb RAM) with 200 000 000<tel:200%C2%A0000%C2%A0000> partitionend & replicated (C=2) «PDX-ized » rows.

When I stop all my cluster (Servers then Locators) and I do a full restart (Locators then Servers) I have in fact two possible problems :

1. The warm-up time is very long (About 15 minutes), each disk store is standing another node one and the time everything finalize rhe service is down (exactly the cluster is there but the region is not there).

Is there a way to make the region available and to directly start accepting at least INSERTS ?

The best would be to directly serve the region for INSERTs and fallback potentially READS to disk 😊

2. Sometimes when I startup my 30 servers some of them crash with this line after near 1 to 2 minutes :

[info 2019/01/30 08:03:02.603 UTC nx130b-srv <main> tid=0x1] Initialization of region PdxTypes completed

[info 2019/01/30 08:04:17.606 UTC nx130b-srv <Geode Failure Detection Scheduler1> tid=0x1c] Failure detection is now watching 10.200.6.107(nx107c-srv:19000)<v5>:41000

[info 2019/01/30 08:04:17.613 UTC nx130b-srv <Geode Failure Detection thread 3> tid=0xe2] Failure detection is now watching 10.200.4.112(nx112b-srv:18872)<v5>:41000

[error 2019/01/30 08:05:02.942 UTC nx130b-srv <main> tid=0x1] Cache initialization for GemFireCache[id = 2125470482<tel:2125470482>; isClosing = false; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=.

[info 2019/01/30 08:05:02.961 UTC nx130b-srv <main> tid=0x1] GemFireCache[id = 2125470482<tel:2125470482>; isClosing = true; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2019/01/30 08:05:03.325 UTC nx130b-srv <main> tid=0x1] Shutting down DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000.

[info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Now closing distribution for 10.200.4.130(nx130b-srv:3065)<v5>:41000

[info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Stopping membership services

[info 2019/01/30 08:05:03.435 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor server socket is closed in stopServices().

[info 2019/01/30 08:05:03.435 UTC nx130b-srv <Geode Failure Detection Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting

[info 2019/01/30 08:05:03.436 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor serverSocketExecutor is terminated

[info 2019/01/30 08:05:03.475 UTC nx130b-srv <main> tid=0x1] DistributionManager stopped in 150ms.

[info 2019/01/30 08:05:03.476 UTC nx130b-srv <main> tid=0x1] Marking DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000 as closed.

Any idea ?

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<http://www.gfi.world/>
—
<image001.png><https://www.facebook.com/gfiinformatique> <image002.png><https://twitter.com/gfiinformatique> <image003.png><https://www.instagram.com/gfiinformatique/> <image004.png><https://www.linkedin.com/company/gfi-informatique> <image005.png><https://www.youtube.com/user/GFIinformatique>
—
<image006.jpg><http://www.gfi.world/>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : jeudi 24 janvier 2019 13:23
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

Own-response,

Finally, after shutting down and restarting the cluster, this one was « just » reimporting data from disks and it took 1h34m to make region back (for persisted 200 000 000<tel:200%C2%A0000%C2%A0000> rows among 10 nodes).

I still have some questions 😊

My cluster reported memory usage is really different than the unique region it serve, see screenshots.

Cluster memory usage (205Gb):

<image008.jpg>

But region consume 311Gb :

<image009.jpg>

What I do not understand :

  *   If I consume 205Gb with 35Gb for « region disk caching » what are the 205Gb - 35Gb = 170Gb used for ?
  *   If I really have 205Gb used by 305Gb, I understand my region is about 311Gb, why so little memory is used to cache region ?

If I launch a simple « query  --query='select count(*) from /ksdata-benchmark where C001="AD6909"' » from GFSH I still have no response after 10 minutes while the requested column is indexd !

gfsh>list indexes
Member Name |                Member ID                 |    Region Path    |         Name          | Type  | Indexed Expression |    From Clause    | Valid Index
----------- | ---------------------------------------- | ----------------- | --------------------- | ----- | ------------------ | ----------------- | -----------
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<http://www.gfi.world/>
—
<image001.png><https://www.facebook.com/gfiinformatique> <image002.png><https://twitter.com/gfiinformatique> <image003.png><https://www.instagram.com/gfiinformatique/> <image004.png><https://www.linkedin.com/company/gfi-informatique> <image005.png><https://www.youtube.com/user/GFIinformatique>
—
<image006.jpg><http://www.gfi.world/>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : jeudi 24 janvier 2019 11:21
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

I understand, here it is a question of heap memory consumption, but I think we do something wrong because :

1.       I have retried with a 10 AWS C5.4XLARGE nodes (8 CPU + 64 Gb RAM + 50Gb SSD disks), this nodes have 4 x more memory than previous test which give us a GEODE server with 305Gb MEMORY (Pulse/Total heap).

2.       My INPUT data is a CSV with 200 000 000<tel:200%C2%A0000%C2%A0000> rows about 150 bytes each divided in 19 LONG & STRING columns, this give me (With region « redundant-copies=2 ») about 3 x 200 000 000 x 150<tel:200%C2%A0000%C2%A0000;150> = 83.81Gb. I have 4 indexes too on 4 LONG columns, my operational product data cost ratio (With 80% of HEAP threshold « eviction-heap-percentage=80 ») is so about (305Gb x 0.8) / 83.81Gb = 2.91, not very good ☹

So, I think I have a problem with OVERFLOW settings even if I did the same as documentation show.

Locators launch command :

gfsh -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='${LLOCATORS}'

PDX configuration :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "configure pdx --disk-store=DEFAULT --read-serialized=true"

Servers launch command:

WMEM=28000
gfsh \
-e "start server --name=${WSNAME} --initial-heap=${WMEM}M --max-heap=${WMEM}M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"

Region declaration :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
-e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
-e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"

Other « new » problem, if I shutdown the cluster (everything) then I restart it (locators then servers) I do not see my region anymore, but HEAP memory is consumed (I attached a PULSE screenshot).

<image010.jpg>

If I look at servers logs I see that, after one hour (multiple times on multiple servers with different details):

10.200.2.108: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [19, 67, 101, 103, 110, 112] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.2.108:   DiskStore ID: 0c16c9b9-eff3-4fe1-84b1-f2ad0b7d19de
10.200.2.108:   Name: nx108a-srv
10.200.2.108:   Location: /10.200.2.108:/opt/app/geode/node/nx108a-srv/ksdata
10.200.2.108: Offline members with potentially new data:[
10.200.2.108:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.2.108:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.2.108:   Buckets: [19, 110]
10.200.2.108: ,
10.200.2.108:   DiskStore ID: 338fd13a-e564-444b-bfd9-377bec060897
10.200.2.108:   Location: /10.200.6.107:/opt/app/geode/node/nx107c-srv/ksdata
10.200.2.108:   Buckets: [19, 67, 101, 112]
10.200.2.108: ,
10.200.2.108:   DiskStore ID: e086e935-62df-4401-99dd-16475adf1f01
10.200.2.108:   Location: /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata
10.200.2.108:   Buckets: [103]
10.200.2.108: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
10.200.6.104: ...........................................................................................................................................
10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 31, 44, 65, 99] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
10.200.6.104:   Name: nx104c-srv
10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
10.200.6.104: Offline members with potentially new data:[
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
10.200.6.104: ,
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [7]
10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
10.200.6.104: ..............................
10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 8, 31, 44, 52, 65, 95, 99] are waiting for another offline member to recover the latest data.My persistent id is:
10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
10.200.6.104:   Name: nx104c-srv
10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
10.200.6.104: Offline members with potentially new data:[
10.200.6.104:   DiskStore ID: 9bd60b3e-f640-4d95-b9a5-be14fccb5f91
10.200.6.104:   Location: /10.200.2.102:/opt/app/geode/node/nx102a-srv/ksdata
10.200.6.104:   Buckets: [8, 52, 95]
10.200.6.104: ,
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
10.200.6.104: ,
10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
10.200.6.104:   Buckets: [7]
10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.

Eveyone seems waiting something from everyone, even if all my 10 nodes are UP.

If I use the « gfsh show missing-disk-stores » command there are from 5 to 20 missing ones.

gfsh>show missing-disk-stores
Missing Disk Stores


           Disk Store ID             |     Host      | Directory
------------------------------------ | ------------- | -------------------------------------
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 | /opt/app/geode/node/nx109b-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
e086e935-62df-4401-99dd-16475adf1f01 | /10.200.2.105 | /opt/app/geode/node/nx105a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
338fd13a-e564-444b-bfd9-377bec060897 | /10.200.6.107 | /opt/app/geode/node/nx107c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
184f8d33-a041-4635-915a-9e64cf9c007c | /10.200.6.101 | /opt/app/geode/node/nx101c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 | /opt/app/geode/node/nx104c-srv/ksdata
68c79259-ffcb-4b3b-a6a4-ef8bff6190be | /10.200.6.110 | /opt/app/geode/node/nx110c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 | /opt/app/geode/node/nx109b-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 | /opt/app/geode/node/nx104c-srv/ksdata
9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata

Very strange...

Does-it meens that some DATA is lost even with a redundancy of #3 ?

I saw in documentation that « missing disk stores » can be revoked but it is not clear on the fact that data is finally lost or not ☹

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<http://www.gfi.world/>
—
<image001.png><https://www.facebook.com/gfiinformatique> <image002.png><https://twitter.com/gfiinformatique> <image003.png><https://www.instagram.com/gfiinformatique/> <image004.png><https://www.linkedin.com/company/gfi-informatique> <image005.png><https://www.youtube.com/user/GFIinformatique>
—
<image006.jpg><http://www.gfi.world/>


De : Anthony Baker [mailto:abaker@pivotal.io]
Envoyé : mercredi 23 janvier 2019 18:28
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Re: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

When a cluster member member becomes unresponsive, Geode may fence off the member in order to preserve consistency and availability.  The question to investigate is *why* the member got into this state.

Questions to investigate:

- How much heap memory is your data consuming?
- How much data is overflowed disk vs in heap memory?
- How much data is being read from disk vs memory?
- Is GC activity consuming significant cpu resources?
- Are there other processes running on the system causing swapping behavior?

Anthony


On Jan 23, 2019, at 8:45 AM, Philippe CEROU <ph...@gfi.fr>> wrote:

Hi,

I think I have found the problem,

When we use REGION with OVERFLOW to disk once the percentage of memory configure dis reached the geode server become very very slow ☹

At the end even if we have a lot of nodes the overall cluster do not reach to write/acquire as far as client send data and everything fall, we past from 140 000 rows per second (memory) to less than 10 000 rows per second (once OVERFLOW is started)...

Is the product able to overflow on disk without a so high bandwidth reduce ?

For information, here are my launch commands (9 nodes) :

3 x :

gfsh \
-e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='nx101c[1234],nc102a[1234],nx103b[1234]'"

For PDX :

gfsh \
-e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
-e "configure pdx --disk-store=DEFAULT --read-serialized=true"

9 x :

gfsh \
-e "start server --name=${WSNAME} --initial-heap=6000M --max-heap=6000M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"

For disks & regions :

gfsh \
                -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
                -e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"

gfsh \
                -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
                -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
                -e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
                -e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>


De : Philippe CEROU [mailto:philippe.cerou@gfi.fr]
Envoyé : mercredi 23 janvier 2019 08:40
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests

Hi,

Following my GEODE study i now have this situation.

I have 6 nodes, 3 have a locator and 6 have a server.

When I try to do massive insertion (200 million PDX-ized rows) I have after some hours this error on client side for all threads :

...
Exception in thread "Thread-20" org.apache.geode.cache.client.ServerOperationException: remote server on nxmaster(28523:loner):35042:efcecc76: Region /ksdata-benchmark putAll at server applied partial keys due to exception.
        at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9542)
        at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9446)
        at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9458)
        at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.TableInsert(CxDrvGEODE.java:144)
        at com.gfi.rt.lib.database.connectors.CxTable.insert(CxTable.java:175)
        at com.gfi.rt.bin.database.dbbench.BenchmarkObject.run(BenchmarkObject.java:279)
        at com.gfi.rt.bin.database.dbbench.BenchmarkThread.DoIt(BenchmarkThread.java:84)
        at com.gfi.rt.bin.database.dbbench.BenchmarkThread.run(BenchmarkThread.java:67)
Caused by: org.apache.geode.cache.persistence.PartitionOfflineException: Region /ksdata-benchmark bucket 48 has persistent data that is no longer online stored at these locations: [/10.200.6.101:/opt/app/geode/node/nx101c-srv/ksdata created at timestamp 1548181352906<tel:1548181352906> version 0 diskStoreId 3650a4bc61f447a3-bc9cba70cf59e514 name null, /10.200.4.106:/opt/app/geode/node/nx106b-srv/ksdata created at timestamp 1548181352865<tel:1548181352865> version 0 diskStoreId a7de7988708b44d2-b4b09d91c1f536b6 name null, /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata created at timestamp 1548181353100<tel:1548181353100>version 0 diskStoreId 72dcab08ae12413c-b099fbb7f9ab740b name null]
        at org.apache.geode.internal.cache.ProxyBucketRegion.checkBucketRedundancyBeforeGrab(ProxyBucketRegion.java:590)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.lockRedundancyLock(PartitionedRegionDataStore.java:595)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:440)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2858)
        at org.apache.geode.internal.cache.PartitionedRegionDataStore.handleManageBucketRequest(PartitionedRegionDataStore.java:1014)
        at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketOnMember(PRHARedundancyProvider.java:1233)
        at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketInstance(PRHARedundancyProvider.java:416)
        at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketAtomically(PRHARedundancyProvider.java:604)
        at org.apache.geode.internal.cache.PartitionedRegion.createBucket(PartitionedRegion.java:3310)
        at org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2055)
        at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
        at org.apache.geode.internal.cache.PartitionedRegion.performPutAllEntry(PartitionedRegion.java:2124)
        at org.apache.geode.internal.cache.LocalRegion.basicEntryPutAll(LocalRegion.java:10060)
        at org.apache.geode.internal.cache.LocalRegion.access$100(LocalRegion.java:231)
        at org.apache.geode.internal.cache.LocalRegion$2.run(LocalRegion.java:9639)
        at org.apache.geode.internal.cache.event.NonDistributedEventTracker.syncBulkOp(NonDistributedEventTracker.java:107)
        at org.apache.geode.internal.cache.LocalRegion.syncBulkOp(LocalRegion.java:6085)
        at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9657)
        at org.apache.geode.internal.cache.LocalRegion.basicBridgePutAll(LocalRegion.java:9367)
        at org.apache.geode.internal.cache.tier.sockets.command.PutAll80.cmdExecute(PutAll80.java:270)
        at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:178)
        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:844)
        at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:74)
        at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:594)
        at org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121)
        at java.lang.Thread.run(Thread.java:748)
...

When I check the cluster 4 of the 6 servers has disappear, when I check their LOG I have this :

...
[info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure Detection Scheduler1> tid=0x1b] Failure detection is now watching 10.200.4.103(nx103b-srv:21892)<v4>:41001

[info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure Detection Scheduler1> tid=0x1b] Failure detection is now watching 10.200.2.102(nx102a-srv:23781)<v4>:41001

[info 2019/01/23 01:28:35.326 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] Membership received a request to remove 10.200.2.102(nx102a-srv:23781)<v4>:41001 from 10.200.2.102(nx102a:23617:locator)<ec><v0>:41000 reason=Member isn't responding to heartbeat requests

[severe 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] Membership service failure: Member isn't responding to heartbeat requests
org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
        at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
        at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
        at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
        at org.jgroups.JChannel.up(JChannel.java:741)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
        at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
        at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
        at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
        at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
        at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
        at org.jgroups.protocols.TP.receive(TP.java:1714)
        at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
        at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
        at java.lang.Thread.run(Thread.java:748)

[info 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] CacheServer configuration saved

[info 2019/01/23 01:28:35.341 UTC nx102a-srv <DisconnectThread> tid=0x1cd] Stopping membership services

[info 2019/01/23 01:28:35.341 UTC nx102a-srv <DisconnectThread> tid=0x1cd] GMSHealthMonitor server socket is closed in stopServices().

[info 2019/01/23 01:28:35.342 UTC nx102a-srv <Geode Failure Detection thread 126> tid=0x1ca] Failure detection is now watching 10.200.2.102(nx102a:23617:locator)<ec><v0>:41000

[info 2019/01/23 01:28:35.343 UTC nx102a-srv <Geode Failure Detection Server thread 1> tid=0x1e] GMSHealthMonitor server thread exiting

[info 2019/01/23 01:28:35.344 UTC nx102a-srv <DisconnectThread> tid=0x1cd] GMSHealthMonitor serverSocketExecutor is terminated

[info 2019/01/23 01:28:35.347 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Disconnecting old DistributedSystem to prepare for a reconnect attempt

[info 2019/01/23 01:28:35.351 UTC nx102a-srv <ReconnectThread> tid=0x1cd] GemFireCache[id = 82825098; isClosing = true; isShutDownAll = false; created = Tue Jan 22 18:22:30 UTC 2019; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2019/01/23 01:28:35.352 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Cache server on port 40404<tel:40404> is shutting down.

[severe 2019/01/23 01:28:45.005 UTC nx102a-srv <EvictorThread8> tid=0x92] Uncaught exception in thread Thread[EvictorThread8,10,main]
org.apache.geode.distributed.DistributedSystemDisconnectedException: Distribution manager on 10.200.2.102(nx102a-srv:23781)<v4>:41001 started at Tue Jan 22 18:22:30 UTC 2019: Member isn't responding to heartbeat requests, caused by org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
        at org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:3926)
        at org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:966)
        at org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1547)
        at org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
        at org.apache.geode.internal.cache.GemFireCacheImpl.getInternalResourceManager(GemFireCacheImpl.java:4330)
        at org.apache.geode.internal.cache.GemFireCacheImpl.getResourceManager(GemFireCacheImpl.java:4319)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.getAllRegionList(HeapEvictor.java:138)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.getAllSortedRegionList(HeapEvictor.java:171)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.createAndSubmitWeightedRegionEvictionTasks(HeapEvictor.java:215)
        at org.apache.geode.internal.cache.eviction.HeapEvictor.access$200(HeapEvictor.java:53)
        at org.apache.geode.internal.cache.eviction.HeapEvictor$1.run(HeapEvictor.java:357)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
        at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
        at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
        at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
        at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
        at org.jgroups.JChannel.up(JChannel.java:741)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
        at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
        at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
        at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
        at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
        at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
        at org.jgroups.protocols.TP.receive(TP.java:1714)
        at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
        at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
        ... 1 more

[info 2019/01/23 01:30:05.874 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Shutting down DistributionManager 10.200.2.102(nx102a-srv:23781)<v4>:41001. At least one Exception occurred.
...

Any idea ?

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>


De : Philippe CEROU
Envoyé : mercredi 23 janvier 2019 08:19
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : RE: Multi-threaded Java client exception.

Hi,

Thanks to Anthony for its help, I so modified my code to have only one cache & region which are thread-shared.

To have something running well I had to use synchronized blocks to handle cache connection and close, and use a dedicated region creation/sharing function has follow (Hope it can helps someone) :

...
public class CxDrvGEODE extends CxObjNOSQL {
       static ClientCache oCache = null;
       static final String oSync = "x";
       static boolean IsConnected = false;
       static int nbThreads = 0;
       static final HashMap<String, Region<Long, CxDrvGEODERow>> hmRegions = newHashMap<String, Region<Long, CxDrvGEODERow>>();

...

       public void DoConnect() {
             synchronized (oSync) {
                    if (!IsConnected) {
                           ReflectionBasedAutoSerializer oRBAS = newReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
                           oCache = newClientCacheFactory().addPoolLocator(this.GetNode(), Integer.valueOf(this.Port)).set("log-level", "WARN").setPdxSerializer(oRBAS).create();
                           IsConnected = true;
                    }
                    nbThreads++;
             }
       }

       public void DoClose() {
             synchronized (oSync) {
                    if (nbThreads > 0) {
                           if (nbThreads == 1 && oCache != null) {
                                  for (String sKey : hmRegions.keySet()) {
                                        hmRegions.get(sKey).close();
                                        hmRegions.remove(sKey);
                                  }
                                  oCache.close();
                                  oCache = null;
                                  IsConnected = false;
                           }
                           nbThreads--;
                    }
             }
       }

...

       private Region<Long, CxDrvGEODERow> getCache(String CTable) {
             synchronized (oSync) {
                    Region<Long, CxDrvGEODERow> oRegion = null;
                    if (hmRegions.containsKey(CTable)) {
                           oRegion = hmRegions.get(CTable);
                    } else {
                           oRegion = oCache.<Long, CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.PROXY).create(this.Base + '-'+ CTable);
                           hmRegions.put(CTable, oRegion);
                    }
                    return oRegion;
             }
       }

...

       public boolean TableInsert(String CTable, String[][] TColumns, Object[][] TOValues,boolean BCommit, boolean BForceBlocMode) {
...
             Region<Long, CxDrvGEODERow> oRegion = getCache(CTable);
...
       }

...

}



Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>


De : Anthony Baker [mailto:abaker@pivotal.io]
Envoyé : mardi 22 janvier 2019 16:59
À : user@geode.apache.org<ma...@geode.apache.org>
Objet : Re: Multi-threaded Java client exception.

You only need one ClientCache in each JVM.  You can create the cache and region once and then pass it to each worker thread.

Anthony

On Jan 22, 2019, at 1:25 AM, Philippe CEROU <ph...@gfi.fr>> wrote:

Hi,

We try to tune a single program using multi-threading for storage interface.

The problem we have is if we launch this code with THREADS=1 then all run well, starting with THREADS=2 then we always have this exception when we create connection (On yellow row).

Exception in thread "main" java.lang.IllegalStateException: Existing cache's default pool was not compatible
        at org.apache.geode.internal.cache.GemFireCacheImpl.validatePoolFactory(GemFireCacheImpl.java:2933)
        at org.apache.geode.cache.client.ClientCacheFactory.basicCreate(ClientCacheFactory.java:252)
        at org.apache.geode.cache.client.ClientCacheFactory.create(ClientCacheFactory.java:213)
        at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.DoConnect(CxDrvGEODE.java:35)
        at com.gfi.rt.lib.database.connectors.CxObj.DoConnect(CxObj.java:121)
        at com.gfi.rt.lib.database.connectors.CxInterface.Connect(CxInterface.java:91)
        at com.gfi.rt.lib.database.connectors.CxInterface.Connect(CxInterface.java:149)
       ...

Here is the data interface code.

package com.gfi.rt.lib.database.connectors.nosql;

import java.util.HashMap;
import org.apache.geode.cache.Region;
import org.apache.geode.cache.client.ClientCache;
import org.apache.geode.cache.client.ClientCacheFactory;
import org.apache.geode.cache.client.ClientRegionShortcut;
import org.apache.geode.pdx.ReflectionBasedAutoSerializer;

public class CxDrvGEODEThread extends CxObjNOSQL {
       ClientCache oCache = null;
       boolean IsConnected = false;

       // Connexion

       public CxDrvGEODEThread() {
             Connector = "geode-native";
       }

       public void DoConnect() {
             if (!IsConnected) {
                    ReflectionBasedAutoSerializer rbas = newReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
                    oCache = new ClientCacheFactory().addPoolLocator(this.GetNode(), Integer.valueOf(this.Port)).set("log-level", "WARN").setPdxSerializer(rbas).create();
                    IsConnected = true;
             }
       }

       // Déconnexion de la connexion courante à la base de données

       public void DoClose() {
             if (oCache != null) {
                    oCache.close();
                    oCache = null;
                    IsConnected = false;
             }
       }

       // Insertion de données

       public boolean TableInsert(String CTable, String[][] TColumns, Object[][] TOValues,boolean BCommit, boolean BForceBlocMode) {
             boolean BResult = false;
             final HashMap<Long, CxDrvGEODERow> mrows = new HashMap<Long, CxDrvGEODERow>();

             ...

             if (!mrows.isEmpty()) {
                           Region<Long, CxDrvGEODERow> oRegion = oCache.<Long, CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.PROXY).create(this.Base + '-' +CTable);
                    if (oRegion != null) {
                           oRegion.putAll(mrows);
                           oRegion.close();
                    }
             }
             mrows.clear();
             BResult = true;
             return BResult;
       }
}

Note that cluster, disk stores, regions and indexes are pre-created on GEODE.

Every thread is isolated and create its own CxDrvGEODEThread class instance and do « DoConnect » --> N x « TableInsert » --> « DoClose ».

Here is a master thread class call example:

        private DataThread launchOneDataThread(long LNbProcess, long LNbLines, int LBatchSize, long LProcessID, String BenchId) {
                final CxObj POCX = CXI.Connect(CXO.Connector, CXO.Server, CXO.Port, CXO.Base, CXO.User, CXO.Password, BTrace);
                final DataThread BT = new DataThread(LNbProcess, LNbLines, LBatchSize, LProcessID, new DataObject(POCX, BenchId,CBenchParams, CTable));
                new Thread(new Runnable() {
                        @Override
                        public void run() {
                                BT.start();
                        }
                }).start();
                return BT;
        }

I’m sure we are doing something really bad, any idea ?

Cordialement,

—
NOTE : n/a
—
Gfi Informatique
Philippe Cerou
Architecte & Expert Système
GFI Production / Toulouse
philippe.cerou @gfi.fr
—
1 Rond-point du Général Eisenhower, 31400 Toulouse
Tél. : +33 (0)5.62.85.11.55<tel:+33%205.62.85.11.55>
Mob. : +33 (0)6.03.56.48.62<tel:+33%206.03.56.48.62>
www.gfi.world<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=>
—
<image001.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=Y-m_AcPqNYz6oz5orp1rMZS_KdYRmbcUsFvj6AaIprI&e=> <image002.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=j3b0MNZZlpcvogNg2pC9MD6B9rhKXj5Et-YiHqfZogE&e=> <image003.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=3j79LCFW1Ey72TkJcMHwsFK0o2vO38fQDFbO0VH6pEY&e=> <image004.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=KbOBhON1qk5FcAwwRtRODcOxwfHcNCIPYk_ql-LwSdA&e=> <image005.png><https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=0T27NgqJxc1Lb1q71yM7mDATuZJEXuClx1sqm3E3RRI&e=>
—
<image006.jpg><https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=>

Re: Is there a way to boost all cluster startup when persistence is ON ?

Posted by Anthony Baker <ab...@pivotal.io>.

1) Regarding recovery and startup time:

When using persistent regions, Geode records updates using append-only logs.  This optimizes write performance.  Unlike a relational database, we don’t need to acquire multiple page/buffer locks in order to update an index structure (such as B* tree).  The tradeoff is that we need to scan all the data at startup time to ensure we know where the most recent copy of the data is on disk.

We do a number of optimizations to speed recovery, including lazily faulting in values—we only have to recover the keys in order for a region to be “online”.  However, if the region defines indexes, we do have to recover all the values in order to rebuild the index in memory.

Here are some details on compaction of the log files:
https://geode.apache.org/docs/guide/11/managing/disk_storage/compacting_disk_stores.html <https://geode.apache.org/docs/guide/11/managing/disk_storage/compacting_disk_stores.html>


HTH,
Anthony


> On Jan 30, 2019, at 12:19 AM, Philippe CEROU <ph...@gfi.fr> wrote:
> 
> Hi,
>  
> New test, new problem 😊
>  
> I have a 30 nodes Apache Geode cluster (30 x 4 CPU / 16Gb RAM) with 200 000 000 <tel:200%C2%A0000%C2%A0000> partitionend & replicated (C=2) «PDX-ized » rows.
>  
> When I stop all my cluster (Servers then Locators) and I do a full restart (Locators then Servers) I have in fact two possible problems :
>  
> 1. The warm-up time is very long (About 15 minutes), each disk store is standing another node one and the time everything finalize rhe service is down (exactly the cluster is there but the region is not there).
>  
> Is there a way to make the region available and to directly start accepting at least INSERTS ?
>  
> The best would be to directly serve the region for INSERTs and fallback potentially READS to disk 😊
>  
> 2. Sometimes when I startup my 30 servers some of them crash with this line after near 1 to 2 minutes :
>  
> [info 2019/01/30 08:03:02.603 UTC nx130b-srv <main> tid=0x1] Initialization of region PdxTypes completed
>  
> [info 2019/01/30 08:04:17.606 UTC nx130b-srv <Geode Failure Detection Scheduler1> tid=0x1c] Failure detection is now watching 10.200.6.107(nx107c-srv:19000)<v5>:41000
>  
> [info 2019/01/30 08:04:17.613 UTC nx130b-srv <Geode Failure Detection thread 3> tid=0xe2] Failure detection is now watching 10.200.4.112(nx112b-srv:18872)<v5>:41000
>  
> [error 2019/01/30 08:05:02.942 UTC nx130b-srv <main> tid=0x1] Cache initialization for GemFireCache[id = 2125470482 <tel:2125470482>; isClosing = false; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] failed because: org.apache.geode.GemFireIOException: While starting cache server CacheServer on port=40404 client subscription config policy=none client subscription config capacity=1 client subscription config overflow directory=.
>  
> [info 2019/01/30 08:05:02.961 UTC nx130b-srv <main> tid=0x1] GemFireCache[id = 2125470482 <tel:2125470482>; isClosing = true; isShutDownAll = false; created = Wed Jan 30 08:02:58 UTC 2019; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
>  
> [info 2019/01/30 08:05:03.325 UTC nx130b-srv <main> tid=0x1] Shutting down DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000.
>  
> [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Now closing distribution for 10.200.4.130(nx130b-srv:3065)<v5>:41000
>  
> [info 2019/01/30 08:05:03.433 UTC nx130b-srv <main> tid=0x1] Stopping membership services
>  
> [info 2019/01/30 08:05:03.435 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor server socket is closed in stopServices().
>  
> [info 2019/01/30 08:05:03.435 UTC nx130b-srv <Geode Failure Detection Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting
>  
> [info 2019/01/30 08:05:03.436 UTC nx130b-srv <main> tid=0x1] GMSHealthMonitor serverSocketExecutor is terminated
>  
> [info 2019/01/30 08:05:03.475 UTC nx130b-srv <main> tid=0x1] DistributionManager stopped in 150ms.
>  
> [info 2019/01/30 08:05:03.476 UTC nx130b-srv <main> tid=0x1] Marking DistributionManager 10.200.4.130(nx130b-srv:3065)<v5>:41000 as closed.
>  
> Any idea ?
>  
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
> —
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <http://www.gfi.world/> 
> — 
> <image001.png> <https://www.facebook.com/gfiinformatique> <image002.png> <https://twitter.com/gfiinformatique> <image003.png> <https://www.instagram.com/gfiinformatique/> <image004.png> <https://www.linkedin.com/company/gfi-informatique> <image005.png> <https://www.youtube.com/user/GFIinformatique>
> —
> <image006.jpg> <http://www.gfi.world/>
>  
>  
> De : Philippe CEROU [mailto:philippe.cerou@gfi.fr <ma...@gfi.fr>] 
> Envoyé : jeudi 24 janvier 2019 13:23
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
>  
> Hi,
>  
> Own-response,
>  
> Finally, after shutting down and restarting the cluster, this one was « just » reimporting data from disks and it took 1h34m to make region back (for persisted 200 000 000 <tel:200%C2%A0000%C2%A0000> rows among 10 nodes).
>  
> I still have some questions 😊
>  
> My cluster reported memory usage is really different than the unique region it serve, see screenshots.
>  
> Cluster memory usage (205Gb):
>  
> <image008.jpg>
>  
> But region consume 311Gb :
>  
> <image009.jpg>
>  
> What I do not understand :
> If I consume 205Gb with 35Gb for « region disk caching » what are the 205Gb - 35Gb = 170Gb used for ?
> If I really have 205Gb used by 305Gb, I understand my region is about 311Gb, why so little memory is used to cache region ?
>  
> If I launch a simple « query  --query='select count(*) from /ksdata-benchmark where C001="AD6909"' » from GFSH I still have no response after 10 minutes while the requested column is indexd !
>  
> gfsh>list indexes
> Member Name |                Member ID                 |    Region Path    |         Name          | Type  | Indexed Expression |    From Clause    | Valid Index
> ----------- | ---------------------------------------- | ----------------- | --------------------- | ----- | ------------------ | ----------------- | -----------
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx101c-srv  | 10.200.6.101(nx101c-srv:18983)<v2>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx102a-srv  | 10.200.2.102(nx102a-srv:4043)<v6>:41001  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx103b-srv  | 10.200.4.103(nx103b-srv:22141)<v8>:41001 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx104c-srv  | 10.200.6.104(nx104c-srv:14996)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx105a-srv  | 10.200.2.105(nx105a-srv:15236)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx106b-srv  | 10.200.4.106(nx106b-srv:15146)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx107c-srv  | 10.200.6.107(nx107c-srv:15145)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx108a-srv  | 10.200.2.108(nx108a-srv:6702)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx109b-srv  | 10.200.4.109(nx109b-srv:15153)<v2>:41000 | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C001 | RANGE | C001               | /ksdata-benchmark | true
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C002 | RANGE | C002               | /ksdata-benchmark | true
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C003 | RANGE | C003               | /ksdata-benchmark | true
> nx110c-srv  | 10.200.6.110(nx110c-srv:7833)<v2>:41000  | /ksdata-benchmark | ksdata-benchmark-C011 | RANGE | C011               | /ksdata-benchmark | true
>  
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
> —
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <http://www.gfi.world/> 
> — 
> <image001.png> <https://www.facebook.com/gfiinformatique> <image002.png> <https://twitter.com/gfiinformatique> <image003.png> <https://www.instagram.com/gfiinformatique/> <image004.png> <https://www.linkedin.com/company/gfi-informatique> <image005.png> <https://www.youtube.com/user/GFIinformatique>
> —
> <image006.jpg> <http://www.gfi.world/>
>  
>  
> De : Philippe CEROU [mailto:philippe.cerou@gfi.fr <ma...@gfi.fr>] 
> Envoyé : jeudi 24 janvier 2019 11:21
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : RE: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
>  
> Hi,
>  
> I understand, here it is a question of heap memory consumption, but I think we do something wrong because :
>  
> I have retried with a 10 AWS C5.4XLARGE nodes (8 CPU + 64 Gb RAM + 50Gb SSD disks), this nodes have 4 x more memory than previous test which give us a GEODE server with 305Gb MEMORY (Pulse/Total heap).
>  
> My INPUT data is a CSV with 200 000 000 <tel:200%C2%A0000%C2%A0000> rows about 150 bytes each divided in 19 LONG & STRING columns, this give me (With region « redundant-copies=2 ») about 3 x 200 000 000 x 150 <tel:200%C2%A0000%C2%A0000;150> = 83.81Gb. I have 4 indexes too on 4 LONG columns, my operational product data cost ratio (With 80% of HEAP threshold « eviction-heap-percentage=80 ») is so about (305Gb x 0.8) / 83.81Gb = 2.91, not very good ☹
>  
> So, I think I have a problem with OVERFLOW settings even if I did the same as documentation show.
>  
> Locators launch command :
>  
> gfsh -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='${LLOCATORS}'
>  
> PDX configuration :
>  
> gfsh \
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
>  
> Servers launch command:
>  
> WMEM=28000
> gfsh \
> -e "start server --name=${WSNAME} --initial-heap=${WMEM}M --max-heap=${WMEM}M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"
>  
> Region declaration :
>  
> gfsh \
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> -e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"
>  
> gfsh \
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
> -e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
> -e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
> -e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
> -e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"
>  
> Other « new » problem, if I shutdown the cluster (everything) then I restart it (locators then servers) I do not see my region anymore, but HEAP memory is consumed (I attached a PULSE screenshot).
>  
> <image010.jpg>
>  
> If I look at servers logs I see that, after one hour (multiple times on multiple servers with different details):
>  
> 10.200.2.108: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [19, 67, 101, 103, 110, 112] are waiting for another offline member to recover the latest data.My persistent id is:
> 10.200.2.108:   DiskStore ID: 0c16c9b9-eff3-4fe1-84b1-f2ad0b7d19de
> 10.200.2.108:   Name: nx108a-srv
> 10.200.2.108:   Location: /10.200.2.108:/opt/app/geode/node/nx108a-srv/ksdata
> 10.200.2.108: Offline members with potentially new data:[
> 10.200.2.108:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 10.200.2.108:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 10.200.2.108:   Buckets: [19, 110]
> 10.200.2.108: ,
> 10.200.2.108:   DiskStore ID: 338fd13a-e564-444b-bfd9-377bec060897
> 10.200.2.108:   Location: /10.200.6.107:/opt/app/geode/node/nx107c-srv/ksdata
> 10.200.2.108:   Buckets: [19, 67, 101, 112]
> 10.200.2.108: ,
> 10.200.2.108:   DiskStore ID: e086e935-62df-4401-99dd-16475adf1f01
> 10.200.2.108:   Location: /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata
> 10.200.2.108:   Buckets: [103]
> 10.200.2.108: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
> 10.200.6.104: ...........................................................................................................................................
> 10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 31, 44, 65, 99] are waiting for another offline member to recover the latest data.My persistent id is:
> 10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
> 10.200.6.104:   Name: nx104c-srv
> 10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
> 10.200.6.104: Offline members with potentially new data:[
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
> 10.200.6.104: ,
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 10.200.6.104:   Buckets: [7]
> 10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
> 10.200.6.104: ..............................
> 10.200.6.104: Region /ksdata-benchmark (and any colocated sub-regions) has potentially stale data.  Buckets [3, 7, 8, 31, 44, 52, 65, 95, 99] are waiting for another offline member to recover the latest data.My persistent id is:
> 10.200.6.104:   DiskStore ID: 0c305067-7193-42ea-a0e6-f6c20795308c
> 10.200.6.104:   Name: nx104c-srv
> 10.200.6.104:   Location: /10.200.6.104:/opt/app/geode/node/nx104c-srv/ksdata
> 10.200.6.104: Offline members with potentially new data:[
> 10.200.6.104:   DiskStore ID: 9bd60b3e-f640-4d95-b9a5-be14fccb5f91
> 10.200.6.104:   Location: /10.200.2.102:/opt/app/geode/node/nx102a-srv/ksdata
> 10.200.6.104:   Buckets: [8, 52, 95]
> 10.200.6.104: ,
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 10.200.6.104:   Buckets: [3, 31, 44, 65, 99]
> 10.200.6.104: ,
> 10.200.6.104:   DiskStore ID: 26112834-88bc-4653-94a9-10db18d5ebb4
> 10.200.6.104:   Location: /10.200.4.103:/opt/app/geode/node/nx103b-srv/ksdata
> 10.200.6.104:   Buckets: [7]
> 10.200.6.104: ]Use the gfsh show missing-disk-stores command to see all disk stores that are being waited on by other members.
>  
> Eveyone seems waiting something from everyone, even if all my 10 nodes are UP.
>  
> If I use the « gfsh show missing-disk-stores » command there are from 5 to 20 missing ones.
>  
> gfsh>show missing-disk-stores
> Missing Disk Stores
>  
>  
>            Disk Store ID             |     Host      | Directory
> ------------------------------------ | ------------- | -------------------------------------
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
> a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 | /opt/app/geode/node/nx109b-srv/ksdata
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
> e086e935-62df-4401-99dd-16475adf1f01 | /10.200.2.105 | /opt/app/geode/node/nx105a-srv/ksdata
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
> 338fd13a-e564-444b-bfd9-377bec060897 | /10.200.6.107 | /opt/app/geode/node/nx107c-srv/ksdata
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
> 184f8d33-a041-4635-915a-9e64cf9c007c | /10.200.6.101 | /opt/app/geode/node/nx101c-srv/ksdata
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
> 0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 | /opt/app/geode/node/nx104c-srv/ksdata
> 68c79259-ffcb-4b3b-a6a4-ef8bff6190be | /10.200.6.110 | /opt/app/geode/node/nx110c-srv/ksdata
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
> a88c7940-7e92-4728-b9bb-043c251e0fd0 | /10.200.4.109 | /opt/app/geode/node/nx109b-srv/ksdata
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
> 0c305067-7193-42ea-a0e6-f6c20795308c | /10.200.6.104 | /opt/app/geode/node/nx104c-srv/ksdata
> 9bd60b3e-f640-4d95-b9a5-be14fccb5f91 | /10.200.2.102 | /opt/app/geode/node/nx102a-srv/ksdata
> 26112834-88bc-4653-94a9-10db18d5ebb4 | /10.200.4.103 | /opt/app/geode/node/nx103b-srv/ksdata
>  
> Very strange...
>  
> Does-it meens that some DATA is lost even with a redundancy of #3 ?
>  
> I saw in documentation that « missing disk stores » can be revoked but it is not clear on the fact that data is finally lost or not ☹
>  
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
> —
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <http://www.gfi.world/> 
> — 
> <image001.png> <https://www.facebook.com/gfiinformatique> <image002.png> <https://twitter.com/gfiinformatique> <image003.png> <https://www.instagram.com/gfiinformatique/> <image004.png> <https://www.linkedin.com/company/gfi-informatique> <image005.png> <https://www.youtube.com/user/GFIinformatique>
> —
> <image006.jpg> <http://www.gfi.world/>
>  
>  
> De : Anthony Baker [mailto:abaker@pivotal.io <ma...@pivotal.io>] 
> Envoyé : mercredi 23 janvier 2019 18:28
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : Re: Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
>  
> When a cluster member member becomes unresponsive, Geode may fence off the member in order to preserve consistency and availability.  The question to investigate is *why* the member got into this state.
>  
> Questions to investigate:
>  
> - How much heap memory is your data consuming?
> - How much data is overflowed disk vs in heap memory?
> - How much data is being read from disk vs memory?
> - Is GC activity consuming significant cpu resources?
> - Are there other processes running on the system causing swapping behavior?
>  
> Anthony
>  
>  
> 
> On Jan 23, 2019, at 8:45 AM, Philippe CEROU <philippe.cerou@gfi.fr <ma...@gfi.fr>> wrote:
>  
> Hi,
>  
> I think I have found the problem,
>  
> When we use REGION with OVERFLOW to disk once the percentage of memory configure dis reached the geode server become very very slow ☹
>  
> At the end even if we have a lot of nodes the overall cluster do not reach to write/acquire as far as client send data and everything fall, we past from 140 000 rows per second (memory) to less than 10 000 rows per second (once OVERFLOW is started)...
>  
> Is the product able to overflow on disk without a so high bandwidth reduce ?
>  
> For information, here are my launch commands (9 nodes) :
>  
> 3 x :
>  
> gfsh \
> -e "start locator --name=${WSNAME} --group=ksgroup --enable-cluster-configuration=true --port=1234 --mcast-port=0 --locators='nx101c[1234],nc102a[1234],nx103b[1234]'"
>  
> For PDX :
>  
> gfsh \
> -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
> -e "configure pdx --disk-store=DEFAULT --read-serialized=true"
>  
> 9 x :
>  
> gfsh \
> -e "start server --name=${WSNAME} --initial-heap=6000M --max-heap=6000M --eviction-heap-percentage=80 --group=ksgroup --use-cluster-configuration=true --mcast-port=0 --locators='nx101c[1234],nx102a[1234],nx103b[1234]'"
>  
> For disks & regions :
>  
> gfsh \
>                 -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>                 -e "create disk-store --name ksdata --allow-force-compaction=true --auto-compact=true --dir=/opt/app/geode/data/ksdata --group=ksgroup"
>  
> gfsh \
>                 -e "connect --locator='nx101c[1234],nc102a[1234],nx103b[1234]'" \
>                 -e "create region --if-not-exists --name=ksdata-benchmark --group=ksgroup --type=PARTITION_REDUNDANT_PERSISTENT_OVERFLOW --disk-store=ksdata --redundant-copies=2 --eviction-action=overflow-to-disk" \
>                 -e "create index --name=ksdata-benchmark-C001 --expression=C001 --region=ksdata-benchmark --group=ksgroup" \
>                 -e "create index --name=ksdata-benchmark-C002 --expression=C002 --region=ksdata-benchmark --group=ksgroup" \
>                 -e "create index --name=ksdata-benchmark-C003 --expression=C003 --region=ksdata-benchmark --group=ksgroup" \
>                 -e "create index --name=ksdata-benchmark-C011 --expression=C011 --region=ksdata-benchmark --group=ksgroup"
>  
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
> —
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=> 
> — 
> <image001.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
> —
> <image006.jpg> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>  
>  
> De : Philippe CEROU [mailto:philippe.cerou@gfi.fr <ma...@gfi.fr>] 
> Envoyé : mercredi 23 janvier 2019 08:40
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : Over-activity cause nodes to crash/disconnect with error : org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
>  
> Hi,
>  
> Following my GEODE study i now have this situation.
>  
> I have 6 nodes, 3 have a locator and 6 have a server.
>  
> When I try to do massive insertion (200 million PDX-ized rows) I have after some hours this error on client side for all threads :
>  
> ...
> Exception in thread "Thread-20" org.apache.geode.cache.client.ServerOperationException: remote server on nxmaster(28523:loner):35042:efcecc76: Region /ksdata-benchmark putAll at server applied partial keys due to exception.
>         at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9542)
>         at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9446)
>         at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:9458)
>         at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.TableInsert(CxDrvGEODE.java:144)
>         at com.gfi.rt.lib.database.connectors.CxTable.insert(CxTable.java:175)
>         at com.gfi.rt.bin.database.dbbench.BenchmarkObject.run(BenchmarkObject.java:279)
>         at com.gfi.rt.bin.database.dbbench.BenchmarkThread.DoIt(BenchmarkThread.java:84)
>         at com.gfi.rt.bin.database.dbbench.BenchmarkThread.run(BenchmarkThread.java:67)
> Caused by: org.apache.geode.cache.persistence.PartitionOfflineException: Region /ksdata-benchmark bucket 48 has persistent data that is no longer online stored at these locations: [/10.200.6.101:/opt/app/geode/node/nx101c-srv/ksdata created at timestamp 1548181352906 <tel:1548181352906> version 0 diskStoreId 3650a4bc61f447a3-bc9cba70cf59e514 name null, /10.200.4.106:/opt/app/geode/node/nx106b-srv/ksdata created at timestamp 1548181352865 <tel:1548181352865> version 0 diskStoreId a7de7988708b44d2-b4b09d91c1f536b6 name null, /10.200.2.105:/opt/app/geode/node/nx105a-srv/ksdata created at timestamp 1548181353100 <tel:1548181353100>version 0 diskStoreId 72dcab08ae12413c-b099fbb7f9ab740b name null]
>         at org.apache.geode.internal.cache.ProxyBucketRegion.checkBucketRedundancyBeforeGrab(ProxyBucketRegion.java:590)
>         at org.apache.geode.internal.cache.PartitionedRegionDataStore.lockRedundancyLock(PartitionedRegionDataStore.java:595)
>         at org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:440)
>         at org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2858)
>         at org.apache.geode.internal.cache.PartitionedRegionDataStore.handleManageBucketRequest(PartitionedRegionDataStore.java:1014)
>         at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketOnMember(PRHARedundancyProvider.java:1233)
>         at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketInstance(PRHARedundancyProvider.java:416)
>         at org.apache.geode.internal.cache.PRHARedundancyProvider.createBucketAtomically(PRHARedundancyProvider.java:604)
>         at org.apache.geode.internal.cache.PartitionedRegion.createBucket(PartitionedRegion.java:3310)
>         at org.apache.geode.internal.cache.PartitionedRegion.virtualPut(PartitionedRegion.java:2055)
>         at org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
>         at org.apache.geode.internal.cache.PartitionedRegion.performPutAllEntry(PartitionedRegion.java:2124)
>         at org.apache.geode.internal.cache.LocalRegion.basicEntryPutAll(LocalRegion.java:10060)
>         at org.apache.geode.internal.cache.LocalRegion.access$100(LocalRegion.java:231)
>         at org.apache.geode.internal.cache.LocalRegion$2.run(LocalRegion.java:9639)
>         at org.apache.geode.internal.cache.event.NonDistributedEventTracker.syncBulkOp(NonDistributedEventTracker.java:107)
>         at org.apache.geode.internal.cache.LocalRegion.syncBulkOp(LocalRegion.java:6085)
>         at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9657)
>         at org.apache.geode.internal.cache.LocalRegion.basicBridgePutAll(LocalRegion.java:9367)
>         at org.apache.geode.internal.cache.tier.sockets.command.PutAll80.cmdExecute(PutAll80.java:270)
>         at org.apache.geode.internal.cache.tier.sockets.BaseCommand.execute(BaseCommand.java:178)
>         at org.apache.geode.internal.cache.tier.sockets.ServerConnection.doNormalMessage(ServerConnection.java:844)
>         at org.apache.geode.internal.cache.tier.sockets.OriginalServerConnection.doOneMessage(OriginalServerConnection.java:74)
>         at org.apache.geode.internal.cache.tier.sockets.ServerConnection.run(ServerConnection.java:1214)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at org.apache.geode.internal.cache.tier.sockets.AcceptorImpl.lambda$initializeServerConnectionThreadPool$3(AcceptorImpl.java:594)
>         at org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121)
>         at java.lang.Thread.run(Thread.java:748)
> ...
>  
> When I check the cluster 4 of the 6 servers has disappear, when I check their LOG I have this :
>  
> ...
> [info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure Detection Scheduler1> tid=0x1b] Failure detection is now watching 10.200.4.103(nx103b-srv:21892)<v4>:41001
>  
> [info 2019/01/23 01:28:35.326 UTC nx102a-srv <Geode Failure Detection Scheduler1> tid=0x1b] Failure detection is now watching 10.200.2.102(nx102a-srv:23781)<v4>:41001
>  
> [info 2019/01/23 01:28:35.326 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] Membership received a request to remove 10.200.2.102(nx102a-srv:23781)<v4>:41001 from 10.200.2.102(nx102a:23617:locator)<ec><v0>:41000 reason=Member isn't responding to heartbeat requests
>  
> [severe 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] Membership service failure: Member isn't responding to heartbeat requests
> org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
>         at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
>         at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
>         at org.jgroups.JChannel.up(JChannel.java:741)
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
>         at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
>         at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
>         at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
>         at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
>         at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
>         at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
>         at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
>         at org.jgroups.protocols.TP.receive(TP.java:1714)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
>         at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
>         at java.lang.Thread.run(Thread.java:748)
>  
> [info 2019/01/23 01:28:35.327 UTC nx102a-srv <unicast receiver,nx102a-34886> tid=0x1a] CacheServer configuration saved
>  
> [info 2019/01/23 01:28:35.341 UTC nx102a-srv <DisconnectThread> tid=0x1cd] Stopping membership services
>  
> [info 2019/01/23 01:28:35.341 UTC nx102a-srv <DisconnectThread> tid=0x1cd] GMSHealthMonitor server socket is closed in stopServices().
>  
> [info 2019/01/23 01:28:35.342 UTC nx102a-srv <Geode Failure Detection thread 126> tid=0x1ca] Failure detection is now watching 10.200.2.102(nx102a:23617:locator)<ec><v0>:41000
>  
> [info 2019/01/23 01:28:35.343 UTC nx102a-srv <Geode Failure Detection Server thread 1> tid=0x1e] GMSHealthMonitor server thread exiting
>  
> [info 2019/01/23 01:28:35.344 UTC nx102a-srv <DisconnectThread> tid=0x1cd] GMSHealthMonitor serverSocketExecutor is terminated
>  
> [info 2019/01/23 01:28:35.347 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Disconnecting old DistributedSystem to prepare for a reconnect attempt
>  
> [info 2019/01/23 01:28:35.351 UTC nx102a-srv <ReconnectThread> tid=0x1cd] GemFireCache[id = 82825098; isClosing = true; isShutDownAll = false; created = Tue Jan 22 18:22:30 UTC 2019; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
>  
> [info 2019/01/23 01:28:35.352 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Cache server on port 40404 <tel:40404> is shutting down.
>  
> [severe 2019/01/23 01:28:45.005 UTC nx102a-srv <EvictorThread8> tid=0x92] Uncaught exception in thread Thread[EvictorThread8,10,main]
> org.apache.geode.distributed.DistributedSystemDisconnectedException: Distribution manager on 10.200.2.102(nx102a-srv:23781)<v4>:41001 started at Tue Jan 22 18:22:30 UTC 2019: Member isn't responding to heartbeat requests, caused by org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
>         at org.apache.geode.distributed.internal.ClusterDistributionManager$Stopper.generateCancelledException(ClusterDistributionManager.java:3926)
>         at org.apache.geode.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:966)
>         at org.apache.geode.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1547)
>         at org.apache.geode.CancelCriterion.checkCancelInProgress(CancelCriterion.java:83)
>         at org.apache.geode.internal.cache.GemFireCacheImpl.getInternalResourceManager(GemFireCacheImpl.java:4330)
>         at org.apache.geode.internal.cache.GemFireCacheImpl.getResourceManager(GemFireCacheImpl.java:4319)
>         at org.apache.geode.internal.cache.eviction.HeapEvictor.getAllRegionList(HeapEvictor.java:138)
>         at org.apache.geode.internal.cache.eviction.HeapEvictor.getAllSortedRegionList(HeapEvictor.java:171)
>         at org.apache.geode.internal.cache.eviction.HeapEvictor.createAndSubmitWeightedRegionEvictionTasks(HeapEvictor.java:215)
>         at org.apache.geode.internal.cache.eviction.HeapEvictor.access$200(HeapEvictor.java:53)
>         at org.apache.geode.internal.cache.eviction.HeapEvictor$1.run(HeapEvictor.java:357)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests
>         at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2503)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1049)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:654)
>         at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1810)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1301)
>         at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
>         at org.jgroups.JChannel.up(JChannel.java:741)
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
>         at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
>         at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
>         at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
>         at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
>         at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
>         at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
>         at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
>         at org.jgroups.protocols.TP.receive(TP.java:1714)
>         at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
>         at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
>         ... 1 more
>  
> [info 2019/01/23 01:30:05.874 UTC nx102a-srv <ReconnectThread> tid=0x1cd] Shutting down DistributionManager 10.200.2.102(nx102a-srv:23781)<v4>:41001. At least one Exception occurred.
> ...
>  
> Any idea ?
>  
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
> —
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=> 
> — 
> <image001.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
> —
> <image006.jpg> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>  
>  
> De : Philippe CEROU 
> Envoyé : mercredi 23 janvier 2019 08:19
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : RE: Multi-threaded Java client exception.
>  
> Hi,
>  
> Thanks to Anthony for its help, I so modified my code to have only one cache & region which are thread-shared.
>  
> To have something running well I had to use synchronized blocks to handle cache connection and close, and use a dedicated region creation/sharing function has follow (Hope it can helps someone) :
>  
> ...
> public class CxDrvGEODE extends CxObjNOSQL {
>        static ClientCache oCache = null;
>        static final String oSync = "x";
>        static boolean IsConnected = false;
>        static int nbThreads = 0;
>        static final HashMap<String, Region<Long, CxDrvGEODERow>> hmRegions = newHashMap<String, Region<Long, CxDrvGEODERow>>();
>  
> ...
>  
>        public void DoConnect() {
>              synchronized (oSync) {
>                     if (!IsConnected) {
>                            ReflectionBasedAutoSerializer oRBAS = newReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
>                            oCache = newClientCacheFactory().addPoolLocator(this.GetNode(), Integer.valueOf(this.Port)).set("log-level", "WARN").setPdxSerializer(oRBAS).create();
>                            IsConnected = true;
>                     }
>                     nbThreads++;
>              }
>        }
>  
>        public void DoClose() {
>              synchronized (oSync) {
>                     if (nbThreads > 0) {
>                            if (nbThreads == 1 && oCache != null) {
>                                   for (String sKey : hmRegions.keySet()) {
>                                         hmRegions.get(sKey).close();
>                                         hmRegions.remove(sKey);
>                                   }
>                                   oCache.close();
>                                   oCache = null;
>                                   IsConnected = false;
>                            }
>                            nbThreads--;
>                     }
>              }
>        }
>  
> ...
>  
>        private Region<Long, CxDrvGEODERow> getCache(String CTable) {
>              synchronized (oSync) {
>                     Region<Long, CxDrvGEODERow> oRegion = null;
>                     if (hmRegions.containsKey(CTable)) {
>                            oRegion = hmRegions.get(CTable);
>                     } else {
>                            oRegion = oCache.<Long, CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.PROXY).create(this.Base + '-'+ CTable);
>                            hmRegions.put(CTable, oRegion);
>                     }
>                     return oRegion;
>              }
>        }
>  
> ...
>  
>        public boolean TableInsert(String CTable, String[][] TColumns, Object[][] TOValues,boolean BCommit, boolean BForceBlocMode) {
> ...
>              Region<Long, CxDrvGEODERow> oRegion = getCache(CTable);
> ...
>        }
>  
> ...
>  
> }
>  
>  
>  
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
> —
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=> 
> — 
> <image001.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=3lXQpRLYUUgkSHMsrBSAzL4sw0IPDMRAH1wpjZCqVFM&e=> <image002.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=A8MchiCueu1HvXXe2qHMQrIg_y8P7UaPct6j0xJWuSg&e=> <image003.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=zJNVVxpSENE8xdhfR4yVeV3kB6jkmETCa9pVa2G2rjM&e=> <image004.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=BQd6Sg0FKP1atR3NPT-yebZx3oUKIC0ljLl_yqWwitw&e=> <image005.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=bJyHwlSdeYnKWy0JZBKG7gCwYs3cUhPOBnxCG0vWBAk&e=>
> —
> <image006.jpg> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=rsnn6hU9M1hRBIbJPza2gcMYA9R6fVlds56uS6mY1zY&s=ZKBaS-xM6Lc3UnXGmHhVJOboSAgUGm53q6GTzyX8cSw&e=>
>  
>  
> De : Anthony Baker [mailto:abaker@pivotal.io <ma...@pivotal.io>] 
> Envoyé : mardi 22 janvier 2019 16:59
> À : user@geode.apache.org <ma...@geode.apache.org>
> Objet : Re: Multi-threaded Java client exception.
>  
> You only need one ClientCache in each JVM.  You can create the cache and region once and then pass it to each worker thread.
>  
> Anthony
>  
> 
> On Jan 22, 2019, at 1:25 AM, Philippe CEROU <philippe.cerou@gfi.fr <ma...@gfi.fr>> wrote:
>  
> Hi,
>  
> We try to tune a single program using multi-threading for storage interface.
>  
> The problem we have is if we launch this code with THREADS=1 then all run well, starting with THREADS=2 then we always have this exception when we create connection (On yellow row).
>  
> Exception in thread "main" java.lang.IllegalStateException: Existing cache's default pool was not compatible
>         at org.apache.geode.internal.cache.GemFireCacheImpl.validatePoolFactory(GemFireCacheImpl.java:2933)
>         at org.apache.geode.cache.client.ClientCacheFactory.basicCreate(ClientCacheFactory.java:252)
>         at org.apache.geode.cache.client.ClientCacheFactory.create(ClientCacheFactory.java:213)
>         at com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODE.DoConnect(CxDrvGEODE.java:35)
>         at com.gfi.rt.lib.database.connectors.CxObj.DoConnect(CxObj.java:121)
>         at com.gfi.rt.lib.database.connectors.CxInterface.Connect(CxInterface.java:91)
>         at com.gfi.rt.lib.database.connectors.CxInterface.Connect(CxInterface.java:149)
>        ...
>  
> Here is the data interface code.
>  
> package com.gfi.rt.lib.database.connectors.nosql;
>  
> import java.util.HashMap;
> import org.apache.geode.cache.Region;
> import org.apache.geode.cache.client.ClientCache;
> import org.apache.geode.cache.client.ClientCacheFactory;
> import org.apache.geode.cache.client.ClientRegionShortcut;
> import org.apache.geode.pdx.ReflectionBasedAutoSerializer;
>  
> public class CxDrvGEODEThread extends CxObjNOSQL {
>        ClientCache oCache = null;
>        boolean IsConnected = false;
>  
>        // Connexion
>  
>        public CxDrvGEODEThread() {
>              Connector = "geode-native";
>        }
>  
>        public void DoConnect() {
>              if (!IsConnected) {
>                     ReflectionBasedAutoSerializer rbas = newReflectionBasedAutoSerializer("com.gfi.rt.lib.database.connectors.nosql.CxDrvGEODERow");
>                     oCache = new ClientCacheFactory().addPoolLocator(this.GetNode(), Integer.valueOf(this.Port)).set("log-level", "WARN").setPdxSerializer(rbas).create();
>                     IsConnected = true;
>              }
>        }
>  
>        // Déconnexion de la connexion courante à la base de données
>  
>        public void DoClose() {
>              if (oCache != null) {
>                     oCache.close();
>                     oCache = null;
>                     IsConnected = false;
>              }
>        }
>  
>        // Insertion de données
>  
>        public boolean TableInsert(String CTable, String[][] TColumns, Object[][] TOValues,boolean BCommit, boolean BForceBlocMode) {
>              boolean BResult = false;
>              final HashMap<Long, CxDrvGEODERow> mrows = new HashMap<Long, CxDrvGEODERow>();
>  
>              ...
>  
>              if (!mrows.isEmpty()) {
>                            Region<Long, CxDrvGEODERow> oRegion = oCache.<Long, CxDrvGEODERow>createClientRegionFactory(ClientRegionShortcut.PROXY).create(this.Base + '-' +CTable);
>                     if (oRegion != null) {
>                            oRegion.putAll(mrows);
>                            oRegion.close();
>                     }
>              }
>              mrows.clear();
>              BResult = true;
>              return BResult;
>        }
> }
>  
> Note that cluster, disk stores, regions and indexes are pre-created on GEODE.
>  
> Every thread is isolated and create its own CxDrvGEODEThread class instance and do « DoConnect » à N x « TableInsert » à « DoClose ».
>  
> Here is a master thread class call example:
>  
>         private DataThread launchOneDataThread(long LNbProcess, long LNbLines, int LBatchSize, long LProcessID, String BenchId) {
>                 final CxObj POCX = CXI.Connect(CXO.Connector, CXO.Server, CXO.Port, CXO.Base, CXO.User, CXO.Password, BTrace);
>                 final DataThread BT = new DataThread(LNbProcess, LNbLines, LBatchSize, LProcessID, new DataObject(POCX, BenchId,CBenchParams, CTable));
>                 new Thread(new Runnable() {
>                         @Override
>                         public void run() {
>                                 BT.start();
>                         }
>                 }).start();
>                 return BT;
>         }
>  
> I’m sure we are doing something really bad, any idea ?
>  
> Cordialement,
> 
> —
> NOTE : n/a
> —
> Gfi Informatique
> Philippe Cerou
> Architecte & Expert Système
> GFI Production / Toulouse
> philippe.cerou @gfi.fr
> —
> 1 Rond-point du Général Eisenhower, 31400 Toulouse
> Tél. : +33 (0)5.62.85.11.55 <tel:+33%205.62.85.11.55>
> Mob. : +33 (0)6.03.56.48.62 <tel:+33%206.03.56.48.62>
> www.gfi.world <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=> 
> — 
> <image001.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=Y-m_AcPqNYz6oz5orp1rMZS_KdYRmbcUsFvj6AaIprI&e=> <image002.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_gfiinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=j3b0MNZZlpcvogNg2pC9MD6B9rhKXj5Et-YiHqfZogE&e=> <image003.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_gfiinformatique_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=3j79LCFW1Ey72TkJcMHwsFK0o2vO38fQDFbO0VH6pEY&e=> <image004.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_gfi-2Dinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=KbOBhON1qk5FcAwwRtRODcOxwfHcNCIPYk_ql-LwSdA&e=> <image005.png> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_user_GFIinformatique&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=0T27NgqJxc1Lb1q71yM7mDATuZJEXuClx1sqm3E3RRI&e=>
> —
> <image006.jpg> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gfi.world_&d=DwMGaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=oRx8X40ShLlxiIIYBg-SFkQ_Kkw5Ic3lnhtwLgN9ZDw&m=RcNF18aqobgQeYlr2MylGZVq9m8pf_sDuSY067gLjIw&s=FWqgfwPtqmvASPXsAuq88CoAzZiF9wRuQJEpXArZ4kg&e=>