You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@geode.apache.org by Roger Vandusen <ro...@ticketmaster.com> on 2017/03/09 20:55:32 UTC

Unknown Pdx Type use case found, bug or expected?

Hey Geode,

We have a 3 node server cluster running with pdx read serialized and disk store persistence for all regions and replication-factor=2.

We do not use cluster-configuration, we use these property overrides:

#configuration settings used
enable-cluster-configuration=false
use-cluster-configuration=false
cache-xml-file=geode-cache.xml

#property default overrides
distributed-system-id=1
log-level=config
enforce-unique-host=true
locator-wait-time=60
conserve-sockets=false
log-file-size-limit=64
mcast-port=0

We use these stop/start scripts:

STOP:

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop server --name=$SERVER_NAME"

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop locator --name=$LOCATOR_NAME"

START:

gfsh start locator \
  --properties-file=$CONF_DIR/geode.properties \
  --name=$LOCATOR_NAME \
  --port=$LOCATOR_PORT \
  --log-level=config \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --enable-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.jmx-manager=true \
  --J=-Dgemfire.jmx-manger-start=true \
  --J=-Xms512m \
  --J=-Xmx512m

gfsh start server \
  --properties-file=$CONF_DIR/geode.properties \
  --cache-xml-file=$CONF_DIR/geode-cache.xml \
  --name=$SERVER_NAME \
  --server-port=$SERVER_PORT \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --start-rest-api=true \
  --use-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.disk.recoverValues=false \
  --J=-Dgemfire.jmx-manager=false \
  --J=-Dgemfire.jmx-manger-start=false \
  --J=-Xms6g \
  --J=-Xmx6g


There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.

We did a ‘scripted’ rolling geode version upgrade redeployment by serially stopping/deploying/restarting each server node.
We had this issue below, which we’ve seen before and still find difficult to solve:
‘Region /xxxx has potentially stale data. It is waiting for another member to recover the latest data.’
The first node1 server hanging on restart and blocking our rolling serial redeployment.

So after not being able to resolve this serial rolling update problem (again) we decided to delete all the data (currently just cached lookup tables and dev WIP/POC data),
redeploy the new geode version and restart from scratch, so we then deleted all the diskstores (including pdx disk store) and restarted the cluster.

REMINDER: the clients were all still connected and not restarted!!! (see link below for our awareness now of this CLIENT-SIDE error state)
These clients then put data into server cluster, the ‘put’s succeeded, the server regions show they have the data.

BUT now gfsh query of this server region data gives ‘Unknown pdx types’ and restarting the clients fails on connecting to these regions with the same error: ‘Unknown pdx type’.

We are seeking GEODE-USER feedback regarding:


1)       We need to find a working enterprise deployment solution to resolve the rolling restart problem with stale data alerts blocking cluster config/version updates?

2)       We don’t believe the problem we saw was related to version upgrading?

3)       We find it very concerning that connected clients can CORRUPT SEVER-SIDE region data and don’t update the pdx registry and diskstore upon ‘put’s?

A FAIL of the client-side proxy region.put would make more sense?

Why didn’t the pdx types cached on the client get registered and written back to the servers diskstores?

The client PUTs DID write data into the server regions – but that data is now corrupted and unreadable as ‘Unknown pdx types’?

That is a major issue, even though we acknowledge that we would NOT be deleting active diskstores from running clusters in production, assuming we can solve the rolling updates problem.

We  are now aware of this CLIENT-SIDE error state and can see how it might be related to our redeployment use case above but we now have corrupted SERVER-SIDE data written in server regions:
https://discuss.pivotal.io/hc/en-us/articles/206357497-IllegalStateException-Unknown-PDX-Type-on-Client-Side


-Roger


Re: Unknown Pdx Type use case found, bug or expected?

Posted by Anilkumar Gingade <ag...@pivotal.io>.
Hi Roger,

Thanks for the info.

During rolling upgrade, as all the other nodes are live other than node
getting re-started, there should not be a disk-store dependency...The node
re-started should get the latest data (initial image) from other nodes in
the cluster....(anyone has comments on this?)

>>We had this issue below, which we’ve seen before and still find difficult
to solve:

>> ‘Region /xxxx has potentially stale data. It is waiting for another
member to recover the latest data.’

-Anil.



On Fri, Mar 10, 2017 at 3:01 PM, Roger Vandusen <
roger.vandusen@ticketmaster.com> wrote:

> Anil,
>
>
>
> We stop, deploy updates and restart each node (one locator, one server per
> node) serially, one at a time, and each node has their own local diskstores.
>
>
>
> -Roger
>
>
>
> *From: *Anilkumar Gingade <ag...@pivotal.io>
> *Reply-To: *"user@geode.apache.org" <us...@geode.apache.org>
> *Date: *Friday, March 10, 2017 at 3:42 PM
> *To: *"user@geode.apache.org" <us...@geode.apache.org>
> *Cc: *Hitesh Khamesra <hi...@yahoo.com>
> *Subject: *Re: Unknown Pdx Type use case found, bug or expected?
>
>
>
> Roger,
>
>
>
> If you are bringing down locators, servers, one after other, and making
> sure the restarted node is part of cluster (before bringing down next); you
> should not see theses issues...
>
>
>
> What is your rolling upgrade procedure is...do you use a common disk-store
> (from one server) to restart other nodes?
>
>
>
> -Anil.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Mar 9, 2017 at 2:56 PM, Roger Vandusen <
> roger.vandusen@ticketmaster.com> wrote:
>
> Hi Hitesh, thanks for the reply.
>
>
>
> I’ll take a look at your links.
>
>
>
> Yes, we did try to revoke the disk-stores manually w gfsh but this isn’t
> manageable going to production.
>
> I can’t recall the details of the revoke outcome but it did not solve our
> problem. I think the disk-store revoked was the pdx disk store, which would
> still potentially lead to ‘unknown pdx type’ right?
>
>
>
> Our main concern, in our scenario, was the corruption (unknown pdx types –
> unregistered or persisted) of server-side data from the client puts.
>
>
>
> -Roger
>
>
>
> *From: *Hitesh Khamesra <hi...@yahoo.com>
> *Reply-To: *"user@geode.apache.org" <us...@geode.apache.org>, Hitesh
> Khamesra <hi...@yahoo.com>
> *Date: *Thursday, March 9, 2017 at 3:05 PM
> *To: *"user@geode.apache.org" <us...@geode.apache.org>, Geode <
> dev@geode.apache.org>
> *Subject: *Re: Unknown Pdx Type use case found, bug or expected?
>
>
>
> Hi Roger:
>
>
>
> Sorry to hear about this. There is system property on client side to clean
> pdx-registry when it disconnects from server. You can find details here
> https://discuss.pivotal.io/hc/en-us/articles/221351508-
> Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.
>
>
>
> I think we should clean pdx-registry when client disconnects. I will file
> the ticket to track this issue.
>
>
>
> For disk issue, here are some guidelines https://discuss.pivotal.io/hc/
> en-us/community/posts/208792347-Region-regionA-has-
> potentially-stale-data-It-is-waiting-for-another-member-to-
> recover-the-latest-data-.
>
>
>
> Did you try to revoke disk store?
>
>
>
> Thanks.
>
> hitesh
>
>
> ------------------------------
>
> *From:* Roger Vandusen <ro...@ticketmaster.com>
> *To:* "user@geode.apache.org" <us...@geode.apache.org>
> *Sent:* Thursday, March 9, 2017 12:55 PM
> *Subject:* Unknown Pdx Type use case found, bug or expected?
>
>
>
>
>
> Hey Geode,
>
>
>
> We have a 3 node server cluster running with pdx read serialized and disk
> store persistence for all regions and replication-factor=2.
>
>
>
> We do not use cluster-configuration, we use these property overrides:
>
>
>
>
> *#configuration settings used **enable-cluster-configuration*=
> *false use-cluster-configuration*=
>
>
> *false cache-xml-file=geode-cache.xml *
> *#property default overrides **distributed-system-id*=
> *1 log-level*=
> *config enforce-unique-host*=
> *true locator-wait-time*=
> *60 conserve-sockets*=
> *false log-file-size-limit*=
> *64 mcast-port*=*0*
>
>
>
> We use these stop/start scripts:
>
>
>
> STOP:
>
>
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop server --name=$SERVER_NAME"
>
>
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop locator --name=$LOCATOR_NAME"
>
>
>
> START:
>
>
>
> gfsh start locator \
>   --properties-file=$CONF_DIR/geode.properties \
>   --name=$LOCATOR_NAME \
>   --port=$LOCATOR_PORT \
>   --log-level=config \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --enable-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-Dgemfire.jmx-manager=true \
>   --J=-Dgemfire.jmx-manger-start=true \
>   --J=-Xms512m \
>   --J=-Xmx512m
>
>
>
> gfsh start server \
>   --properties-file=$CONF_DIR/geode.properties \
>   --cache-xml-file=$CONF_DIR/geode-cache.xml \
>   --name=$SERVER_NAME \
>   --server-port=$SERVER_PORT \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --start-rest-api=true \
>   --use-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-*Dgemfire.disk.recoverValues=false* \
>   --J=-Dgemfire.jmx-manager=false \
>   --J=-Dgemfire.jmx-manger-start=false \
>   --J=-Xms6g \
>   --J=-Xmx6g
>
>
>
>
>
> There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while
> we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.
>
>
>
> We did a ‘scripted’ rolling geode version upgrade redeployment by serially
> stopping/deploying/restarting each server node.
>
> We had this issue below, which we’ve seen before and still find difficult
> to solve:
>
> ‘Region /xxxx has potentially stale data. It is waiting for another member
> to recover the latest data.’
>
> The first node1 server hanging on restart and blocking our rolling serial
> redeployment.
>
>
>
> So after not being able to resolve this serial rolling update problem
> (again) we decided to delete all the data (currently just cached lookup
> tables and dev WIP/POC data),
>
> redeploy the new geode version and restart from scratch, so we then
> deleted all the diskstores (including pdx disk store) and restarted the
> cluster.
>
>
>
> REMINDER: the clients were all still connected and not restarted!!! (see
> link below for our awareness now of this CLIENT-SIDE error state)
>
> These clients then put data into server cluster, the ‘put’s succeeded, the
> server regions show they have the data.
>
>
>
> BUT now gfsh query of this server region data gives ‘Unknown pdx types’
> and restarting the clients fails on connecting to these regions with the
> same error: ‘Unknown pdx type’.
>
>
>
> We are seeking GEODE-USER feedback regarding:
>
>
>
> 1)       We need to find a working enterprise deployment solution to
> resolve the rolling restart problem with stale data alerts blocking cluster
> config/version updates?
>
> 2)       We don’t believe the problem we saw was related to version
> upgrading?
>
> 3)       We find it very concerning that connected clients can CORRUPT
> SEVER-SIDE region data and don’t update the pdx registry and diskstore upon
> ‘put’s?
>
> A FAIL of the client-side proxy region.put would make more sense?
>
> Why didn’t the pdx types cached on the client get registered and written
> back to the servers diskstores?
>
> The client PUTs DID write data into the server regions – but that data is
> now corrupted and unreadable as ‘Unknown pdx types’?
>
> That is a major issue, even though we acknowledge that we would NOT be
> deleting active diskstores from running clusters in production, assuming we
> can solve the rolling updates problem.
>
>
>
> We  are now aware of this CLIENT-SIDE error state and can see how it might
> be related to our redeployment use case above but we now have corrupted
> SERVER-SIDE data written in server regions:
>
> https://discuss.pivotal.io/hc/en-us/articles/206357497-
> IllegalStateException-Unknown-PDX-Type-on-Client-Side
>
>
>
>
>
> -Roger
>
>
>
>
>
>
>

Re: Unknown Pdx Type use case found, bug or expected?

Posted by Roger Vandusen <ro...@ticketmaster.com>.
Anil,

We stop, deploy updates and restart each node (one locator, one server per node) serially, one at a time, and each node has their own local diskstores.

-Roger

From: Anilkumar Gingade <ag...@pivotal.io>
Reply-To: "user@geode.apache.org" <us...@geode.apache.org>
Date: Friday, March 10, 2017 at 3:42 PM
To: "user@geode.apache.org" <us...@geode.apache.org>
Cc: Hitesh Khamesra <hi...@yahoo.com>
Subject: Re: Unknown Pdx Type use case found, bug or expected?

Roger,

If you are bringing down locators, servers, one after other, and making sure the restarted node is part of cluster (before bringing down next); you should not see theses issues...

What is your rolling upgrade procedure is...do you use a common disk-store (from one server) to restart other nodes?

-Anil.








On Thu, Mar 9, 2017 at 2:56 PM, Roger Vandusen <ro...@ticketmaster.com>> wrote:
Hi Hitesh, thanks for the reply.

I’ll take a look at your links.

Yes, we did try to revoke the disk-stores manually w gfsh but this isn’t manageable going to production.
I can’t recall the details of the revoke outcome but it did not solve our problem. I think the disk-store revoked was the pdx disk store, which would still potentially lead to ‘unknown pdx type’ right?

Our main concern, in our scenario, was the corruption (unknown pdx types – unregistered or persisted) of server-side data from the client puts.

-Roger

From: Hitesh Khamesra <hi...@yahoo.com>>
Reply-To: "user@geode.apache.org<ma...@geode.apache.org>" <us...@geode.apache.org>>, Hitesh Khamesra <hi...@yahoo.com>>
Date: Thursday, March 9, 2017 at 3:05 PM
To: "user@geode.apache.org<ma...@geode.apache.org>" <us...@geode.apache.org>>, Geode <de...@geode.apache.org>>
Subject: Re: Unknown Pdx Type use case found, bug or expected?

Hi Roger:

Sorry to hear about this. There is system property on client side to clean pdx-registry when it disconnects from server. You can find details here https://discuss.pivotal.io/hc/en-us/articles/221351508-Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.

I think we should clean pdx-registry when client disconnects. I will file the ticket to track this issue.

For disk issue, here are some guidelines https://discuss.pivotal.io/hc/en-us/community/posts/208792347-Region-regionA-has-potentially-stale-data-It-is-waiting-for-another-member-to-recover-the-latest-data-.

Did you try to revoke disk store?

Thanks.
hitesh

________________________________
From: Roger Vandusen <ro...@ticketmaster.com>>
To: "user@geode.apache.org<ma...@geode.apache.org>" <us...@geode.apache.org>>
Sent: Thursday, March 9, 2017 12:55 PM
Subject: Unknown Pdx Type use case found, bug or expected?


Hey Geode,

We have a 3 node server cluster running with pdx read serialized and disk store persistence for all regions and replication-factor=2.

We do not use cluster-configuration, we use these property overrides:

#configuration settings used
enable-cluster-configuration=false
use-cluster-configuration=false
cache-xml-file=geode-cache.xml

#property default overrides
distributed-system-id=1
log-level=config
enforce-unique-host=true
locator-wait-time=60
conserve-sockets=false
log-file-size-limit=64
mcast-port=0

We use these stop/start scripts:

STOP:

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop server --name=$SERVER_NAME"

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop locator --name=$LOCATOR_NAME"

START:

gfsh start locator \
  --properties-file=$CONF_DIR/geode.properties \
  --name=$LOCATOR_NAME \
  --port=$LOCATOR_PORT \
  --log-level=config \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --enable-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.jmx-manager=true \
  --J=-Dgemfire.jmx-manger-start=true \
  --J=-Xms512m \
  --J=-Xmx512m

gfsh start server \
  --properties-file=$CONF_DIR/geode.properties \
  --cache-xml-file=$CONF_DIR/geode-cache.xml \
  --name=$SERVER_NAME \
  --server-port=$SERVER_PORT \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --start-rest-api=true \
  --use-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.disk.recoverValues=false \
  --J=-Dgemfire.jmx-manager=false \
  --J=-Dgemfire.jmx-manger-start=false \
  --J=-Xms6g \
  --J=-Xmx6g


There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.

We did a ‘scripted’ rolling geode version upgrade redeployment by serially stopping/deploying/restarting each server node.
We had this issue below, which we’ve seen before and still find difficult to solve:
‘Region /xxxx has potentially stale data. It is waiting for another member to recover the latest data.’
The first node1 server hanging on restart and blocking our rolling serial redeployment.

So after not being able to resolve this serial rolling update problem (again) we decided to delete all the data (currently just cached lookup tables and dev WIP/POC data),
redeploy the new geode version and restart from scratch, so we then deleted all the diskstores (including pdx disk store) and restarted the cluster.

REMINDER: the clients were all still connected and not restarted!!! (see link below for our awareness now of this CLIENT-SIDE error state)
These clients then put data into server cluster, the ‘put’s succeeded, the server regions show they have the data.

BUT now gfsh query of this server region data gives ‘Unknown pdx types’ and restarting the clients fails on connecting to these regions with the same error: ‘Unknown pdx type’.

We are seeking GEODE-USER feedback regarding:

1)       We need to find a working enterprise deployment solution to resolve the rolling restart problem with stale data alerts blocking cluster config/version updates?
2)       We don’t believe the problem we saw was related to version upgrading?
3)       We find it very concerning that connected clients can CORRUPT SEVER-SIDE region data and don’t update the pdx registry and diskstore upon ‘put’s?
A FAIL of the client-side proxy region.put would make more sense?
Why didn’t the pdx types cached on the client get registered and written back to the servers diskstores?
The client PUTs DID write data into the server regions – but that data is now corrupted and unreadable as ‘Unknown pdx types’?
That is a major issue, even though we acknowledge that we would NOT be deleting active diskstores from running clusters in production, assuming we can solve the rolling updates problem.

We  are now aware of this CLIENT-SIDE error state and can see how it might be related to our redeployment use case above but we now have corrupted SERVER-SIDE data written in server regions:
https://discuss.pivotal.io/hc/en-us/articles/206357497-IllegalStateException-Unknown-PDX-Type-on-Client-Side


-Roger




Re: Unknown Pdx Type use case found, bug or expected?

Posted by Anilkumar Gingade <ag...@pivotal.io>.
Roger,

If you are bringing down locators, servers, one after other, and making
sure the restarted node is part of cluster (before bringing down next); you
should not see theses issues...

What is your rolling upgrade procedure is...do you use a common disk-store
(from one server) to restart other nodes?

-Anil.








On Thu, Mar 9, 2017 at 2:56 PM, Roger Vandusen <
roger.vandusen@ticketmaster.com> wrote:

> Hi Hitesh, thanks for the reply.
>
>
>
> I’ll take a look at your links.
>
>
>
> Yes, we did try to revoke the disk-stores manually w gfsh but this isn’t
> manageable going to production.
>
> I can’t recall the details of the revoke outcome but it did not solve our
> problem. I think the disk-store revoked was the pdx disk store, which would
> still potentially lead to ‘unknown pdx type’ right?
>
>
>
> Our main concern, in our scenario, was the corruption (unknown pdx types –
> unregistered or persisted) of server-side data from the client puts.
>
>
>
> -Roger
>
>
>
> *From: *Hitesh Khamesra <hi...@yahoo.com>
> *Reply-To: *"user@geode.apache.org" <us...@geode.apache.org>, Hitesh
> Khamesra <hi...@yahoo.com>
> *Date: *Thursday, March 9, 2017 at 3:05 PM
> *To: *"user@geode.apache.org" <us...@geode.apache.org>, Geode <
> dev@geode.apache.org>
> *Subject: *Re: Unknown Pdx Type use case found, bug or expected?
>
>
>
> Hi Roger:
>
>
>
> Sorry to hear about this. There is system property on client side to clean
> pdx-registry when it disconnects from server. You can find details here
> https://discuss.pivotal.io/hc/en-us/articles/221351508-
> Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.
>
>
>
> I think we should clean pdx-registry when client disconnects. I will file
> the ticket to track this issue.
>
>
>
> For disk issue, here are some guidelines https://discuss.pivotal.io/hc/
> en-us/community/posts/208792347-Region-regionA-has-
> potentially-stale-data-It-is-waiting-for-another-member-to-
> recover-the-latest-data-.
>
>
>
> Did you try to revoke disk store?
>
>
>
> Thanks.
>
> hitesh
>
>
> ------------------------------
>
> *From:* Roger Vandusen <ro...@ticketmaster.com>
> *To:* "user@geode.apache.org" <us...@geode.apache.org>
> *Sent:* Thursday, March 9, 2017 12:55 PM
> *Subject:* Unknown Pdx Type use case found, bug or expected?
>
>
>
>
>
> Hey Geode,
>
>
>
> We have a 3 node server cluster running with pdx read serialized and disk
> store persistence for all regions and replication-factor=2.
>
>
>
> We do not use cluster-configuration, we use these property overrides:
>
>
>
>
> *#configuration settings used **enable-cluster-configuration*=
> *false use-cluster-configuration*=
>
>
> *false cache-xml-file=geode-cache.xml *
> *#property default overrides **distributed-system-id*=
> *1 log-level*=
> *config enforce-unique-host*=
> *true locator-wait-time*=
> *60 conserve-sockets*=
> *false log-file-size-limit*=
> *64 mcast-port*=*0*
>
>
>
> We use these stop/start scripts:
>
>
>
> STOP:
>
>
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop server --name=$SERVER_NAME"
>
>
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop locator --name=$LOCATOR_NAME"
>
>
>
> START:
>
>
>
> gfsh start locator \
>   --properties-file=$CONF_DIR/geode.properties \
>   --name=$LOCATOR_NAME \
>   --port=$LOCATOR_PORT \
>   --log-level=config \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --enable-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-Dgemfire.jmx-manager=true \
>   --J=-Dgemfire.jmx-manger-start=true \
>   --J=-Xms512m \
>   --J=-Xmx512m
>
>
>
> gfsh start server \
>   --properties-file=$CONF_DIR/geode.properties \
>   --cache-xml-file=$CONF_DIR/geode-cache.xml \
>   --name=$SERVER_NAME \
>   --server-port=$SERVER_PORT \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --start-rest-api=true \
>   --use-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-*Dgemfire.disk.recoverValues=false* \
>   --J=-Dgemfire.jmx-manager=false \
>   --J=-Dgemfire.jmx-manger-start=false \
>   --J=-Xms6g \
>   --J=-Xmx6g
>
>
>
>
>
> There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while
> we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.
>
>
>
> We did a ‘scripted’ rolling geode version upgrade redeployment by serially
> stopping/deploying/restarting each server node.
>
> We had this issue below, which we’ve seen before and still find difficult
> to solve:
>
> ‘Region /xxxx has potentially stale data. It is waiting for another member
> to recover the latest data.’
>
> The first node1 server hanging on restart and blocking our rolling serial
> redeployment.
>
>
>
> So after not being able to resolve this serial rolling update problem
> (again) we decided to delete all the data (currently just cached lookup
> tables and dev WIP/POC data),
>
> redeploy the new geode version and restart from scratch, so we then
> deleted all the diskstores (including pdx disk store) and restarted the
> cluster.
>
>
>
> REMINDER: the clients were all still connected and not restarted!!! (see
> link below for our awareness now of this CLIENT-SIDE error state)
>
> These clients then put data into server cluster, the ‘put’s succeeded, the
> server regions show they have the data.
>
>
>
> BUT now gfsh query of this server region data gives ‘Unknown pdx types’
> and restarting the clients fails on connecting to these regions with the
> same error: ‘Unknown pdx type’.
>
>
>
> We are seeking GEODE-USER feedback regarding:
>
>
>
> 1)       We need to find a working enterprise deployment solution to
> resolve the rolling restart problem with stale data alerts blocking cluster
> config/version updates?
>
> 2)       We don’t believe the problem we saw was related to version
> upgrading?
>
> 3)       We find it very concerning that connected clients can CORRUPT
> SEVER-SIDE region data and don’t update the pdx registry and diskstore upon
> ‘put’s?
>
> A FAIL of the client-side proxy region.put would make more sense?
>
> Why didn’t the pdx types cached on the client get registered and written
> back to the servers diskstores?
>
> The client PUTs DID write data into the server regions – but that data is
> now corrupted and unreadable as ‘Unknown pdx types’?
>
> That is a major issue, even though we acknowledge that we would NOT be
> deleting active diskstores from running clusters in production, assuming we
> can solve the rolling updates problem.
>
>
>
> We  are now aware of this CLIENT-SIDE error state and can see how it might
> be related to our redeployment use case above but we now have corrupted
> SERVER-SIDE data written in server regions:
>
> https://discuss.pivotal.io/hc/en-us/articles/206357497-
> IllegalStateException-Unknown-PDX-Type-on-Client-Side
>
>
>
>
>
> -Roger
>
>
>
>
>

Re: Unknown Pdx Type use case found, bug or expected?

Posted by Roger Vandusen <ro...@ticketmaster.com>.
Hi Hitesh, thanks for the reply.

I’ll take a look at your links.

Yes, we did try to revoke the disk-stores manually w gfsh but this isn’t manageable going to production.
I can’t recall the details of the revoke outcome but it did not solve our problem. I think the disk-store revoked was the pdx disk store, which would still potentially lead to ‘unknown pdx type’ right?

Our main concern, in our scenario, was the corruption (unknown pdx types – unregistered or persisted) of server-side data from the client puts.

-Roger

From: Hitesh Khamesra <hi...@yahoo.com>
Reply-To: "user@geode.apache.org" <us...@geode.apache.org>, Hitesh Khamesra <hi...@yahoo.com>
Date: Thursday, March 9, 2017 at 3:05 PM
To: "user@geode.apache.org" <us...@geode.apache.org>, Geode <de...@geode.apache.org>
Subject: Re: Unknown Pdx Type use case found, bug or expected?

Hi Roger:

Sorry to hear about this. There is system property on client side to clean pdx-registry when it disconnects from server. You can find details here https://discuss.pivotal.io/hc/en-us/articles/221351508-Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.

I think we should clean pdx-registry when client disconnects. I will file the ticket to track this issue.


For disk issue, here are some guidelines https://discuss.pivotal.io/hc/en-us/community/posts/208792347-Region-regionA-has-potentially-stale-data-It-is-waiting-for-another-member-to-recover-the-latest-data-.


Did you try to revoke disk store?



Thanks.
hitesh

________________________________
From: Roger Vandusen <ro...@ticketmaster.com>
To: "user@geode.apache.org" <us...@geode.apache.org>
Sent: Thursday, March 9, 2017 12:55 PM
Subject: Unknown Pdx Type use case found, bug or expected?


Hey Geode,

We have a 3 node server cluster running with pdx read serialized and disk store persistence for all regions and replication-factor=2.

We do not use cluster-configuration, we use these property overrides:

#configuration settings used
enable-cluster-configuration=false
use-cluster-configuration=false
cache-xml-file=geode-cache.xml

#property default overrides
distributed-system-id=1
log-level=config
enforce-unique-host=true
locator-wait-time=60
conserve-sockets=false
log-file-size-limit=64
mcast-port=0

We use these stop/start scripts:

STOP:

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop server --name=$SERVER_NAME"

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop locator --name=$LOCATOR_NAME"

START:

gfsh start locator \
  --properties-file=$CONF_DIR/geode.properties \
  --name=$LOCATOR_NAME \
  --port=$LOCATOR_PORT \
  --log-level=config \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --enable-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.jmx-manager=true \
  --J=-Dgemfire.jmx-manger-start=true \
  --J=-Xms512m \
  --J=-Xmx512m

gfsh start server \
  --properties-file=$CONF_DIR/geode.properties \
  --cache-xml-file=$CONF_DIR/geode-cache.xml \
  --name=$SERVER_NAME \
  --server-port=$SERVER_PORT \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --start-rest-api=true \
  --use-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.disk.recoverValues=false \
  --J=-Dgemfire.jmx-manager=false \
  --J=-Dgemfire.jmx-manger-start=false \
  --J=-Xms6g \
  --J=-Xmx6g


There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.

We did a ‘scripted’ rolling geode version upgrade redeployment by serially stopping/deploying/restarting each server node.
We had this issue below, which we’ve seen before and still find difficult to solve:
‘Region /xxxx has potentially stale data. It is waiting for another member to recover the latest data.’
The first node1 server hanging on restart and blocking our rolling serial redeployment.

So after not being able to resolve this serial rolling update problem (again) we decided to delete all the data (currently just cached lookup tables and dev WIP/POC data),
redeploy the new geode version and restart from scratch, so we then deleted all the diskstores (including pdx disk store) and restarted the cluster.

REMINDER: the clients were all still connected and not restarted!!! (see link below for our awareness now of this CLIENT-SIDE error state)
These clients then put data into server cluster, the ‘put’s succeeded, the server regions show they have the data.

BUT now gfsh query of this server region data gives ‘Unknown pdx types’ and restarting the clients fails on connecting to these regions with the same error: ‘Unknown pdx type’.

We are seeking GEODE-USER feedback regarding:

1)       We need to find a working enterprise deployment solution to resolve the rolling restart problem with stale data alerts blocking cluster config/version updates?
2)       We don’t believe the problem we saw was related to version upgrading?
3)       We find it very concerning that connected clients can CORRUPT SEVER-SIDE region data and don’t update the pdx registry and diskstore upon ‘put’s?
A FAIL of the client-side proxy region.put would make more sense?
Why didn’t the pdx types cached on the client get registered and written back to the servers diskstores?
The client PUTs DID write data into the server regions – but that data is now corrupted and unreadable as ‘Unknown pdx types’?
That is a major issue, even though we acknowledge that we would NOT be deleting active diskstores from running clusters in production, assuming we can solve the rolling updates problem.

We  are now aware of this CLIENT-SIDE error state and can see how it might be related to our redeployment use case above but we now have corrupted SERVER-SIDE data written in server regions:
https://discuss.pivotal.io/hc/en-us/articles/206357497-IllegalStateException-Unknown-PDX-Type-on-Client-Side


-Roger



Re: Unknown Pdx Type use case found, bug or expected?

Posted by Hitesh Khamesra <hi...@yahoo.com.INVALID>.
Hi Roger:
Sorry to hear about this. There is system property on client side to clean pdx-registry when it disconnects from server. You can find details here https://discuss.pivotal.io/hc/en-us/articles/221351508-Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.
I think we should clean pdx-registry when client disconnects. I will file the ticket to track this issue.
For disk issue, here are some guidelines https://discuss.pivotal.io/hc/en-us/community/posts/208792347-Region-regionA-has-potentially-stale-data-It-is-waiting-for-another-member-to-recover-the-latest-data-.
Did you try to revoke disk store?

Thanks.hitesh  

      From: Roger Vandusen <ro...@ticketmaster.com>
 To: "user@geode.apache.org" <us...@geode.apache.org> 
 Sent: Thursday, March 9, 2017 12:55 PM
 Subject: Unknown Pdx Type use case found, bug or expected?
   
 <!--#yiv0185086641 _filtered #yiv0185086641 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv0185086641 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv0185086641 #yiv0185086641 p.yiv0185086641MsoNormal, #yiv0185086641 li.yiv0185086641MsoNormal, #yiv0185086641 div.yiv0185086641MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;font-family:Calibri;}#yiv0185086641 a:link, #yiv0185086641 span.yiv0185086641MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv0185086641 a:visited, #yiv0185086641 span.yiv0185086641MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv0185086641 pre {margin:0in;margin-bottom:.0001pt;font-size:10.0pt;font-family:Courier;}#yiv0185086641 p.yiv0185086641MsoListParagraph, #yiv0185086641 li.yiv0185086641MsoListParagraph, #yiv0185086641 div.yiv0185086641MsoListParagraph {margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:12.0pt;font-family:Calibri;}#yiv0185086641 span.yiv0185086641EmailStyle17 {font-family:Calibri;color:windowtext;}#yiv0185086641 span.yiv0185086641HTMLPreformattedChar {font-family:Courier;}#yiv0185086641 span.yiv0185086641msoIns {text-decoration:underline;color:teal;}#yiv0185086641 .yiv0185086641MsoChpDefault {font-family:Calibri;} _filtered #yiv0185086641 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv0185086641 div.yiv0185086641WordSection1 {}#yiv0185086641 _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {}#yiv0185086641 ol {margin-bottom:0in;}#yiv0185086641 ul {margin-bottom:0in;}-->   Hey Geode,    We have a 3 node server cluster running with pdx read serialized and disk store persistence for all regions and replication-factor=2.    We do not use cluster-configuration, we use these property overrides:    #configuration settings used
enable-cluster-configuration=false
use-cluster-configuration=false
cache-xml-file=geode-cache.xml

#property default overrides
distributed-system-id=1
log-level=config
enforce-unique-host=true
locator-wait-time=60
conserve-sockets=false
log-file-size-limit=64
mcast-port=0    We use these stop/start scripts:    STOP:    gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop server --name=$SERVER_NAME"    gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop locator --name=$LOCATOR_NAME"    START:    gfsh start locator \
  --properties-file=$CONF_DIR/geode.properties \
  --name=$LOCATOR_NAME \
  --port=$LOCATOR_PORT \
  --log-level=config \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --enable-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.jmx-manager=true \
  --J=-Dgemfire.jmx-manger-start=true \
  --J=-Xms512m \
  --J=-Xmx512m    gfsh start server \
  --properties-file=$CONF_DIR/geode.properties \
  --cache-xml-file=$CONF_DIR/geode-cache.xml \
  --name=$SERVER_NAME \
  --server-port=$SERVER_PORT \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --start-rest-api=true \
  --use-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.disk.recoverValues=false \
  --J=-Dgemfire.jmx-manager=false \
  --J=-Dgemfire.jmx-manger-start=false \
  --J=-Xms6g \
  --J=-Xmx6g       There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.    We did a ‘scripted’ rolling geode version upgrade redeployment by serially stopping/deploying/restarting each server node. We had this issue below, which we’ve seen before and still find difficult to solve: ‘Region /xxxx has potentially stale data. It is waiting for another member to recover the latest data.’ The first node1 server hanging on restart and blocking our rolling serial redeployment.    So after not being able to resolve this serial rolling update problem (again) we decided to delete all the data (currently just cached lookup tables and dev WIP/POC data), redeploy the new geode version and restart from scratch, so we then deleted all the diskstores (including pdx disk store) and restarted the cluster.    REMINDER: the clients were all still connected and not restarted!!! (see link below for our awareness now of this CLIENT-SIDE error state) These clients then put data into server cluster, the ‘put’s succeeded, the server regions show they have the data.    BUT now gfsh query of this server region data gives ‘Unknown pdx types’ and restarting the clients fails on connecting to these regions with the same error: ‘Unknown pdx type’.    We are seeking GEODE-USER feedback regarding:    1)      We need to find a working enterprise deployment solution to resolve the rolling restart problem with stale data alerts blocking cluster config/version updates? 2)      We don’t believe the problem we saw was related to version upgrading? 3)      We find it very concerning that connected clients can CORRUPT SEVER-SIDE region data and don’t update the pdx registry and diskstore upon ‘put’s? A FAIL of the client-side proxy region.put would make more sense? Why didn’t the pdx types cached on the client get registered and written back to the servers diskstores? The client PUTs DID write data into the server regions – but that data is now corrupted and unreadable as ‘Unknown pdx types’?   That is a major issue, even though we acknowledge that we would NOT be deleting active diskstores from running clusters in production, assuming we can solve the rolling updates problem.    We  are now aware of this CLIENT-SIDE error state and can see how it might be related to our redeployment use case above but we now have corrupted SERVER-SIDE data written in server regions: https://discuss.pivotal.io/hc/en-us/articles/206357497-IllegalStateException-Unknown-PDX-Type-on-Client-Side       -Roger    

   

Re: Unknown Pdx Type use case found, bug or expected?

Posted by Hitesh Khamesra <hi...@yahoo.com>.
Hi Roger:
Sorry to hear about this. There is system property on client side to clean pdx-registry when it disconnects from server. You can find details here https://discuss.pivotal.io/hc/en-us/articles/221351508-Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.
I think we should clean pdx-registry when client disconnects. I will file the ticket to track this issue.
For disk issue, here are some guidelines https://discuss.pivotal.io/hc/en-us/community/posts/208792347-Region-regionA-has-potentially-stale-data-It-is-waiting-for-another-member-to-recover-the-latest-data-.
Did you try to revoke disk store?

Thanks.hitesh  

      From: Roger Vandusen <ro...@ticketmaster.com>
 To: "user@geode.apache.org" <us...@geode.apache.org> 
 Sent: Thursday, March 9, 2017 12:55 PM
 Subject: Unknown Pdx Type use case found, bug or expected?
   
 <!--#yiv0185086641 _filtered #yiv0185086641 {font-family:"Cambria Math";panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv0185086641 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv0185086641 #yiv0185086641 p.yiv0185086641MsoNormal, #yiv0185086641 li.yiv0185086641MsoNormal, #yiv0185086641 div.yiv0185086641MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;font-family:Calibri;}#yiv0185086641 a:link, #yiv0185086641 span.yiv0185086641MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv0185086641 a:visited, #yiv0185086641 span.yiv0185086641MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv0185086641 pre {margin:0in;margin-bottom:.0001pt;font-size:10.0pt;font-family:Courier;}#yiv0185086641 p.yiv0185086641MsoListParagraph, #yiv0185086641 li.yiv0185086641MsoListParagraph, #yiv0185086641 div.yiv0185086641MsoListParagraph {margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:12.0pt;font-family:Calibri;}#yiv0185086641 span.yiv0185086641EmailStyle17 {font-family:Calibri;color:windowtext;}#yiv0185086641 span.yiv0185086641HTMLPreformattedChar {font-family:Courier;}#yiv0185086641 span.yiv0185086641msoIns {text-decoration:underline;color:teal;}#yiv0185086641 .yiv0185086641MsoChpDefault {font-family:Calibri;} _filtered #yiv0185086641 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv0185086641 div.yiv0185086641WordSection1 {}#yiv0185086641 _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {} _filtered #yiv0185086641 {}#yiv0185086641 ol {margin-bottom:0in;}#yiv0185086641 ul {margin-bottom:0in;}-->   Hey Geode,    We have a 3 node server cluster running with pdx read serialized and disk store persistence for all regions and replication-factor=2.    We do not use cluster-configuration, we use these property overrides:    #configuration settings used
enable-cluster-configuration=false
use-cluster-configuration=false
cache-xml-file=geode-cache.xml

#property default overrides
distributed-system-id=1
log-level=config
enforce-unique-host=true
locator-wait-time=60
conserve-sockets=false
log-file-size-limit=64
mcast-port=0    We use these stop/start scripts:    STOP:    gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop server --name=$SERVER_NAME"    gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop locator --name=$LOCATOR_NAME"    START:    gfsh start locator \
  --properties-file=$CONF_DIR/geode.properties \
  --name=$LOCATOR_NAME \
  --port=$LOCATOR_PORT \
  --log-level=config \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --enable-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.jmx-manager=true \
  --J=-Dgemfire.jmx-manger-start=true \
  --J=-Xms512m \
  --J=-Xmx512m    gfsh start server \
  --properties-file=$CONF_DIR/geode.properties \
  --cache-xml-file=$CONF_DIR/geode-cache.xml \
  --name=$SERVER_NAME \
  --server-port=$SERVER_PORT \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --start-rest-api=true \
  --use-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.disk.recoverValues=false \
  --J=-Dgemfire.jmx-manager=false \
  --J=-Dgemfire.jmx-manger-start=false \
  --J=-Xms6g \
  --J=-Xmx6g       There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.    We did a ‘scripted’ rolling geode version upgrade redeployment by serially stopping/deploying/restarting each server node. We had this issue below, which we’ve seen before and still find difficult to solve: ‘Region /xxxx has potentially stale data. It is waiting for another member to recover the latest data.’ The first node1 server hanging on restart and blocking our rolling serial redeployment.    So after not being able to resolve this serial rolling update problem (again) we decided to delete all the data (currently just cached lookup tables and dev WIP/POC data), redeploy the new geode version and restart from scratch, so we then deleted all the diskstores (including pdx disk store) and restarted the cluster.    REMINDER: the clients were all still connected and not restarted!!! (see link below for our awareness now of this CLIENT-SIDE error state) These clients then put data into server cluster, the ‘put’s succeeded, the server regions show they have the data.    BUT now gfsh query of this server region data gives ‘Unknown pdx types’ and restarting the clients fails on connecting to these regions with the same error: ‘Unknown pdx type’.    We are seeking GEODE-USER feedback regarding:    1)      We need to find a working enterprise deployment solution to resolve the rolling restart problem with stale data alerts blocking cluster config/version updates? 2)      We don’t believe the problem we saw was related to version upgrading? 3)      We find it very concerning that connected clients can CORRUPT SEVER-SIDE region data and don’t update the pdx registry and diskstore upon ‘put’s? A FAIL of the client-side proxy region.put would make more sense? Why didn’t the pdx types cached on the client get registered and written back to the servers diskstores? The client PUTs DID write data into the server regions – but that data is now corrupted and unreadable as ‘Unknown pdx types’?   That is a major issue, even though we acknowledge that we would NOT be deleting active diskstores from running clusters in production, assuming we can solve the rolling updates problem.    We  are now aware of this CLIENT-SIDE error state and can see how it might be related to our redeployment use case above but we now have corrupted SERVER-SIDE data written in server regions: https://discuss.pivotal.io/hc/en-us/articles/206357497-IllegalStateException-Unknown-PDX-Type-on-Client-Side       -Roger    

   

Re: Unknown Pdx Type use case found, bug or expected?

Posted by Udo Kohlmeyer <uk...@pivotal.io>.
Does anyone know why we introduced the property 
ON_DISCONNECT_CLEAR_PDXTYPEIDS for the client and then make it FALSE as 
a default?

Surely we'd actually want to reset the pdx registry when a client 
disconnects from the servers.... GEODE-1037


On 3/9/17 12:55, Roger Vandusen wrote:
>
> Hey Geode,
>
> We have a 3 node server cluster running with pdx read serialized and 
> disk store persistence for all regions and replication-factor=2.
>
> We do not use cluster-configuration, we use these property overrides:
>
> //
>
> /#configuration settings used
> /*enable-cluster-configuration*=*false
> use-cluster-configuration*=*false
> cache-xml-file=geode-cache.xml
>
> */#property default overrides
> /*distributed-system-id*=*1
> log-level*=*config
> enforce-unique-host*=*true
> locator-wait-time*=*60
> conserve-sockets*=*false
> log-file-size-limit*=*64
> mcast-port*=*0*
>
> We use these stop/start scripts:
>
> STOP:
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop server --name=$SERVER_NAME"
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop locator --name=$LOCATOR_NAME"
>
> START:
>
> gfsh start locator \
>   --properties-file=$CONF_DIR/geode.properties \
>   --name=$LOCATOR_NAME \
>   --port=$LOCATOR_PORT \
>   --log-level=config \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --enable-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-Dgemfire.jmx-manager=true \
>   --J=-Dgemfire.jmx-manger-start=true \
>   --J=-Xms512m \
>   --J=-Xmx512m
>
> gfsh start server \
>   --properties-file=$CONF_DIR/geode.properties \
>   --cache-xml-file=$CONF_DIR/geode-cache.xml \
>   --name=$SERVER_NAME \
>   --server-port=$SERVER_PORT \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --start-rest-api=true \
>   --use-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-*Dgemfire.disk.recoverValues=false* \
>   --J=-Dgemfire.jmx-manager=false \
>   --J=-Dgemfire.jmx-manger-start=false \
>   --J=-Xms6g \
>   --J=-Xmx6g
>
> There were active proxy clients (1.0.0-incubating/GFE 9.0) connected 
> while we proceeded to update the geode version from 1.0.0-incubating 
> to 1.1.0.
>
> We did a \u2018scripted\u2019 rolling geode version upgrade redeployment by 
> serially stopping/deploying/restarting each server node.
>
> We had this issue below, which we\u2019ve seen before and still find 
> difficult to solve:
>
> \u2018Region /xxxx has potentially stale data. It is waiting for another 
> member to recover the latest data.\u2019
>
> The first node1 server hanging on restart and blocking our rolling 
> serial redeployment.
>
> So after not being able to resolve this serial rolling update problem 
> (again) we decided to delete all the data (currently just cached 
> lookup tables and dev WIP/POC data),
>
> redeploy the new geode version and restart from scratch, so we then 
> deleted all the diskstores (including pdx disk store) and restarted 
> the cluster.
>
> REMINDER: the clients were all still connected and not restarted!!! 
> (see link below for our awareness now of this CLIENT-SIDE error state)
>
> These clients then put data into server cluster, the \u2018put\u2019s succeeded, 
> the server regions show they have the data.
>
> BUT now gfsh query of this server region data gives \u2018Unknown pdx 
> types\u2019 and restarting the clients fails on connecting to these regions 
> with the same error: \u2018Unknown pdx type\u2019.
>
> We are seeking GEODE-USER feedback regarding:
>
> 1)We need to find a working enterprise deployment solution to resolve 
> the rolling restart problem with stale data alerts blocking cluster 
> config/version updates?
>
> 2)We don\u2019t believe the problem we saw was related to version upgrading?
>
> 3)We find it very concerning that connected clients can CORRUPT 
> SEVER-SIDE region data and don\u2019t update the pdx registry and diskstore 
> upon \u2018put\u2019s?
>
> A FAIL of the client-side proxy region.put would make more sense?
>
> Why didn\u2019t the pdx types cached on the client get registered and 
> written back to the servers diskstores?
>
> The client PUTs DID write data into the server regions \u2013 but that data 
> is now corrupted and unreadable as \u2018Unknown pdx types\u2019?
>
> That is a major issue, even though we acknowledge that we would NOT be 
> deleting active diskstores from running clusters in production, 
> assuming we can solve the rolling updates problem.
>
> We  are now aware of this CLIENT-SIDE error state and can see how it 
> might be related to our redeployment use case above but we now have 
> corrupted SERVER-SIDE data written in server regions:
>
> https://discuss.pivotal.io/hc/en-us/articles/206357497-IllegalStateException-Unknown-PDX-Type-on-Client-Side
>
> -Roger
>