You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geode.apache.org by "evaristo.camarero@yahoo.es" <ev...@yahoo.es> on 2019/10/22 20:23:48 UTC

RECOVERING A WAN INSTALLATION

Hi there,

 

We are planning to use aninstallation with 2 Geode cluster connected via WAN and using gateway senders/receiversto keep them updated. Main reason is resiliency for disasters in a data center.

 

It is not clear for us how torecover a datacenter in case of disaster. This is the use case:

- One of the data centers have aproblem (natural catastrophe)

- The other data center keepsrunning traffic and filling the gateway sender queues that need to be stoppedat some point to avoid filling up the disk resources.

 

At some point in time, the datacenter is ready to start recovery that will require to synchronize the Geodecopy. The procedure should something like:

- Drain gateway service queues incopy providing service

- Start gateway senders

- Make a copy

- Transfer copy to data center thatwill be recovered

- Import the copy

- Allow the data center to catchupup via replication

- Start again the copy.

 

Does it make sense? Or is there abetter way to do it. In case the answer is yes, is there any way to draingateway sender’s queues (both for parallel and serial GWs)

 

Thanks in advance,

 

/Evaristo

Re: RECOVERING A WAN INSTALLATION

Posted by Barry Oglesby <bo...@pivotal.io>.

- When having multiple regions, the processes should be repeated for each
region, and senders should be started when all regions "have finished" to
ingest data and consume events, right?

Yes. I haven't tried it, but you could have multiple concurrent CQs running
- one for each region. When the last one has processed the initial result
set, then run the rest of the steps. This will minimize the duplicates. If
you want to minimize the heap requirement of the client, then serial CQs
would be better.

- I am not sure if this approach scales well when clusters are big.

You're definitely right about that. The client has to hold the entire
initial result set with this approach. You don't say how much of your data
is off-heap, but that is potentially a fatal flaw for this approach.

Another idea you can think about is using a function instead of a CQ. You
can use an onRegion function to stream the values back to the caller. The
function would iterate the entries and batch them back to the client using
sendResults. A ResultCollector on the client would process each batch in
addResult and do the same thing the CqListener is doing.

Thanks,
Barry Oglesby



On Mon, Oct 28, 2019 at 1:32 AM evaristo.camarero@yahoo.es <
evaristo.camarero@yahoo.es> wrote:

> Thanks a lot Barry and others,
>
>  You are right some info is missing in my previous mails. Let me detail a
> bit further:
>
>  - We have both PARTITIONED AND REPLICATION regions (both of them are
> PERSISTENT) -> We have both parallel and serial senders with overflow to
> disk
>
> - We are using off-heap
>
> - We are using PDX serialization
>
> - Our default setup is an active / active setup with some data stickiness
> (In normal circumstances the same data is handled always in the same Geode
> instance). An instance is taking all the traffic under network split
> between WAN instances or failure of a Geode cluster.
>
> - We have custom conflict resolution, to minimize consistency issues.
>
>
> I was checking your proposal and I have several comments / questions, so
> any feedback is really appreciated:
>
> - When having multiple regions, the processes should be repeated for each
> region, and senders should be started when all regions "have finished" to
> ingest data and consume events, right?
>
> - I am not sure if this approach scales well when clusters are big. We
> were thinking more on an export data / transfer / import data approach. I
> am not 100 % sure what is best. We will do some testing and we can find the
> best option. Your approach has the benefit that time in which events are
> duplicated is much more reduced and I think that could avoid potential
> consistency issues.
>
>
>
> Thanks,
>
> /Evaristo
>
>
>
>
> En jueves, 24 de octubre de 2019 23:47:03 CEST, Barry Oglesby <
> boglesby@pivotal.io> escribió:
>
>
> You could look into a blue-green-type strategy to re-populate the second
> WAN site.
>
> This idea uses a Geode durable client that is connected to both sites. It
> connects to site1 using CQ and site2 using a proxy region. It basically
> takes initial results and events from site1 and puts them into site2.
>
> If you're in the state where site1 is up and site2 is down, then here are
> steps:
>
> 1. Stop gateway sender in site1 so that no events are queued for site2.
> You can use gfsh stop gateway-sender to do this.
>
> 2. Restart locators and servers in site2
>
> 3. Stop gateway sender in site2 so that events from the durable client are
> not sent back to site1. You can use gfsh stop gateway-sender to do this.
>
> At this point, the two sites are not really connected by the WAN.
>
> 4. Start durable client (set durable-client-id=migration-client)
>
> This:
> - creates a CQ connected to site1
> - executes the CQ with initial results
> - adds those results to site2 using the proxy region
> - sends ready for events which starts the events flowing to the
> MigrationListener. Events received by the MigrationListener are added to
> site2 using the proxy region.
>
> When steady state is achieved (meaning all the initial results are
> processed and only the MigrationListener is processing events):
>
> 5. Restart gateway sender in site1
>
> 6. Stop durable client
>
> After you restart the gateway sender in site1 but before you stop the
> durable client, both will be sending events to site2. This will result in
> duplicate events in site2, so the shorter the time between these actions,
> the fewer duplicate events.
>
> 7. After the durable client has been stopped, restart the gateway sender
> in site2.
>
> Notes / Caveats:
>
> I attached the MigrationClient, MigrationListener and configuration files.
>
> If you're using PDX serialization, you might have to work around JIRA
> GEODE-6271:
>
> https://issues.apache.org/jira/browse/GEODE-6271
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_GEODE-2D6271&d=DwMFaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=_W_boBj3zc19uw4Vu1kfjtvATBZRGkSdBMmcpykOJNU&m=6nmlUO-bZ0Azi2fqfT83cT16VuiCzlHyIiJag54HUY0&s=dYz15Sqta90EiWaPHjzIOJaMxrQXqurOll07sUUGRrQ&e=>
>
> The MigrationClient does this in registerPdxTypesOnAllPools. If you're not
> using PDX serialization, you can remove this code.
>
> You don't mention if any of your entities are persistent.
>
> If your PdxTypes are persistent in site2, you won't need to work around
> JIRA GEODE-6271
>
> If your senders are persistent, you may need to delete the disk files
> before restarting the senders.
>
> Thanks,
> Barry Oglesby
>
>
>
> On Wed, Oct 23, 2019 at 10:36 PM evaristo.camarero@yahoo.es <
> evaristo.camarero@yahoo.es> wrote:
>
> Thanks a lot. We Will try this
>
> Enviado desde Yahoo Mail con Android
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__go.onelink.me_107872968-3Fpid-3DInProduct-26c-3DGlobal-5FInternal-5FYGrowth-5FAndroidEmailSig-5F-5FAndroidUsers-26af-5Fwl-3Dym-26af-5Fsub1-3DInternal-26af-5Fsub2-3DGlobal-5FYGrowth-26af-5Fsub3-3DEmailSignature&d=DwMFaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=_W_boBj3zc19uw4Vu1kfjtvATBZRGkSdBMmcpykOJNU&m=6nmlUO-bZ0Azi2fqfT83cT16VuiCzlHyIiJag54HUY0&s=D4tpXV2kzB_HWwENcWaU0CRgi202WK748JqniHxB_b8&e=>
>
> El mié., oct. 23, 2019 a 23:35, Jason Huynh
> <jh...@pivotal.io> escribió:
> Hi Evaristo,
>
> I spoke with another committer, Anil, and from what we understand, this
> process that is described would work.  I am not sure if this it the
> recommended way to do a restart but we believe the steps outlined would get
> the intended outcome.
>
> To clear a Serial gateway, I believe stopping the gateway sender will
> clear it's queue. However for a parallel gateway sender I think the
> parallel queue gets cleared once the sender is restarted (so a stop and
> then a start).  There may be other ways such as destroying the gateway
> sender but you'd probably have to detach it from the region first.
>
> This sounds like a WAN gii feature would be useful and help reduce the
> steps in this use case.
>
> Please chime in if this response is wrong or can be improved.
>
> Thanks,
> -Jason
>
> On Tue, Oct 22, 2019 at 1:26 PM evaristo.camarero@yahoo.es <
> evaristo.camarero@yahoo.es> wrote:
>
> Hi there,
>
>
>
> We are planning to use an installation with 2 Geode cluster connected via
> WAN and using gateway senders/receivers to keep them updated. Main reason
> is resiliency for disasters in a data center.
>
>
>
> It is not clear for us how to recover a datacenter in case of disaster.
> This is the use case:
>
> - One of the data centers have a problem (natural catastrophe)
>
> - The other data center keeps running traffic and filling the gateway
> sender queues that need to be stopped at some point to avoid filling up the
> disk resources.
>
>
>
> At some point in time, the data center is ready to start recovery that
> will require to synchronize the Geode copy. The procedure should something
> like:
>
> - Drain gateway service queues in copy providing service
>
> - Start gateway senders
>
> - Make a copy
>
> - Transfer copy to data center that will be recovered
>
> - Import the copy
>
> - Allow the data center to catchup up via replication
>
> - Start again the copy.
>
>
>
> Does it make sense? Or is there a better way to do it. In case the answer
> is yes, is there any way to drain gateway sender’s queues (both for
> parallel and serial GWs)
>
>
>
> Thanks in advance,
>
>
>
> /Evaristo
>
>
>
>

Re: RECOVERING A WAN INSTALLATION

Posted by "evaristo.camarero@yahoo.es" <ev...@yahoo.es>.

Thanks a lot Barry and others,

You are right some info is missingin my previous mails. Let me detail a bit further:

- We have both PARTITIONED ANDREPLICATION regions (both of them are PERSISTENT) -> We have both paralleland serial senders with overflow to disk

- We are using off-heap

- We are using PDX serialization

- Our default setup is an active /active setup with some data stickiness (In normal circumstances the same datais handled always in the same Geode instance). An instance is taking all thetraffic under network split between WAN instances or failure of a Geode cluster.

- We have custom conflict resolution,to minimize consistency issues.
I was checking your proposal and Ihave several comments / questions, so any feedback is really appreciated:
- When having multiple regions, theprocesses should be repeated for each region, and senders should be startedwhen all regions "have finished" to ingest data and consume events,right?

- I am not sure if this approachscales well when clusters are big. We were thinking more on an export data /transfer / import data approach. I am not 100 % sure what is best. We will dosome testing and we can find the best option. Your approach has the benefitthat time in which events are duplicated is much more reduced and I think thatcould avoid potential consistency issues.

Thanks,

/Evaristo

En jueves, 24 de octubre de 2019 23:47:03 CEST, Barry Oglesby <bo...@pivotal.io> escribió:

You could look into a blue-green-type strategy to re-populate the second WAN site.

This idea uses a Geode durable client that is connected to both sites. It connects to site1 using CQ and site2 using a proxy region. It basically takes initial results and events from site1 and puts them into site2.

If you're in the state where site1 is up and site2 is down, then here are steps:

1. Stop gateway sender in site1 so that no events are queued for site2. You can use gfsh stop gateway-sender to do this.

2. Restart locators and servers in site2

3. Stop gateway sender in site2 so that events from the durable client are not sent back to site1. You can use gfsh stop gateway-sender to do this.

At this point, the two sites are not really connected by the WAN.

4. Start durable client (set durable-client-id=migration-client)

This:
- creates a CQ connected to site1
- executes the CQ with initial results
- adds those results to site2 using the proxy region
- sends ready for events which starts the events flowing to the MigrationListener. Events received by the MigrationListener are added to site2 using the proxy region.

When steady state is achieved (meaning all the initial results are processed and only the MigrationListener is processing events):

5. Restart gateway sender in site1

6. Stop durable client

After you restart the gateway sender in site1 but before you stop the durable client, both will be sending events to site2. This will result in duplicate events in site2, so the shorter the time between these actions, the fewer duplicate events.

7. After the durable client has been stopped, restart the gateway sender in site2.

Notes / Caveats:

I attached the MigrationClient, MigrationListener and configuration files.

If you're using PDX serialization, you might have to work around JIRA GEODE-6271:

https://issues.apache.org/jira/browse/GEODE-6271

The MigrationClient does this in registerPdxTypesOnAllPools. If you're not using PDX serialization, you can remove this code.

You don't mention if any of your entities are persistent.

If your PdxTypes are persistent in site2, you won't need to work around JIRA GEODE-6271

If your senders are persistent, you may need to delete the disk files before restarting the senders.

Thanks,Barry Oglesby

On Wed, Oct 23, 2019 at 10:36 PM evaristo.camarero@yahoo.es <ev...@yahoo.es> wrote:

Thanks a lot. We Will try this

Enviado desde Yahoo Mail con Android

El mié., oct. 23, 2019 a 23:35, Jason Huynh<jh...@pivotal.io> escribió: Hi Evaristo,
I spoke with another committer, Anil, and from what we understand, this process that is described would work. I am not sure if this it the recommended way to do a restart but we believe the steps outlined would get the intended outcome.
To clear a Serial gateway, I believe stopping the gateway sender will clear it's queue. However for a parallel gateway sender I think the parallel queue gets cleared once the sender is restarted (so a stop and then a start). There may be other ways such as destroying the gateway sender but you'd probably have to detach it from the region first.
This sounds like a WAN gii feature would be useful and help reduce the steps in this use case.
Please chime in if this response is wrong or can be improved.
Thanks,-Jason
On Tue, Oct 22, 2019 at 1:26 PM evaristo.camarero@yahoo.es <ev...@yahoo.es> wrote:

Hi there,

We are planning to use aninstallation with 2 Geode cluster connected via WAN and using gateway senders/receiversto keep them updated. Main reason is resiliency for disasters in a data center.

It is not clear for us how torecover a datacenter in case of disaster. This is the use case:

- One of the data centers have aproblem (natural catastrophe)

- The other data center keepsrunning traffic and filling the gateway sender queues that need to be stoppedat some point to avoid filling up the disk resources.

At some point in time, the datacenter is ready to start recovery that will require to synchronize the Geodecopy. The procedure should something like:

- Drain gateway service queues incopy providing service

- Start gateway senders

- Make a copy

- Transfer copy to data center thatwill be recovered

- Import the copy

- Allow the data center to catchupup via replication

- Start again the copy.

Does it make sense? Or is there abetter way to do it. In case the answer is yes, is there any way to draingateway sender’s queues (both for parallel and serial GWs)

Thanks in advance,

/Evaristo

Re: RECOVERING A WAN INSTALLATION

Posted by Barry Oglesby <bo...@pivotal.io>.

You could look into a blue-green-type strategy to re-populate the second
WAN site.

This idea uses a Geode durable client that is connected to both sites. It
connects to site1 using CQ and site2 using a proxy region. It basically
takes initial results and events from site1 and puts them into site2.

If you're in the state where site1 is up and site2 is down, then here are
steps:

1. Stop gateway sender in site1 so that no events are queued for site2. You
can use gfsh stop gateway-sender to do this.

2. Restart locators and servers in site2

3. Stop gateway sender in site2 so that events from the durable client are
not sent back to site1. You can use gfsh stop gateway-sender to do this.

At this point, the two sites are not really connected by the WAN.

4. Start durable client (set durable-client-id=migration-client)

This:
- creates a CQ connected to site1
- executes the CQ with initial results
- adds those results to site2 using the proxy region
- sends ready for events which starts the events flowing to the
MigrationListener. Events received by the MigrationListener are added to
site2 using the proxy region.

When steady state is achieved (meaning all the initial results are
processed and only the MigrationListener is processing events):

5. Restart gateway sender in site1

6. Stop durable client

After you restart the gateway sender in site1 but before you stop the
durable client, both will be sending events to site2. This will result in
duplicate events in site2, so the shorter the time between these actions,
the fewer duplicate events.

7. After the durable client has been stopped, restart the gateway sender in
site2.

Notes / Caveats:

I attached the MigrationClient, MigrationListener and configuration files.

If you're using PDX serialization, you might have to work around JIRA
GEODE-6271:

https://issues.apache.org/jira/browse/GEODE-6271

The MigrationClient does this in registerPdxTypesOnAllPools. If you're not
using PDX serialization, you can remove this code.

You don't mention if any of your entities are persistent.

If your PdxTypes are persistent in site2, you won't need to work around
JIRA GEODE-6271

If your senders are persistent, you may need to delete the disk files
before restarting the senders.

Thanks,
Barry Oglesby

On Wed, Oct 23, 2019 at 10:36 PM evaristo.camarero@yahoo.es <
evaristo.camarero@yahoo.es> wrote:

> Thanks a lot. We Will try this
>
> Enviado desde Yahoo Mail con Android
> <https://go.onelink.me/107872968?pid=InProduct&c=Global_Internal_YGrowth_AndroidEmailSig__AndroidUsers&af_wl=ym&af_sub1=Internal&af_sub2=Global_YGrowth&af_sub3=EmailSignature>
>
> El mié., oct. 23, 2019 a 23:35, Jason Huynh
> <jh...@pivotal.io> escribió:
> Hi Evaristo,
>
> I spoke with another committer, Anil, and from what we understand, this
> process that is described would work.  I am not sure if this it the
> recommended way to do a restart but we believe the steps outlined would get
> the intended outcome.
>
> To clear a Serial gateway, I believe stopping the gateway sender will
> clear it's queue. However for a parallel gateway sender I think the
> parallel queue gets cleared once the sender is restarted (so a stop and
> then a start).  There may be other ways such as destroying the gateway
> sender but you'd probably have to detach it from the region first.
>
> This sounds like a WAN gii feature would be useful and help reduce the
> steps in this use case.
>
> Please chime in if this response is wrong or can be improved.
>
> Thanks,
> -Jason
>
> On Tue, Oct 22, 2019 at 1:26 PM evaristo.camarero@yahoo.es <
> evaristo.camarero@yahoo.es> wrote:
>
> Hi there,
>
>
>
> We are planning to use an installation with 2 Geode cluster connected via
> WAN and using gateway senders/receivers to keep them updated. Main reason
> is resiliency for disasters in a data center.
>
>
>
> It is not clear for us how to recover a datacenter in case of disaster.
> This is the use case:
>
> - One of the data centers have a problem (natural catastrophe)
>
> - The other data center keeps running traffic and filling the gateway
> sender queues that need to be stopped at some point to avoid filling up the
> disk resources.
>
>
>
> At some point in time, the data center is ready to start recovery that
> will require to synchronize the Geode copy. The procedure should something
> like:
>
> - Drain gateway service queues in copy providing service
>
> - Start gateway senders
>
> - Make a copy
>
> - Transfer copy to data center that will be recovered
>
> - Import the copy
>
> - Allow the data center to catchup up via replication
>
> - Start again the copy.
>
>
>
> Does it make sense? Or is there a better way to do it. In case the answer
> is yes, is there any way to drain gateway sender’s queues (both for
> parallel and serial GWs)
>
>
>
> Thanks in advance,
>
>
>
> /Evaristo
>
>
>
>

Re: RECOVERING A WAN INSTALLATION

Posted by "evaristo.camarero@yahoo.es" <ev...@yahoo.es>.

Thanks a lot. We Will try this

Enviado desde Yahoo Mail con Android

Hi there,

We are planning to use aninstallation with 2 Geode cluster connected via WAN and using gateway senders/receiversto keep them updated. Main reason is resiliency for disasters in a data center.

It is not clear for us how torecover a datacenter in case of disaster. This is the use case:

- One of the data centers have aproblem (natural catastrophe)

- The other data center keepsrunning traffic and filling the gateway sender queues that need to be stoppedat some point to avoid filling up the disk resources.

At some point in time, the datacenter is ready to start recovery that will require to synchronize the Geodecopy. The procedure should something like:

- Drain gateway service queues incopy providing service

- Start gateway senders

- Make a copy

- Transfer copy to data center thatwill be recovered

- Import the copy

- Allow the data center to catchupup via replication

- Start again the copy.

Does it make sense? Or is there abetter way to do it. In case the answer is yes, is there any way to draingateway sender’s queues (both for parallel and serial GWs)

Thanks in advance,

/Evaristo

Re: RECOVERING A WAN INSTALLATION

Posted by Jason Huynh <jh...@pivotal.io>.

Hi Evaristo,

I spoke with another committer, Anil, and from what we understand, this
process that is described would work.  I am not sure if this it the
recommended way to do a restart but we believe the steps outlined would get
the intended outcome.

To clear a Serial gateway, I believe stopping the gateway sender will clear
it's queue. However for a parallel gateway sender I think the parallel
queue gets cleared once the sender is restarted (so a stop and then a
start).  There may be other ways such as destroying the gateway sender but
you'd probably have to detach it from the region first.

This sounds like a WAN gii feature would be useful and help reduce the
steps in this use case.

Please chime in if this response is wrong or can be improved.

Thanks,
-Jason

On Tue, Oct 22, 2019 at 1:26 PM evaristo.camarero@yahoo.es <
evaristo.camarero@yahoo.es> wrote:

> Hi there,
>
>
>
> We are planning to use an installation with 2 Geode cluster connected via
> WAN and using gateway senders/receivers to keep them updated. Main reason
> is resiliency for disasters in a data center.
>
>
>
> It is not clear for us how to recover a datacenter in case of disaster.
> This is the use case:
>
> - One of the data centers have a problem (natural catastrophe)
>
> - The other data center keeps running traffic and filling the gateway
> sender queues that need to be stopped at some point to avoid filling up the
> disk resources.
>
>
>
> At some point in time, the data center is ready to start recovery that
> will require to synchronize the Geode copy. The procedure should something
> like:
>
> - Drain gateway service queues in copy providing service
>
> - Start gateway senders
>
> - Make a copy
>
> - Transfer copy to data center that will be recovered
>
> - Import the copy
>
> - Allow the data center to catchup up via replication
>
> - Start again the copy.
>
>
>
> Does it make sense? Or is there a better way to do it. In case the answer
> is yes, is there any way to drain gateway sender’s queues (both for
> parallel and serial GWs)
>
>
>
> Thanks in advance,
>
>
>
> /Evaristo
>
>
>
>