You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Anshum Gupta <an...@anshumgupta.net> on 2020/12/04 16:22:09 UTC

[DISCUSS] Cross Data-Center Replication in Apache Solr

Hi everyone,


Large scale Solr installations often require cross data-center replication
in order to achieve data replication for both, access latency reasons as
well as disaster recovery. In the past users have either designed their own
solutions to deal with this or have tried to rely on the now-deprecated
CDCR.


It would be really good to have support for cross data-center replication
within Solr, that is offered and supported by the community. This would
allow the effort around this shared problem to converge.


I’d like to propose a new solution based on my experiences at my day job.
The key points about this approach:

   1. Uses an external, configurable, messaging system in the middle for
   actual replication/mirroring.
   2. We offer an abstraction and some default implementations based on
   what we can support and what users really want. An example here would be
   Kafka.
   3. This would be a separate repository allowing it to have its own
   release cadence. We shouldn’t have to release this with every Solr release
   as the overlap is just limited to SolrJ interactions.


I’ll share a more detailed and evolving document soon with the design for
everyone else to contribute to but wanted to share this as I’m starting to
work on this and wanted to avoid parallel efforts towards the same end-goal.

-- 
Anshum Gupta

Re: [DISCUSS] Cross Data-Center Replication in Apache Solr

Posted by Marcus Eagan <ma...@gmail.com>.

Looking forward to the doc. Thanks

Marcus

On Sun, Dec 6, 2020 at 05:47 Erick Erickson <er...@gmail.com> wrote:

> Anshum:
>
> I know I’ve been recommending something like this to clients for a while,
> do you think a call to the community for people who’ve already put
> something in the middle might net us some good info on the lurking
> gremlins? Mind you “recommend” hasn’t actually involved me _doing_ it
> so I don’t have any actual experience there…
>
> But yeah, absolutely +1 for something making this easier for clients...
>
> Erick
>
> > On Dec 5, 2020, at 11:43 AM, Ilan Ginzburg <il...@gmail.com> wrote:
> >
> > That's an interesting initiative Anshum!
> >
> > I can see at least two different approaches here, your mention of SolrJ
> seems to hint at the first one:
> > 1. Get the data as it comes from the client and fork it to local and
> remote data centers,
> > 2. Create (an asynchronous) stream replicating local data center data to
> remote.
> >
> > Option 1 is strongly consistent but adds latency and potentially
> blocking on the critical path.
> > Option 2 could look like remote PULL replicas, might have lower impact
> on the local data center but has to deal with the remote data center always
> being somewhat behind. If the client application can handle that, the
> performance and efficiency gain (as well as simpler implementation? It
> doesn't require another persistence layer) might be worth it...
> >
> > Ilan
> >
> > On Fri, Dec 4, 2020 at 5:24 PM Anshum Gupta <an...@anshumgupta.net>
> wrote:
> > Hi everyone,
> >
> > Large scale Solr installations often require cross data-center
> replication in order to achieve data replication for both, access latency
> reasons as well as disaster recovery. In the past users have either
> designed their own solutions to deal with this or have tried to rely on the
> now-deprecated CDCR.
> >
> > It would be really good to have support for cross data-center
> replication within Solr, that is offered and supported by the community.
> This would allow the effort around this shared problem to converge.
> >
> > I’d like to propose a new solution based on my experiences at my day
> job. The key points about this approach:
> >       • Uses an external, configurable, messaging system in the middle
> for actual replication/mirroring.
> >       • We offer an abstraction and some default implementations based
> on what we can support and what users really want. An example here would be
> Kafka.
> >       • This would be a separate repository allowing it to have its own
> release cadence. We shouldn’t have to release this with every Solr release
> as the overlap is just limited to SolrJ interactions.
> >
> > I’ll share a more detailed and evolving document soon with the design
> for everyone else to contribute to but wanted to share this as I’m starting
> to work on this and wanted to avoid parallel efforts towards the same
> end-goal.
> >
> > --
> > Anshum Gupta
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Marcus Eagan

Re: [DISCUSS] Cross Data-Center Replication in Apache Solr

Posted by Anshum Gupta <an...@anshumgupta.net>.

Sorry, forgot to send this email. Here's a draft of the SIP. There are
still open questions that we shall handle as we go along and based on the
usage patterns that emerge from the community.

An important thing to note here is that we could provide users the option
to not have Solr forward the requests and instead just always just consume
from the queue. This would be useful for folks who can manage versioning
themselves, and don't need an actual hot-hot setup.

https://cwiki.apache.org/confluence/display/SOLR/SIP-13%3A+Cross+Data+Center+Replication


On Sun, Dec 6, 2020 at 9:18 PM Anshum Gupta <an...@anshumgupta.net> wrote:

> Thank you everyone for showing interest and sharing your thoughts.
>
> The overall approach I've been thinking about involves:
> 1. A consumer app that reads from a (local) message queue and writes to a
> local Solr instance.
> 2. Update Request Processor in Solr to forward the updates to an outgoing
> local queue.
>
> The messaging system in the middle will be responsible to handle the
> replication from the local source queue to remote destination queues that
> are local to the remote clusters. This will allow the individual clusters
> to be agnostic of the remote cluster location. I'll need a few more days
> before I share the document, but just wanted to give an idea about the data
> flow we've been using for a few years.
>
> As expected, there are a few caveats and restrictions with this approach
> which I'll include with the document :)
>
> -Anshum
>
> On Sun, Dec 6, 2020 at 1:53 PM Bram Van Dam <br...@intix.eu> wrote:
>
>> We've had some experience with this. As per usual, customers tend to
>> want the holy grail: no data loss when one data centre blows up, but no
>> increased latency when updating data. This then somehow, magically, has
>> to work over a slow uplink between two data centres without saturating
>> the link.
>>
>> Currently we use NRT replicas across data centres. Which does add some
>> latency, but consistency is a bit more important for us. Overall, this
>> works pretty well.
>>
>> The biggest problems we've experienced have all been related to
>> recovering replicas across a slow data centre uplink. A saturated link
>> can cause multiple replicas to lag behind, and when this gets too bad,
>> they too will go into recovery, and then shit really hits the fan.
>>
>> I'm not sure whether there are any easy ways of improving that
>> behaviour. Limiting max bandwidth per solr instance during recovery?
>> Slow recovery is better than destructive recovery.
>>
>> External tools like Kafka add a lot of operational overhead. One of the
>> great things about SolrCloud is how simple the whole replication setup is.
>>
>>
>>
>> On 06/12/2020 14:46, Erick Erickson wrote:
>> >> I can see at least two different approaches here, your mention of
>> SolrJ seems to hint at the first one:
>> >> 1. Get the data as it comes from the client and fork it to local and
>> remote data centers,
>> >> 2. Create (an asynchronous) stream replicating local data center data
>> to remote.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> Anshum Gupta
>


-- 
Anshum Gupta

Re: [DISCUSS] Cross Data-Center Replication in Apache Solr

Posted by Anshum Gupta <an...@anshumgupta.net>.

Thank you everyone for showing interest and sharing your thoughts.

The overall approach I've been thinking about involves:
1. A consumer app that reads from a (local) message queue and writes to a
local Solr instance.
2. Update Request Processor in Solr to forward the updates to an outgoing
local queue.

The messaging system in the middle will be responsible to handle the
replication from the local source queue to remote destination queues that
are local to the remote clusters. This will allow the individual clusters
to be agnostic of the remote cluster location. I'll need a few more days
before I share the document, but just wanted to give an idea about the data
flow we've been using for a few years.

As expected, there are a few caveats and restrictions with this approach
which I'll include with the document :)

-Anshum

On Sun, Dec 6, 2020 at 1:53 PM Bram Van Dam <br...@intix.eu> wrote:

> We've had some experience with this. As per usual, customers tend to
> want the holy grail: no data loss when one data centre blows up, but no
> increased latency when updating data. This then somehow, magically, has
> to work over a slow uplink between two data centres without saturating
> the link.
>
> Currently we use NRT replicas across data centres. Which does add some
> latency, but consistency is a bit more important for us. Overall, this
> works pretty well.
>
> The biggest problems we've experienced have all been related to
> recovering replicas across a slow data centre uplink. A saturated link
> can cause multiple replicas to lag behind, and when this gets too bad,
> they too will go into recovery, and then shit really hits the fan.
>
> I'm not sure whether there are any easy ways of improving that
> behaviour. Limiting max bandwidth per solr instance during recovery?
> Slow recovery is better than destructive recovery.
>
> External tools like Kafka add a lot of operational overhead. One of the
> great things about SolrCloud is how simple the whole replication setup is.
>
>
>
> On 06/12/2020 14:46, Erick Erickson wrote:
> >> I can see at least two different approaches here, your mention of SolrJ
> seems to hint at the first one:
> >> 1. Get the data as it comes from the client and fork it to local and
> remote data centers,
> >> 2. Create (an asynchronous) stream replicating local data center data
> to remote.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

-- 
Anshum Gupta

Re: [DISCUSS] Cross Data-Center Replication in Apache Solr

Posted by Bram Van Dam <br...@intix.eu>.

We've had some experience with this. As per usual, customers tend to
want the holy grail: no data loss when one data centre blows up, but no
increased latency when updating data. This then somehow, magically, has
to work over a slow uplink between two data centres without saturating
the link.

Currently we use NRT replicas across data centres. Which does add some
latency, but consistency is a bit more important for us. Overall, this
works pretty well.

The biggest problems we've experienced have all been related to
recovering replicas across a slow data centre uplink. A saturated link
can cause multiple replicas to lag behind, and when this gets too bad,
they too will go into recovery, and then shit really hits the fan.

I'm not sure whether there are any easy ways of improving that
behaviour. Limiting max bandwidth per solr instance during recovery?
Slow recovery is better than destructive recovery.

External tools like Kafka add a lot of operational overhead. One of the
great things about SolrCloud is how simple the whole replication setup is.

On 06/12/2020 14:46, Erick Erickson wrote:
>> I can see at least two different approaches here, your mention of SolrJ seems to hint at the first one:
>> 1. Get the data as it comes from the client and fork it to local and remote data centers,
>> 2. Create (an asynchronous) stream replicating local data center data to remote.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [DISCUSS] Cross Data-Center Replication in Apache Solr

Posted by Erick Erickson <er...@gmail.com>.

Anshum:

I know I’ve been recommending something like this to clients for a while,
do you think a call to the community for people who’ve already put
something in the middle might net us some good info on the lurking
gremlins? Mind you “recommend” hasn’t actually involved me _doing_ it
so I don’t have any actual experience there…

But yeah, absolutely +1 for something making this easier for clients...

Erick

> On Dec 5, 2020, at 11:43 AM, Ilan Ginzburg <il...@gmail.com> wrote:
> 
> That's an interesting initiative Anshum!
> 
> I can see at least two different approaches here, your mention of SolrJ seems to hint at the first one:
> 1. Get the data as it comes from the client and fork it to local and remote data centers,
> 2. Create (an asynchronous) stream replicating local data center data to remote.
> 
> Option 1 is strongly consistent but adds latency and potentially blocking on the critical path.
> Option 2 could look like remote PULL replicas, might have lower impact on the local data center but has to deal with the remote data center always being somewhat behind. If the client application can handle that, the performance and efficiency gain (as well as simpler implementation? It doesn't require another persistence layer) might be worth it...
> 
> Ilan
> 
> On Fri, Dec 4, 2020 at 5:24 PM Anshum Gupta <an...@anshumgupta.net> wrote:
> Hi everyone,
> 
> Large scale Solr installations often require cross data-center replication in order to achieve data replication for both, access latency reasons as well as disaster recovery. In the past users have either designed their own solutions to deal with this or have tried to rely on the now-deprecated CDCR.
> 
> It would be really good to have support for cross data-center replication within Solr, that is offered and supported by the community. This would allow the effort around this shared problem to converge.
> 
> I’d like to propose a new solution based on my experiences at my day job. The key points about this approach:
> 	• Uses an external, configurable, messaging system in the middle for actual replication/mirroring.
> 	• We offer an abstraction and some default implementations based on what we can support and what users really want. An example here would be Kafka.
> 	• This would be a separate repository allowing it to have its own release cadence. We shouldn’t have to release this with every Solr release as the overlap is just limited to SolrJ interactions.
> 
> I’ll share a more detailed and evolving document soon with the design for everyone else to contribute to but wanted to share this as I’m starting to work on this and wanted to avoid parallel efforts towards the same end-goal.
> 
> -- 
> Anshum Gupta


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: [DISCUSS] Cross Data-Center Replication in Apache Solr

Posted by Ilan Ginzburg <il...@gmail.com>.

That's an interesting initiative Anshum!

I can see at least two different approaches here, your mention of SolrJ
seems to hint at the first one:
1. Get the data as it comes from the client and fork it to local and remote
data centers,
2. Create (an asynchronous) stream replicating local data center data to
remote.

Option 1 is strongly consistent but adds latency and potentially blocking
on the critical path.
Option 2 could look like remote PULL replicas, might have lower impact on
the local data center but has to deal with the remote data center always
being somewhat behind. If the client application can handle that, the
performance and efficiency gain (as well as simpler implementation? It
doesn't require another persistence layer) might be worth it...

Ilan

On Fri, Dec 4, 2020 at 5:24 PM Anshum Gupta <an...@anshumgupta.net> wrote:

> Hi everyone,
>
>
> Large scale Solr installations often require cross data-center replication
> in order to achieve data replication for both, access latency reasons as
> well as disaster recovery. In the past users have either designed their own
> solutions to deal with this or have tried to rely on the now-deprecated
> CDCR.
>
>
> It would be really good to have support for cross data-center replication
> within Solr, that is offered and supported by the community. This would
> allow the effort around this shared problem to converge.
>
>
> I’d like to propose a new solution based on my experiences at my day job.
> The key points about this approach:
>
>    1. Uses an external, configurable, messaging system in the middle for
>    actual replication/mirroring.
>    2. We offer an abstraction and some default implementations based on
>    what we can support and what users really want. An example here would be
>    Kafka.
>    3. This would be a separate repository allowing it to have its own
>    release cadence. We shouldn’t have to release this with every Solr release
>    as the overlap is just limited to SolrJ interactions.
>
>
> I’ll share a more detailed and evolving document soon with the design for
> everyone else to contribute to but wanted to share this as I’m starting to
> work on this and wanted to avoid parallel efforts towards the same end-goal.
>
> --
> Anshum Gupta
>

Re: [DISCUSS] Cross Data-Center Replication in Apache Solr

Posted by David Smiley <ds...@apache.org>.

Sounds great Anshum!

I also think what you propose might have an additional purpose beyond
CDCR.  A persistent/durable queue (esp. Kafka) can also be used to make
writes from the external source durable into the search tier
(Kafka+SolrCloud) without SolrCloud needed to provide that immediate
durability.  Today (without Kafka), this immediate durability of an update
request to Solr is satisfied by a distributed UpdateLog.  It's worth
exploring an option of no UpdateLog[1] -- rely on Kafka for durability.
Assuming the client writes directly to the queue, it can return quickly and
know the updates won't be lost.  Then, on the other side of Kafka, a client
can keep Solr up to date.

Some things are lost with this external queue, be it used for CDCR or to
what I describe above.
* lack of error notifications if a document can't be indexed (bogus field
or bad format).
* any requirements to make documents visible (searchable) that an indexing
client might specify.  Perhaps that can be added with some complexity
around waiting for a commit message to completely make it through.

[1] https://issues.apache.org/jira/browse/SOLR-14778
    Also this short-cuts a lot of complexity in Solr pertaining to the
UpdateLog and versions.  It provides a total ordering of updates to all
replicas that is a useful property for other things like synchronizing when
segment boundaries occur which is useful to make peer replication based
recovery more efficient.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Dec 4, 2020 at 11:24 AM Anshum Gupta <an...@anshumgupta.net> wrote:

> Hi everyone,
>
>
> Large scale Solr installations often require cross data-center replication
> in order to achieve data replication for both, access latency reasons as
> well as disaster recovery. In the past users have either designed their own
> solutions to deal with this or have tried to rely on the now-deprecated
> CDCR.
>
>
> It would be really good to have support for cross data-center replication
> within Solr, that is offered and supported by the community. This would
> allow the effort around this shared problem to converge.
>
>
> I’d like to propose a new solution based on my experiences at my day job.
> The key points about this approach:
>
>    1. Uses an external, configurable, messaging system in the middle for
>    actual replication/mirroring.
>    2. We offer an abstraction and some default implementations based on
>    what we can support and what users really want. An example here would be
>    Kafka.
>    3. This would be a separate repository allowing it to have its own
>    release cadence. We shouldn’t have to release this with every Solr release
>    as the overlap is just limited to SolrJ interactions.
>
>
> I’ll share a more detailed and evolving document soon with the design for
> everyone else to contribute to but wanted to share this as I’m starting to
> work on this and wanted to avoid parallel efforts towards the same end-goal.
>
> --
> Anshum Gupta
>