You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Dan Davis <da...@gmail.com> on 2015/01/25 06:56:16 UTC

solr replication vs. rsync

When I polled the various projects already using Solr at my organization, I
was greatly surprised that none of them were using Solr replication,
because they had talked about "replicating" the data.

But we are not Pinterest, and do not expect to be taking in changes one
post at a time (at least the engineers don't - just wait until its used for
a Crud app that wants full-text search on a description field!).    Still,
rsync can be very, very fast with the right options (-W for gigabit
ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s over
GigE previously.

Does anyone have any numbers for how fast Solr replication goes, and what
to do to tune it?

I'm not enthusiastic to give-up recently tested cluster stability for a
home grown mess, but I am interested in numbers that are out there.

Re: solr replication vs. rsync

Posted by Erick Erickson <er...@gmail.com>.
bq:  I thought SolrCloud replicas were replication, and you imply parallel
indexing

Absolutely! You couldn't get near-real-time indexing if you relied on
replication a-la
3x. And you also couldn't guarantee consistency.

Say you have 1 shard, a leader and a follower (i.e. 2 replicas). Now you
throw a doc
to be indexed. The sequence is:
leader gets the doc
leader forwards the doc to the follower
leader and follower both add the doc to their local index (and tlog).
follower acks back to leader
leader acks back to client.

So yes, the raw document is forwarded to all replicas before the leader
responds
to the client, the docs all get written to the tlogs, etc. That's the only
way to guarantee
that if the leader goes down, the follower can take over without losing
documents.

Best,
Erick

On Sun, Jan 25, 2015 at 6:15 PM, Dan Davis <da...@gmail.com> wrote:

> @Erick,
>
> Problem space is not constant indexing.   I thought SolrCloud replicas were
> replication, and you imply parallel indexing.  Good to know.
>
> On Sunday, January 25, 2015, Erick Erickson <er...@gmail.com>
> wrote:
>
> > @Shawn: Cool table, thanks!
> >
> > @Dan:
> > Just to throw a different spin on it, if you migrate to SolrCloud, then
> > this question becomes moot as the raw documents are sent to each of the
> > replicas so you very rarely have to copy the full index. Kind of a
> tradeoff
> > between constant load because you're sending the raw documents around
> > whenever you index and peak usage when the index replicates.
> >
> > There are a bunch of other reasons to go to SolrCloud, but you know your
> > problem space best.
> >
> > FWIW,
> > Erick
> >
> > On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey <apache@elyograg.org
> > <javascript:;>> wrote:
> >
> > > On 1/24/2015 10:56 PM, Dan Davis wrote:
> > > > When I polled the various projects already using Solr at my
> > > organization, I
> > > > was greatly surprised that none of them were using Solr replication,
> > > > because they had talked about "replicating" the data.
> > > >
> > > > But we are not Pinterest, and do not expect to be taking in changes
> one
> > > > post at a time (at least the engineers don't - just wait until its
> used
> > > for
> > > > a Crud app that wants full-text search on a description field!).
> > > Still,
> > > > rsync can be very, very fast with the right options (-W for gigabit
> > > > ethernet, and maybe -S for sparse files).   I've clocked it at 48
> MB/s
> > > over
> > > > GigE previously.
> > > >
> > > > Does anyone have any numbers for how fast Solr replication goes, and
> > what
> > > > to do to tune it?
> > > >
> > > > I'm not enthusiastic to give-up recently tested cluster stability
> for a
> > > > home grown mess, but I am interested in numbers that are out there.
> > >
> > > Numbers are included on the Solr replication wiki page, both in graph
> > > and numeric form.  Gathering these numbers must have been pretty easy
> --
> > > before the HTTP replication made it into Solr, Solr used to contain an
> > > rsync-based implementation.
> > >
> > > http://wiki.apache.org/solr/SolrReplication#Performance_numbers
> > >
> > > Other data on that wiki page discusses the replication config.  There's
> > > not a lot to tune.
> > >
> > > I run a redundant non-SolrCloud index myself through a different method
> > > -- my indexing program indexes each index copy completely
> independently.
> > >  There is no replication.  This separation allows me to upgrade any
> > > component, or change any part of solrconfig or schema, on either copy
> of
> > > the index without affecting the other copy at all.  With replication,
> if
> > > something is changed on the master or the slave, you might find that
> the
> > > slave no longer works, because it will be handling an index created by
> > > different software or a different config.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>

Re: solr replication vs. rsync

Posted by Dan Davis <da...@gmail.com>.
@Erick,

Problem space is not constant indexing.   I thought SolrCloud replicas were
replication, and you imply parallel indexing.  Good to know.

On Sunday, January 25, 2015, Erick Erickson <er...@gmail.com> wrote:

> @Shawn: Cool table, thanks!
>
> @Dan:
> Just to throw a different spin on it, if you migrate to SolrCloud, then
> this question becomes moot as the raw documents are sent to each of the
> replicas so you very rarely have to copy the full index. Kind of a tradeoff
> between constant load because you're sending the raw documents around
> whenever you index and peak usage when the index replicates.
>
> There are a bunch of other reasons to go to SolrCloud, but you know your
> problem space best.
>
> FWIW,
> Erick
>
> On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey <apache@elyograg.org
> <javascript:;>> wrote:
>
> > On 1/24/2015 10:56 PM, Dan Davis wrote:
> > > When I polled the various projects already using Solr at my
> > organization, I
> > > was greatly surprised that none of them were using Solr replication,
> > > because they had talked about "replicating" the data.
> > >
> > > But we are not Pinterest, and do not expect to be taking in changes one
> > > post at a time (at least the engineers don't - just wait until its used
> > for
> > > a Crud app that wants full-text search on a description field!).
> > Still,
> > > rsync can be very, very fast with the right options (-W for gigabit
> > > ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
> > over
> > > GigE previously.
> > >
> > > Does anyone have any numbers for how fast Solr replication goes, and
> what
> > > to do to tune it?
> > >
> > > I'm not enthusiastic to give-up recently tested cluster stability for a
> > > home grown mess, but I am interested in numbers that are out there.
> >
> > Numbers are included on the Solr replication wiki page, both in graph
> > and numeric form.  Gathering these numbers must have been pretty easy --
> > before the HTTP replication made it into Solr, Solr used to contain an
> > rsync-based implementation.
> >
> > http://wiki.apache.org/solr/SolrReplication#Performance_numbers
> >
> > Other data on that wiki page discusses the replication config.  There's
> > not a lot to tune.
> >
> > I run a redundant non-SolrCloud index myself through a different method
> > -- my indexing program indexes each index copy completely independently.
> >  There is no replication.  This separation allows me to upgrade any
> > component, or change any part of solrconfig or schema, on either copy of
> > the index without affecting the other copy at all.  With replication, if
> > something is changed on the master or the slave, you might find that the
> > slave no longer works, because it will be handling an index created by
> > different software or a different config.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: solr replication vs. rsync

Posted by Dan Davis <da...@gmail.com>.
Thanks!

On Sunday, January 25, 2015, Erick Erickson <er...@gmail.com> wrote:

> @Shawn: Cool table, thanks!
>
> @Dan:
> Just to throw a different spin on it, if you migrate to SolrCloud, then
> this question becomes moot as the raw documents are sent to each of the
> replicas so you very rarely have to copy the full index. Kind of a tradeoff
> between constant load because you're sending the raw documents around
> whenever you index and peak usage when the index replicates.
>
> There are a bunch of other reasons to go to SolrCloud, but you know your
> problem space best.
>
> FWIW,
> Erick
>
> On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey <apache@elyograg.org
> <javascript:;>> wrote:
>
> > On 1/24/2015 10:56 PM, Dan Davis wrote:
> > > When I polled the various projects already using Solr at my
> > organization, I
> > > was greatly surprised that none of them were using Solr replication,
> > > because they had talked about "replicating" the data.
> > >
> > > But we are not Pinterest, and do not expect to be taking in changes one
> > > post at a time (at least the engineers don't - just wait until its used
> > for
> > > a Crud app that wants full-text search on a description field!).
> > Still,
> > > rsync can be very, very fast with the right options (-W for gigabit
> > > ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
> > over
> > > GigE previously.
> > >
> > > Does anyone have any numbers for how fast Solr replication goes, and
> what
> > > to do to tune it?
> > >
> > > I'm not enthusiastic to give-up recently tested cluster stability for a
> > > home grown mess, but I am interested in numbers that are out there.
> >
> > Numbers are included on the Solr replication wiki page, both in graph
> > and numeric form.  Gathering these numbers must have been pretty easy --
> > before the HTTP replication made it into Solr, Solr used to contain an
> > rsync-based implementation.
> >
> > http://wiki.apache.org/solr/SolrReplication#Performance_numbers
> >
> > Other data on that wiki page discusses the replication config.  There's
> > not a lot to tune.
> >
> > I run a redundant non-SolrCloud index myself through a different method
> > -- my indexing program indexes each index copy completely independently.
> >  There is no replication.  This separation allows me to upgrade any
> > component, or change any part of solrconfig or schema, on either copy of
> > the index without affecting the other copy at all.  With replication, if
> > something is changed on the master or the slave, you might find that the
> > slave no longer works, because it will be handling an index created by
> > different software or a different config.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: solr replication vs. rsync

Posted by Erick Erickson <er...@gmail.com>.
@Shawn: Cool table, thanks!

@Dan:
Just to throw a different spin on it, if you migrate to SolrCloud, then
this question becomes moot as the raw documents are sent to each of the
replicas so you very rarely have to copy the full index. Kind of a tradeoff
between constant load because you're sending the raw documents around
whenever you index and peak usage when the index replicates.

There are a bunch of other reasons to go to SolrCloud, but you know your
problem space best.

FWIW,
Erick

On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 1/24/2015 10:56 PM, Dan Davis wrote:
> > When I polled the various projects already using Solr at my
> organization, I
> > was greatly surprised that none of them were using Solr replication,
> > because they had talked about "replicating" the data.
> >
> > But we are not Pinterest, and do not expect to be taking in changes one
> > post at a time (at least the engineers don't - just wait until its used
> for
> > a Crud app that wants full-text search on a description field!).
> Still,
> > rsync can be very, very fast with the right options (-W for gigabit
> > ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
> over
> > GigE previously.
> >
> > Does anyone have any numbers for how fast Solr replication goes, and what
> > to do to tune it?
> >
> > I'm not enthusiastic to give-up recently tested cluster stability for a
> > home grown mess, but I am interested in numbers that are out there.
>
> Numbers are included on the Solr replication wiki page, both in graph
> and numeric form.  Gathering these numbers must have been pretty easy --
> before the HTTP replication made it into Solr, Solr used to contain an
> rsync-based implementation.
>
> http://wiki.apache.org/solr/SolrReplication#Performance_numbers
>
> Other data on that wiki page discusses the replication config.  There's
> not a lot to tune.
>
> I run a redundant non-SolrCloud index myself through a different method
> -- my indexing program indexes each index copy completely independently.
>  There is no replication.  This separation allows me to upgrade any
> component, or change any part of solrconfig or schema, on either copy of
> the index without affecting the other copy at all.  With replication, if
> something is changed on the master or the slave, you might find that the
> slave no longer works, because it will be handling an index created by
> different software or a different config.
>
> Thanks,
> Shawn
>
>

Re: solr replication vs. rsync

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/24/2015 10:56 PM, Dan Davis wrote:
> When I polled the various projects already using Solr at my organization, I
> was greatly surprised that none of them were using Solr replication,
> because they had talked about "replicating" the data.
> 
> But we are not Pinterest, and do not expect to be taking in changes one
> post at a time (at least the engineers don't - just wait until its used for
> a Crud app that wants full-text search on a description field!).    Still,
> rsync can be very, very fast with the right options (-W for gigabit
> ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s over
> GigE previously.
> 
> Does anyone have any numbers for how fast Solr replication goes, and what
> to do to tune it?
> 
> I'm not enthusiastic to give-up recently tested cluster stability for a
> home grown mess, but I am interested in numbers that are out there.

Numbers are included on the Solr replication wiki page, both in graph
and numeric form.  Gathering these numbers must have been pretty easy --
before the HTTP replication made it into Solr, Solr used to contain an
rsync-based implementation.

http://wiki.apache.org/solr/SolrReplication#Performance_numbers

Other data on that wiki page discusses the replication config.  There's
not a lot to tune.

I run a redundant non-SolrCloud index myself through a different method
-- my indexing program indexes each index copy completely independently.
 There is no replication.  This separation allows me to upgrade any
component, or change any part of solrconfig or schema, on either copy of
the index without affecting the other copy at all.  With replication, if
something is changed on the master or the slave, you might find that the
slave no longer works, because it will be handling an index created by
different software or a different config.

Thanks,
Shawn