You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Davis <jo...@gmail.com> on 2017/12/15 17:03:19 UTC

SolrCloud

Hello,
We are thinking about migrating to SolrCloud. Our current setup is:
1. Multiple replicas and shards.
2. Each query typically hits a single shard only.
3. We have an external system that assigns a document to a shard based on
it's origin and is also used by solr clients when querying to find the
correct shard to query.

It looks like the biggest advantage of SolrCloud is #3 - to route document
to the correct shard & replicas when indexing and to route query similarly.
Given we already have a fairly reliable system to do this, are there other
benefits from migrating to SolrCloud?

Thanks,
John

Re: SolrCloud

Posted by John Davis <jo...@gmail.com>.
Thanks Erick. I agree SolrCloud is better than master/slave, however we
have some questions between managing replicas separately vs with solrcloud.
For eg how much overhead do SolrCloud nodes have wrt memory/cpu/disk in
order to be able to sync pending index updates to other replicas? What
monitoring and safeguards are in place out of the box so too many pending
updates for unreachable replicas don't make the alive ones fall over? Or a
new replica doesn't overwhelm existing replica.

Of course everything works great when things are running well but when
things go south our preference would be for solr to not fall over as first
priority.

On Fri, Dec 15, 2017 at 9:41 AM, Erick Erickson <er...@gmail.com>
wrote:

> The main advantage in SolrCloud in your setup is HA/DR. You say you
> have multiple replicas and shards. Either you have to index to each
> replica separately or you use master/slave replication. In either case
> you have to manage and fix the case where some node goes down. If
> you're using master/slave, if the master goes down you need to get in
> there and fix it, reassign the master, make config changes, restart
> Solr to pick them up, make sure you pick up any missed updates and all
> that.
>
> in SolrCloud that is managed for you. Plus, let's say you want to
> increase QPS capacity. In SolrCloud all you do is use the collections
> API ADDREPLICA command and you're done. It gets created (and you can
> specify exactly what node if you want), the index gets copied, new
> updates are automatically routed to it and it starts serving requests
> when it's synchronized all automagically. Symmetrically you can
> DELETEREPLICA if you have too much capacity.
>
> The price here is you have to get comfortable with maintaining
> ZooKeeper admittedly.
>
> Also in the 7x world you have different types of replicas, TLOG, PULL
> and NRT that combine some of the features of master/slave with
> SolrCloud.
>
> Generally my rule of thumb is the minute you get beyond a single shard
> you should move to SolrCloud. If all your data fits in one Solr core
> then it's less clear-cut, master/slave can work just fine. It Depends
> (tm) of course.
>
> Your use case is "implicit" (being renamed "manual") routing when you
> create your Solr collection. There are pros and cons here, but that's
> beyond the scope of your question. Your infrastructure should port
> pretty directly to SolrCloud. The short form is that all your indexing
> and/or querying is happening on a single node when using manual
> routing rather than in parallel. Of course executing parallel
> sub-queries imposes its own overhead.....
>
> If your use-case for having these on a single shard it to segregate
> the data by some set (say users), you might want to consider just
> using separate _collections_ in SolrCloud where old_shard ==
> new_collection, basically all your routing is the same. You can create
> aliases pointing to multiple collections or specify multiple
> collections on the query, don't know if that fits your use case or not
> though.
>
>
> Best,
> Erick
>
> On Fri, Dec 15, 2017 at 9:03 AM, John Davis <jo...@gmail.com>
> wrote:
> > Hello,
> > We are thinking about migrating to SolrCloud. Our current setup is:
> > 1. Multiple replicas and shards.
> > 2. Each query typically hits a single shard only.
> > 3. We have an external system that assigns a document to a shard based on
> > it's origin and is also used by solr clients when querying to find the
> > correct shard to query.
> >
> > It looks like the biggest advantage of SolrCloud is #3 - to route
> document
> > to the correct shard & replicas when indexing and to route query
> similarly.
> > Given we already have a fairly reliable system to do this, are there
> other
> > benefits from migrating to SolrCloud?
> >
> > Thanks,
> > John
>

Re: SolrCloud

Posted by Erick Erickson <er...@gmail.com>.
The main advantage in SolrCloud in your setup is HA/DR. You say you
have multiple replicas and shards. Either you have to index to each
replica separately or you use master/slave replication. In either case
you have to manage and fix the case where some node goes down. If
you're using master/slave, if the master goes down you need to get in
there and fix it, reassign the master, make config changes, restart
Solr to pick them up, make sure you pick up any missed updates and all
that.

in SolrCloud that is managed for you. Plus, let's say you want to
increase QPS capacity. In SolrCloud all you do is use the collections
API ADDREPLICA command and you're done. It gets created (and you can
specify exactly what node if you want), the index gets copied, new
updates are automatically routed to it and it starts serving requests
when it's synchronized all automagically. Symmetrically you can
DELETEREPLICA if you have too much capacity.

The price here is you have to get comfortable with maintaining
ZooKeeper admittedly.

Also in the 7x world you have different types of replicas, TLOG, PULL
and NRT that combine some of the features of master/slave with
SolrCloud.

Generally my rule of thumb is the minute you get beyond a single shard
you should move to SolrCloud. If all your data fits in one Solr core
then it's less clear-cut, master/slave can work just fine. It Depends
(tm) of course.

Your use case is "implicit" (being renamed "manual") routing when you
create your Solr collection. There are pros and cons here, but that's
beyond the scope of your question. Your infrastructure should port
pretty directly to SolrCloud. The short form is that all your indexing
and/or querying is happening on a single node when using manual
routing rather than in parallel. Of course executing parallel
sub-queries imposes its own overhead.....

If your use-case for having these on a single shard it to segregate
the data by some set (say users), you might want to consider just
using separate _collections_ in SolrCloud where old_shard ==
new_collection, basically all your routing is the same. You can create
aliases pointing to multiple collections or specify multiple
collections on the query, don't know if that fits your use case or not
though.


Best,
Erick

On Fri, Dec 15, 2017 at 9:03 AM, John Davis <jo...@gmail.com> wrote:
> Hello,
> We are thinking about migrating to SolrCloud. Our current setup is:
> 1. Multiple replicas and shards.
> 2. Each query typically hits a single shard only.
> 3. We have an external system that assigns a document to a shard based on
> it's origin and is also used by solr clients when querying to find the
> correct shard to query.
>
> It looks like the biggest advantage of SolrCloud is #3 - to route document
> to the correct shard & replicas when indexing and to route query similarly.
> Given we already have a fairly reliable system to do this, are there other
> benefits from migrating to SolrCloud?
>
> Thanks,
> John