You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Blythe <jo...@gmail.com> on 2017/09/29 12:34:15 UTC

tipping point for using solrcloud—or not?

hi all.

complete noob as to solrcloud here. almost-non-noob on solr in general.

we're experiencing growing pains in our data and am thinking through moving
to solrcloud as a result. i'm hoping to find out if it seems like a good
strategy or if we need to get other areas of interest handled first before
introducing new complexities.

here's a rundown of things:
- we are on a 30g ram aws instance
- we have ~30g tucked away in the ../solr/server/ dir
- our largest core is 6.8g w/ ~25 segments at any given time. this is also
the core that our business directly runs off of, users interact with, etc.
- 5g is for a logs type of dataset that analytics can be built off of to
help inform the primary core above
- 3g are taken up by 3 different third party sources that we use solr to
warehouse and have available for query for the sake of linking items in our
primary core to these cores for data enrichment
- several others take up < 1g each
- and then we have dev- and demo- flavors for some of these

we had been operating on a 16gb machine till a few weeks ago (actually
bumped while at lucene revolution bc i hadn't noticed how much we'd
outgrown the cache size's needs till the week before!). the load when doing
an import or running our heavier operations is much better and doesn't fall
under the weight of the operations like it had been doing.

we have no master/slave replica. all of our data is 'replicated' by the
fact that it exists in mysql. if solr were to go down it'd be a nice big
fire but one we could recover from within a couple hours by simply
reimporting.

i'd like to have a more sophisticated set up in place for fault tolerance
than that, of course. i'd also like to see our heavy, many-query based
operations be speedier and better capable of handling multi-threaded runs
at once w/ ease.

is this a matter of getting still more ram on the machine? cpus for faster
processing? splitting up the read/write operations between master/slave?
going full steam into a solrcloud configuration?

one more note. per discussion at the conference i'm combing through our
configs to make sure we trim any fat we can. also wanting to get
optimization scheduled more regularly to help out w segmentation and
garbage heap. not sure how far those two alone will get us, though.

thanks for any thoughts!

--
John Blythe

Re: tipping point for using solrcloud—or not?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/29/2017 6:34 AM, John Blythe wrote:
> complete noob as to solrcloud here. almost-non-noob on solr in general.
>
> we're experiencing growing pains in our data and am thinking through moving
> to solrcloud as a result. i'm hoping to find out if it seems like a good
> strategy or if we need to get other areas of interest handled first before
> introducing new complexities.

SolrCloud's main advantages are in automation, centralization, and 
eliminating single points of failure. Indexing multiple replicas works 
very differently in cloud than in master/slave, a difference that can 
have both advantages and disadvantages.  It is advantageous in *most* 
situations, but master/slave might have an edge in *some* situations.

For most *new* production setups requiring high availability, I would in 
almost every case recommend SolrCloud. Master/slave is a system that 
works, but the master represents a single point of failure.  If the 
master dies, manual reconfiguration of all machines is usually required 
in order to define a new master.  If you're willing to do some tricks 
with DNS, it might be possible to avoid manual Solr reconfiguration, but 
it is not seamless like SolrCloud, which is a true cluster that has no 
masters and no slaves.

I do not use SolrCloud in most of my setups.  This is only because when 
those setups were designed, SolrCloud was a development dream, something 
that was being worked on in a development branch.  SolrCloud did not 
arrive in a released version until 4.0.0-ALPHA.  If I were designing a 
setup from scratch now, I would definitely build it with SolrCloud.

> here's a rundown of things:
> - we are on a 30g ram aws instance
> - we have ~30g tucked away in the ../solr/server/ dir
> - our largest core is 6.8g w/ ~25 segments at any given time. this is also
> the core that our business directly runs off of, users interact with, etc.
> - 5g is for a logs type of dataset that analytics can be built off of to
> help inform the primary core above
> - 3g are taken up by 3 different third party sources that we use solr to
> warehouse and have available for query for the sake of linking items in our
> primary core to these cores for data enrichment
> - several others take up < 1g each
> - and then we have dev- and demo- flavors for some of these
>
> we had been operating on a 16gb machine till a few weeks ago (actually
> bumped while at lucene revolution bc i hadn't noticed how much we'd
> outgrown the cache size's needs till the week before!). the load when doing
> an import or running our heavier operations is much better and doesn't fall
> under the weight of the operations like it had been doing.
>
> we have no master/slave replica. all of our data is 'replicated' by the
> fact that it exists in mysql. if solr were to go down it'd be a nice big
> fire but one we could recover from within a couple hours by simply
> reimporting.

If your business model can tolerate a two hour outage, I am envious.  
That is not something that most businesses can tolerate.  Also, many 
setups cannot do a full rebuild in two hours.  Some kind of replication 
is required for a fault tolerant installation.

> i'd like to have a more sophisticated set up in place for fault tolerance
> than that, of course. i'd also like to see our heavy, many-query based
> operations be speedier and better capable of handling multi-threaded runs
> at once w/ ease.
>
> is this a matter of getting still more ram on the machine? cpus for faster
> processing? splitting up the read/write operations between master/slave?
> going full steam into a solrcloud configuration?
>
> one more note. per discussion at the conference i'm combing through our
> configs to make sure we trim any fat we can. also wanting to get
> optimization scheduled more regularly to help out w segmentation and
> garbage heap. not sure how far those two alone will get us, though

The desire to scale an index, either in size or query load, is not by 
itself a reason to switch to SolrCloud.  Scaling is generally easier to 
manage with cloud, because you just fire up another server, and it is 
immediately part of the cloud, ready for whatever collection changes or 
additions you might need, most of which can be done with requests via 
the HTTP API.  Although performance can improve with SolrCloud, it is 
not usually a *significant* improvement, assuming that the distribution 
of data and the number/configuration of servers are similar between 
master/slave and SolrCloud.

If you rearrange the data or upgrade/add server hardware *with* the 
switch to SolrCloud, then any significant performance improvement is 
probably not attributable to SolrCloud, but to the other changes.

If all your homegrown tools are designed around non-cloud setups, you 
might find it very painful to switch.  Some things require different 
HTTP APIs, and the APIs that you might already use could have different 
responses or require slightly different information in the request.

RAM is the resource with the most impact on Solr performance.  CPU is 
certainly important, but increasing the available RAM will usually give 
the biggest boost.  If there is sufficient RAM, disk speed will have 
very little effect on performance.  Disk speed only becomes a major 
factor when you do not have enough memory to effectively cache the index.

Thanks,
Shawn


Re: tipping point for using solrcloud—or not?

Posted by Emir Arnautović <em...@sematext.com>.
Hi John,
Your data volume does not require SolrCloud, especially if you isolate core that is related to your business from other cores. You mentioned that the second largest is logs core used for analytics - not sure what sort of logs, but if write intensive logging, you might want to isolate those. It is probably better to have two 15GB instances than one 30GB and dedicate one instance to your main core. If you do not see the size going up in the near future, you can go with even smaller one. It may also be better to invest some money into instances with SSD. You may consider sending logs to some centralised logging solutions (one such is out Logsene http://sematext.com/logsene <http://sematext.com/logsene> ).
When it comes to FT, you can still have it with MS model by introducing slaves. That can also be one way to isolate cores that your users are facing - they will query only slaves and the only replicated core will be the main core.
It is hard to tell more without knowing your ingestion/query rate, query types, NRT requirements…

HTH,
Emir

> On 29 Sep 2017, at 17:27, Erick Erickson <er...@gmail.com> wrote:
> 
> SolrCloud. SolrCloud. SolrCloud.
> 
> Well, it actually depends. I recommend people go to SolrCloud when any
> of the following apply:
> 
>> The instant you need to break any collection up into shards because you're running into the constraints of your hardware (you can't just keep adding memory to the JVM forever).
> 
>> You need NRT searching and need multiple replicas for either your traffic rate or HA purposes.
> 
>> You find yourself dealing with lots of administrative complexity for various indexes. You have what sounds like 6-10 cores laying around. You can move them to different machines without going to SolrCloud, but then something has to keep track of where they all are and route requests appropriately. If that gets onerous, SolrCloud will simplify it.
> 
> If none of the above apply, master/slave is just fine. Since you can
> rebuild in a couple of hours, most of the difficulty with M/S when the
> master goes down are manageable. With a master and several slaves, you
> have HA, and a load balancer will see to it that some are used.
> There's no real need to exclusively search on the slaves, I've seen
> situations where the master is used for queries as well as indexing.
> 
> To increase your query rate, you can just add more slaves to the hot
> index, assuming you're content with the latency between indexing and
> being able to search newly indexed documents.
> 
> SolrCloud, of course, comes with the added complexity of ZooKeeper.
> 
> Best,
> Erick
> 
> 
> 
> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe <jo...@gmail.com> wrote:
>> hi all.
>> 
>> complete noob as to solrcloud here. almost-non-noob on solr in general.
>> 
>> we're experiencing growing pains in our data and am thinking through moving
>> to solrcloud as a result. i'm hoping to find out if it seems like a good
>> strategy or if we need to get other areas of interest handled first before
>> introducing new complexities.
>> 
>> here's a rundown of things:
>> - we are on a 30g ram aws instance
>> - we have ~30g tucked away in the ../solr/server/ dir
>> - our largest core is 6.8g w/ ~25 segments at any given time. this is also
>> the core that our business directly runs off of, users interact with, etc.
>> - 5g is for a logs type of dataset that analytics can be built off of to
>> help inform the primary core above
>> - 3g are taken up by 3 different third party sources that we use solr to
>> warehouse and have available for query for the sake of linking items in our
>> primary core to these cores for data enrichment
>> - several others take up < 1g each
>> - and then we have dev- and demo- flavors for some of these
>> 
>> we had been operating on a 16gb machine till a few weeks ago (actually
>> bumped while at lucene revolution bc i hadn't noticed how much we'd
>> outgrown the cache size's needs till the week before!). the load when doing
>> an import or running our heavier operations is much better and doesn't fall
>> under the weight of the operations like it had been doing.
>> 
>> we have no master/slave replica. all of our data is 'replicated' by the
>> fact that it exists in mysql. if solr were to go down it'd be a nice big
>> fire but one we could recover from within a couple hours by simply
>> reimporting.
>> 
>> i'd like to have a more sophisticated set up in place for fault tolerance
>> than that, of course. i'd also like to see our heavy, many-query based
>> operations be speedier and better capable of handling multi-threaded runs
>> at once w/ ease.
>> 
>> is this a matter of getting still more ram on the machine? cpus for faster
>> processing? splitting up the read/write operations between master/slave?
>> going full steam into a solrcloud configuration?
>> 
>> one more note. per discussion at the conference i'm combing through our
>> configs to make sure we trim any fat we can. also wanting to get
>> optimization scheduled more regularly to help out w segmentation and
>> garbage heap. not sure how far those two alone will get us, though.
>> 
>> thanks for any thoughts!
>> 
>> --
>> John Blythe


Re: tipping point for using solrcloud—or not?

Posted by John Blythe <jo...@gmail.com>.
Nope, NRT is within seconds at most in several cases. Sounds like cloud
needs to be whah we plan for.

Thanks!

On Mon, Oct 2, 2017 at 5:39 PM Erick Erickson <er...@gmail.com>
wrote:

> Short form: Use SolCloud from what you've described.
>
> NRT and M/S is simply oil and water. The _very_ best you can do when
> searching slaves is
> master's commit interval + slave polling interval + time to transmit
> the index to the slave + autowarming time on the slave.
>
> Now, that said, when you say NRT it's really "10 minutes is OK" then
> M/S will work for you.
>
> But otherwise I'd be using SolrCloud.
>
> Best,
> Erick
>
> On Mon, Oct 2, 2017 at 1:48 PM, John Blythe <jo...@gmail.com> wrote:
> > thanks for the responses, guys.
> >
> > erick: we do need NRT in several cases. also in need of HA pending where
> > the line is drawn. we do need it relatively speaking, i.e. w/i our user
> > base. if the largest of our cores falters then our business is completely
> > stopped till we can get everything reindexed.
> >
> > is there a general rule when it comes to query rate and efficiency
> between
> > Cloud and M/S? in either case we need to add complexity to the system so,
> > if it's a jump ball, that will be the thing that likely tips in favor.
> >
> > emir: the logs aren't write intensive. what are the core benefits to
> > splitting up the machine if there isn't a jvm load issue we're currently
> > experiencing?
> >
> > i can def provide more info that could help in the discussion. help me
> know
> > the best way / stuff to send if you can please.
> >
> > thanks again for the help guys-
> >
> > --
> > John Blythe
> >
> > On Fri, Sep 29, 2017 at 10:27 AM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> >> SolrCloud. SolrCloud. SolrCloud.
> >>
> >> Well, it actually depends. I recommend people go to SolrCloud when any
> >> of the following apply:
> >>
> >> > The instant you need to break any collection up into shards because
> >> you're running into the constraints of your hardware (you can't just
> keep
> >> adding memory to the JVM forever).
> >>
> >> > You need NRT searching and need multiple replicas for either your
> >> traffic rate or HA purposes.
> >>
> >> > You find yourself dealing with lots of administrative complexity for
> >> various indexes. You have what sounds like 6-10 cores laying around. You
> >> can move them to different machines without going to SolrCloud, but then
> >> something has to keep track of where they all are and route requests
> >> appropriately. If that gets onerous, SolrCloud will simplify it.
> >>
> >> If none of the above apply, master/slave is just fine. Since you can
> >> rebuild in a couple of hours, most of the difficulty with M/S when the
> >> master goes down are manageable. With a master and several slaves, you
> >> have HA, and a load balancer will see to it that some are used.
> >> There's no real need to exclusively search on the slaves, I've seen
> >> situations where the master is used for queries as well as indexing.
> >>
> >> To increase your query rate, you can just add more slaves to the hot
> >> index, assuming you're content with the latency between indexing and
> >> being able to search newly indexed documents.
> >>
> >> SolrCloud, of course, comes with the added complexity of ZooKeeper.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe <jo...@gmail.com>
> wrote:
> >> > hi all.
> >> >
> >> > complete noob as to solrcloud here. almost-non-noob on solr in
> general.
> >> >
> >> > we're experiencing growing pains in our data and am thinking through
> >> moving
> >> > to solrcloud as a result. i'm hoping to find out if it seems like a
> good
> >> > strategy or if we need to get other areas of interest handled first
> >> before
> >> > introducing new complexities.
> >> >
> >> > here's a rundown of things:
> >> > - we are on a 30g ram aws instance
> >> > - we have ~30g tucked away in the ../solr/server/ dir
> >> > - our largest core is 6.8g w/ ~25 segments at any given time. this is
> >> also
> >> > the core that our business directly runs off of, users interact with,
> >> etc.
> >> > - 5g is for a logs type of dataset that analytics can be built off of
> to
> >> > help inform the primary core above
> >> > - 3g are taken up by 3 different third party sources that we use solr
> to
> >> > warehouse and have available for query for the sake of linking items
> in
> >> our
> >> > primary core to these cores for data enrichment
> >> > - several others take up < 1g each
> >> > - and then we have dev- and demo- flavors for some of these
> >> >
> >> > we had been operating on a 16gb machine till a few weeks ago (actually
> >> > bumped while at lucene revolution bc i hadn't noticed how much we'd
> >> > outgrown the cache size's needs till the week before!). the load when
> >> doing
> >> > an import or running our heavier operations is much better and doesn't
> >> fall
> >> > under the weight of the operations like it had been doing.
> >> >
> >> > we have no master/slave replica. all of our data is 'replicated' by
> the
> >> > fact that it exists in mysql. if solr were to go down it'd be a nice
> big
> >> > fire but one we could recover from within a couple hours by simply
> >> > reimporting.
> >> >
> >> > i'd like to have a more sophisticated set up in place for fault
> tolerance
> >> > than that, of course. i'd also like to see our heavy, many-query based
> >> > operations be speedier and better capable of handling multi-threaded
> runs
> >> > at once w/ ease.
> >> >
> >> > is this a matter of getting still more ram on the machine? cpus for
> >> faster
> >> > processing? splitting up the read/write operations between
> master/slave?
> >> > going full steam into a solrcloud configuration?
> >> >
> >> > one more note. per discussion at the conference i'm combing through
> our
> >> > configs to make sure we trim any fat we can. also wanting to get
> >> > optimization scheduled more regularly to help out w segmentation and
> >> > garbage heap. not sure how far those two alone will get us, though.
> >> >
> >> > thanks for any thoughts!
> >> >
> >> > --
> >> > John Blythe
> >>
>
-- 
John Blythe

Re: tipping point for using solrcloud—or not?

Posted by Erick Erickson <er...@gmail.com>.
Short form: Use SolCloud from what you've described.

NRT and M/S is simply oil and water. The _very_ best you can do when
searching slaves is
master's commit interval + slave polling interval + time to transmit
the index to the slave + autowarming time on the slave.

Now, that said, when you say NRT it's really "10 minutes is OK" then
M/S will work for you.

But otherwise I'd be using SolrCloud.

Best,
Erick

On Mon, Oct 2, 2017 at 1:48 PM, John Blythe <jo...@gmail.com> wrote:
> thanks for the responses, guys.
>
> erick: we do need NRT in several cases. also in need of HA pending where
> the line is drawn. we do need it relatively speaking, i.e. w/i our user
> base. if the largest of our cores falters then our business is completely
> stopped till we can get everything reindexed.
>
> is there a general rule when it comes to query rate and efficiency between
> Cloud and M/S? in either case we need to add complexity to the system so,
> if it's a jump ball, that will be the thing that likely tips in favor.
>
> emir: the logs aren't write intensive. what are the core benefits to
> splitting up the machine if there isn't a jvm load issue we're currently
> experiencing?
>
> i can def provide more info that could help in the discussion. help me know
> the best way / stuff to send if you can please.
>
> thanks again for the help guys-
>
> --
> John Blythe
>
> On Fri, Sep 29, 2017 at 10:27 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> SolrCloud. SolrCloud. SolrCloud.
>>
>> Well, it actually depends. I recommend people go to SolrCloud when any
>> of the following apply:
>>
>> > The instant you need to break any collection up into shards because
>> you're running into the constraints of your hardware (you can't just keep
>> adding memory to the JVM forever).
>>
>> > You need NRT searching and need multiple replicas for either your
>> traffic rate or HA purposes.
>>
>> > You find yourself dealing with lots of administrative complexity for
>> various indexes. You have what sounds like 6-10 cores laying around. You
>> can move them to different machines without going to SolrCloud, but then
>> something has to keep track of where they all are and route requests
>> appropriately. If that gets onerous, SolrCloud will simplify it.
>>
>> If none of the above apply, master/slave is just fine. Since you can
>> rebuild in a couple of hours, most of the difficulty with M/S when the
>> master goes down are manageable. With a master and several slaves, you
>> have HA, and a load balancer will see to it that some are used.
>> There's no real need to exclusively search on the slaves, I've seen
>> situations where the master is used for queries as well as indexing.
>>
>> To increase your query rate, you can just add more slaves to the hot
>> index, assuming you're content with the latency between indexing and
>> being able to search newly indexed documents.
>>
>> SolrCloud, of course, comes with the added complexity of ZooKeeper.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe <jo...@gmail.com> wrote:
>> > hi all.
>> >
>> > complete noob as to solrcloud here. almost-non-noob on solr in general.
>> >
>> > we're experiencing growing pains in our data and am thinking through
>> moving
>> > to solrcloud as a result. i'm hoping to find out if it seems like a good
>> > strategy or if we need to get other areas of interest handled first
>> before
>> > introducing new complexities.
>> >
>> > here's a rundown of things:
>> > - we are on a 30g ram aws instance
>> > - we have ~30g tucked away in the ../solr/server/ dir
>> > - our largest core is 6.8g w/ ~25 segments at any given time. this is
>> also
>> > the core that our business directly runs off of, users interact with,
>> etc.
>> > - 5g is for a logs type of dataset that analytics can be built off of to
>> > help inform the primary core above
>> > - 3g are taken up by 3 different third party sources that we use solr to
>> > warehouse and have available for query for the sake of linking items in
>> our
>> > primary core to these cores for data enrichment
>> > - several others take up < 1g each
>> > - and then we have dev- and demo- flavors for some of these
>> >
>> > we had been operating on a 16gb machine till a few weeks ago (actually
>> > bumped while at lucene revolution bc i hadn't noticed how much we'd
>> > outgrown the cache size's needs till the week before!). the load when
>> doing
>> > an import or running our heavier operations is much better and doesn't
>> fall
>> > under the weight of the operations like it had been doing.
>> >
>> > we have no master/slave replica. all of our data is 'replicated' by the
>> > fact that it exists in mysql. if solr were to go down it'd be a nice big
>> > fire but one we could recover from within a couple hours by simply
>> > reimporting.
>> >
>> > i'd like to have a more sophisticated set up in place for fault tolerance
>> > than that, of course. i'd also like to see our heavy, many-query based
>> > operations be speedier and better capable of handling multi-threaded runs
>> > at once w/ ease.
>> >
>> > is this a matter of getting still more ram on the machine? cpus for
>> faster
>> > processing? splitting up the read/write operations between master/slave?
>> > going full steam into a solrcloud configuration?
>> >
>> > one more note. per discussion at the conference i'm combing through our
>> > configs to make sure we trim any fat we can. also wanting to get
>> > optimization scheduled more regularly to help out w segmentation and
>> > garbage heap. not sure how far those two alone will get us, though.
>> >
>> > thanks for any thoughts!
>> >
>> > --
>> > John Blythe
>>

Re: tipping point for using solrcloud—or not?

Posted by John Blythe <jo...@gmail.com>.
thanks for the responses, guys.

erick: we do need NRT in several cases. also in need of HA pending where
the line is drawn. we do need it relatively speaking, i.e. w/i our user
base. if the largest of our cores falters then our business is completely
stopped till we can get everything reindexed.

is there a general rule when it comes to query rate and efficiency between
Cloud and M/S? in either case we need to add complexity to the system so,
if it's a jump ball, that will be the thing that likely tips in favor.

emir: the logs aren't write intensive. what are the core benefits to
splitting up the machine if there isn't a jvm load issue we're currently
experiencing?

i can def provide more info that could help in the discussion. help me know
the best way / stuff to send if you can please.

thanks again for the help guys-

--
John Blythe

On Fri, Sep 29, 2017 at 10:27 AM, Erick Erickson <er...@gmail.com>
wrote:

> SolrCloud. SolrCloud. SolrCloud.
>
> Well, it actually depends. I recommend people go to SolrCloud when any
> of the following apply:
>
> > The instant you need to break any collection up into shards because
> you're running into the constraints of your hardware (you can't just keep
> adding memory to the JVM forever).
>
> > You need NRT searching and need multiple replicas for either your
> traffic rate or HA purposes.
>
> > You find yourself dealing with lots of administrative complexity for
> various indexes. You have what sounds like 6-10 cores laying around. You
> can move them to different machines without going to SolrCloud, but then
> something has to keep track of where they all are and route requests
> appropriately. If that gets onerous, SolrCloud will simplify it.
>
> If none of the above apply, master/slave is just fine. Since you can
> rebuild in a couple of hours, most of the difficulty with M/S when the
> master goes down are manageable. With a master and several slaves, you
> have HA, and a load balancer will see to it that some are used.
> There's no real need to exclusively search on the slaves, I've seen
> situations where the master is used for queries as well as indexing.
>
> To increase your query rate, you can just add more slaves to the hot
> index, assuming you're content with the latency between indexing and
> being able to search newly indexed documents.
>
> SolrCloud, of course, comes with the added complexity of ZooKeeper.
>
> Best,
> Erick
>
>
>
> On Fri, Sep 29, 2017 at 5:34 AM, John Blythe <jo...@gmail.com> wrote:
> > hi all.
> >
> > complete noob as to solrcloud here. almost-non-noob on solr in general.
> >
> > we're experiencing growing pains in our data and am thinking through
> moving
> > to solrcloud as a result. i'm hoping to find out if it seems like a good
> > strategy or if we need to get other areas of interest handled first
> before
> > introducing new complexities.
> >
> > here's a rundown of things:
> > - we are on a 30g ram aws instance
> > - we have ~30g tucked away in the ../solr/server/ dir
> > - our largest core is 6.8g w/ ~25 segments at any given time. this is
> also
> > the core that our business directly runs off of, users interact with,
> etc.
> > - 5g is for a logs type of dataset that analytics can be built off of to
> > help inform the primary core above
> > - 3g are taken up by 3 different third party sources that we use solr to
> > warehouse and have available for query for the sake of linking items in
> our
> > primary core to these cores for data enrichment
> > - several others take up < 1g each
> > - and then we have dev- and demo- flavors for some of these
> >
> > we had been operating on a 16gb machine till a few weeks ago (actually
> > bumped while at lucene revolution bc i hadn't noticed how much we'd
> > outgrown the cache size's needs till the week before!). the load when
> doing
> > an import or running our heavier operations is much better and doesn't
> fall
> > under the weight of the operations like it had been doing.
> >
> > we have no master/slave replica. all of our data is 'replicated' by the
> > fact that it exists in mysql. if solr were to go down it'd be a nice big
> > fire but one we could recover from within a couple hours by simply
> > reimporting.
> >
> > i'd like to have a more sophisticated set up in place for fault tolerance
> > than that, of course. i'd also like to see our heavy, many-query based
> > operations be speedier and better capable of handling multi-threaded runs
> > at once w/ ease.
> >
> > is this a matter of getting still more ram on the machine? cpus for
> faster
> > processing? splitting up the read/write operations between master/slave?
> > going full steam into a solrcloud configuration?
> >
> > one more note. per discussion at the conference i'm combing through our
> > configs to make sure we trim any fat we can. also wanting to get
> > optimization scheduled more regularly to help out w segmentation and
> > garbage heap. not sure how far those two alone will get us, though.
> >
> > thanks for any thoughts!
> >
> > --
> > John Blythe
>

Re: tipping point for using solrcloud—or not?

Posted by Erick Erickson <er...@gmail.com>.
SolrCloud. SolrCloud. SolrCloud.

Well, it actually depends. I recommend people go to SolrCloud when any
of the following apply:

> The instant you need to break any collection up into shards because you're running into the constraints of your hardware (you can't just keep adding memory to the JVM forever).

> You need NRT searching and need multiple replicas for either your traffic rate or HA purposes.

> You find yourself dealing with lots of administrative complexity for various indexes. You have what sounds like 6-10 cores laying around. You can move them to different machines without going to SolrCloud, but then something has to keep track of where they all are and route requests appropriately. If that gets onerous, SolrCloud will simplify it.

If none of the above apply, master/slave is just fine. Since you can
rebuild in a couple of hours, most of the difficulty with M/S when the
master goes down are manageable. With a master and several slaves, you
have HA, and a load balancer will see to it that some are used.
There's no real need to exclusively search on the slaves, I've seen
situations where the master is used for queries as well as indexing.

To increase your query rate, you can just add more slaves to the hot
index, assuming you're content with the latency between indexing and
being able to search newly indexed documents.

SolrCloud, of course, comes with the added complexity of ZooKeeper.

Best,
Erick



On Fri, Sep 29, 2017 at 5:34 AM, John Blythe <jo...@gmail.com> wrote:
> hi all.
>
> complete noob as to solrcloud here. almost-non-noob on solr in general.
>
> we're experiencing growing pains in our data and am thinking through moving
> to solrcloud as a result. i'm hoping to find out if it seems like a good
> strategy or if we need to get other areas of interest handled first before
> introducing new complexities.
>
> here's a rundown of things:
> - we are on a 30g ram aws instance
> - we have ~30g tucked away in the ../solr/server/ dir
> - our largest core is 6.8g w/ ~25 segments at any given time. this is also
> the core that our business directly runs off of, users interact with, etc.
> - 5g is for a logs type of dataset that analytics can be built off of to
> help inform the primary core above
> - 3g are taken up by 3 different third party sources that we use solr to
> warehouse and have available for query for the sake of linking items in our
> primary core to these cores for data enrichment
> - several others take up < 1g each
> - and then we have dev- and demo- flavors for some of these
>
> we had been operating on a 16gb machine till a few weeks ago (actually
> bumped while at lucene revolution bc i hadn't noticed how much we'd
> outgrown the cache size's needs till the week before!). the load when doing
> an import or running our heavier operations is much better and doesn't fall
> under the weight of the operations like it had been doing.
>
> we have no master/slave replica. all of our data is 'replicated' by the
> fact that it exists in mysql. if solr were to go down it'd be a nice big
> fire but one we could recover from within a couple hours by simply
> reimporting.
>
> i'd like to have a more sophisticated set up in place for fault tolerance
> than that, of course. i'd also like to see our heavy, many-query based
> operations be speedier and better capable of handling multi-threaded runs
> at once w/ ease.
>
> is this a matter of getting still more ram on the machine? cpus for faster
> processing? splitting up the read/write operations between master/slave?
> going full steam into a solrcloud configuration?
>
> one more note. per discussion at the conference i'm combing through our
> configs to make sure we trim any fat we can. also wanting to get
> optimization scheduled more regularly to help out w segmentation and
> garbage heap. not sure how far those two alone will get us, though.
>
> thanks for any thoughts!
>
> --
> John Blythe