You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chris Troullis <cp...@gmail.com> on 2017/05/06 12:12:13 UTC

Multiple collections vs multiple shards for multitenancy

Hi,

I use Solr to serve multiple tenants and currently all tenant's data
resides in one large collection, and queries have a tenant identifier. This
works fine with aggressive autowarming, but I have a need to reduce my NRT
search capabilities to seconds as opposed to the minutes it is at now,
which will mean drastically reducing if not eliminating my autowarming. As
such I am considering splitting my index out by tenant so that when one
tenant modifies their data it doesn't blow away all of the searcher based
caches for all tenants on soft commit.

I have done a lot of research on the subject and it seems like Solr Cloud
can have problems handling large numbers of collections. I'm obviously
going to have to run some tests to see how it performs, but my main
question is this: are there pros and cons to splitting the index into
multiple collections vs having 1 collection but splitting into multiple
shards? In my case I would have a shard per tenant and use implicit routing
to route to that specific shard. As I understand it a shard is basically
it's own lucene index, so I would still be eating that overhead with either
approach. What I don't know is if there are any other overheads involved
WRT collections vs shards, routing, zookeeper, etc.

Thanks,

Chris

Re: Multiple collections vs multiple shards for multitenancy

Posted by Chris Troullis <cp...@gmail.com>.

Thanks for the great advice Erick. I will experiment with your suggestions
and see how it goes!

Chris

On Sun, May 7, 2017 at 12:34 AM, Erick Erickson <er...@gmail.com>
wrote:

> Well, you've been doing your homework ;).
>
> bq: I am a little confused on this statement you made:
>
> > Plus you can't commit
> > individually, a commit on one will _still_ commit on all so you're
> > right back where you started.
>
> Never mind. autocommit kicks off on a per replica basis. IOW, when a
> new doc is indexed to a shard (really, any replica) the timer is
> started. So if replica 1_1 gets a doc and replica 2_1 doesn't, there
> is no commit on replica 2_1. My comment was mainly directed at the
> idea that you might issue commits from the client, which are
> distributed to all replicas. However, even in that case the a replica
> that has received no updates won't do anything.
>
> About the hybrid approach. I've seen situations where essentially you
> partition clients along "size" lines. So something like "put clients
> on a shared single-shard collection as long as the aggregate number of
> records is < X". The theory is that the update frequency is roughly
> the same if you have 10 clients with 100K docs each .vs. one client
> with 1M docs. So the pain of opening a new searcher is roughly the
> same. "X" here is experimentally determined.
>
> Do note that moving from master/slave to SolrCloud will reduce
> latency. In M/S, the time it takes to search is autocommit + poling
> interval + autowarm time. Going to SolrCloud will remove the "polling
> interval" from the equation. Not sure how much that helps....
>
> There should be an autowarm statistic in the Solr logs BTW. Or some
> messages about "opening searcher (some hex stuff) and another message
> about when it's registered as active, along with timestamps. That'll
> tell you how long it takes to autowarm.
>
> OK. "straw man" strategy for your case. Create a collection per
> tenant. What you want to balance is where the collections are hosted.
> Host a number of small tenants on the same Solr instance and fewer
> larger tenants on other hardware. FWIW, I expect at least 25M docs per
> Solr JVM (very hardware dependent of course), although testing is
> important.
>
> Under the covers, each Solr instance establishes "watchers" on the
> collections it hosts. So if a particular Solr hosts replicas for, say,
> 10 collections, it establishes 10 watchers on the state.json zNode in
> Zookeeper. 300 collections isn't all that much in recent Solr
> installations. All that filtered through how beefy your hardware is of
> course.
>
> Startup is an interesting case, but I've put 1,600 replicas on 4 Solr
> instance on a Mac Pro (400 each). You can configure the number of
> startup threads if starting up is too painful.
>
> So a cluster with 300 collections isn't really straining things. Some
> of the literature is talking about thousands of collections.
>
> Good luck!
> Erick
>
> On Sat, May 6, 2017 at 4:26 PM, Chris Troullis <cp...@gmail.com>
> wrote:
> > Hi Erick,
> >
> > Thanks for the reply, I really appreciate it.
> >
> > To answer your questions, we have a little over 300 tenants, and a couple
> > of different collections, the largest of which has ~11 million documents
> > (so not terribly large). We are currently running standard Solr with
> simple
> > master/slave replication, so all of the documents are in a single solr
> > core. We are planning to move to Solr cloud for various reasons, and as
> > discussed previously, I am trying to find the best way to distribute the
> > documents to serve a more NRT focused search case.
> >
> > I totally get your point on pushing back on NRT requirements, and I have
> > done so for as long as I can. Currently our auto softcommit is set to 1
> > minute and we are able to achieve great query times with autowarming.
> > Unfortunately, due to the nature of our application, our customers expect
> > any changes they make to be visible almost immediately in search, and we
> > have recently been getting a lot of complaints in this area, leading to
> an
> > initiative to drive down the time it takes for documents to become
> visible
> > in search. Which leaves me where I am now, trying to find the right
> balance
> > between document visibility and reasonable, stable, query times.
> >
> > Regarding autowarming, our autowarming times aren't too crazy. We are
> > warming a max of 100 entries from the filter cache and it takes around
> 5-10
> > seconds to complete on average. I suspect our biggest slow down during
> > autowarming is the static warming query that we have that runs 10+ facets
> > over the entire index. Our searches are very facet intensive, we use the
> > JSON facet API to do some decently complex faceting (block joins, etc),
> and
> > for whatever reason, even though we use doc values for all of our facet
> > fields, simply warming the filter cache doesn't seem to prevent a giant
> > drop off in performance whenever a new searcher is opened. The only way I
> > could find to prevent the giant performance dropoff was to run a static
> > warming query on new searcher that runs all of our facets over the whole
> > index. I haven't found a good way of telling how long this takes, as the
> > JMX hooks for monitoring autowarming times don't seem to include static
> > warming queries (that I can tell).
> >
> > Through experimentation I've found that by sharding my index in Solr
> cloud,
> > I can pretty much eliminate autowarming entirely and still achieve
> > reasonable query times once the shards reach a small enough size (around
> 1
> > million docs per shard). This is great, however, your assumptions as to
> our
> > tenant size distribution was spot on. Because of this, sharding naturally
> > using the composite id router with the tenant ID as the key yields an
> > uneven distribution of documents across shards. Basically whatever
> unlucky
> > small tenants happen to get stuck on the same shard as a gigantic tenant
> > will suffer in performance because of it. That's why I was exploring the
> > idea of having a tenant per collection or per shard, as a way of
> isolating
> > tenants from a performance perspective.
> >
> > I am a little confused on this statement you made:
> >
> >> Plus you can't commit
> >> individually, a commit on one will _still_ commit on all so you're
> >> right back where you started.
> >
> >
> > We don't commit manually at all, we rely on auto softcommit to commit for
> > us. My understanding was that since a shard is basically it's own solr
> core
> > under the covers, that indexing a document to a single shard would only
> > open a new searcher (and thus invalidate caches) on that one shard, and
> > thus separating tenants in their own shards would mean that tenant A
> (shard
> > A) updating it's documents would not require tenant B (shard B) to have
> all
> > of it's caches invalidated. Is this not correct?
> >
> > I'm also not sure I understand what you are saying regarding the hybrid
> > approach you mentioned. You say to experiment with how many documents are
> > ideal for a collection, but isn't the number of documents per shard the
> > more meaningful number WRT performance? I apologize if I am being dense,
> > maybe I'm not 100% clear on my terminology. My understanding was that a
> > collection is a logical abstraction consisting of multiple
> shards/replicas,
> > and that the shards/replicas were actual physical solr cores. So for
> > example, what is the difference between having 1000 collections with 1
> > shard each, vs 1 collection with 1000 shards? Both cases will end up with
> > the same amount of physical solr cores right? Or am I completely off
> base?
> >
> > Thanks again,
> >
> > Chris
> >
> > On Sat, May 6, 2017 at 10:36 AM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> Well, it's not either/or. And you haven't said how many tenants we're
> >> talking about here. Solr startup times for a single _instance_ of Solr
> >> when there are thousands of collections can be slow.
> >>
> >> But note what I am talking about here: A single Solr on a single node
> >> where there are hundreds and hundreds of collections (or replicas for
> >> that matter). I know of very large installations with 100s of
> >> thousands of _replicas_ that run. Admittedly with a lot of care and
> >> feeding...
> >>
> >> Sharding a single large collection and using custom routing to push
> >> tenants to a single shard will be an administrative problem for you.
> >> I'm assuming you have the typical multi-tenant problems, a bunch of
> >> tenants have around N docs, some smaller percentage have 3N and a few
> >> have 100N. Now you're having to keep track of how many docs are on
> >> each shard, do the routing yourself, etc. Plus you can't commit
> >> individually, a commit on one will _still_ commit on all so you're
> >> right back where you started.
> >>
> >> I've seen people use a hybrid approach: experiment with how many
> >> _documents_ you can have in a collection (however you partition that
> >> up) and use the multi-tenant approach. So you have N collections and
> >> each collection has a (varying) number of tenants. This also tends to
> >> flatten out the update process on the assumption that your smaller
> >> tenants also don't update their data as often.
> >>
> >> However, I really have to question one of your basic statements:
> >>
> >> "This works fine with aggressive autowarming, but I have a need to
> reduce
> >> my NRT
> >> search capabilities to seconds as opposed to the minutes it is at
> now,"...
> >>
> >> The implication here is that your autowarming takes minutes. Very
> >> often people severely overdo the warmup by setting their autowarm
> >> counts to 100s or 1000s. This is rarely necessary, especially if you
> >> use docValues fields appropriately. Very often much of autowarming is
> >> "uninverting" fields (look in your Solr log). Essentially for any
> >> field you see this, use docValues and loading will be much faster.
> >>
> >> You also haven't said how many documents you have in a shard at
> >> present. This is actually the metric I use most often to size
> >> hardware. I claim you can find a sweet spot where minimal autowarming
> >> will give you good enough performance, and that number is what you
> >> should design to. Of course YMMV.
> >>
> >> Finally: push back really hard on how aggressive NRT support needs to
> >> be. Often "requirements" like this are made without much thought as
> >> "faster is better, let's make it 1 second!". There are situations
> >> where that's true, but it comes at a cost. Users may be better served
> >> by a predictable but fast system than one that's fast but
> >> unpredictable. "Documents may take up to 5 minutes to appear and
> >> searches will usually take less than a second" is nice and concise. I
> >> have my expectations. "Documents are searchable in 1 second, but the
> >> results may not come back for between 1 and 10 seconds" is much more
> >> frustrating.
> >>
> >> FWIW,
> >> Erick
> >>
> >> On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cp...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I use Solr to serve multiple tenants and currently all tenant's data
> >> > resides in one large collection, and queries have a tenant identifier.
> >> This
> >> > works fine with aggressive autowarming, but I have a need to reduce my
> >> NRT
> >> > search capabilities to seconds as opposed to the minutes it is at now,
> >> > which will mean drastically reducing if not eliminating my
> autowarming.
> >> As
> >> > such I am considering splitting my index out by tenant so that when
> one
> >> > tenant modifies their data it doesn't blow away all of the searcher
> based
> >> > caches for all tenants on soft commit.
> >> >
> >> > I have done a lot of research on the subject and it seems like Solr
> Cloud
> >> > can have problems handling large numbers of collections. I'm obviously
> >> > going to have to run some tests to see how it performs, but my main
> >> > question is this: are there pros and cons to splitting the index into
> >> > multiple collections vs having 1 collection but splitting into
> multiple
> >> > shards? In my case I would have a shard per tenant and use implicit
> >> routing
> >> > to route to that specific shard. As I understand it a shard is
> basically
> >> > it's own lucene index, so I would still be eating that overhead with
> >> either
> >> > approach. What I don't know is if there are any other overheads
> involved
> >> > WRT collections vs shards, routing, zookeeper, etc.
> >> >
> >> > Thanks,
> >> >
> >> > Chris
> >>
>

Re: Multiple collections vs multiple shards for multitenancy

Posted by Erick Erickson <er...@gmail.com>.

Well, you've been doing your homework ;).

bq: I am a little confused on this statement you made:

> Plus you can't commit
> individually, a commit on one will _still_ commit on all so you're
> right back where you started.

Never mind. autocommit kicks off on a per replica basis. IOW, when a
new doc is indexed to a shard (really, any replica) the timer is
started. So if replica 1_1 gets a doc and replica 2_1 doesn't, there
is no commit on replica 2_1. My comment was mainly directed at the
idea that you might issue commits from the client, which are
distributed to all replicas. However, even in that case the a replica
that has received no updates won't do anything.

About the hybrid approach. I've seen situations where essentially you
partition clients along "size" lines. So something like "put clients
on a shared single-shard collection as long as the aggregate number of
records is < X". The theory is that the update frequency is roughly
the same if you have 10 clients with 100K docs each .vs. one client
with 1M docs. So the pain of opening a new searcher is roughly the
same. "X" here is experimentally determined.

Do note that moving from master/slave to SolrCloud will reduce
latency. In M/S, the time it takes to search is autocommit + poling
interval + autowarm time. Going to SolrCloud will remove the "polling
interval" from the equation. Not sure how much that helps....

There should be an autowarm statistic in the Solr logs BTW. Or some
messages about "opening searcher (some hex stuff) and another message
about when it's registered as active, along with timestamps. That'll
tell you how long it takes to autowarm.

OK. "straw man" strategy for your case. Create a collection per
tenant. What you want to balance is where the collections are hosted.
Host a number of small tenants on the same Solr instance and fewer
larger tenants on other hardware. FWIW, I expect at least 25M docs per
Solr JVM (very hardware dependent of course), although testing is
important.

Under the covers, each Solr instance establishes "watchers" on the
collections it hosts. So if a particular Solr hosts replicas for, say,
10 collections, it establishes 10 watchers on the state.json zNode in
Zookeeper. 300 collections isn't all that much in recent Solr
installations. All that filtered through how beefy your hardware is of
course.

Startup is an interesting case, but I've put 1,600 replicas on 4 Solr
instance on a Mac Pro (400 each). You can configure the number of
startup threads if starting up is too painful.

So a cluster with 300 collections isn't really straining things. Some
of the literature is talking about thousands of collections.

Good luck!
Erick

On Sat, May 6, 2017 at 4:26 PM, Chris Troullis <cp...@gmail.com> wrote:
> Hi Erick,
>
> Thanks for the reply, I really appreciate it.
>
> To answer your questions, we have a little over 300 tenants, and a couple
> of different collections, the largest of which has ~11 million documents
> (so not terribly large). We are currently running standard Solr with simple
> master/slave replication, so all of the documents are in a single solr
> core. We are planning to move to Solr cloud for various reasons, and as
> discussed previously, I am trying to find the best way to distribute the
> documents to serve a more NRT focused search case.
>
> I totally get your point on pushing back on NRT requirements, and I have
> done so for as long as I can. Currently our auto softcommit is set to 1
> minute and we are able to achieve great query times with autowarming.
> Unfortunately, due to the nature of our application, our customers expect
> any changes they make to be visible almost immediately in search, and we
> have recently been getting a lot of complaints in this area, leading to an
> initiative to drive down the time it takes for documents to become visible
> in search. Which leaves me where I am now, trying to find the right balance
> between document visibility and reasonable, stable, query times.
>
> Regarding autowarming, our autowarming times aren't too crazy. We are
> warming a max of 100 entries from the filter cache and it takes around 5-10
> seconds to complete on average. I suspect our biggest slow down during
> autowarming is the static warming query that we have that runs 10+ facets
> over the entire index. Our searches are very facet intensive, we use the
> JSON facet API to do some decently complex faceting (block joins, etc), and
> for whatever reason, even though we use doc values for all of our facet
> fields, simply warming the filter cache doesn't seem to prevent a giant
> drop off in performance whenever a new searcher is opened. The only way I
> could find to prevent the giant performance dropoff was to run a static
> warming query on new searcher that runs all of our facets over the whole
> index. I haven't found a good way of telling how long this takes, as the
> JMX hooks for monitoring autowarming times don't seem to include static
> warming queries (that I can tell).
>
> Through experimentation I've found that by sharding my index in Solr cloud,
> I can pretty much eliminate autowarming entirely and still achieve
> reasonable query times once the shards reach a small enough size (around 1
> million docs per shard). This is great, however, your assumptions as to our
> tenant size distribution was spot on. Because of this, sharding naturally
> using the composite id router with the tenant ID as the key yields an
> uneven distribution of documents across shards. Basically whatever unlucky
> small tenants happen to get stuck on the same shard as a gigantic tenant
> will suffer in performance because of it. That's why I was exploring the
> idea of having a tenant per collection or per shard, as a way of isolating
> tenants from a performance perspective.
>
> I am a little confused on this statement you made:
>
>> Plus you can't commit
>> individually, a commit on one will _still_ commit on all so you're
>> right back where you started.
>
>
> We don't commit manually at all, we rely on auto softcommit to commit for
> us. My understanding was that since a shard is basically it's own solr core
> under the covers, that indexing a document to a single shard would only
> open a new searcher (and thus invalidate caches) on that one shard, and
> thus separating tenants in their own shards would mean that tenant A (shard
> A) updating it's documents would not require tenant B (shard B) to have all
> of it's caches invalidated. Is this not correct?
>
> I'm also not sure I understand what you are saying regarding the hybrid
> approach you mentioned. You say to experiment with how many documents are
> ideal for a collection, but isn't the number of documents per shard the
> more meaningful number WRT performance? I apologize if I am being dense,
> maybe I'm not 100% clear on my terminology. My understanding was that a
> collection is a logical abstraction consisting of multiple shards/replicas,
> and that the shards/replicas were actual physical solr cores. So for
> example, what is the difference between having 1000 collections with 1
> shard each, vs 1 collection with 1000 shards? Both cases will end up with
> the same amount of physical solr cores right? Or am I completely off base?
>
> Thanks again,
>
> Chris
>
> On Sat, May 6, 2017 at 10:36 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Well, it's not either/or. And you haven't said how many tenants we're
>> talking about here. Solr startup times for a single _instance_ of Solr
>> when there are thousands of collections can be slow.
>>
>> But note what I am talking about here: A single Solr on a single node
>> where there are hundreds and hundreds of collections (or replicas for
>> that matter). I know of very large installations with 100s of
>> thousands of _replicas_ that run. Admittedly with a lot of care and
>> feeding...
>>
>> Sharding a single large collection and using custom routing to push
>> tenants to a single shard will be an administrative problem for you.
>> I'm assuming you have the typical multi-tenant problems, a bunch of
>> tenants have around N docs, some smaller percentage have 3N and a few
>> have 100N. Now you're having to keep track of how many docs are on
>> each shard, do the routing yourself, etc. Plus you can't commit
>> individually, a commit on one will _still_ commit on all so you're
>> right back where you started.
>>
>> I've seen people use a hybrid approach: experiment with how many
>> _documents_ you can have in a collection (however you partition that
>> up) and use the multi-tenant approach. So you have N collections and
>> each collection has a (varying) number of tenants. This also tends to
>> flatten out the update process on the assumption that your smaller
>> tenants also don't update their data as often.
>>
>> However, I really have to question one of your basic statements:
>>
>> "This works fine with aggressive autowarming, but I have a need to reduce
>> my NRT
>> search capabilities to seconds as opposed to the minutes it is at now,"...
>>
>> The implication here is that your autowarming takes minutes. Very
>> often people severely overdo the warmup by setting their autowarm
>> counts to 100s or 1000s. This is rarely necessary, especially if you
>> use docValues fields appropriately. Very often much of autowarming is
>> "uninverting" fields (look in your Solr log). Essentially for any
>> field you see this, use docValues and loading will be much faster.
>>
>> You also haven't said how many documents you have in a shard at
>> present. This is actually the metric I use most often to size
>> hardware. I claim you can find a sweet spot where minimal autowarming
>> will give you good enough performance, and that number is what you
>> should design to. Of course YMMV.
>>
>> Finally: push back really hard on how aggressive NRT support needs to
>> be. Often "requirements" like this are made without much thought as
>> "faster is better, let's make it 1 second!". There are situations
>> where that's true, but it comes at a cost. Users may be better served
>> by a predictable but fast system than one that's fast but
>> unpredictable. "Documents may take up to 5 minutes to appear and
>> searches will usually take less than a second" is nice and concise. I
>> have my expectations. "Documents are searchable in 1 second, but the
>> results may not come back for between 1 and 10 seconds" is much more
>> frustrating.
>>
>> FWIW,
>> Erick
>>
>> On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cp...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I use Solr to serve multiple tenants and currently all tenant's data
>> > resides in one large collection, and queries have a tenant identifier.
>> This
>> > works fine with aggressive autowarming, but I have a need to reduce my
>> NRT
>> > search capabilities to seconds as opposed to the minutes it is at now,
>> > which will mean drastically reducing if not eliminating my autowarming.
>> As
>> > such I am considering splitting my index out by tenant so that when one
>> > tenant modifies their data it doesn't blow away all of the searcher based
>> > caches for all tenants on soft commit.
>> >
>> > I have done a lot of research on the subject and it seems like Solr Cloud
>> > can have problems handling large numbers of collections. I'm obviously
>> > going to have to run some tests to see how it performs, but my main
>> > question is this: are there pros and cons to splitting the index into
>> > multiple collections vs having 1 collection but splitting into multiple
>> > shards? In my case I would have a shard per tenant and use implicit
>> routing
>> > to route to that specific shard. As I understand it a shard is basically
>> > it's own lucene index, so I would still be eating that overhead with
>> either
>> > approach. What I don't know is if there are any other overheads involved
>> > WRT collections vs shards, routing, zookeeper, etc.
>> >
>> > Thanks,
>> >
>> > Chris
>>

Re: Multiple collections vs multiple shards for multitenancy

Posted by Chris Troullis <cp...@gmail.com>.

Hi Erick,

Thanks for the reply, I really appreciate it.

To answer your questions, we have a little over 300 tenants, and a couple
of different collections, the largest of which has ~11 million documents
(so not terribly large). We are currently running standard Solr with simple
master/slave replication, so all of the documents are in a single solr
core. We are planning to move to Solr cloud for various reasons, and as
discussed previously, I am trying to find the best way to distribute the
documents to serve a more NRT focused search case.

I totally get your point on pushing back on NRT requirements, and I have
done so for as long as I can. Currently our auto softcommit is set to 1
minute and we are able to achieve great query times with autowarming.
Unfortunately, due to the nature of our application, our customers expect
any changes they make to be visible almost immediately in search, and we
have recently been getting a lot of complaints in this area, leading to an
initiative to drive down the time it takes for documents to become visible
in search. Which leaves me where I am now, trying to find the right balance
between document visibility and reasonable, stable, query times.

Regarding autowarming, our autowarming times aren't too crazy. We are
warming a max of 100 entries from the filter cache and it takes around 5-10
seconds to complete on average. I suspect our biggest slow down during
autowarming is the static warming query that we have that runs 10+ facets
over the entire index. Our searches are very facet intensive, we use the
JSON facet API to do some decently complex faceting (block joins, etc), and
for whatever reason, even though we use doc values for all of our facet
fields, simply warming the filter cache doesn't seem to prevent a giant
drop off in performance whenever a new searcher is opened. The only way I
could find to prevent the giant performance dropoff was to run a static
warming query on new searcher that runs all of our facets over the whole
index. I haven't found a good way of telling how long this takes, as the
JMX hooks for monitoring autowarming times don't seem to include static
warming queries (that I can tell).

Through experimentation I've found that by sharding my index in Solr cloud,
I can pretty much eliminate autowarming entirely and still achieve
reasonable query times once the shards reach a small enough size (around 1
million docs per shard). This is great, however, your assumptions as to our
tenant size distribution was spot on. Because of this, sharding naturally
using the composite id router with the tenant ID as the key yields an
uneven distribution of documents across shards. Basically whatever unlucky
small tenants happen to get stuck on the same shard as a gigantic tenant
will suffer in performance because of it. That's why I was exploring the
idea of having a tenant per collection or per shard, as a way of isolating
tenants from a performance perspective.

I am a little confused on this statement you made:

> Plus you can't commit
> individually, a commit on one will _still_ commit on all so you're
> right back where you started.

We don't commit manually at all, we rely on auto softcommit to commit for
us. My understanding was that since a shard is basically it's own solr core
under the covers, that indexing a document to a single shard would only
open a new searcher (and thus invalidate caches) on that one shard, and
thus separating tenants in their own shards would mean that tenant A (shard
A) updating it's documents would not require tenant B (shard B) to have all
of it's caches invalidated. Is this not correct?

I'm also not sure I understand what you are saying regarding the hybrid
approach you mentioned. You say to experiment with how many documents are
ideal for a collection, but isn't the number of documents per shard the
more meaningful number WRT performance? I apologize if I am being dense,
maybe I'm not 100% clear on my terminology. My understanding was that a
collection is a logical abstraction consisting of multiple shards/replicas,
and that the shards/replicas were actual physical solr cores. So for
example, what is the difference between having 1000 collections with 1
shard each, vs 1 collection with 1000 shards? Both cases will end up with
the same amount of physical solr cores right? Or am I completely off base?

Thanks again,

Chris

On Sat, May 6, 2017 at 10:36 AM, Erick Erickson <er...@gmail.com>
wrote:

> Well, it's not either/or. And you haven't said how many tenants we're
> talking about here. Solr startup times for a single _instance_ of Solr
> when there are thousands of collections can be slow.
>
> But note what I am talking about here: A single Solr on a single node
> where there are hundreds and hundreds of collections (or replicas for
> that matter). I know of very large installations with 100s of
> thousands of _replicas_ that run. Admittedly with a lot of care and
> feeding...
>
> Sharding a single large collection and using custom routing to push
> tenants to a single shard will be an administrative problem for you.
> I'm assuming you have the typical multi-tenant problems, a bunch of
> tenants have around N docs, some smaller percentage have 3N and a few
> have 100N. Now you're having to keep track of how many docs are on
> each shard, do the routing yourself, etc. Plus you can't commit
> individually, a commit on one will _still_ commit on all so you're
> right back where you started.
>
> I've seen people use a hybrid approach: experiment with how many
> _documents_ you can have in a collection (however you partition that
> up) and use the multi-tenant approach. So you have N collections and
> each collection has a (varying) number of tenants. This also tends to
> flatten out the update process on the assumption that your smaller
> tenants also don't update their data as often.
>
> However, I really have to question one of your basic statements:
>
> "This works fine with aggressive autowarming, but I have a need to reduce
> my NRT
> search capabilities to seconds as opposed to the minutes it is at now,"...
>
> The implication here is that your autowarming takes minutes. Very
> often people severely overdo the warmup by setting their autowarm
> counts to 100s or 1000s. This is rarely necessary, especially if you
> use docValues fields appropriately. Very often much of autowarming is
> "uninverting" fields (look in your Solr log). Essentially for any
> field you see this, use docValues and loading will be much faster.
>
> You also haven't said how many documents you have in a shard at
> present. This is actually the metric I use most often to size
> hardware. I claim you can find a sweet spot where minimal autowarming
> will give you good enough performance, and that number is what you
> should design to. Of course YMMV.
>
> Finally: push back really hard on how aggressive NRT support needs to
> be. Often "requirements" like this are made without much thought as
> "faster is better, let's make it 1 second!". There are situations
> where that's true, but it comes at a cost. Users may be better served
> by a predictable but fast system than one that's fast but
> unpredictable. "Documents may take up to 5 minutes to appear and
> searches will usually take less than a second" is nice and concise. I
> have my expectations. "Documents are searchable in 1 second, but the
> results may not come back for between 1 and 10 seconds" is much more
> frustrating.
>
> FWIW,
> Erick
>
> On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cp...@gmail.com>
> wrote:
> > Hi,
> >
> > I use Solr to serve multiple tenants and currently all tenant's data
> > resides in one large collection, and queries have a tenant identifier.
> This
> > works fine with aggressive autowarming, but I have a need to reduce my
> NRT
> > search capabilities to seconds as opposed to the minutes it is at now,
> > which will mean drastically reducing if not eliminating my autowarming.
> As
> > such I am considering splitting my index out by tenant so that when one
> > tenant modifies their data it doesn't blow away all of the searcher based
> > caches for all tenants on soft commit.
> >
> > I have done a lot of research on the subject and it seems like Solr Cloud
> > can have problems handling large numbers of collections. I'm obviously
> > going to have to run some tests to see how it performs, but my main
> > question is this: are there pros and cons to splitting the index into
> > multiple collections vs having 1 collection but splitting into multiple
> > shards? In my case I would have a shard per tenant and use implicit
> routing
> > to route to that specific shard. As I understand it a shard is basically
> > it's own lucene index, so I would still be eating that overhead with
> either
> > approach. What I don't know is if there are any other overheads involved
> > WRT collections vs shards, routing, zookeeper, etc.
> >
> > Thanks,
> >
> > Chris
>

Re: Multiple collections vs multiple shards for multitenancy

Posted by Erick Erickson <er...@gmail.com>.

Well, it's not either/or. And you haven't said how many tenants we're
talking about here. Solr startup times for a single _instance_ of Solr
when there are thousands of collections can be slow.

But note what I am talking about here: A single Solr on a single node
where there are hundreds and hundreds of collections (or replicas for
that matter). I know of very large installations with 100s of
thousands of _replicas_ that run. Admittedly with a lot of care and
feeding...

Sharding a single large collection and using custom routing to push
tenants to a single shard will be an administrative problem for you.
I'm assuming you have the typical multi-tenant problems, a bunch of
tenants have around N docs, some smaller percentage have 3N and a few
have 100N. Now you're having to keep track of how many docs are on
each shard, do the routing yourself, etc. Plus you can't commit
individually, a commit on one will _still_ commit on all so you're
right back where you started.

I've seen people use a hybrid approach: experiment with how many
_documents_ you can have in a collection (however you partition that
up) and use the multi-tenant approach. So you have N collections and
each collection has a (varying) number of tenants. This also tends to
flatten out the update process on the assumption that your smaller
tenants also don't update their data as often.

However, I really have to question one of your basic statements:

"This works fine with aggressive autowarming, but I have a need to reduce my NRT
search capabilities to seconds as opposed to the minutes it is at now,"...

The implication here is that your autowarming takes minutes. Very
often people severely overdo the warmup by setting their autowarm
counts to 100s or 1000s. This is rarely necessary, especially if you
use docValues fields appropriately. Very often much of autowarming is
"uninverting" fields (look in your Solr log). Essentially for any
field you see this, use docValues and loading will be much faster.

You also haven't said how many documents you have in a shard at
present. This is actually the metric I use most often to size
hardware. I claim you can find a sweet spot where minimal autowarming
will give you good enough performance, and that number is what you
should design to. Of course YMMV.

Finally: push back really hard on how aggressive NRT support needs to
be. Often "requirements" like this are made without much thought as
"faster is better, let's make it 1 second!". There are situations
where that's true, but it comes at a cost. Users may be better served
by a predictable but fast system than one that's fast but
unpredictable. "Documents may take up to 5 minutes to appear and
searches will usually take less than a second" is nice and concise. I
have my expectations. "Documents are searchable in 1 second, but the
results may not come back for between 1 and 10 seconds" is much more
frustrating.

FWIW,
Erick

On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cp...@gmail.com> wrote:
> Hi,
>
> I use Solr to serve multiple tenants and currently all tenant's data
> resides in one large collection, and queries have a tenant identifier. This
> works fine with aggressive autowarming, but I have a need to reduce my NRT
> search capabilities to seconds as opposed to the minutes it is at now,
> which will mean drastically reducing if not eliminating my autowarming. As
> such I am considering splitting my index out by tenant so that when one
> tenant modifies their data it doesn't blow away all of the searcher based
> caches for all tenants on soft commit.
>
> I have done a lot of research on the subject and it seems like Solr Cloud
> can have problems handling large numbers of collections. I'm obviously
> going to have to run some tests to see how it performs, but my main
> question is this: are there pros and cons to splitting the index into
> multiple collections vs having 1 collection but splitting into multiple
> shards? In my case I would have a shard per tenant and use implicit routing
> to route to that specific shard. As I understand it a shard is basically
> it's own lucene index, so I would still be eating that overhead with either
> approach. What I don't know is if there are any other overheads involved
> WRT collections vs shards, routing, zookeeper, etc.
>
> Thanks,
>
> Chris