You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chamil Jeewantha <kd...@gmail.com> on 2016/08/26 19:13:05 UTC

Solr for Multi Tenant architecture

Dear Solr Members,

We are using SolrCloud as the search provider of a multi-tenant cloud based
application. We have one schema for all the tenants. The indexes will have
large number(millions) of documents.

As of our research, we have two options,

   - One large collection for all the tenants and use Composite-ID routing
   - Collection per tenant

The below mail says,


https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201403.mbox/%3C5324CD4B.2020309@protulae.com%3E

SolrCloud is *more scalable in terms of index size*. Plus you get
redundancy which can't be underestimated in a hosted solution.


AND

The issue is management. 1000s of cores/collections require a level of
automation. On the other hand, having a single core/collection means if
you make one change to the schema or solrconfig, it affects everyone.


Based on the above facts we think One large collection will be the way to
go.

Questions:

   1. Is that the right way to go?
   2. Will it be a hassle when we need to do reindexing?
   3. What is the chance of entire collection crash? (in that case all
   tenants will be affected and reindexing will be painful.

Thank you in advance for your kind opinion.

Best Regards,
Chamil

-- 
http://kavimalla.blgospot.com
http://kdchamil.blogspot.com

Re: Solr for Multi Tenant architecture

Posted by Chamil Jeewantha <kd...@gmail.com>.
Dear all,

Thank you for all your advices.

This comment says:

"SolrCloud starts to have serious problems when you create a lot of
collections.
We are aware of the scalability issues, but they are not easy to fix."

http://lucene.472066.n3.nabble.com/Fwd-Solr-Cloud-6-0-0-hangs-when-creating-large-amount-of-collections-and-node-fails-to-recover-aftert-tp4276364p4276404.html

So I am doubt whether it will affect when our system goes beyond thousands
of tenants..

One way I feel is adding a custom load balancing mechanism which will route
tenants to different solr clusters. Any easy way of dealing with this
situation?

Best Regards,
Chamil

On Wed, Aug 31, 2016 at 1:42 PM, Emir Arnautovic <
emir.arnautovic@sematext.com> wrote:

> HI Chamil,
>
> One thing to consider is relevancy, especially in case tenants' domains
> are different (e.g. one is tech and other pharmacy). If you go with one
> collection and use same field (e.g. desc) for all tenants, you will get one
> field stats and could skew results ordering if you order by score (e.g.
> word 'cream' might be infrequent in tech tenant but could become frequent
> overall because of large pharmacy tenant).
>
> On the other side having large number of collection could also be
> problematic. You can address that issue with splitting tenants to multiple
> clusters, or having collections for large tenants and grouping smaller
> tenants by domain.
>
> Make sure that you use routing by tenant id in case of multi tenant
> collection.
>
> HTH,
> Emir
>
>
>
> On 28.08.2016 07:02, Chamil Jeewantha wrote:
>
>> Thank you everyone for your great support.
>>
>> I will update you with our final approach.
>>
>> Best regards,
>> Chamil
>>
>> On Aug 28, 2016 01:34, "John Bickerstaff" <jo...@johnbickerstaff.com>
>> wrote:
>>
>> In my own work, the risk to the business if every single client cannot
>>> access search is so great, we would never consider putting everything in
>>> one.  You should certainly ask that question of the business stakeholders
>>> before you decide.
>>>
>>> For that reason, I might recommend that each of the multiple collections
>>> suggested above by Erick could also be on a separate SolrCloud (or single
>>> Solr instance) so that no single failure can ever take down every
>>> tenant's
>>> ability to search -- only those on that particular SolrCloud...
>>>
>>> On Sat, Aug 27, 2016 at 10:36 AM, Erick Erickson <
>>> erickerickson@gmail.com>
>>> wrote:
>>>
>>> There's no one right answer here. I've also seen a hybrid approach
>>>> where there are multiple collections each of which has some
>>>> number of tenants resident. Eventually, you need to think of some
>>>> kind of partitioning, my rough number of documents for a single core
>>>> is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).
>>>>
>>>> All that said, you may also be interested in the "transient cores"
>>>> option, see: https://cwiki.apache.org/confluence/display/solr/
>>>> Defining+core.properties
>>>> and the transient and transientCacheSize (this latter in solr.xml). Note
>>>> that this is stand-alone only so you can't move that concept to
>>>> SolrCloud if you eventually go there.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha <kd...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dear Solr Members,
>>>>>
>>>>> We are using SolrCloud as the search provider of a multi-tenant cloud
>>>>>
>>>> based
>>>>
>>>>> application. We have one schema for all the tenants. The indexes will
>>>>>
>>>> have
>>>>
>>>>> large number(millions) of documents.
>>>>>
>>>>> As of our research, we have two options,
>>>>>
>>>>>     - One large collection for all the tenants and use Composite-ID
>>>>>
>>>> routing
>>>>
>>>>>     - Collection per tenant
>>>>>
>>>>> The below mail says,
>>>>>
>>>>>
>>>>> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
>>>>>
>>>> 201403.mbox/%3C5324CD4B.2020309@protulae.com%3E
>>>>
>>>>> SolrCloud is *more scalable in terms of index size*. Plus you get
>>>>> redundancy which can't be underestimated in a hosted solution.
>>>>>
>>>>>
>>>>> AND
>>>>>
>>>>> The issue is management. 1000s of cores/collections require a level of
>>>>> automation. On the other hand, having a single core/collection means if
>>>>> you make one change to the schema or solrconfig, it affects everyone.
>>>>>
>>>>>
>>>>> Based on the above facts we think One large collection will be the way
>>>>>
>>>> to
>>>
>>>> go.
>>>>>
>>>>> Questions:
>>>>>
>>>>>     1. Is that the right way to go?
>>>>>     2. Will it be a hassle when we need to do reindexing?
>>>>>     3. What is the chance of entire collection crash? (in that case all
>>>>>     tenants will be affected and reindexing will be painful.
>>>>>
>>>>> Thank you in advance for your kind opinion.
>>>>>
>>>>> Best Regards,
>>>>> Chamil
>>>>>
>>>>> --
>>>>> http://kavimalla.blgospot.com
>>>>> http://kdchamil.blogspot.com
>>>>>
>>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


-- 
http://kavimalla.blgospot.com
http://kdchamil.blogspot.com

Re: Solr for Multi Tenant architecture

Posted by Emir Arnautovic <em...@sematext.com>.
HI Chamil,

One thing to consider is relevancy, especially in case tenants' domains 
are different (e.g. one is tech and other pharmacy). If you go with one 
collection and use same field (e.g. desc) for all tenants, you will get 
one field stats and could skew results ordering if you order by score 
(e.g. word 'cream' might be infrequent in tech tenant but could become 
frequent overall because of large pharmacy tenant).

On the other side having large number of collection could also be 
problematic. You can address that issue with splitting tenants to 
multiple clusters, or having collections for large tenants and grouping 
smaller tenants by domain.

Make sure that you use routing by tenant id in case of multi tenant 
collection.

HTH,
Emir


On 28.08.2016 07:02, Chamil Jeewantha wrote:
> Thank you everyone for your great support.
>
> I will update you with our final approach.
>
> Best regards,
> Chamil
>
> On Aug 28, 2016 01:34, "John Bickerstaff" <jo...@johnbickerstaff.com> wrote:
>
>> In my own work, the risk to the business if every single client cannot
>> access search is so great, we would never consider putting everything in
>> one.  You should certainly ask that question of the business stakeholders
>> before you decide.
>>
>> For that reason, I might recommend that each of the multiple collections
>> suggested above by Erick could also be on a separate SolrCloud (or single
>> Solr instance) so that no single failure can ever take down every tenant's
>> ability to search -- only those on that particular SolrCloud...
>>
>> On Sat, Aug 27, 2016 at 10:36 AM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> There's no one right answer here. I've also seen a hybrid approach
>>> where there are multiple collections each of which has some
>>> number of tenants resident. Eventually, you need to think of some
>>> kind of partitioning, my rough number of documents for a single core
>>> is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).
>>>
>>> All that said, you may also be interested in the "transient cores"
>>> option, see: https://cwiki.apache.org/confluence/display/solr/
>>> Defining+core.properties
>>> and the transient and transientCacheSize (this latter in solr.xml). Note
>>> that this is stand-alone only so you can't move that concept to
>>> SolrCloud if you eventually go there.
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha <kd...@gmail.com>
>>> wrote:
>>>> Dear Solr Members,
>>>>
>>>> We are using SolrCloud as the search provider of a multi-tenant cloud
>>> based
>>>> application. We have one schema for all the tenants. The indexes will
>>> have
>>>> large number(millions) of documents.
>>>>
>>>> As of our research, we have two options,
>>>>
>>>>     - One large collection for all the tenants and use Composite-ID
>>> routing
>>>>     - Collection per tenant
>>>>
>>>> The below mail says,
>>>>
>>>>
>>>> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
>>> 201403.mbox/%3C5324CD4B.2020309@protulae.com%3E
>>>> SolrCloud is *more scalable in terms of index size*. Plus you get
>>>> redundancy which can't be underestimated in a hosted solution.
>>>>
>>>>
>>>> AND
>>>>
>>>> The issue is management. 1000s of cores/collections require a level of
>>>> automation. On the other hand, having a single core/collection means if
>>>> you make one change to the schema or solrconfig, it affects everyone.
>>>>
>>>>
>>>> Based on the above facts we think One large collection will be the way
>> to
>>>> go.
>>>>
>>>> Questions:
>>>>
>>>>     1. Is that the right way to go?
>>>>     2. Will it be a hassle when we need to do reindexing?
>>>>     3. What is the chance of entire collection crash? (in that case all
>>>>     tenants will be affected and reindexing will be painful.
>>>>
>>>> Thank you in advance for your kind opinion.
>>>>
>>>> Best Regards,
>>>> Chamil
>>>>
>>>> --
>>>> http://kavimalla.blgospot.com
>>>> http://kdchamil.blogspot.com

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


Re: Solr for Multi Tenant architecture

Posted by Walter Underwood <wu...@wunderwood.org>.
Apple did a preso on massive multi-tenancy. I haven’t watched it yet, but it might help.

https://www.youtube.com/watch?v=_Erkln5WWLw <https://www.youtube.com/watch?v=_Erkln5WWLw>

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 27, 2016, at 10:02 PM, Chamil Jeewantha <kd...@gmail.com> wrote:
> 
> Thank you everyone for your great support.
> 
> I will update you with our final approach.
> 
> Best regards,
> Chamil
> 
> On Aug 28, 2016 01:34, "John Bickerstaff" <jo...@johnbickerstaff.com> wrote:
> 
>> In my own work, the risk to the business if every single client cannot
>> access search is so great, we would never consider putting everything in
>> one.  You should certainly ask that question of the business stakeholders
>> before you decide.
>> 
>> For that reason, I might recommend that each of the multiple collections
>> suggested above by Erick could also be on a separate SolrCloud (or single
>> Solr instance) so that no single failure can ever take down every tenant's
>> ability to search -- only those on that particular SolrCloud...
>> 
>> On Sat, Aug 27, 2016 at 10:36 AM, Erick Erickson <er...@gmail.com>
>> wrote:
>> 
>>> There's no one right answer here. I've also seen a hybrid approach
>>> where there are multiple collections each of which has some
>>> number of tenants resident. Eventually, you need to think of some
>>> kind of partitioning, my rough number of documents for a single core
>>> is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).
>>> 
>>> All that said, you may also be interested in the "transient cores"
>>> option, see: https://cwiki.apache.org/confluence/display/solr/
>>> Defining+core.properties
>>> and the transient and transientCacheSize (this latter in solr.xml). Note
>>> that this is stand-alone only so you can't move that concept to
>>> SolrCloud if you eventually go there.
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha <kd...@gmail.com>
>>> wrote:
>>>> Dear Solr Members,
>>>> 
>>>> We are using SolrCloud as the search provider of a multi-tenant cloud
>>> based
>>>> application. We have one schema for all the tenants. The indexes will
>>> have
>>>> large number(millions) of documents.
>>>> 
>>>> As of our research, we have two options,
>>>> 
>>>>   - One large collection for all the tenants and use Composite-ID
>>> routing
>>>>   - Collection per tenant
>>>> 
>>>> The below mail says,
>>>> 
>>>> 
>>>> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
>>> 201403.mbox/%3C5324CD4B.2020309@protulae.com%3E
>>>> 
>>>> SolrCloud is *more scalable in terms of index size*. Plus you get
>>>> redundancy which can't be underestimated in a hosted solution.
>>>> 
>>>> 
>>>> AND
>>>> 
>>>> The issue is management. 1000s of cores/collections require a level of
>>>> automation. On the other hand, having a single core/collection means if
>>>> you make one change to the schema or solrconfig, it affects everyone.
>>>> 
>>>> 
>>>> Based on the above facts we think One large collection will be the way
>> to
>>>> go.
>>>> 
>>>> Questions:
>>>> 
>>>>   1. Is that the right way to go?
>>>>   2. Will it be a hassle when we need to do reindexing?
>>>>   3. What is the chance of entire collection crash? (in that case all
>>>>   tenants will be affected and reindexing will be painful.
>>>> 
>>>> Thank you in advance for your kind opinion.
>>>> 
>>>> Best Regards,
>>>> Chamil
>>>> 
>>>> --
>>>> http://kavimalla.blgospot.com
>>>> http://kdchamil.blogspot.com
>>> 
>> 


Re: Solr for Multi Tenant architecture

Posted by Chamil Jeewantha <kd...@gmail.com>.
Thank you everyone for your great support.

I will update you with our final approach.

Best regards,
Chamil

On Aug 28, 2016 01:34, "John Bickerstaff" <jo...@johnbickerstaff.com> wrote:

> In my own work, the risk to the business if every single client cannot
> access search is so great, we would never consider putting everything in
> one.  You should certainly ask that question of the business stakeholders
> before you decide.
>
> For that reason, I might recommend that each of the multiple collections
> suggested above by Erick could also be on a separate SolrCloud (or single
> Solr instance) so that no single failure can ever take down every tenant's
> ability to search -- only those on that particular SolrCloud...
>
> On Sat, Aug 27, 2016 at 10:36 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > There's no one right answer here. I've also seen a hybrid approach
> > where there are multiple collections each of which has some
> > number of tenants resident. Eventually, you need to think of some
> > kind of partitioning, my rough number of documents for a single core
> > is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).
> >
> > All that said, you may also be interested in the "transient cores"
> > option, see: https://cwiki.apache.org/confluence/display/solr/
> > Defining+core.properties
> > and the transient and transientCacheSize (this latter in solr.xml). Note
> > that this is stand-alone only so you can't move that concept to
> > SolrCloud if you eventually go there.
> >
> > Best,
> > Erick
> >
> > On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha <kd...@gmail.com>
> > wrote:
> > > Dear Solr Members,
> > >
> > > We are using SolrCloud as the search provider of a multi-tenant cloud
> > based
> > > application. We have one schema for all the tenants. The indexes will
> > have
> > > large number(millions) of documents.
> > >
> > > As of our research, we have two options,
> > >
> > >    - One large collection for all the tenants and use Composite-ID
> > routing
> > >    - Collection per tenant
> > >
> > > The below mail says,
> > >
> > >
> > > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
> > 201403.mbox/%3C5324CD4B.2020309@protulae.com%3E
> > >
> > > SolrCloud is *more scalable in terms of index size*. Plus you get
> > > redundancy which can't be underestimated in a hosted solution.
> > >
> > >
> > > AND
> > >
> > > The issue is management. 1000s of cores/collections require a level of
> > > automation. On the other hand, having a single core/collection means if
> > > you make one change to the schema or solrconfig, it affects everyone.
> > >
> > >
> > > Based on the above facts we think One large collection will be the way
> to
> > > go.
> > >
> > > Questions:
> > >
> > >    1. Is that the right way to go?
> > >    2. Will it be a hassle when we need to do reindexing?
> > >    3. What is the chance of entire collection crash? (in that case all
> > >    tenants will be affected and reindexing will be painful.
> > >
> > > Thank you in advance for your kind opinion.
> > >
> > > Best Regards,
> > > Chamil
> > >
> > > --
> > > http://kavimalla.blgospot.com
> > > http://kdchamil.blogspot.com
> >
>

Re: Solr for Multi Tenant architecture

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
In my own work, the risk to the business if every single client cannot
access search is so great, we would never consider putting everything in
one.  You should certainly ask that question of the business stakeholders
before you decide.

For that reason, I might recommend that each of the multiple collections
suggested above by Erick could also be on a separate SolrCloud (or single
Solr instance) so that no single failure can ever take down every tenant's
ability to search -- only those on that particular SolrCloud...

On Sat, Aug 27, 2016 at 10:36 AM, Erick Erickson <er...@gmail.com>
wrote:

> There's no one right answer here. I've also seen a hybrid approach
> where there are multiple collections each of which has some
> number of tenants resident. Eventually, you need to think of some
> kind of partitioning, my rough number of documents for a single core
> is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).
>
> All that said, you may also be interested in the "transient cores"
> option, see: https://cwiki.apache.org/confluence/display/solr/
> Defining+core.properties
> and the transient and transientCacheSize (this latter in solr.xml). Note
> that this is stand-alone only so you can't move that concept to
> SolrCloud if you eventually go there.
>
> Best,
> Erick
>
> On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha <kd...@gmail.com>
> wrote:
> > Dear Solr Members,
> >
> > We are using SolrCloud as the search provider of a multi-tenant cloud
> based
> > application. We have one schema for all the tenants. The indexes will
> have
> > large number(millions) of documents.
> >
> > As of our research, we have two options,
> >
> >    - One large collection for all the tenants and use Composite-ID
> routing
> >    - Collection per tenant
> >
> > The below mail says,
> >
> >
> > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/
> 201403.mbox/%3C5324CD4B.2020309@protulae.com%3E
> >
> > SolrCloud is *more scalable in terms of index size*. Plus you get
> > redundancy which can't be underestimated in a hosted solution.
> >
> >
> > AND
> >
> > The issue is management. 1000s of cores/collections require a level of
> > automation. On the other hand, having a single core/collection means if
> > you make one change to the schema or solrconfig, it affects everyone.
> >
> >
> > Based on the above facts we think One large collection will be the way to
> > go.
> >
> > Questions:
> >
> >    1. Is that the right way to go?
> >    2. Will it be a hassle when we need to do reindexing?
> >    3. What is the chance of entire collection crash? (in that case all
> >    tenants will be affected and reindexing will be painful.
> >
> > Thank you in advance for your kind opinion.
> >
> > Best Regards,
> > Chamil
> >
> > --
> > http://kavimalla.blgospot.com
> > http://kdchamil.blogspot.com
>

Re: Solr for Multi Tenant architecture

Posted by Erick Erickson <er...@gmail.com>.
There's no one right answer here. I've also seen a hybrid approach
where there are multiple collections each of which has some
number of tenants resident. Eventually, you need to think of some
kind of partitioning, my rough number of documents for a single core
is 50M (NOTE: I've seen between 10M and 300M docs fit in a core).

All that said, you may also be interested in the "transient cores"
option, see: https://cwiki.apache.org/confluence/display/solr/Defining+core.properties
and the transient and transientCacheSize (this latter in solr.xml). Note
that this is stand-alone only so you can't move that concept to
SolrCloud if you eventually go there.

Best,
Erick

On Fri, Aug 26, 2016 at 12:13 PM, Chamil Jeewantha <kd...@gmail.com> wrote:
> Dear Solr Members,
>
> We are using SolrCloud as the search provider of a multi-tenant cloud based
> application. We have one schema for all the tenants. The indexes will have
> large number(millions) of documents.
>
> As of our research, we have two options,
>
>    - One large collection for all the tenants and use Composite-ID routing
>    - Collection per tenant
>
> The below mail says,
>
>
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201403.mbox/%3C5324CD4B.2020309@protulae.com%3E
>
> SolrCloud is *more scalable in terms of index size*. Plus you get
> redundancy which can't be underestimated in a hosted solution.
>
>
> AND
>
> The issue is management. 1000s of cores/collections require a level of
> automation. On the other hand, having a single core/collection means if
> you make one change to the schema or solrconfig, it affects everyone.
>
>
> Based on the above facts we think One large collection will be the way to
> go.
>
> Questions:
>
>    1. Is that the right way to go?
>    2. Will it be a hassle when we need to do reindexing?
>    3. What is the chance of entire collection crash? (in that case all
>    tenants will be affected and reindexing will be painful.
>
> Thank you in advance for your kind opinion.
>
> Best Regards,
> Chamil
>
> --
> http://kavimalla.blgospot.com
> http://kdchamil.blogspot.com

Re: Solr for Multi Tenant architecture

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/26/2016 1:13 PM, Chamil Jeewantha wrote:
> We are using SolrCloud as the search provider of a multi-tenant cloud based
> application. We have one schema for all the tenants. The indexes will have
> large number(millions) of documents.
>
> As of our research, we have two options,
>
>    - One large collection for all the tenants and use Composite-ID routing
>    - Collection per tenant

I would tend to agree that you should use SolrCloud.  And to avoid
potential problems, each tenant should have their own collection or
collections.

You probably also need to put a smart load balancer in front of Solr
that can restrict access to URL paths containing the collection names to
the source addresses for each tenant.  The tenants should have no access
to the admin UI, because it's not possible to keep people using the
admin UI from seeing collections that aren't theirs.  Developing that
kind of security could be possible, but won't be easy at all.

If access to the admin UI is something that your customers demand, then
I think you'll need to have an entire cloud per tenant -- which probably
means you're going to want to delve into virtualization, possibly using
one of the lightweight implementations like Docker.  Note that if you
take this path, you're going to need a LOT of RAM -- much more than you
might imagine.

Thanks,
Shawn