You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Vikram B. Kumar" <vi...@gmail.com> on 2009/02/26 12:44:43 UTC

What is the best scalable scheme to support multiple users?

Hi All,

Our web based document management system has few thousand users and is 
growing rapidly. Like any SaaS, while we support a lot of customers, 
only few of them (those logged in) will be reading their index and only 
a subset of those logged in (who are adding documents) will be writing 
to their index.

i.,e TU > L > U

and TU ~ 100 x L

where TU is total no of users, L is logged in users who are searching 
and U is the uploaders who are updating their index.

We have been using Lucene over a simple RESTful server for searching. 
Indexing is currently done using regular JavaSE based setup, instead of 
a server. We are thinking about moving to Solr to scale better and to 
get rid of the latency associated with our non-live JavaSE based 
indexer. We have a custom Analyzer/Filter that adds some payload to each 
term to support our web based service.

My message is about on how best to partition the index to support 
multiple users.

Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4 
cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support 
8000-10000 users before I need to start sharding them across multiple hosts.

I have thought of the following options:

1. One Monilithic index, but index files segmented by user_id field.

2. MultiCore - One core per user.

3. Multiple Solr instances - Non scalable.

4. Don't use Solr, but enhance our Lucene +RESTful server model to 
support indexing as well. - Least favored approach as we will be doing a 
lot of things that Solr already does (replication, live 
add/update/delete). Most of the things we are doing, can be done with 
Solr's pluggable query handlers. (I guess this is not a true option at all).

I am currently favouring Option 2 though want to try out whether 1 works 
as well.

Looks like some of the most obvious problems with MultiCores are "too 
many open file" problems, which can be handled with hardware and 
software boundaries (properly close index after updating and after users 
logout).

My questions:

1. Can our analyzers/filters be plugged into Solr during the time of 
indexing?
2. Does option 2 fit the above needs? Has anybody done option 2 with 
thousands of cores in a Solr instance?
3. Does option 2 to support horizontal scaling (sharding?)

Thanks,
Vikram



Re: What is the best scalable scheme to support multiple users?

Posted by Walter Underwood <wu...@netflix.com>.
With five servers, assign 1/5 of user_id's to each server. Choose
the number of servers to handle the number of logged-in users.
Each user's searches go to the single server with their data.

Partitioning by user_id is common with relational databases.
We do this to hold our two billion movie ratings from ten
million customers.

wunder

On 2/26/09 8:21 AM, "Vikram Kumar" <vi...@gmail.com> wrote:

> Hi Wunder,
> Can you please elaborate?
> 
> Vikram
> 
> On Thu, Feb 26, 2009 at 10:13 AM, Walter Underwood
> <wu...@netflix.com>wrote:
> 
>> 1a. Multiple Solr instances partitioned by user_id%N, with index
>> files segmented by user_id field.
>> 
>> That can scale rather gracefully, though it does need reindexing
>> to add a server.
>> 
>> wunder
>> 
>> On 2/26/09 3:44 AM, "Vikram B. Kumar" <vi...@gmail.com> wrote:
>> 
>>> Hi All,
>>> 
>>> Our web based document management system has few thousand users and is
>>> growing rapidly. Like any SaaS, while we support a lot of customers,
>>> only few of them (those logged in) will be reading their index and only
>>> a subset of those logged in (who are adding documents) will be writing
>>> to their index.
>>> 
>>> i.,e TU > L > U
>>> 
>>> and TU ~ 100 x L
>>> 
>>> where TU is total no of users, L is logged in users who are searching
>>> and U is the uploaders who are updating their index.
>>> 
>>> We have been using Lucene over a simple RESTful server for searching.
>>> Indexing is currently done using regular JavaSE based setup, instead of
>>> a server. We are thinking about moving to Solr to scale better and to
>>> get rid of the latency associated with our non-live JavaSE based
>>> indexer. We have a custom Analyzer/Filter that adds some payload to each
>>> term to support our web based service.
>>> 
>>> My message is about on how best to partition the index to support
>>> multiple users.
>>> 
>>> Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
>>> cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
>>> 8000-10000 users before I need to start sharding them across multiple
>> hosts.
>>> 
>>> I have thought of the following options:
>>> 
>>> 1. One Monilithic index, but index files segmented by user_id field.
>>> 
>>> 2. MultiCore - One core per user.
>>> 
>>> 3. Multiple Solr instances - Non scalable.
>>> 
>>> 4. Don't use Solr, but enhance our Lucene +RESTful server model to
>>> support indexing as well. - Least favored approach as we will be doing a
>>> lot of things that Solr already does (replication, live
>>> add/update/delete). Most of the things we are doing, can be done with
>>> Solr's pluggable query handlers. (I guess this is not a true option at
>> all).
>>> 
>>> I am currently favouring Option 2 though want to try out whether 1 works
>>> as well.
>>> 
>>> Looks like some of the most obvious problems with MultiCores are "too
>>> many open file" problems, which can be handled with hardware and
>>> software boundaries (properly close index after updating and after users
>>> logout).
>>> 
>>> My questions:
>>> 
>>> 1. Can our analyzers/filters be plugged into Solr during the time of
>>> indexing?
>>> 2. Does option 2 fit the above needs? Has anybody done option 2 with
>>> thousands of cores in a Solr instance?
>>> 3. Does option 2 to support horizontal scaling (sharding?)
>>> 
>>> Thanks,
>>> Vikram
>>> 
>>> 
>> 
>> 


Re: What is the best scalable scheme to support multiple users?

Posted by Vikram Kumar <vi...@gmail.com>.
Hi Wunder,
Can you please elaborate?

Vikram

On Thu, Feb 26, 2009 at 10:13 AM, Walter Underwood
<wu...@netflix.com>wrote:

> 1a. Multiple Solr instances partitioned by user_id%N, with index
> files segmented by user_id field.
>
> That can scale rather gracefully, though it does need reindexing
> to add a server.
>
> wunder
>
> On 2/26/09 3:44 AM, "Vikram B. Kumar" <vi...@gmail.com> wrote:
>
> > Hi All,
> >
> > Our web based document management system has few thousand users and is
> > growing rapidly. Like any SaaS, while we support a lot of customers,
> > only few of them (those logged in) will be reading their index and only
> > a subset of those logged in (who are adding documents) will be writing
> > to their index.
> >
> > i.,e TU > L > U
> >
> > and TU ~ 100 x L
> >
> > where TU is total no of users, L is logged in users who are searching
> > and U is the uploaders who are updating their index.
> >
> > We have been using Lucene over a simple RESTful server for searching.
> > Indexing is currently done using regular JavaSE based setup, instead of
> > a server. We are thinking about moving to Solr to scale better and to
> > get rid of the latency associated with our non-live JavaSE based
> > indexer. We have a custom Analyzer/Filter that adds some payload to each
> > term to support our web based service.
> >
> > My message is about on how best to partition the index to support
> > multiple users.
> >
> > Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
> > cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
> > 8000-10000 users before I need to start sharding them across multiple
> hosts.
> >
> > I have thought of the following options:
> >
> > 1. One Monilithic index, but index files segmented by user_id field.
> >
> > 2. MultiCore - One core per user.
> >
> > 3. Multiple Solr instances - Non scalable.
> >
> > 4. Don't use Solr, but enhance our Lucene +RESTful server model to
> > support indexing as well. - Least favored approach as we will be doing a
> > lot of things that Solr already does (replication, live
> > add/update/delete). Most of the things we are doing, can be done with
> > Solr's pluggable query handlers. (I guess this is not a true option at
> all).
> >
> > I am currently favouring Option 2 though want to try out whether 1 works
> > as well.
> >
> > Looks like some of the most obvious problems with MultiCores are "too
> > many open file" problems, which can be handled with hardware and
> > software boundaries (properly close index after updating and after users
> > logout).
> >
> > My questions:
> >
> > 1. Can our analyzers/filters be plugged into Solr during the time of
> > indexing?
> > 2. Does option 2 fit the above needs? Has anybody done option 2 with
> > thousands of cores in a Solr instance?
> > 3. Does option 2 to support horizontal scaling (sharding?)
> >
> > Thanks,
> > Vikram
> >
> >
>
>

Re: What is the best scalable scheme to support multiple users?

Posted by Walter Underwood <wu...@netflix.com>.
1a. Multiple Solr instances partitioned by user_id%N, with index
files segmented by user_id field.

That can scale rather gracefully, though it does need reindexing
to add a server.

wunder

On 2/26/09 3:44 AM, "Vikram B. Kumar" <vi...@gmail.com> wrote:

> Hi All,
> 
> Our web based document management system has few thousand users and is
> growing rapidly. Like any SaaS, while we support a lot of customers,
> only few of them (those logged in) will be reading their index and only
> a subset of those logged in (who are adding documents) will be writing
> to their index.
> 
> i.,e TU > L > U
> 
> and TU ~ 100 x L
> 
> where TU is total no of users, L is logged in users who are searching
> and U is the uploaders who are updating their index.
> 
> We have been using Lucene over a simple RESTful server for searching.
> Indexing is currently done using regular JavaSE based setup, instead of
> a server. We are thinking about moving to Solr to scale better and to
> get rid of the latency associated with our non-live JavaSE based
> indexer. We have a custom Analyzer/Filter that adds some payload to each
> term to support our web based service.
> 
> My message is about on how best to partition the index to support
> multiple users.
> 
> Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
> cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
> 8000-10000 users before I need to start sharding them across multiple hosts.
> 
> I have thought of the following options:
> 
> 1. One Monilithic index, but index files segmented by user_id field.
> 
> 2. MultiCore - One core per user.
> 
> 3. Multiple Solr instances - Non scalable.
> 
> 4. Don't use Solr, but enhance our Lucene +RESTful server model to
> support indexing as well. - Least favored approach as we will be doing a
> lot of things that Solr already does (replication, live
> add/update/delete). Most of the things we are doing, can be done with
> Solr's pluggable query handlers. (I guess this is not a true option at all).
> 
> I am currently favouring Option 2 though want to try out whether 1 works
> as well.
> 
> Looks like some of the most obvious problems with MultiCores are "too
> many open file" problems, which can be handled with hardware and
> software boundaries (properly close index after updating and after users
> logout).
> 
> My questions:
> 
> 1. Can our analyzers/filters be plugged into Solr during the time of
> indexing?
> 2. Does option 2 fit the above needs? Has anybody done option 2 with
> thousands of cores in a Solr instance?
> 3. Does option 2 to support horizontal scaling (sharding?)
> 
> Thanks,
> Vikram
> 
>