You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Vikram B. Kumar" <vi...@gmail.com> on 2009/02/26 12:44:43 UTC
What is the best scalable scheme to support multiple users?
Hi All,
Our web based document management system has few thousand users and is
growing rapidly. Like any SaaS, while we support a lot of customers,
only few of them (those logged in) will be reading their index and only
a subset of those logged in (who are adding documents) will be writing
to their index.
i.,e TU > L > U
and TU ~ 100 x L
where TU is total no of users, L is logged in users who are searching
and U is the uploaders who are updating their index.
We have been using Lucene over a simple RESTful server for searching.
Indexing is currently done using regular JavaSE based setup, instead of
a server. We are thinking about moving to Solr to scale better and to
get rid of the latency associated with our non-live JavaSE based
indexer. We have a custom Analyzer/Filter that adds some payload to each
term to support our web based service.
My message is about on how best to partition the index to support
multiple users.
Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
8000-10000 users before I need to start sharding them across multiple hosts.
I have thought of the following options:
1. One Monilithic index, but index files segmented by user_id field.
2. MultiCore - One core per user.
3. Multiple Solr instances - Non scalable.
4. Don't use Solr, but enhance our Lucene +RESTful server model to
support indexing as well. - Least favored approach as we will be doing a
lot of things that Solr already does (replication, live
add/update/delete). Most of the things we are doing, can be done with
Solr's pluggable query handlers. (I guess this is not a true option at all).
I am currently favouring Option 2 though want to try out whether 1 works
as well.
Looks like some of the most obvious problems with MultiCores are "too
many open file" problems, which can be handled with hardware and
software boundaries (properly close index after updating and after users
logout).
My questions:
1. Can our analyzers/filters be plugged into Solr during the time of
indexing?
2. Does option 2 fit the above needs? Has anybody done option 2 with
thousands of cores in a Solr instance?
3. Does option 2 to support horizontal scaling (sharding?)
Thanks,
Vikram
Re: What is the best scalable scheme to support multiple users?
Posted by Walter Underwood <wu...@netflix.com>.
With five servers, assign 1/5 of user_id's to each server. Choose
the number of servers to handle the number of logged-in users.
Each user's searches go to the single server with their data.
Partitioning by user_id is common with relational databases.
We do this to hold our two billion movie ratings from ten
million customers.
wunder
On 2/26/09 8:21 AM, "Vikram Kumar" <vi...@gmail.com> wrote:
> Hi Wunder,
> Can you please elaborate?
>
> Vikram
>
> On Thu, Feb 26, 2009 at 10:13 AM, Walter Underwood
> <wu...@netflix.com>wrote:
>
>> 1a. Multiple Solr instances partitioned by user_id%N, with index
>> files segmented by user_id field.
>>
>> That can scale rather gracefully, though it does need reindexing
>> to add a server.
>>
>> wunder
>>
>> On 2/26/09 3:44 AM, "Vikram B. Kumar" <vi...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Our web based document management system has few thousand users and is
>>> growing rapidly. Like any SaaS, while we support a lot of customers,
>>> only few of them (those logged in) will be reading their index and only
>>> a subset of those logged in (who are adding documents) will be writing
>>> to their index.
>>>
>>> i.,e TU > L > U
>>>
>>> and TU ~ 100 x L
>>>
>>> where TU is total no of users, L is logged in users who are searching
>>> and U is the uploaders who are updating their index.
>>>
>>> We have been using Lucene over a simple RESTful server for searching.
>>> Indexing is currently done using regular JavaSE based setup, instead of
>>> a server. We are thinking about moving to Solr to scale better and to
>>> get rid of the latency associated with our non-live JavaSE based
>>> indexer. We have a custom Analyzer/Filter that adds some payload to each
>>> term to support our web based service.
>>>
>>> My message is about on how best to partition the index to support
>>> multiple users.
>>>
>>> Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
>>> cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
>>> 8000-10000 users before I need to start sharding them across multiple
>> hosts.
>>>
>>> I have thought of the following options:
>>>
>>> 1. One Monilithic index, but index files segmented by user_id field.
>>>
>>> 2. MultiCore - One core per user.
>>>
>>> 3. Multiple Solr instances - Non scalable.
>>>
>>> 4. Don't use Solr, but enhance our Lucene +RESTful server model to
>>> support indexing as well. - Least favored approach as we will be doing a
>>> lot of things that Solr already does (replication, live
>>> add/update/delete). Most of the things we are doing, can be done with
>>> Solr's pluggable query handlers. (I guess this is not a true option at
>> all).
>>>
>>> I am currently favouring Option 2 though want to try out whether 1 works
>>> as well.
>>>
>>> Looks like some of the most obvious problems with MultiCores are "too
>>> many open file" problems, which can be handled with hardware and
>>> software boundaries (properly close index after updating and after users
>>> logout).
>>>
>>> My questions:
>>>
>>> 1. Can our analyzers/filters be plugged into Solr during the time of
>>> indexing?
>>> 2. Does option 2 fit the above needs? Has anybody done option 2 with
>>> thousands of cores in a Solr instance?
>>> 3. Does option 2 to support horizontal scaling (sharding?)
>>>
>>> Thanks,
>>> Vikram
>>>
>>>
>>
>>
Re: What is the best scalable scheme to support multiple users?
Posted by Vikram Kumar <vi...@gmail.com>.
Hi Wunder,
Can you please elaborate?
Vikram
On Thu, Feb 26, 2009 at 10:13 AM, Walter Underwood
<wu...@netflix.com>wrote:
> 1a. Multiple Solr instances partitioned by user_id%N, with index
> files segmented by user_id field.
>
> That can scale rather gracefully, though it does need reindexing
> to add a server.
>
> wunder
>
> On 2/26/09 3:44 AM, "Vikram B. Kumar" <vi...@gmail.com> wrote:
>
> > Hi All,
> >
> > Our web based document management system has few thousand users and is
> > growing rapidly. Like any SaaS, while we support a lot of customers,
> > only few of them (those logged in) will be reading their index and only
> > a subset of those logged in (who are adding documents) will be writing
> > to their index.
> >
> > i.,e TU > L > U
> >
> > and TU ~ 100 x L
> >
> > where TU is total no of users, L is logged in users who are searching
> > and U is the uploaders who are updating their index.
> >
> > We have been using Lucene over a simple RESTful server for searching.
> > Indexing is currently done using regular JavaSE based setup, instead of
> > a server. We are thinking about moving to Solr to scale better and to
> > get rid of the latency associated with our non-live JavaSE based
> > indexer. We have a custom Analyzer/Filter that adds some payload to each
> > term to support our web based service.
> >
> > My message is about on how best to partition the index to support
> > multiple users.
> >
> > Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
> > cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
> > 8000-10000 users before I need to start sharding them across multiple
> hosts.
> >
> > I have thought of the following options:
> >
> > 1. One Monilithic index, but index files segmented by user_id field.
> >
> > 2. MultiCore - One core per user.
> >
> > 3. Multiple Solr instances - Non scalable.
> >
> > 4. Don't use Solr, but enhance our Lucene +RESTful server model to
> > support indexing as well. - Least favored approach as we will be doing a
> > lot of things that Solr already does (replication, live
> > add/update/delete). Most of the things we are doing, can be done with
> > Solr's pluggable query handlers. (I guess this is not a true option at
> all).
> >
> > I am currently favouring Option 2 though want to try out whether 1 works
> > as well.
> >
> > Looks like some of the most obvious problems with MultiCores are "too
> > many open file" problems, which can be handled with hardware and
> > software boundaries (properly close index after updating and after users
> > logout).
> >
> > My questions:
> >
> > 1. Can our analyzers/filters be plugged into Solr during the time of
> > indexing?
> > 2. Does option 2 fit the above needs? Has anybody done option 2 with
> > thousands of cores in a Solr instance?
> > 3. Does option 2 to support horizontal scaling (sharding?)
> >
> > Thanks,
> > Vikram
> >
> >
>
>
Re: What is the best scalable scheme to support multiple users?
Posted by Walter Underwood <wu...@netflix.com>.
1a. Multiple Solr instances partitioned by user_id%N, with index
files segmented by user_id field.
That can scale rather gracefully, though it does need reindexing
to add a server.
wunder
On 2/26/09 3:44 AM, "Vikram B. Kumar" <vi...@gmail.com> wrote:
> Hi All,
>
> Our web based document management system has few thousand users and is
> growing rapidly. Like any SaaS, while we support a lot of customers,
> only few of them (those logged in) will be reading their index and only
> a subset of those logged in (who are adding documents) will be writing
> to their index.
>
> i.,e TU > L > U
>
> and TU ~ 100 x L
>
> where TU is total no of users, L is logged in users who are searching
> and U is the uploaders who are updating their index.
>
> We have been using Lucene over a simple RESTful server for searching.
> Indexing is currently done using regular JavaSE based setup, instead of
> a server. We are thinking about moving to Solr to scale better and to
> get rid of the latency associated with our non-live JavaSE based
> indexer. We have a custom Analyzer/Filter that adds some payload to each
> term to support our web based service.
>
> My message is about on how best to partition the index to support
> multiple users.
>
> Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
> cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
> 8000-10000 users before I need to start sharding them across multiple hosts.
>
> I have thought of the following options:
>
> 1. One Monilithic index, but index files segmented by user_id field.
>
> 2. MultiCore - One core per user.
>
> 3. Multiple Solr instances - Non scalable.
>
> 4. Don't use Solr, but enhance our Lucene +RESTful server model to
> support indexing as well. - Least favored approach as we will be doing a
> lot of things that Solr already does (replication, live
> add/update/delete). Most of the things we are doing, can be done with
> Solr's pluggable query handlers. (I guess this is not a true option at all).
>
> I am currently favouring Option 2 though want to try out whether 1 works
> as well.
>
> Looks like some of the most obvious problems with MultiCores are "too
> many open file" problems, which can be handled with hardware and
> software boundaries (properly close index after updating and after users
> logout).
>
> My questions:
>
> 1. Can our analyzers/filters be plugged into Solr during the time of
> indexing?
> 2. Does option 2 fit the above needs? Has anybody done option 2 with
> thousands of cores in a Solr instance?
> 3. Does option 2 to support horizontal scaling (sharding?)
>
> Thanks,
> Vikram
>
>