You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nico Heid <ni...@gmx.com> on 2008/04/29 10:10:09 UTC

Index splitting

Hi,
Let me first roughly describe the scenario :-)

We're trying to index online stored data for some thousand users.
The schema.xml has a custom identifier for the user, so FQ can be applied
and further filtering is only done for the user (more important, the user
doesn't get to see results from data not belonging to him)

Unfortunatelly, the Index might become quite big ( we're indexing more that
50 TB Data, all kind of files, full text (indexed only, not stored) where
possible, elsewhere fileinfos (size, date) and meta if available)

So Question the is:

We're thinking of starting out with multiple Solr instances (either in their
own containers or MultiCore, guess that's not the important point), on 1 to
n machines. Lets just pretend: we do modulo 5 on the user number and assign
it to one of the two machines. The index gets distributed on QuerySlaves (
1-m dependend on the need).

So now the Question:
Is there a way to split a too big index into smaller ones? Do I have to
create more instances at the beginning, so that I will not run out of power
and space? (which will ad quite a bit of redundance of data)
Lets say I miscalculated and used only 2 indices, but now I see I need at
least 4.

Any idea will be very welcome,

Thanks,
Nico



Re: Index splitting

Posted by Norberto Meijome <fr...@meijome.net>.
On Tue, 29 Apr 2008 10:10:09 +0200
"Nico Heid" <ni...@gmx.com> wrote:

> So now the Question:
> Is there a way to split a too big index into smaller ones? Do I have to
> create more instances at the beginning, so that I will not run out of power
> and space? (which will ad quite a bit of redundance of data)
> Lets say I miscalculated and used only 2 indices, but now I see I need at
> least 4.

Hi Nico,
being able to split the index without having to reindex the lot would be a
nice option :)

One approach we use in a project I am working on is to split up the full extent
of your domain (user IDs) in equal parts from the start - with this we get n
clusters and it is as much as we will need to grow outwards . Then we grow
each cluster in depth as needed.

It obviously helps if you have an equal (or random) distribution across your
clusters (we do). Given that you probably won't know how many users you'll get
your case is different to ours. 

To even out your distribution of user-ids to cluster you can use a function of
the user-id (ie, md5(user_id) ) instead of user_id itself. 

HIH,
B
_________________________
{Beto|Norberto|Numard} Meijome

Percusive Maintenance - The art of tuning or repairing equipment by hitting it.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.