You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Benjamin Reitzammer <br...@gmail.com> on 2005/08/04 10:54:01 UTC

Very large number of indices in distributed environment

Hi,
we are in the process of planning a search feature of a product and we
are having quite a hard time figuring out the "right" way to do it.

The requirements for our app are the following:
1) Large number of indices (at _least_ 10000)
2) The amount of data involved per index is not very high, but because
of the number of indices involved the data set will be something about
500 - 1000 GB
3) The searching capabilities must be fail safe, while it's acceptable
if deletes/updates can take some time.
4) The majority of operations will be searching the indices.

I've followed the mailing list intensively the last month and
especially the "Best Practices for Distributing Lucene Indexing and
Searching" (http://marc.theaimsgroup.com/?l=lucene-user&m=110971318020691&w=2)
and "Real time indexing and distribution to lucene on separate boxes (long)"
(http://marc.theaimsgroup.com/?l=lucene-user&m=107900097217474&w=2)
threads provided some interesting insight.

But still our requirements are a bit different.

My thoughts how the above could be handled, so far are:

1) Have one *really big* "master"  which handles all tasks related to
index manipulation. Sync the indices according to Doug's tips
http://marc.theaimsgroup.com/?l=lucene-user&m=110973989200204&w=2 out
to a cluster of slaves that are responsible for searching.
Problem: How to make sure that indices across  slaves are in sync. 
Big Problem: Syncing of this large number of indices will cause a lot
of traffic and cause already quite a load on the slaves (not to speak
of the master)

1.1) Is it safe to _search_ (only) an index mounted via NFS? If yes,
then the search boxes could mount the indices on the master box. But
this solution would probably lead to some serious perfomance issues
because of the needed disk I/O on the master.
Though I'd love to be proven wrong on this one.

2) Split up index collection into smaller portions and distribute a
certain number of indices (~ up to 1000 indices) into smaller
autonomous clusters, that are completely responsible for their
collection of indices.
Problem: How do I keep index distribution dynamic so I don't have to
hardcode where to look for a certain index (that's not a real lucene
issue, but more one of distributed computing, but nevertheless I
thought you guys might know a way to solve it).

Any ideas on this? 
Has anyone ever worked with such a large number of Lucene indices (and
the amount of data it involves)?

I appreciate your help very much.

Cheers

Benjamin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Very large number of indices in distributed environment

Posted by Rasik Pandey <rb...@gmail.com>.

> A big index is slow to merge, slow to search, and as you mentioned,
> it's slow to sync. An 1G index took me several hours to merge on a P3
> 1GHz 512MRam

Don't forget "slow to back-up".

-- 
Regards,
Rus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Very large number of indices in distributed environment

Posted by Ian Lea <ia...@gmail.com>.

In my experience, searching a read only index mounted via NFS is fine.
The NFS related issues are with locking.

I'd agree with Chris that you should try to avoid very big indexes.


--
Ian.


On 04/08/05, Chris Lu <ch...@gmail.com> wrote:
> A big index is slow to merge, slow to search, and as you mentioned,
> it's slow to sync. An 1G index took me several hours to merge on a P3
> 1GHz 512MRam.
> Mounting Lucene on NFS is also a "No GO".
> 
> I feel you choice 2 may be feasible, although complicated. That's your
> job, right. :)
> 
> BTW: I guess your app is a web hosted app for different users, is it?
> 
> --
> Chris Lu
> ------------
> Lucene Search RAD on Any Database
> http://www.dbsight.net
> 
> 
> On 8/4/05, Benjamin Reitzammer <br...@gmail.com> wrote:
> > Hi,
> > we are in the process of planning a search feature of a product and we
> > are having quite a hard time figuring out the "right" way to do it.
> >
> > The requirements for our app are the following:
> > 1) Large number of indices (at _least_ 10000)
> > 2) The amount of data involved per index is not very high, but because
> > of the number of indices involved the data set will be something about
> > 500 - 1000 GB
> > 3) The searching capabilities must be fail safe, while it's acceptable
> > if deletes/updates can take some time.
> > 4) The majority of operations will be searching the indices.
> >
> > I've followed the mailing list intensively the last month and
> > especially the "Best Practices for Distributing Lucene Indexing and
> > Searching" (http://marc.theaimsgroup.com/?l=lucene-user&m=110971318020691&w=2)
> > and "Real time indexing and distribution to lucene on separate boxes (long)"
> > (http://marc.theaimsgroup.com/?l=lucene-user&m=107900097217474&w=2)
> > threads provided some interesting insight.
> >
> > But still our requirements are a bit different.
> >
> > My thoughts how the above could be handled, so far are:
> >
> > 1) Have one *really big* "master"  which handles all tasks related to
> > index manipulation. Sync the indices according to Doug's tips
> > http://marc.theaimsgroup.com/?l=lucene-user&m=110973989200204&w=2 out
> > to a cluster of slaves that are responsible for searching.
> > Problem: How to make sure that indices across  slaves are in sync.
> > Big Problem: Syncing of this large number of indices will cause a lot
> > of traffic and cause already quite a load on the slaves (not to speak
> > of the master)
> >
> > 1.1) Is it safe to _search_ (only) an index mounted via NFS? If yes,
> > then the search boxes could mount the indices on the master box. But
> > this solution would probably lead to some serious perfomance issues
> > because of the needed disk I/O on the master.
> > Though I'd love to be proven wrong on this one.
> >
> > 2) Split up index collection into smaller portions and distribute a
> > certain number of indices (~ up to 1000 indices) into smaller
> > autonomous clusters, that are completely responsible for their
> > collection of indices.
> > Problem: How do I keep index distribution dynamic so I don't have to
> > hardcode where to look for a certain index (that's not a real lucene
> > issue, but more one of distributed computing, but nevertheless I
> > thought you guys might know a way to solve it).
> >
> > Any ideas on this?
> > Has anyone ever worked with such a large number of Lucene indices (and
> > the amount of data it involves)?
> >
> > I appreciate your help very much.
> >
> > Cheers
> >
> > Benjamin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Very large number of indices in distributed environment

Posted by Chris Lu <ch...@gmail.com>.

A big index is slow to merge, slow to search, and as you mentioned,
it's slow to sync. An 1G index took me several hours to merge on a P3
1GHz 512MRam.
Mounting Lucene on NFS is also a "No GO".

I feel you choice 2 may be feasible, although complicated. That's your
job, right. :)

BTW: I guess your app is a web hosted app for different users, is it?

-- 
Chris Lu
------------
Lucene Search RAD on Any Database
http://www.dbsight.net


On 8/4/05, Benjamin Reitzammer <br...@gmail.com> wrote:
> Hi,
> we are in the process of planning a search feature of a product and we
> are having quite a hard time figuring out the "right" way to do it.
> 
> The requirements for our app are the following:
> 1) Large number of indices (at _least_ 10000)
> 2) The amount of data involved per index is not very high, but because
> of the number of indices involved the data set will be something about
> 500 - 1000 GB
> 3) The searching capabilities must be fail safe, while it's acceptable
> if deletes/updates can take some time.
> 4) The majority of operations will be searching the indices.
> 
> I've followed the mailing list intensively the last month and
> especially the "Best Practices for Distributing Lucene Indexing and
> Searching" (http://marc.theaimsgroup.com/?l=lucene-user&m=110971318020691&w=2)
> and "Real time indexing and distribution to lucene on separate boxes (long)"
> (http://marc.theaimsgroup.com/?l=lucene-user&m=107900097217474&w=2)
> threads provided some interesting insight.
> 
> But still our requirements are a bit different.
> 
> My thoughts how the above could be handled, so far are:
> 
> 1) Have one *really big* "master"  which handles all tasks related to
> index manipulation. Sync the indices according to Doug's tips
> http://marc.theaimsgroup.com/?l=lucene-user&m=110973989200204&w=2 out
> to a cluster of slaves that are responsible for searching.
> Problem: How to make sure that indices across  slaves are in sync.
> Big Problem: Syncing of this large number of indices will cause a lot
> of traffic and cause already quite a load on the slaves (not to speak
> of the master)
> 
> 1.1) Is it safe to _search_ (only) an index mounted via NFS? If yes,
> then the search boxes could mount the indices on the master box. But
> this solution would probably lead to some serious perfomance issues
> because of the needed disk I/O on the master.
> Though I'd love to be proven wrong on this one.
> 
> 2) Split up index collection into smaller portions and distribute a
> certain number of indices (~ up to 1000 indices) into smaller
> autonomous clusters, that are completely responsible for their
> collection of indices.
> Problem: How do I keep index distribution dynamic so I don't have to
> hardcode where to look for a certain index (that's not a real lucene
> issue, but more one of distributed computing, but nevertheless I
> thought you guys might know a way to solve it).
> 
> Any ideas on this?
> Has anyone ever worked with such a large number of Lucene indices (and
> the amount of data it involves)?
> 
> I appreciate your help very much.
> 
> Cheers
> 
> Benjamin
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Very large number of indices in distributed environment

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Benjamin,

I don't know what exactly your application is doing, but option 2)
sounds the most manageable to me.

Otis

--- Benjamin Reitzammer <br...@gmail.com> wrote:

> Hi,
> we are in the process of planning a search feature of a product and
> we
> are having quite a hard time figuring out the "right" way to do it.
> 
> The requirements for our app are the following:
> 1) Large number of indices (at _least_ 10000)
> 2) The amount of data involved per index is not very high, but
> because
> of the number of indices involved the data set will be something
> about
> 500 - 1000 GB
> 3) The searching capabilities must be fail safe, while it's
> acceptable
> if deletes/updates can take some time.
> 4) The majority of operations will be searching the indices.
> 
> I've followed the mailing list intensively the last month and
> especially the "Best Practices for Distributing Lucene Indexing and
> Searching"
> (http://marc.theaimsgroup.com/?l=lucene-user&m=110971318020691&w=2)
> and "Real time indexing and distribution to lucene on separate boxes
> (long)"
> (http://marc.theaimsgroup.com/?l=lucene-user&m=107900097217474&w=2)
> threads provided some interesting insight.
> 
> But still our requirements are a bit different.
> 
> My thoughts how the above could be handled, so far are:
> 
> 1) Have one *really big* "master"  which handles all tasks related to
> index manipulation. Sync the indices according to Doug's tips
> http://marc.theaimsgroup.com/?l=lucene-user&m=110973989200204&w=2 out
> to a cluster of slaves that are responsible for searching.
> Problem: How to make sure that indices across  slaves are in sync. 
> Big Problem: Syncing of this large number of indices will cause a lot
> of traffic and cause already quite a load on the slaves (not to speak
> of the master)
> 
> 1.1) Is it safe to _search_ (only) an index mounted via NFS? If yes,
> then the search boxes could mount the indices on the master box. But
> this solution would probably lead to some serious perfomance issues
> because of the needed disk I/O on the master.
> Though I'd love to be proven wrong on this one.
> 
> 2) Split up index collection into smaller portions and distribute a
> certain number of indices (~ up to 1000 indices) into smaller
> autonomous clusters, that are completely responsible for their
> collection of indices.
> Problem: How do I keep index distribution dynamic so I don't have to
> hardcode where to look for a certain index (that's not a real lucene
> issue, but more one of distributed computing, but nevertheless I
> thought you guys might know a way to solve it).
> 
> Any ideas on this? 
> Has anyone ever worked with such a large number of Lucene indices
> (and
> the amount of data it involves)?
> 
> I appreciate your help very much.
> 
> Cheers
> 
> Benjamin
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org