You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ryan Aslett <Ry...@Qsent.com> on 2004/12/23 00:45:15 UTC

addIndexes() Question

 
Hi there, Im about to embark on a Lucene project of massive scale
(between 500 million and 2 billion documents).  I am currently working
on parallellizing the construction of the Index(es). 

Rough summary of my plan:
I have many, many physical machines, each with multiple processors that
I wish to dedicate to the construction of a single index. 
I plan on having each machine gather its documents from a central
sychronized source (network, JMS, whatever). 
Within each machine I will have multiple threads each responsible for
construcing an index slice.

When all machines and all threads are finished, I should have a slew of
index slices that I want to combine together to create one index.

My question is this:  Will it be more efficient to call
addIndexes(Directory[] dirs) on all the slices all at once? 

Or might it be better to continually merge small indexes into a larger
index, i.e. once an index slice reaches a particular size, merge it into
the main index and start building a new slice...

Any help would be appreciated.. 

Ryan Aslett


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: addIndexes() Question

Posted by Daniel Naber <da...@t-online.de>.

On Thursday 23 December 2004 00:45, Ryan Aslett wrote:

> When all machines and all threads are finished, I should have a slew of
> index slices that I want to combine together to create one index.

You should simply skip this step and instead search the small indices with 
a ParallelMultiSearcher. This should scale much better than one huge index 
(note that ranking is currently messed up with (Parllel)MultiSearcher, see 
the bug reports for a proposed fix).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: addIndexes() Question

Posted by Sergiu Gordea <gs...@ifit.uni-klu.ac.at>.

I think you should change a little bit your plans, and to think that 
your goal is to
create a fast search engine not a fast indexing engine.
When you plan to index a lot of documents then it is possible to creata 
a lot of segments (if you don't optimize the index)
and the serch will be very slow comparing with the search on an 
optimized index.
The problem is that the optimization of big indexes is a time consuming 
operation, and also

addIndexes(Directory[] dirs) I think is also a time consuming operation.

 Therefore I suggest to think how can you design the indices to have a fast search, and then 
you should design an offline indexing process. 

 That is my suggestion ... maybe it doesn't fit your requirements, maybe it does ...

  All the best,

  Sergiu

Ryan Aslett wrote:

> 
>Hi there, Im about to embark on a Lucene project of massive scale
>(between 500 million and 2 billion documents).  I am currently working
>on parallellizing the construction of the Index(es). 
>
>Rough summary of my plan:
>I have many, many physical machines, each with multiple processors that
>I wish to dedicate to the construction of a single index. 
>I plan on having each machine gather its documents from a central
>sychronized source (network, JMS, whatever). 
>Within each machine I will have multiple threads each responsible for
>construcing an index slice.
>
>When all machines and all threads are finished, I should have a slew of
>index slices that I want to combine together to create one index.
>
>My question is this:  Will it be more efficient to call
>addIndexes(Directory[] dirs) on all the slices all at once? 
>
>Or might it be better to continually merge small indexes into a larger
>index, i.e. once an index slice reaches a particular size, merge it into
>the main index and start building a new slice...
>
>Any help would be appreciated.. 
>
>Ryan Aslett
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: addIndexes() Question

Posted by Garrett Heaver <ga...@researchandmarkets.com>.

Hi Ryan

I too am using addIndexes(), all be it for slightly different reasons.
However, I would recommend only calling addIndexes() for fairly sizable
slices and all slices at once. The reason I'm suggesting it is that optimize
is called automagically both before and after the addIndexes method so if
you are only adding very small slices you're optimizing the main index more
times than necessary

There is of course the obvious trade of "spider --> live index" time being
shorter in one method that the other.

The other thing that I found on my machines (I'm spidering on one machine
and storing the live index on another) is that network performance isn't so
hot when you are continually opening and closing connections on other
machines to do the merge (under NT this is, Linux may be much better :) so
it made more sense for me to create larger slices and only open the
connection to the live index machine when necessary

Hope this helps

Garrett

-----Original Message-----
From: Ryan Aslett [mailto:Ryan.Aslett@Qsent.com] 
Sent: 22 December 2004 23:45
To: Lucene Users List
Subject: addIndexes() Question

 
Hi there, Im about to embark on a Lucene project of massive scale
(between 500 million and 2 billion documents).  I am currently working
on parallellizing the construction of the Index(es). 

Rough summary of my plan:
I have many, many physical machines, each with multiple processors that
I wish to dedicate to the construction of a single index. 
I plan on having each machine gather its documents from a central
sychronized source (network, JMS, whatever). 
Within each machine I will have multiple threads each responsible for
construcing an index slice.

When all machines and all threads are finished, I should have a slew of
index slices that I want to combine together to create one index.

My question is this:  Will it be more efficient to call
addIndexes(Directory[] dirs) on all the slices all at once? 

Or might it be better to continually merge small indexes into a larger
index, i.e. once an index slice reaches a particular size, merge it into
the main index and start building a new slice...

Any help would be appreciated.. 

Ryan Aslett


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: addIndexes() Question

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I _think_ you'd be better off doing it all at once, but I wouldn't
trust myself on this and would instead construct a small 3-index set
and test, looking at a) maximal disk usage, b) time, and c) RAM usage.
:)

Otis

--- Ryan Aslett <Ry...@Qsent.com> wrote:

>  
> Hi there, Im about to embark on a Lucene project of massive scale
> (between 500 million and 2 billion documents).  I am currently
> working
> on parallellizing the construction of the Index(es). 
> 
> Rough summary of my plan:
> I have many, many physical machines, each with multiple processors
> that
> I wish to dedicate to the construction of a single index. 
> I plan on having each machine gather its documents from a central
> sychronized source (network, JMS, whatever). 
> Within each machine I will have multiple threads each responsible for
> construcing an index slice.
> 
> When all machines and all threads are finished, I should have a slew
> of
> index slices that I want to combine together to create one index.
> 
> My question is this:  Will it be more efficient to call
> addIndexes(Directory[] dirs) on all the slices all at once? 
> 
> Or might it be better to continually merge small indexes into a
> larger
> index, i.e. once an index slice reaches a particular size, merge it
> into
> the main index and start building a new slice...
> 
> Any help would be appreciated.. 
> 
> Ryan Aslett
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org