You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ke...@ckhill.com on 2004/02/10 04:29:36 UTC

Index advice...

Hey Lucene-users,

I'm setting up a Lucene index on 5G of PDF files (full-text search).  I've 
been really happy with Lucene so far but I'm curious what tips and strategies 
I can use to optimize my performance at this large size.

So far I am using pretty much all of the defaults (I'm new to Lucene).

I am using PDFBox to add the documents to the index.
I can usually add about 800 or so PDF files and then the add loop:

for ( int i = 0; i < fileNames.length; i++ ) {
	Document doc = IndexFile.index(baseDirectory+documentRoot+"fileNames
[i]); 
	writer.addDocument(doc);
}


really starts to slow down.  Doesn't seem to be memory related.
Thoughts anyone?

Thanks in advance,
CK Hill



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index advice...

Posted by petite_abeille <pe...@mac.com>.
On Feb 10, 2004, at 14:03, Scott ganyo wrote:

> I have.  While document.add() itself doesn't increase over time, the 
> merge does.  Ways of partially overcoming this include increasing the 
> mergeFactor (but this will increase the number of file handles used), 
> or building blocks of the index in memory and then merging them to 
> disk.  This has been discussed before, so you should be able to find 
> additional information on this fairly easily.

This is what I noticed also: adding documents by itself is a fairly 
benign operation, but anything that triggers an index merge in one form 
or another is a killer as an index grows in size.

So, overall, adding more documents does slow down the indexing.

At least this is the impression I get. But I would love to be proven 
wrong on this :)

Cheers,

PA.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index advice...

Posted by Scott ganyo <sc...@ganyo.com>.
I have.  While document.add() itself doesn't increase over time, the 
merge does.  Ways of partially overcoming this include increasing the 
mergeFactor (but this will increase the number of file handles used), 
or building blocks of the index in memory and then merging them to 
disk.  This has been discussed before, so you should be able to find 
additional information on this fairly easily.

Scott

On Feb 10, 2004, at 7:55 AM, Otis Gospodnetic wrote:

>
> --- Leo Galambos <Le...@seznam.cz> wrote:
>> Otis Gospodnetic napsal(a):
>>
>>> Without seeing more information/code, I can't tell which part of
>> your
>>> system slows down with time, but I can tell you that Lucene's 'add'
>>> does not slow over time (i.e. as the index gets larger).  Therefore,
>> I
>>> would look elsewhere for causes of the slowdown.
>>>
>>>
>>
>> Otis, can you point me to some proofs that time of "insert" operation
>>
>> does not depend on the index size, please? Amortized time of "insert"
>> is O(log(docsIndexed/mergeFac)), I think.
>
> This would imply that Lucene gets slower as it adds more documents to
> the index.  Have you observed this behaviour?  I haven't.
>
>> Thus I do not know how it could be O(1).
>
> ~ O(1) is what I have observed through experiments with indexing of
> several million documents.
>
> Otis
>
>
>> AFAIK the issue with PDF files can be based on the PDF parser (I
>> already
>> encountered this with PDFbox).
>>
>>> The easiest thing to do is add logging to suspicious portions of the
>>> code.  That will narrow the scope of the code you need to analyze.
>>>
>>> Otis
>>>
>>>
>>> --- kevin@ckhill.com wrote:
>>>
>>>
>>>> Hey Lucene-users,
>>>>
>>>> I'm setting up a Lucene index on 5G of PDF files (full-text
>> search).
>>>> I've
>>>> been really happy with Lucene so far but I'm curious what tips and
>>>> strategies
>>>> I can use to optimize my performance at this large size.
>>>>
>>>> So far I am using pretty much all of the defaults (I'm new to
>>>> Lucene).
>>>>
>>>> I am using PDFBox to add the documents to the index.
>>>> I can usually add about 800 or so PDF files and then the add loop:
>>>>
>>>> for ( int i = 0; i < fileNames.length; i++ ) {
>>>> 	Document doc =
>> IndexFile.index(baseDirectory+documentRoot+"fileNames
>>>> [i]);
>>>> 	writer.addDocument(doc);
>>>> }
>>>>
>>>>
>>>> really starts to slow down.  Doesn't seem to be memory related.
>>>> Thoughts anyone?
>>>>
>>>> Thanks in advance,
>>>> CK Hill
>>>>
>>>>
>>>>
>>
>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>> For additional commands, e-mail:
>> lucene-user-help@jakarta.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

Re: Index advice...

Posted by Leo Galambos <Le...@seznam.cz>.
Otis Gospodnetic napsal(a):

>--- Leo Galambos <Le...@seznam.cz> wrote:
>  
>
>>Otis Gospodnetic napsal(a):
>>
>>    
>>
>>>>Thus I do not know how it could be O(1).
>>>>   
>>>>
>>>>        
>>>>
>>>~ O(1) is what I have observed through experiments with indexing of
>>>several million documents.
>>> 
>>>
>>>      
>>>
>>What did you exactly measured? Just the time of the insert operation 
>>(incl. merge(), of course)? Was it a test on real documents?
>>    
>>
>
>I didn't really measure anything, I only observed this, as my focus was
>something else, not performance measurements.
>It is true that every time an insert/add triggers a merge operation,
>things will slow down, but from what I recall (and this was about 1
>year ago), the overall performance was steady as the index grew.
>
>  
>

Try the same test with mergeFactor=2, you will see the difference.

Leo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index advice...

Posted by Otis Gospodnetic <ot...@yahoo.com>.
--- Leo Galambos <Le...@seznam.cz> wrote:
> Otis Gospodnetic napsal(a):
> 
> >>Thus I do not know how it could be O(1).
> >>    
> >>
> >
> >~ O(1) is what I have observed through experiments with indexing of
> >several million documents.
> >  
> >
> 
> What did you exactly measured? Just the time of the insert operation 
> (incl. merge(), of course)? Was it a test on real documents?

I didn't really measure anything, I only observed this, as my focus was
something else, not performance measurements.
It is true that every time an insert/add triggers a merge operation,
things will slow down, but from what I recall (and this was about 1
year ago), the overall performance was steady as the index grew.

Documents were articifially created from random dictionary words.
Their size was variable, but not by a lot.

Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index advice...

Posted by Leo Galambos <Le...@seznam.cz>.
Otis Gospodnetic napsal(a):

>>Thus I do not know how it could be O(1).
>>    
>>
>
>~ O(1) is what I have observed through experiments with indexing of
>several million documents.
>  
>

What did you exactly measured? Just the time of the insert operation 
(incl. merge(), of course)? Was it a test on real documents?

THX
Leo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index advice...

Posted by Otis Gospodnetic <ot...@yahoo.com>.
--- Leo Galambos <Le...@seznam.cz> wrote:
> Otis Gospodnetic napsal(a):
> 
> >Without seeing more information/code, I can't tell which part of
> your
> >system slows down with time, but I can tell you that Lucene's 'add'
> >does not slow over time (i.e. as the index gets larger).  Therefore,
> I
> >would look elsewhere for causes of the slowdown.
> >  
> >
> 
> Otis, can you point me to some proofs that time of "insert" operation
> 
> does not depend on the index size, please? Amortized time of "insert"
> is O(log(docsIndexed/mergeFac)), I think.

This would imply that Lucene gets slower as it adds more documents to
the index.  Have you observed this behaviour?  I haven't.

> Thus I do not know how it could be O(1).

~ O(1) is what I have observed through experiments with indexing of
several million documents.

Otis


> AFAIK the issue with PDF files can be based on the PDF parser (I
> already 
> encountered this with PDFbox).
> 
> >The easiest thing to do is add logging to suspicious portions of the
> >code.  That will narrow the scope of the code you need to analyze.
> >
> >Otis
> >
> >
> >--- kevin@ckhill.com wrote:
> >  
> >
> >>Hey Lucene-users,
> >>
> >>I'm setting up a Lucene index on 5G of PDF files (full-text
> search). 
> >>I've 
> >>been really happy with Lucene so far but I'm curious what tips and
> >>strategies 
> >>I can use to optimize my performance at this large size.
> >>
> >>So far I am using pretty much all of the defaults (I'm new to
> >>Lucene).
> >>
> >>I am using PDFBox to add the documents to the index.
> >>I can usually add about 800 or so PDF files and then the add loop:
> >>
> >>for ( int i = 0; i < fileNames.length; i++ ) {
> >>	Document doc =
> IndexFile.index(baseDirectory+documentRoot+"fileNames
> >>[i]); 
> >>	writer.addDocument(doc);
> >>}
> >>
> >>
> >>really starts to slow down.  Doesn't seem to be memory related.
> >>Thoughts anyone?
> >>
> >>Thanks in advance,
> >>CK Hill
> >>
> >>
> >>
>
>>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >
>
>---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index advice...

Posted by Leo Galambos <Le...@seznam.cz>.
Otis Gospodnetic napsal(a):

>Without seeing more information/code, I can't tell which part of your
>system slows down with time, but I can tell you that Lucene's 'add'
>does not slow over time (i.e. as the index gets larger).  Therefore, I
>would look elsewhere for causes of the slowdown.
>  
>

Otis, can you point me to some proofs that time of "insert" operation 
does not depend on the index size, please? Amortized time of "insert" is 
O(log(docsIndexed/mergeFac)), I think. Thus I do not know how it could 
be O(1).

Thank you.
Leo

AFAIK the issue with PDF files can be based on the PDF parser (I already 
encountered this with PDFbox).

>The easiest thing to do is add logging to suspicious portions of the
>code.  That will narrow the scope of the code you need to analyze.
>
>Otis
>
>
>--- kevin@ckhill.com wrote:
>  
>
>>Hey Lucene-users,
>>
>>I'm setting up a Lucene index on 5G of PDF files (full-text search). 
>>I've 
>>been really happy with Lucene so far but I'm curious what tips and
>>strategies 
>>I can use to optimize my performance at this large size.
>>
>>So far I am using pretty much all of the defaults (I'm new to
>>Lucene).
>>
>>I am using PDFBox to add the documents to the index.
>>I can usually add about 800 or so PDF files and then the add loop:
>>
>>for ( int i = 0; i < fileNames.length; i++ ) {
>>	Document doc = IndexFile.index(baseDirectory+documentRoot+"fileNames
>>[i]); 
>>	writer.addDocument(doc);
>>}
>>
>>
>>really starts to slow down.  Doesn't seem to be memory related.
>>Thoughts anyone?
>>
>>Thanks in advance,
>>CK Hill
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index advice...

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Without seeing more information/code, I can't tell which part of your
system slows down with time, but I can tell you that Lucene's 'add'
does not slow over time (i.e. as the index gets larger).  Therefore, I
would look elsewhere for causes of the slowdown.
The easiest thing to do is add logging to suspicious portions of the
code.  That will narrow the scope of the code you need to analyze.

Otis


--- kevin@ckhill.com wrote:
> Hey Lucene-users,
> 
> I'm setting up a Lucene index on 5G of PDF files (full-text search). 
> I've 
> been really happy with Lucene so far but I'm curious what tips and
> strategies 
> I can use to optimize my performance at this large size.
> 
> So far I am using pretty much all of the defaults (I'm new to
> Lucene).
> 
> I am using PDFBox to add the documents to the index.
> I can usually add about 800 or so PDF files and then the add loop:
> 
> for ( int i = 0; i < fileNames.length; i++ ) {
> 	Document doc = IndexFile.index(baseDirectory+documentRoot+"fileNames
> [i]); 
> 	writer.addDocument(doc);
> }
> 
> 
> really starts to slow down.  Doesn't seem to be memory related.
> Thoughts anyone?
> 
> Thanks in advance,
> CK Hill
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org