You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Andreas Guther <An...@markettools.com> on 2008/03/01 02:02:29 UTC

RE: Lucene Search Performance

Just some comment and I understand that you cannot change your index:

What we did is to organize our index based on creation date of entries.
We limit our search to a given number of years starting from the current
year.  Organizing the index in that way allows us to take off outdated
information.  We can provide the user an option to search across older
indexes as well, if wanted.

Andreas

-----Original Message-----
From: Jamie [mailto:jamie@stimulussoft.com] 
Sent: Wednesday, February 27, 2008 10:17 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Search Performance

Hi

Thanks for the suggestions. This would require us to change the index 
and right now we literally have millions of documents stored in current 
index format. I'll bear it in mind, but I am not entirely sure how I 
would go about implementing the change at this point.

Much appreciate

Jamie


h t wrote:
> 1. redefine the archivedate field as YYmmDD format,
> 2. add another field using timestamp for sort use.
> 3. use RangeFilter to get result and then sort by timestamp.
>
> 2008/2/27, Jamie <ja...@stimulussoft.com>:
>   
>> Hi Michael & Others
>>
>> Ok. I've gathered some more statistics from a different machine for
your
>> analysis.
>> (I had to switch machines because the original one was in production
and
>> my tests were interfering).
>>
>> Here are the statistics from the new machine:
>>
>> Total Documents: 1.2 million
>> Results Returned:  900k
>> Store Size 238G (size of original documents)
>> Index Size 1.2G (lucene index size)
>> Index / Store Ratio 0.5%
>>
>> The search query is as follows:
>>
>> archivedate:[d20071229010000 TO d20080228235900]
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~why there
is an
>> extra 'd' ?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> As you can see, I am using a range query to search between specific
dates.
>> Question: should this query be moved to a filter rather? I did not do
>> this as I needed to have the option to sort on date.
>>
>> There are no other specific filters applied and in this example
sorting
>> is turned off.
>>
>> On this particular machine the search time varies between 2.64
seconds
>> and about 5 seconds.
>>
>> The limitations of this machine are that it does uses a normal IDE
drive
>> to house the index, not a SATA drive
>>
>> IOStat Statistics
>>
>> Linux 2.6.20-15-server 27/02/2008
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>          20.25    0.00    3.23    0.34    0.00   76.19
>>
>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read
Blk_wrtn
>> sda               7.12        50.67       186.41   38936841
143240688
>>
>> See attached for hardware info and the CPU call tree (taken from
YourKit).
>>
>> I would appreciate your recommendations.
>>
>>
>> Jamie
>>
>>
>> h t wrote:
>> Hi Michael,
>> I guess the hotspot of lucene is
>> org.apache.lucene.search.IndexSearcher.search()
>>
>> Hi Jamie,
>> What's the original text size of a million emails?
>> I estimate the size of an email is around 100k, is this true?
>> When you doing search, what kind keywords did you input, words or
short
>> sentence?
>> How many results return?
>> Did you use filter to shrink the results size?
>>
>> 2008/2/27, Michael Stoppelman <st...@gmail.com>:
>>   So you're saying searches are taking 10 seconds on a 5G index? If
so
>> that
>> seems ungodly slow.
>> If you're on *nix, have you watched your iostat statistics? Maybe
>> something
>> is hammering your hds.
>> Something seems amiss.
>>
>> What lucene methods were pointed to as hotspots by YourKit?
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-366-8072 ext 3
UK Tel: +44-20-80991035 ext 3
Email: jamie@stimulussoft.com
Web: http://www.mailarchiva.com

To receive MailArchiva Enterprise Edition product announcements, send a
message to: <ma...@stimulussoft.com> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org