You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jamie <ja...@stimulussoft.com> on 2008/02/26 19:18:13 UTC

Lucene Search Performance

Hi

I am looking for a way to improve the search performance of my 
application. I've followed every suggestion in the Lucene Wiki but the 
search is still too slow with large indexes. I was wondering whether 
there was a way to restrict a search to a specific time period and in 
doing so sacrifice the quality of search results? Any other suggestions 
on how to improve search performance?

Much appreciate

Jamie


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Search Performance

Posted by Anshum <an...@naukri.com>.
Hi Jamie,

Are you running concurrent searches on the index i.e. spawning multiple
threads and not handling them?
I have been having similar issues and I am planning to try out a
workaround for it using Java's Interface Executor.
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Executor.html
This might help as it would control the resource utilization and of what
I know it as, it would pool resources and threads to handle concurrency,
thereby decreasing time.
Let me know in case you come across something better.

--
Anshum



On Wed, 2008-02-27 at 07:37 +0530, h t wrote:
> Hi Michael,
> I guess the hotspot of lucene is
> org.apache.lucene.search.IndexSearcher.search()
> 
> Hi Jamie,
> What's the original text size of a million emails?
> I estimate the size of an email is around 100k, is this true?
> When you doing search, what kind keywords did you input, words or short
> sentence?
> How many results return?
> Did you use filter to shrink the results size?
> 
> 2008/2/27, Michael Stoppelman <st...@gmail.com>:
> >
> > So you're saying searches are taking 10 seconds on a 5G index? If so that
> > seems ungodly slow.
> > If you're on *nix, have you watched your iostat statistics? Maybe
> > something
> > is hammering your hds.
> > Something seems amiss.
> >
> > What lucene methods were pointed to as hotspots by YourKit?
> >
> >
> > -M
> >
> >
> > On Tue, Feb 26, 2008 at 2:13 PM, Jamie <ja...@stimulussoft.com> wrote:
> >
> > > Hi Michael
> > >
> > > Perhaps this will help. We are using Lucene to index emails and provide
> > > a search interface to search through those emails. Many of our customers
> > > have 3-5 TB's or more of email data. The index size tends to be around 5
> > > GB per million messages. On a 3 GHZ intel core duo with standard 7200 mb
> > > drive, it takes approx. 10 seconds to search across a million emails. We
> > > need sub second search times, especially since, as time progresses, some
> > > of our archives are expected to reach 10-20 TB of data. In future, we
> > > will be recommending the use of SSD drives, but I'd like to know if they
> > > are any other strategies can pursued. One such strategy is to
> > > automatically create a new index after the index gets to a certain size.
> > > Then, when a search is conducted, based on date, search only those
> > > indexes that fall between specified dates. I've run my code through the
> > > YourKit profiler. The time appears to be consumed by Lucene itself and
> > > not by my code.
> > >
> > > Any other ideas?
> > >
> > >
> > > Michael Stoppelman wrote:
> > > > On Tue, Feb 26, 2008 at 10:18 AM, Jamie <ja...@stimulussoft.com>
> > wrote:
> > > >
> > > >
> > > >> Hi
> > > >>
> > > >> I am looking for a way to improve the search performance of my
> > > >> application. I've followed every suggestion in the Lucene Wiki but
> > the
> > > >> search is still too slow with large indexes. I was wondering whether
> > > >>
> > > >
> > > >
> > > > Did you optimize your index yet? That gave me a 2x bump.
> > > >
> > > > Have you put timers around parts of your code? Maybe it's something
> > > > unrelated to lucene.
> > > > You should probably give more details on your setup if you want more
> > > helpful
> > > > advice.
> > > >
> > > >
> > > >
> > > >> there was a way to restrict a search to a specific time period and in
> > > >> doing so sacrifice the quality of search results? Any other
> > suggestions
> > > >> on how to improve search performance?
> > > >>
> > > >> Much appreciate
> > > >>
> > > >> Jamie
> > > >>
> > > >>
> > > >> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> > > --
> > > Stimulus Software - MailArchiva
> > > Email Archiving And Compliance
> > > USA Tel: +1-713-366-8072 ext 3
> > > UK Tel: +44-20-80991035 ext 3
> > > Email: jamie@stimulussoft.com
> > > Web: http://www.mailarchiva.com
> > >
> > > To receive MailArchiva Enterprise Edition product announcements, send a
> > > message to: <ma...@stimulussoft.com>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Search Performance

Posted by Michael Prichard <mi...@mac.com>.
I'm wondering if your date field's precision may be a little too  
much?  What I mean is that you are going all the way down to  
seconds.  Whenever you do a range query you are essentially spawning  
a BooleanQuery with a representation of that range.  Do you really  
need to be that precise?  I usually stick with YYYYMMDD for search  
date fields and it works pretty well.  So you know, I have a 13 GB  
index with 3 million records and my search time is very low.   
Definitely under 1 second.

Just a thought.....

On Feb 27, 2008, at 6:14 AM, Jamie wrote:

> Hi Michael & Others
>
> Ok. I've gathered some more statistics from a different machine for  
> your analysis.
> (I had to switch machines because the original one was in  
> production and my tests were interfering).
>
> Here are the statistics from the new machine:
>
> Total Documents: 1.2 million
> Results Returned:  900k
> Store Size 238G (size of original documents)
> Index Size 1.2G (lucene index size)
> Index / Store Ratio 0.5%
>
> The search query is as follows:
>
> archivedate:[d20071229010000 TO d20080228235900]
>
> As you can see, I am using a range query to search between specific  
> dates.
> Question: should this query be moved to a filter rather? I did not  
> do this as I needed to have the option to sort on date.
>
> There are no other specific filters applied and in this example  
> sorting is turned off.
>
> On this particular machine the search time varies between 2.64  
> seconds and about 5 seconds.
>
> The limitations of this machine are that it does uses a normal IDE  
> drive to house the index, not a SATA drive
>
> IOStat Statistics
>
> Linux 2.6.20-15-server 27/02/2008
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>         20.25    0.00    3.23    0.34    0.00   76.19
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda               7.12        50.67       186.41   38936841  143240688
>
> See attached for hardware info and the CPU call tree (taken from  
> YourKit).
>
> I would appreciate your recommendations.
>
> Jamie
>
> h t wrote:
> Hi Michael,
> I guess the hotspot of lucene is
> org.apache.lucene.search.IndexSearcher.search()
>
> Hi Jamie,
> What's the original text size of a million emails?
> I estimate the size of an email is around 100k, is this true?
> When you doing search, what kind keywords did you input, words or  
> short
> sentence?
> How many results return?
> Did you use filter to shrink the results size?
>
> 2008/2/27, Michael Stoppelman <st...@gmail.com>:
>  So you're saying searches are taking 10 seconds on a 5G index? If  
> so that
> seems ungodly slow.
> If you're on *nix, have you watched your iostat statistics? Maybe
> something
> is hammering your hds.
> Something seems amiss.
>
> What lucene methods were pointed to as hotspots by YourKit?
>
>
>
> <stats.zip>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Lucene Search Performance

Posted by Andreas Guther <An...@markettools.com>.
Just some comment and I understand that you cannot change your index:

What we did is to organize our index based on creation date of entries.
We limit our search to a given number of years starting from the current
year.  Organizing the index in that way allows us to take off outdated
information.  We can provide the user an option to search across older
indexes as well, if wanted.

Andreas

-----Original Message-----
From: Jamie [mailto:jamie@stimulussoft.com] 
Sent: Wednesday, February 27, 2008 10:17 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Search Performance

Hi

Thanks for the suggestions. This would require us to change the index 
and right now we literally have millions of documents stored in current 
index format. I'll bear it in mind, but I am not entirely sure how I 
would go about implementing the change at this point.

Much appreciate

Jamie


h t wrote:
> 1. redefine the archivedate field as YYmmDD format,
> 2. add another field using timestamp for sort use.
> 3. use RangeFilter to get result and then sort by timestamp.
>
> 2008/2/27, Jamie <ja...@stimulussoft.com>:
>   
>> Hi Michael & Others
>>
>> Ok. I've gathered some more statistics from a different machine for
your
>> analysis.
>> (I had to switch machines because the original one was in production
and
>> my tests were interfering).
>>
>> Here are the statistics from the new machine:
>>
>> Total Documents: 1.2 million
>> Results Returned:  900k
>> Store Size 238G (size of original documents)
>> Index Size 1.2G (lucene index size)
>> Index / Store Ratio 0.5%
>>
>> The search query is as follows:
>>
>> archivedate:[d20071229010000 TO d20080228235900]
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~why there
is an
>> extra 'd' ?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> As you can see, I am using a range query to search between specific
dates.
>> Question: should this query be moved to a filter rather? I did not do
>> this as I needed to have the option to sort on date.
>>
>> There are no other specific filters applied and in this example
sorting
>> is turned off.
>>
>> On this particular machine the search time varies between 2.64
seconds
>> and about 5 seconds.
>>
>> The limitations of this machine are that it does uses a normal IDE
drive
>> to house the index, not a SATA drive
>>
>> IOStat Statistics
>>
>> Linux 2.6.20-15-server 27/02/2008
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>          20.25    0.00    3.23    0.34    0.00   76.19
>>
>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read
Blk_wrtn
>> sda               7.12        50.67       186.41   38936841
143240688
>>
>> See attached for hardware info and the CPU call tree (taken from
YourKit).
>>
>> I would appreciate your recommendations.
>>
>>
>> Jamie
>>
>>
>> h t wrote:
>> Hi Michael,
>> I guess the hotspot of lucene is
>> org.apache.lucene.search.IndexSearcher.search()
>>
>> Hi Jamie,
>> What's the original text size of a million emails?
>> I estimate the size of an email is around 100k, is this true?
>> When you doing search, what kind keywords did you input, words or
short
>> sentence?
>> How many results return?
>> Did you use filter to shrink the results size?
>>
>> 2008/2/27, Michael Stoppelman <st...@gmail.com>:
>>   So you're saying searches are taking 10 seconds on a 5G index? If
so
>> that
>> seems ungodly slow.
>> If you're on *nix, have you watched your iostat statistics? Maybe
>> something
>> is hammering your hds.
>> Something seems amiss.
>>
>> What lucene methods were pointed to as hotspots by YourKit?
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-366-8072 ext 3
UK Tel: +44-20-80991035 ext 3
Email: jamie@stimulussoft.com
Web: http://www.mailarchiva.com

To receive MailArchiva Enterprise Edition product announcements, send a
message to: <ma...@stimulussoft.com> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Search Performance

Posted by Jamie <ja...@stimulussoft.com>.
Hi

Thanks for the suggestions. This would require us to change the index 
and right now we literally have millions of documents stored in current 
index format. I'll bear it in mind, but I am not entirely sure how I 
would go about implementing the change at this point.

Much appreciate

Jamie


h t wrote:
> 1. redefine the archivedate field as YYmmDD format,
> 2. add another field using timestamp for sort use.
> 3. use RangeFilter to get result and then sort by timestamp.
>
> 2008/2/27, Jamie <ja...@stimulussoft.com>:
>   
>> Hi Michael & Others
>>
>> Ok. I've gathered some more statistics from a different machine for your
>> analysis.
>> (I had to switch machines because the original one was in production and
>> my tests were interfering).
>>
>> Here are the statistics from the new machine:
>>
>> Total Documents: 1.2 million
>> Results Returned:  900k
>> Store Size 238G (size of original documents)
>> Index Size 1.2G (lucene index size)
>> Index / Store Ratio 0.5%
>>
>> The search query is as follows:
>>
>> archivedate:[d20071229010000 TO d20080228235900]
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~why there is an
>> extra 'd' ?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> As you can see, I am using a range query to search between specific dates.
>> Question: should this query be moved to a filter rather? I did not do
>> this as I needed to have the option to sort on date.
>>
>> There are no other specific filters applied and in this example sorting
>> is turned off.
>>
>> On this particular machine the search time varies between 2.64 seconds
>> and about 5 seconds.
>>
>> The limitations of this machine are that it does uses a normal IDE drive
>> to house the index, not a SATA drive
>>
>> IOStat Statistics
>>
>> Linux 2.6.20-15-server 27/02/2008
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>          20.25    0.00    3.23    0.34    0.00   76.19
>>
>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>> sda               7.12        50.67       186.41   38936841  143240688
>>
>> See attached for hardware info and the CPU call tree (taken from YourKit).
>>
>> I would appreciate your recommendations.
>>
>>
>> Jamie
>>
>>
>> h t wrote:
>> Hi Michael,
>> I guess the hotspot of lucene is
>> org.apache.lucene.search.IndexSearcher.search()
>>
>> Hi Jamie,
>> What's the original text size of a million emails?
>> I estimate the size of an email is around 100k, is this true?
>> When you doing search, what kind keywords did you input, words or short
>> sentence?
>> How many results return?
>> Did you use filter to shrink the results size?
>>
>> 2008/2/27, Michael Stoppelman <st...@gmail.com>:
>>   So you're saying searches are taking 10 seconds on a 5G index? If so
>> that
>> seems ungodly slow.
>> If you're on *nix, have you watched your iostat statistics? Maybe
>> something
>> is hammering your hds.
>> Something seems amiss.
>>
>> What lucene methods were pointed to as hotspots by YourKit?
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-366-8072 ext 3
UK Tel: +44-20-80991035 ext 3
Email: jamie@stimulussoft.com
Web: http://www.mailarchiva.com

To receive MailArchiva Enterprise Edition product announcements, send a message to: <ma...@stimulussoft.com> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Search Performance

Posted by h t <bl...@gmail.com>.
1. redefine the archivedate field as YYmmDD format,
2. add another field using timestamp for sort use.
3. use RangeFilter to get result and then sort by timestamp.

2008/2/27, Jamie <ja...@stimulussoft.com>:
>
> Hi Michael & Others
>
> Ok. I've gathered some more statistics from a different machine for your
> analysis.
> (I had to switch machines because the original one was in production and
> my tests were interfering).
>
> Here are the statistics from the new machine:
>
> Total Documents: 1.2 million
> Results Returned:  900k
> Store Size 238G (size of original documents)
> Index Size 1.2G (lucene index size)
> Index / Store Ratio 0.5%
>
> The search query is as follows:
>
> archivedate:[d20071229010000 TO d20080228235900]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~why there is an
> extra 'd' ?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> As you can see, I am using a range query to search between specific dates.
> Question: should this query be moved to a filter rather? I did not do
> this as I needed to have the option to sort on date.
>
> There are no other specific filters applied and in this example sorting
> is turned off.
>
> On this particular machine the search time varies between 2.64 seconds
> and about 5 seconds.
>
> The limitations of this machine are that it does uses a normal IDE drive
> to house the index, not a SATA drive
>
> IOStat Statistics
>
> Linux 2.6.20-15-server 27/02/2008
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>          20.25    0.00    3.23    0.34    0.00   76.19
>
> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
> sda               7.12        50.67       186.41   38936841  143240688
>
> See attached for hardware info and the CPU call tree (taken from YourKit).
>
> I would appreciate your recommendations.
>
>
> Jamie
>
>
> h t wrote:
> Hi Michael,
> I guess the hotspot of lucene is
> org.apache.lucene.search.IndexSearcher.search()
>
> Hi Jamie,
> What's the original text size of a million emails?
> I estimate the size of an email is around 100k, is this true?
> When you doing search, what kind keywords did you input, words or short
> sentence?
> How many results return?
> Did you use filter to shrink the results size?
>
> 2008/2/27, Michael Stoppelman <st...@gmail.com>:
>   So you're saying searches are taking 10 seconds on a 5G index? If so
> that
> seems ungodly slow.
> If you're on *nix, have you watched your iostat statistics? Maybe
> something
> is hammering your hds.
> Something seems amiss.
>
> What lucene methods were pointed to as hotspots by YourKit?
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Search Performance

Posted by Jamie <ja...@stimulussoft.com>.
Hi Michael & Others

Ok. I've gathered some more statistics from a different machine for your 
analysis.
(I had to switch machines because the original one was in production and 
my tests were interfering).

Here are the statistics from the new machine:

Total Documents: 1.2 million
Results Returned:  900k
Store Size 238G (size of original documents)
Index Size 1.2G (lucene index size)
Index / Store Ratio 0.5%

The search query is as follows:

archivedate:[d20071229010000 TO d20080228235900]

As you can see, I am using a range query to search between specific dates.
Question: should this query be moved to a filter rather? I did not do 
this as I needed to have the option to sort on date.

There are no other specific filters applied and in this example sorting 
is turned off.

On this particular machine the search time varies between 2.64 seconds 
and about 5 seconds.

The limitations of this machine are that it does uses a normal IDE drive 
to house the index, not a SATA drive

IOStat Statistics

Linux 2.6.20-15-server 27/02/2008

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         20.25    0.00    3.23    0.34    0.00   76.19

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               7.12        50.67       186.41   38936841  143240688

See attached for hardware info and the CPU call tree (taken from YourKit).

I would appreciate your recommendations.

Jamie

h t wrote:
Hi Michael,
I guess the hotspot of lucene is
org.apache.lucene.search.IndexSearcher.search()

Hi Jamie,
What's the original text size of a million emails?
I estimate the size of an email is around 100k, is this true?
When you doing search, what kind keywords did you input, words or short
sentence?
How many results return?
Did you use filter to shrink the results size?

2008/2/27, Michael Stoppelman <st...@gmail.com>:
  So you're saying searches are taking 10 seconds on a 5G index? If so that
seems ungodly slow.
If you're on *nix, have you watched your iostat statistics? Maybe
something
is hammering your hds.
Something seems amiss.

What lucene methods were pointed to as hotspots by YourKit?




Re: Lucene Search Performance

Posted by h t <bl...@gmail.com>.
Hi Michael,
I guess the hotspot of lucene is
org.apache.lucene.search.IndexSearcher.search()

Hi Jamie,
What's the original text size of a million emails?
I estimate the size of an email is around 100k, is this true?
When you doing search, what kind keywords did you input, words or short
sentence?
How many results return?
Did you use filter to shrink the results size?

2008/2/27, Michael Stoppelman <st...@gmail.com>:
>
> So you're saying searches are taking 10 seconds on a 5G index? If so that
> seems ungodly slow.
> If you're on *nix, have you watched your iostat statistics? Maybe
> something
> is hammering your hds.
> Something seems amiss.
>
> What lucene methods were pointed to as hotspots by YourKit?
>
>
> -M
>
>
> On Tue, Feb 26, 2008 at 2:13 PM, Jamie <ja...@stimulussoft.com> wrote:
>
> > Hi Michael
> >
> > Perhaps this will help. We are using Lucene to index emails and provide
> > a search interface to search through those emails. Many of our customers
> > have 3-5 TB's or more of email data. The index size tends to be around 5
> > GB per million messages. On a 3 GHZ intel core duo with standard 7200 mb
> > drive, it takes approx. 10 seconds to search across a million emails. We
> > need sub second search times, especially since, as time progresses, some
> > of our archives are expected to reach 10-20 TB of data. In future, we
> > will be recommending the use of SSD drives, but I'd like to know if they
> > are any other strategies can pursued. One such strategy is to
> > automatically create a new index after the index gets to a certain size.
> > Then, when a search is conducted, based on date, search only those
> > indexes that fall between specified dates. I've run my code through the
> > YourKit profiler. The time appears to be consumed by Lucene itself and
> > not by my code.
> >
> > Any other ideas?
> >
> >
> > Michael Stoppelman wrote:
> > > On Tue, Feb 26, 2008 at 10:18 AM, Jamie <ja...@stimulussoft.com>
> wrote:
> > >
> > >
> > >> Hi
> > >>
> > >> I am looking for a way to improve the search performance of my
> > >> application. I've followed every suggestion in the Lucene Wiki but
> the
> > >> search is still too slow with large indexes. I was wondering whether
> > >>
> > >
> > >
> > > Did you optimize your index yet? That gave me a 2x bump.
> > >
> > > Have you put timers around parts of your code? Maybe it's something
> > > unrelated to lucene.
> > > You should probably give more details on your setup if you want more
> > helpful
> > > advice.
> > >
> > >
> > >
> > >> there was a way to restrict a search to a specific time period and in
> > >> doing so sacrifice the quality of search results? Any other
> suggestions
> > >> on how to improve search performance?
> > >>
> > >> Much appreciate
> > >>
> > >> Jamie
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
> > --
> > Stimulus Software - MailArchiva
> > Email Archiving And Compliance
> > USA Tel: +1-713-366-8072 ext 3
> > UK Tel: +44-20-80991035 ext 3
> > Email: jamie@stimulussoft.com
> > Web: http://www.mailarchiva.com
> >
> > To receive MailArchiva Enterprise Edition product announcements, send a
> > message to: <ma...@stimulussoft.com>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Lucene Search Performance

Posted by Michael Stoppelman <st...@gmail.com>.
So you're saying searches are taking 10 seconds on a 5G index? If so that
seems ungodly slow.
If you're on *nix, have you watched your iostat statistics? Maybe something
is hammering your hds.
Something seems amiss.

What lucene methods were pointed to as hotspots by YourKit?

-M

On Tue, Feb 26, 2008 at 2:13 PM, Jamie <ja...@stimulussoft.com> wrote:

> Hi Michael
>
> Perhaps this will help. We are using Lucene to index emails and provide
> a search interface to search through those emails. Many of our customers
> have 3-5 TB's or more of email data. The index size tends to be around 5
> GB per million messages. On a 3 GHZ intel core duo with standard 7200 mb
> drive, it takes approx. 10 seconds to search across a million emails. We
> need sub second search times, especially since, as time progresses, some
> of our archives are expected to reach 10-20 TB of data. In future, we
> will be recommending the use of SSD drives, but I'd like to know if they
> are any other strategies can pursued. One such strategy is to
> automatically create a new index after the index gets to a certain size.
> Then, when a search is conducted, based on date, search only those
> indexes that fall between specified dates. I've run my code through the
> YourKit profiler. The time appears to be consumed by Lucene itself and
> not by my code.
>
> Any other ideas?
>
>
> Michael Stoppelman wrote:
> > On Tue, Feb 26, 2008 at 10:18 AM, Jamie <ja...@stimulussoft.com> wrote:
> >
> >
> >> Hi
> >>
> >> I am looking for a way to improve the search performance of my
> >> application. I've followed every suggestion in the Lucene Wiki but the
> >> search is still too slow with large indexes. I was wondering whether
> >>
> >
> >
> > Did you optimize your index yet? That gave me a 2x bump.
> >
> > Have you put timers around parts of your code? Maybe it's something
> > unrelated to lucene.
> > You should probably give more details on your setup if you want more
> helpful
> > advice.
> >
> >
> >
> >> there was a way to restrict a search to a specific time period and in
> >> doing so sacrifice the quality of search results? Any other suggestions
> >> on how to improve search performance?
> >>
> >> Much appreciate
> >>
> >> Jamie
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >>
> >
> >
>
>
> --
> Stimulus Software - MailArchiva
> Email Archiving And Compliance
> USA Tel: +1-713-366-8072 ext 3
> UK Tel: +44-20-80991035 ext 3
> Email: jamie@stimulussoft.com
> Web: http://www.mailarchiva.com
>
> To receive MailArchiva Enterprise Edition product announcements, send a
> message to: <ma...@stimulussoft.com>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Search Performance

Posted by Jamie <ja...@stimulussoft.com>.
Hi Michael

Perhaps this will help. We are using Lucene to index emails and provide 
a search interface to search through those emails. Many of our customers 
have 3-5 TB's or more of email data. The index size tends to be around 5 
GB per million messages. On a 3 GHZ intel core duo with standard 7200 mb 
drive, it takes approx. 10 seconds to search across a million emails. We 
need sub second search times, especially since, as time progresses, some 
of our archives are expected to reach 10-20 TB of data. In future, we 
will be recommending the use of SSD drives, but I'd like to know if they 
are any other strategies can pursued. One such strategy is to 
automatically create a new index after the index gets to a certain size. 
Then, when a search is conducted, based on date, search only those 
indexes that fall between specified dates. I've run my code through the 
YourKit profiler. The time appears to be consumed by Lucene itself and 
not by my code.

Any other ideas?


Michael Stoppelman wrote:
> On Tue, Feb 26, 2008 at 10:18 AM, Jamie <ja...@stimulussoft.com> wrote:
>
>   
>> Hi
>>
>> I am looking for a way to improve the search performance of my
>> application. I've followed every suggestion in the Lucene Wiki but the
>> search is still too slow with large indexes. I was wondering whether
>>     
>
>
> Did you optimize your index yet? That gave me a 2x bump.
>
> Have you put timers around parts of your code? Maybe it's something
> unrelated to lucene.
> You should probably give more details on your setup if you want more helpful
> advice.
>
>
>   
>> there was a way to restrict a search to a specific time period and in
>> doing so sacrifice the quality of search results? Any other suggestions
>> on how to improve search performance?
>>
>> Much appreciate
>>
>> Jamie
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance
USA Tel: +1-713-366-8072 ext 3
UK Tel: +44-20-80991035 ext 3
Email: jamie@stimulussoft.com
Web: http://www.mailarchiva.com

To receive MailArchiva Enterprise Edition product announcements, send a message to: <ma...@stimulussoft.com> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Search Performance

Posted by Michael Stoppelman <st...@gmail.com>.
On Tue, Feb 26, 2008 at 10:18 AM, Jamie <ja...@stimulussoft.com> wrote:

> Hi
>
> I am looking for a way to improve the search performance of my
> application. I've followed every suggestion in the Lucene Wiki but the
> search is still too slow with large indexes. I was wondering whether


Did you optimize your index yet? That gave me a 2x bump.

Have you put timers around parts of your code? Maybe it's something
unrelated to lucene.
You should probably give more details on your setup if you want more helpful
advice.


>
> there was a way to restrict a search to a specific time period and in
> doing so sacrifice the quality of search results? Any other suggestions
> on how to improve search performance?
>
> Much appreciate
>
> Jamie
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>