You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Shane <sh...@LEXUM.UMontreal.CA> on 2010/03/19 21:56:27 UTC

PhraseQuery Performance Issues [Lucene 2.9.0]

I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in it.

We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older versions as well).

By looking at my logs, I'm finding that phrase queries are especially long to perform. In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on purpose.

If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some results back, and a bit less afterwards (7sec). 

More complete phrases that match maybe only 1 document can also take >10 secs if they have many stopwords in them.

I was wondering if this a normal behavior considering the fact that we do not remove stopwords? 

Also, on some phrase queries (not all), the difference between the first call and any subsequent calls can be very big. For example, it could take 5 seconds to do one query and then less than 1 second to perform it again. 

Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file system caching at work?

If this is a normal behavior, I assume that the solution is either to remove stopwords from the index or shard it and ParallelMultiSearch it.

What do you think?

Daniel Shane




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PhraseQuery Performance Issues [Lucene 2.9.0]

Posted by Rafael Turk <ra...@gmail.com>.
unsubscribe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PhraseQuery Performance Issues [Lucene 2.9.0]

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Indeed!

I found a very good article on this as well at :

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

It really sums up what you are saying.

Thanks for the help!
Daniel Shane

----- Original Message -----
From: "Michael McCandless" <lu...@mikemccandless.com>
To: java-user@lucene.apache.org
Sent: Friday, March 19, 2010 7:14:06 PM
Subject: Re: PhraseQuery Performance Issues [Lucene 2.9.0]

Nutch/Solr's CommonGrams is the right way to solve this.  It combines
frequent terms (eg stopwords) with adjacent terms.  So "the wizard of
oz" will be indexed eg as the_wizard wizard_of of_oz.  It'll require a
full re-index though, and you have to fixup searching so that the same
term expansion works.

Lucene does no caching that'd speed up PhraseQuery (unless that was
the very first query against the field since you opened the reader) --
this is probably the OS's IO cache.

Mike

On Fri, Mar 19, 2010 at 3:56 PM, Daniel Shane <sh...@lexum.umontreal.ca> wrote:
> I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in it.
>
> We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older versions as well).
>
> By looking at my logs, I'm finding that phrase queries are especially long to perform. In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on purpose.
>
> If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some results back, and a bit less afterwards (7sec).
>
> More complete phrases that match maybe only 1 document can also take >10 secs if they have many stopwords in them.
>
> I was wondering if this a normal behavior considering the fact that we do not remove stopwords?
>
> Also, on some phrase queries (not all), the difference between the first call and any subsequent calls can be very big. For example, it could take 5 seconds to do one query and then less than 1 second to perform it again.
>
> Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file system caching at work?
>
> If this is a normal behavior, I assume that the solution is either to remove stopwords from the index or shard it and ParallelMultiSearch it.
>
> What do you think?
>
> Daniel Shane
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: PhraseQuery Performance Issues [Lucene 2.9.0]

Posted by Michael McCandless <lu...@mikemccandless.com>.
Nutch/Solr's CommonGrams is the right way to solve this.  It combines
frequent terms (eg stopwords) with adjacent terms.  So "the wizard of
oz" will be indexed eg as the_wizard wizard_of of_oz.  It'll require a
full re-index though, and you have to fixup searching so that the same
term expansion works.

Lucene does no caching that'd speed up PhraseQuery (unless that was
the very first query against the field since you opened the reader) --
this is probably the OS's IO cache.

Mike

On Fri, Mar 19, 2010 at 3:56 PM, Daniel Shane <sh...@lexum.umontreal.ca> wrote:
> I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in it.
>
> We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older versions as well).
>
> By looking at my logs, I'm finding that phrase queries are especially long to perform. In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on purpose.
>
> If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some results back, and a bit less afterwards (7sec).
>
> More complete phrases that match maybe only 1 document can also take >10 secs if they have many stopwords in them.
>
> I was wondering if this a normal behavior considering the fact that we do not remove stopwords?
>
> Also, on some phrase queries (not all), the difference between the first call and any subsequent calls can be very big. For example, it could take 5 seconds to do one query and then less than 1 second to perform it again.
>
> Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file system caching at work?
>
> If this is a normal behavior, I assume that the solution is either to remove stopwords from the index or shard it and ParallelMultiSearch it.
>
> What do you think?
>
> Daniel Shane
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org