You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rob Young <bu...@gmail.com> on 2006/09/27 18:51:11 UTC

Splitting the index

Hi,

I'm using Lucene to search a product database (CDs, DVDs, games and now 
books). Recently that index has increased in size to over a million items 
(added books). I have been performance testing our search server and the 
throughput of requests has dropped significantly, profiling the server it all 
seems to be in the Lucene searching.

So, now that I've narrowed it down to the searching itself rather than the 
rest of the application. What can I do about it? I am running a TermQuery, 
falling back to a FuzzyQuery when no results are found (each combined in a 
boolean queries with the product type restrictions). 

One solution I had in mind was to split the index down into four, would this 
provide any gains? It will require a lot of re-factoring so I don't want to 
commit myself if there's no chance it will help.

Another solution along the same train of thought was to use a caching filter 
search to cut the index into parts. How would this compare to the previous 
idea?

Does anyone have any other ideas / suggestions?

Thanks
Rob

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting the index

Posted by karl wettin <ka...@gmail.com>.

On Fri, 2006-09-29 at 11:50 +0200, karl wettin wrote:
> I don't consider a 300M to be a fairly small index. 

Oups. I /do/ think it is.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting the index

Posted by karl wettin <ka...@gmail.com>.

On Thu, 2006-09-28 at 10:05 +0100, Rob Young wrote:
> 
> > total file system size of the index?
> segments    31b
> deletable    4b
> index      286Mb 

If you experience that a 300M index is much slower than a.. 30M or so,
then something is probably rotten. I don't consider a 300M to be a
fairly small index. Perhaps it is your application that keep falling
back on fuzzy? That could explain it.

And if it is a commercial project where time is expensive, go buy a
bunch of RAM and run from a RAMDirectory. It might make no difference as
a FSDirectory end up in memory as file cache. My (jira 550) index might
be able to help you then. It consumes twice the memory but is something
like 10-200 time faster than RAMDirectory depending on your query and
the number of resulting hits.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting the index

Posted by Rob Young <bu...@gmail.com>.

On Wednesday 27 September 2006 18:51, Erik Hatcher wrote:
> Lots of possible issues, but we need more information to troubleshoot
> this properly.   
> How big is your index, number of documents?  
CDs       137,390
DVDs       41,049
Games       3,360
Books     648,941
Total     830,740

> total file system size of the index?
segments    31b
deletable    4b
index      286Mb

> is your index optimized?  
Yes, after every 1000 adds

> how often do you update the index?
Continuously. We have document builders running for each product type. They 
pull all products of their type which have changed since their last index 
time, build the documents and stick them in a queue for the indexer (this 
includes deletes), sleep for an hour and repeat. The indexer processes the 
queue optimising the index every 1000 documents.

> how are you managing indexsearcher instances after the index is updated?
My main search class is a Runnable, in it's run method it does the following.
while( !this.stop ) {
  try {
    Thread.sleep( 1000 * 60 * 5 );
    this.indexSearcher = 
      new IndexSearcher( 
        new RAMDirectory( 
          FSDirectory.getDirectory(
            this.index_directory, true
          )
        )
      )
    );
  } catch( IOException e ) {
    log.severe( "Failed to reload the searcher" );
  } catch( InterruptedException e ) {
    log.notice( "Index reloading interrupted" );
  }
}


I have replied to the other message in this thread with the questions asked 
there as well.

Many Thanks
Rob

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting the index

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Lots of possible issues, but we need more information to troubleshoot  
this properly.   How big is your index, number of documents?  total  
file system size of the index?  is your index optimized?  how often  
do you update the index?  how are you managing indexsearcher  
instances after the index is updated?


On Sep 27, 2006, at 12:51 PM, Rob Young wrote:

> Hi,
>
> I'm using Lucene to search a product database (CDs, DVDs, games and  
> now
> books). Recently that index has increased in size to over a million  
> items
> (added books). I have been performance testing our search server  
> and the
> throughput of requests has dropped significantly, profiling the  
> server it all
> seems to be in the Lucene searching.
>
> So, now that I've narrowed it down to the searching itself rather  
> than the
> rest of the application. What can I do about it? I am running a  
> TermQuery,
> falling back to a FuzzyQuery when no results are found (each  
> combined in a
> boolean queries with the product type restrictions).
>
> One solution I had in mind was to split the index down into four,  
> would this
> provide any gains? It will require a lot of re-factoring so I don't  
> want to
> commit myself if there's no chance it will help.
>
> Another solution along the same train of thought was to use a  
> caching filter
> search to cut the index into parts. How would this compare to the  
> previous
> idea?
>
> Does anyone have any other ideas / suggestions?
>
> Thanks
> Rob
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Splitting the index

Posted by Erick Erickson <er...@gmail.com>.

I'd ask for more details. You say that you've narrowed it down to Lucene
doing the searching.... But which part of the search? Here're two places
people have run into problems before (sorry if you already know this...).

1> Iterating through the entire returned set with Hits.doc(#).
2> opening and closing your indexreader between queries.

The first thing I'd do is insert some timing logging into your search code.
For instance, log the time after you've assembled your query and before you
execute the search. Log the time it takes to do the raw search. Log the time
you spend spinning through the returned hits preparing to return the
results. I'm not talking anything fancy here, just
System.currentTimeMilliseconds().

I can't emphasize strongly enough that you simply *cannot* jump to a
solution before you *know* where you're spending your time. I've spent
waaaaay more time that I want to admit to fixing code that I was *sure* was
slow only find out that the *real* problem was somewhere else.

Finally, what times are you seeing? And what was the index size before and
after? Without some numbers, nobody else can guess at any solutions.

Best
Erick

On 9/27/06, Rob Young <bu...@gmail.com> wrote:
>
> Hi,
>
> I'm using Lucene to search a product database (CDs, DVDs, games and now
> books). Recently that index has increased in size to over a million items
> (added books). I have been performance testing our search server and the
> throughput of requests has dropped significantly, profiling the server it
> all
> seems to be in the Lucene searching.
>
> So, now that I've narrowed it down to the searching itself rather than the
> rest of the application. What can I do about it? I am running a TermQuery,
> falling back to a FuzzyQuery when no results are found (each combined in a
> boolean queries with the product type restrictions).
>
> One solution I had in mind was to split the index down into four, would
> this
> provide any gains? It will require a lot of re-factoring so I don't want
> to
> commit myself if there's no chance it will help.
>
> Another solution along the same train of thought was to use a caching
> filter
> search to cut the index into parts. How would this compare to the previous
> idea?
>
> Does anyone have any other ideas / suggestions?
>
> Thanks
> Rob
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>