You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ori Schnaps <os...@gmail.com> on 2006/01/24 01:25:06 UTC

performance implications for an index with large number of documents.

Hi,

Apologies if this question has being asked before on this list.

I am working on an application with a Lucene index whose performance
(response time for a query) has started degrading as its size has
increase.

The index is made up of approximately 10 million documents that have
11 fields.  The average document size is less then 1k.  The index has
a total of 13 million terms.  The total index size is about 2.2 gig.
The index is being updated relatively aggressively.  In a 24hr period
there may be any where from 500k to 3 million updates.

What I have noticed is that as the document number increased from 6
million to 10 million, the response time for a query has continually
increased from 0.5 seconds to ~2+ seconds.

I am using a Java j2sdk1.4.2_08 and Lucene 1.4.3.  The container is
tomcat and the java process is allocated 2gig heap.  The heap is
shared between the Lucene index and the end user application.

Our initial inclination is to pull the index out of the application
and load it fully into memory on a separate box.  Anyone have any
experience with performance for index of this nature and what kind of
request time should be expected from Lucene?

thanks
ori

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: performance implications for an index with large number of documents.

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,
Quick reactions:
- Do use -server option, it makes a difference, and I don't think there is much to test there (I've never run a daemon-like service without the -server option, and have seen the improvement in performance due to HotSpot with my own eyes)
- Optimizing every hour sounds like a bad idea.  Instead of re-optimizing so often and rewriting the whole index to disk (slow), consider changing your mergeFactor.

Otis

----- Original Message ----
From: Ori Schnaps <os...@gmail.com>
To: java-user@lucene.apache.org
Sent: Tue 24 Jan 2006 11:39:11 AM EST
Subject: Re: performance implications for an index with large number of documents.

hi,

Thank you for all the quick and pertinent responses.

The index is being optimized every hour due to the number of updates. 
The JVM has a heap of 2gig and the machine has a total of 4. 
Currently the JVM is not configured with -server parameter and the
parallel garbage collection (we are testing that configuration).

The high ration of unique terms in the documents is mainly due to two
sets of unique identifiers.  The larger set does not need to be index
since that key is not utilize it in any query and as such we are going
to change that field to UnIndex from Keyword.

The queries are ad hoc, i.e. from users.  There is one primary field
that is used for the initially query.  The other fields are used as
filters on the data.

The initial query can return several thousand results.  Out of the
total hits we typically use the top 100 - 200.

As for the need for the aggressive updates, shrug, a business decision.

thanks much,
ori

On 1/24/06, Michael D. Curtin <mi...@curtin.com> wrote:
> Hi Ori,
>
> Before taking drastic rehosting measures, and introducing the associated
> software complexity off splitting your application into pieces running
> on separate machines, I'd recommend looking at the way your document
> data is distributed and the way you're searching them.  Here are some
> questions that may help you find a less-complex solution:
>
> -   Is your high ratio of unique terms to documents due to a unique
> identifier in the documents?  If so, are you performing wildcard or
> range searches on that field?
>
> -   Are your queries "canned", i.e. hard-coded in form, or are they "ad
> hoc", coming from users?
>
> -   Do your queries refer to every field you've indexed?  On a similar
> note, does your application use every field you've indexed or stored in
> Lucene?
>
> -   How many documents do your queries hit typically?  How many of those
> hits do you typically use?
>
> -   How important is it that queries are run on up-to-the-second data?
> In other words, would the hits be pretty much as useful if the updates
> were batched up for a few runs per day, instead of continuous?
>
>
> One of the things I really like about Lucene is that one can quickly
> whip up an application and it basically works.  But, like most
> databases, small differences in organization can produce
> disproportionately large differences in performance when there are
> millions of rows/records/entries.  A little time spent examining data
> distribution and access patterns can go a long way.
>
> Good luck!
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: performance implications for an index with large number of documents.

Posted by Ori Schnaps <os...@gmail.com>.
hi,

Thank you for all the quick and pertinent responses.

The index is being optimized every hour due to the number of updates. 
The JVM has a heap of 2gig and the machine has a total of 4. 
Currently the JVM is not configured with -server parameter and the
parallel garbage collection (we are testing that configuration).

The high ration of unique terms in the documents is mainly due to two
sets of unique identifiers.  The larger set does not need to be index
since that key is not utilize it in any query and as such we are going
to change that field to UnIndex from Keyword.

The queries are ad hoc, i.e. from users.  There is one primary field
that is used for the initially query.  The other fields are used as
filters on the data.

The initial query can return several thousand results.  Out of the
total hits we typically use the top 100 - 200.

As for the need for the aggressive updates, shrug, a business decision.

thanks much,
ori

On 1/24/06, Michael D. Curtin <mi...@curtin.com> wrote:
> Hi Ori,
>
> Before taking drastic rehosting measures, and introducing the associated
> software complexity off splitting your application into pieces running
> on separate machines, I'd recommend looking at the way your document
> data is distributed and the way you're searching them.  Here are some
> questions that may help you find a less-complex solution:
>
> -   Is your high ratio of unique terms to documents due to a unique
> identifier in the documents?  If so, are you performing wildcard or
> range searches on that field?
>
> -   Are your queries "canned", i.e. hard-coded in form, or are they "ad
> hoc", coming from users?
>
> -   Do your queries refer to every field you've indexed?  On a similar
> note, does your application use every field you've indexed or stored in
> Lucene?
>
> -   How many documents do your queries hit typically?  How many of those
> hits do you typically use?
>
> -   How important is it that queries are run on up-to-the-second data?
> In other words, would the hits be pretty much as useful if the updates
> were batched up for a few runs per day, instead of continuous?
>
>
> One of the things I really like about Lucene is that one can quickly
> whip up an application and it basically works.  But, like most
> databases, small differences in organization can produce
> disproportionately large differences in performance when there are
> millions of rows/records/entries.  A little time spent examining data
> distribution and access patterns can go a long way.
>
> Good luck!
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: performance implications for an index with large number of documents.

Posted by "Michael D. Curtin" <mi...@curtin.com>.
Hi Ori,

Before taking drastic rehosting measures, and introducing the associated 
software complexity off splitting your application into pieces running 
on separate machines, I'd recommend looking at the way your document 
data is distributed and the way you're searching them.  Here are some 
questions that may help you find a less-complex solution:

-   Is your high ratio of unique terms to documents due to a unique 
identifier in the documents?  If so, are you performing wildcard or 
range searches on that field?

-   Are your queries "canned", i.e. hard-coded in form, or are they "ad 
hoc", coming from users?

-   Do your queries refer to every field you've indexed?  On a similar 
note, does your application use every field you've indexed or stored in 
Lucene?

-   How many documents do your queries hit typically?  How many of those 
hits do you typically use?

-   How important is it that queries are run on up-to-the-second data? 
In other words, would the hits be pretty much as useful if the updates 
were batched up for a few runs per day, instead of continuous?


One of the things I really like about Lucene is that one can quickly 
whip up an application and it basically works.  But, like most 
databases, small differences in organization can produce 
disproportionately large differences in performance when there are 
millions of rows/records/entries.  A little time spent examining data 
distribution and access patterns can go a long way.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: performance implications for an index with large number of documents.

Posted by Chris Lamprecht <cl...@gmail.com>.
How much RAM do you have?  If you're under linux, can you run
something like "iostat -x -d -t 60" and watch your disk usage during
searching?  If your disk utilization is high, add more RAM (enough to
hold your index in RAM) and see if the OS cache solves the problem.  I
would try this before the common approach of loading the index into a
RAMDirectory.  You might even try lowering your java heap to give more
to the OS.

Watch your CPU usage (using tools 'top' or 'sar'), are you CPU-bound? 
With more info, we can try to pinpoint the bottleneck.

-c

On 1/23/06, Ori Schnaps <os...@gmail.com> wrote:
> Hi,
>
> Apologies if this question has being asked before on this list.
>
> I am working on an application with a Lucene index whose performance
> (response time for a query) has started degrading as its size has
> increase.
>
> The index is made up of approximately 10 million documents that have
> 11 fields.  The average document size is less then 1k.  The index has
> a total of 13 million terms.  The total index size is about 2.2 gig.
> The index is being updated relatively aggressively.  In a 24hr period
> there may be any where from 500k to 3 million updates.
>
> What I have noticed is that as the document number increased from 6
> million to 10 million, the response time for a query has continually
> increased from 0.5 seconds to ~2+ seconds.
>
> I am using a Java j2sdk1.4.2_08 and Lucene 1.4.3.  The container is
> tomcat and the java process is allocated 2gig heap.  The heap is
> shared between the Lucene index and the end user application.
>
> Our initial inclination is to pull the index out of the application
> and load it fully into memory on a separate box.  Anyone have any
> experience with performance for index of this nature and what kind of
> request time should be expected from Lucene?
>
> thanks
> ori
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: performance implications for an index with large number of documents.

Posted by Dave Kor <da...@gmail.com>.
Lucene scales with the number of unique terms in the index and not the
number of documents nor the size of the documents. Typically, you
should have at most 1 million unique terms for a set of 10 million
documents.

So the fact that you have 13 million unique terms in 10 million
documents tells me that the characteristics of your document set don't
follow the typical term growth rate that Lucene is designed for. You
might actually be better off using a database for storing and
searching these documents.

On 1/24/06, Ori Schnaps <os...@gmail.com> wrote:
> Hi,
>
> Apologies if this question has being asked before on this list.
>
> I am working on an application with a Lucene index whose performance
> (response time for a query) has started degrading as its size has
> increase.
>
> The index is made up of approximately 10 million documents that have
> 11 fields.  The average document size is less then 1k.  The index has
> a total of 13 million terms.  The total index size is about 2.2 gig.
> The index is being updated relatively aggressively.  In a 24hr period
> there may be any where from 500k to 3 million updates.
>
> What I have noticed is that as the document number increased from 6
> million to 10 million, the response time for a query has continually
> increased from 0.5 seconds to ~2+ seconds.
>
> I am using a Java j2sdk1.4.2_08 and Lucene 1.4.3.  The container is
> tomcat and the java process is allocated 2gig heap.  The heap is
> shared between the Lucene index and the end user application.
>
> Our initial inclination is to pull the index out of the application
> and load it fully into memory on a separate box.  Anyone have any
> experience with performance for index of this nature and what kind of
> request time should be expected from Lucene?
>
> thanks
> ori
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Dave Kor, Research Assistant
Center for Information Mining and Extraction
School of Computing
National University of Singapore.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: performance implications for an index with large number of documents.

Posted by Chris Hostetter <ho...@fucit.org>.
:
: The index is made up of approximately 10 million documents that have
: 11 fields.  The average document size is less then 1k.  The index has
: a total of 13 million terms.  The total index size is about 2.2 gig.
: The index is being updated relatively aggressively.  In a 24hr period
: there may be any where from 500k to 3 million updates.

I'm interpreting "update" to mean a deletion followed by an add (unless
you mean your index is growing by 0.5-3 million docs a day)

Which begs the question: How often are you optimizing?

Deleting documents doesn't free up all of the space used to store
information about which terms map to those documents, which could explain
why your total number of terms seems high to Dave Kor -- a lot of those
Terms may only be mapped to deleted documents.


(of course, i could be wrong.  maybe you are optimizing regularly, and
this is an unrelated issue ... but it's the first thing i'd double check
-- how does maxDoc compare with numDoc? what is the relative size of index
before/after an optimize? what is the relative number of terms
before/after an optime?)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org