You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Peter Keegan <pe...@gmail.com> on 2007/03/21 19:38:52 UTC

Re: Lucene search performance: linear?

On a similar topic, has anybody measured query performance as a function of
index size?
Well, I did and the results surprised me. I measured query throughput on 8
indexes that varied in size from 55,000 to 4.4 million documents. When
plotted on a graph, there is a distinct hyperbolic curve (1/x). I expected
to see more of a linear curve with a sharp drop-off at some point.
Interesting

Peter

On 12/5/06, Zhang, Lisheng <Li...@broadvision.com> wrote:
>
> Hi Soeren,
>
> Thanks very much for explanations, yes, there
> is no linear relation when searching a keyword
> which is only in a few docs.
>
> Best regards, Lisheng
>
> -----Original Message-----
> From: Soeren Pekrul [mailto:soeren.pekrul@gmx.de]
> Sent: Tuesday, December 05, 2006 10:37 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene search performance: linear?
>
>
> Hello Lisheng,
>
> a search process has to do usually two thinks. First it has to find the
> term in the index. I don't know the implementation of finding a term in
> Lucene. I hope that the index is at least a sorted list or a binary
> tree, so it can search binary. The time finding a term depends of the
> term's number n_t. If it searches binary the complexity is approximately
> log(n_t). The search time should be better then linear.
>
> Second it has to collect the documents for a term. This depends of the
> documents number n_d for a term. It has to go thru the list of documents
> for a term. The time should be proportional to the number of documents
> for a term even if it doesn't calculate the similarity. Usually the
> number of documents for a single term is less than the total number of
> documents in the collection and less than the total number of terms in
> the index.
>
> If the number of documents for a single term is less than the total
> number of documents the search process for a single term including
> process one (finding the term) and process two (collecting the documents
> and calculating the score) should be better the linear to the number of
> documents.
>
> > I indexed first 220,000, all with a special keyword, I did a simple
> > query and only fetched 5 docs, with Hits.length()=220,000.
> >
> > Then I indexed 440,000 docs, with the same keyword, query it
> > again and fetched a few docs, with Hits.length(0=440,000.
>
> In your case the query term is contained in all documents. The number of
> documents for a single term is equals the total number of documents in
> your collection. The hit collector has to collect all documents. The
> collecting process is proportional to the number of documents to
> collect. So the search for all documents should be at least linear to
> the total number of documents.
>
> Sören
>
> Zhang, Lisheng schrieb:
> > Hi,
> >
> > I indexed first 220,000, all with a special keyword, I did a simple
> > query and only fetched 5 docs, with Hits.length()=220,000.
> >
> > Then I indexed 440,000 docs, with the same keyword, query it
> > again and fetched a few docs, with Hits.length(0=440,000.
> >
> > I found that search time is about linear: 2nd time is about 2 times
> > longer than 1st query. I would like to understand:
> >
> > Does the linear relation come from score calculation, since we
> > have to calculate score one by one? Or other reason?
> >
> > If we have B-tree index I would naively expect a better scalibility?
> >
> > Thanks very much for your helps,
> >
> > Lisheng
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene search performance: linear?

Posted by Yonik Seeley <yo...@apache.org>.
On 3/21/07, Peter Keegan <pe...@gmail.com> wrote:
> On a similar topic, has anybody measured query performance as a function of
> index size?
> Well, I did and the results surprised me. I measured query throughput on 8
> indexes that varied in size from 55,000 to 4.4 million documents. When
> plotted on a graph, there is a distinct hyperbolic curve (1/x). I expected
> to see more of a linear curve with a sharp drop-off at some point.
> Interesting

The size of the index should also grow sub-linearly.
I think some of what you are seeing is due to index compression.
As you add more and more documents, there are fewer new terms added to
the index.  If you searched for a term in most of the documents, you
should see linear scaling with number of documents.  But those are
cases when you want to separate out that component and cache it :-)

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org