You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by rama44ster <ra...@gmail.com> on 2015/01/07 16:54:58 UTC

A question on performance

Hi,
I have a lucene index which has close to 480M documents. And I ran around
1000 queries against the index. Each query is a boolean query with 3
different tokens. That is the query has 3 operands which MUST occur.
Executing such 3 token queries gives the following latency percentiles.

50 = 16 ms
75 = 52 ms
90 = 121 ms
95 = 262 ms
99 = 76010 ms
99.9 = 76037 ms

Is the latency expected to degrade when the number of docs is as high as
480M? The size of the index is 36G. All the segments in the index are
merged into one segment. Even when the segments are not merged, the
latencies are not very different. Each document has 5-6 stored fields. But
as mentioned above, the above latencies are for boolean queries that don't
access any stored fields, but just do a posting list lookup on 3 tokens.

Any ideas on what could be wrong here?

Thanks in advance,
Prasad.

RE: A question on performance

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
rama44ster [rama44ster@gmail.com] wrote:

[3 MUST clauses]

> 50 = 16 ms
> 75 = 52 ms
> 90 = 121 ms
> 95 = 262 ms
> 99 = 76010 ms
> 99.9 = 76037 ms

> Is the latency expected to degrade when the number of docs is as high as
> 480M?

Try plotting response times as a function of hit count. My guess is that your 99 and 99.9 percentiles are for really high hitcounts, which will take a long time as they all needs to be scored. Alternatively, if it is easier for you, just check the queries in the 99+ percentiles manually and see if they hit a lot of documents.

If your response times grows about linear (with a bump at one point, due to switch from sparse to non-sparse docset) as a function of hitcount, there is not much about it besides sharding, with the current single-threaded processing of lucene queries.

- Toke Eskildsen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: A question on performance

Posted by Arvind Kalyan <ba...@gmail.com>.
Performance measurements must be made carefully. Have you performed any
warmup?

I recommend doing 10k calls just to let the dust settle  including stuff
like jit, before taking any kind if measurements. Also use mmapdirectory,
if not already, to help with spikes in disk accesses.

Also keep track of garbage collections that happened during your profiling.
That is a different problem to solve and has different solutions. But most
importantly, make sure you don't use a big heap just to use the big index
if you are using mmapdirectory.

There are probably a few more things I'd do given various other
requirements (like disabling swap) and constraints.

On Wednesday, January 7, 2015, rama44ster <ra...@gmail.com> wrote:

> Hi,
> I have a lucene index which has close to 480M documents. And I ran around
> 1000 queries against the index. Each query is a boolean query with 3
> different tokens. That is the query has 3 operands which MUST occur.
> Executing such 3 token queries gives the following latency percentiles.
>
> 50 = 16 ms
> 75 = 52 ms
> 90 = 121 ms
> 95 = 262 ms
> 99 = 76010 ms
> 99.9 = 76037 ms
>
> Is the latency expected to degrade when the number of docs is as high as
> 480M? The size of the index is 36G. All the segments in the index are
> merged into one segment. Even when the segments are not merged, the
> latencies are not very different. Each document has 5-6 stored fields. But
> as mentioned above, the above latencies are for boolean queries that don't
> access any stored fields, but just do a posting list lookup on 3 tokens.
>
> Any ideas on what could be wrong here?
>
> Thanks in advance,
> Prasad.
>


-- 
Arvind Kalyan
http://www.linkedin.com/in/base16
cell: (408) 761-2030