You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Rajnish kamboj <ra...@gmail.com> on 2017/01/02 18:23:53 UTC

Lucene performance benchmark | search throughput

Hi

Is there any Lucene performance benchmark against certain set of data?
[i.e Is there any stats for search throughput which Lucene can provide for
a certain data?]

Search throughput Example:
Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with SSD)
Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with SSD)
Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with SSD)
etc.

Also, does the index size matters for search throughput?

Our observation:
When we increase the data size (hence index size) the search throughput
decreases.
When we add more AND conditions, the search throughput increases. Why?
Ideally if we add more conditions then the Lucene should have more work to
do (including merging) and the throughput should decrease but the
throughput increases?


Regards
Rajnish

Re: Lucene performance benchmark | search throughput

Posted by Michael McCandless <lu...@mikemccandless.com>.

The cost() method on DocIdSetIterator is responsible for telling
BooleanQuery how costly that clause is, and how cost() is implemented
varies by query.

For the multi-term queries, like WildcardQuery, Lucene will first
visit all matched terms (during the Query.rewrite phase), and rewrite
the query either into a disjunction (SHOULD of the N terms), or it
will, per segment, visit all docs for all matching terms, setting them
in a sparse or dense bitset, recording the cost as the number of
documents.

But there is work underway now to try to improve the multi-term query
cases so that we don't go and do all that up-front work (visiting all
terms, and all docs matching each term) when another clause in the
boolean query is more restrictive:
https://issues.apache.org/jira/browse/LUCENE-7055

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jan 6, 2017 at 2:28 AM, Rajnish kamboj <ra...@gmail.com> wrote:
> OK, got it
>
> One thing still I need to know (which is not clear to me)....
> How does Lucene calculates the most restrictive clause?
>
> Correct me, if I am wrong in my understanding (in abstract):
> 1. During indexing, Lucene keeps information of documents count against
> every indexed items.
> 2. During search, it first checks, which condition has less number of
> documents count before actually iterating.
> 3. Then, it iterates that restricted set against other set of conditions.
>
> If the above is correct then how does Lucene calculates most restrictive
> clause in case of Wildcard conditions?
> Also, if Lucene first check for most restrictive clause, and then iterate to
> match documents to the other clauses,
>         Then when will the merging of documents happen?
>
> Coming on to my main query for which I ask question in Lucene community:
> What is the search performance benchmark against Lucene version, so that I
> can benchmark my application throughput?
>
>
> Regards
> Rajnish
>
> On Tue, Jan 3, 2017 at 6:09 PM, Rajnish kamboj <ra...@gmail.com>
> wrote:
>>
>> OK, got it
>>
>> One thing still I need to know (which is not clear to me)....
>> How does Lucene calculates the most restrictive clause?
>>
>> Correct me, if I am wrong in my understanding (in abstract):
>> 1. During indexing, Lucene keeps information of documents count against
>> every indexed items.
>> 2. During search, it first checks, which condition has less number of
>> documents count before actually iterating.
>> 3. Then, it iterates that restricted set against other set of conditions.
>>
>> If the above is correct then how does Lucene calculates most restrictive
>> clause in case of Wildcard conditions?
>> Also, if Lucene first check for most restrictive clause, and then iterate
>> to match documents to the other clauses,
>>         Then when will the merging of documents happen?
>>
>> Coming on to my main query for which I ask question in Lucene community:
>> What is the search performance benchmark against Lucene version, so that I
>> can benchmark my application throughput?
>>
>>
>>
>> On Tue, Jan 3, 2017 at 5:12 PM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>>>
>>> When you add MUST sub-clauses to a BooleanQuery  (AND to the query
>>> parsers) it can make the search run faster because Lucene will take
>>> the most restrictive clause and use that to "drive" the iteration of
>>> matching documents to the other clauses, allowing those other clauses
>>> to iterate much faster than they would otherwise require if they were
>>> not AND'd.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Jan 3, 2017 at 6:33 AM, Rajnish kamboj <ra...@gmail.com>
>>> wrote:
>>> > The answer is not clear.
>>> >
>>> > Suppose I have following query and I want 10 records.
>>> > Condition1 AND Condition2 AND Condition3
>>> >
>>> > As per my understanding Lucene will first evaluate all conditions
>>> > separately and then merge the Documents as per AND/OR clauses.
>>> > At last it will return me 10 records.
>>> >
>>> > So, if I add one more condition, then it will add to search time and
>>> > merge
>>> > time and hence increase latency, which results in decreased throughput.
>>> >
>>> >
>>> > Also, what is the search performance benchmark against Lucene version?
>>> >
>>> >
>>> > Regards
>>> > Rajnish
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tuesday 3 January 2017, Michael Wilkowski <mw...@silenteight.com>
>>> > wrote:
>>> >
>>> >> My guess: more conditions = less documents to score and sort to
>>> >> return.
>>> >>
>>> >> On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj
>>> >> <ra...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > Hi
>>> >> >
>>> >> > Is there any Lucene performance benchmark against certain set of
>>> >> > data?
>>> >> > [i.e Is there any stats for search throughput which Lucene can
>>> >> > provide
>>> >> for
>>> >> > a certain data?]
>>> >> >
>>> >> > Search throughput Example:
>>> >> > Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with
>>> >> > SSD)
>>> >> > Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with
>>> >> > SSD)
>>> >> > Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with
>>> >> > SSD)
>>> >> > etc.
>>> >> >
>>> >> > Also, does the index size matters for search throughput?
>>> >> >
>>> >> > Our observation:
>>> >> > When we increase the data size (hence index size) the search
>>> >> > throughput
>>> >> > decreases.
>>> >> > When we add more AND conditions, the search throughput increases.
>>> >> > Why?
>>> >> > Ideally if we add more conditions then the Lucene should have more
>>> >> > work
>>> >> to
>>> >> > do (including merging) and the throughput should decrease but the
>>> >> > throughput increases?
>>> >> >
>>> >> >
>>> >> > Regards
>>> >> > Rajnish
>>> >> >
>>> >>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance benchmark | search throughput

Posted by Rajnish kamboj <ra...@gmail.com>.

OK, got it

One thing still I need to know (which is not clear to me)....
How does Lucene calculates the most restrictive clause?

Correct me, if I am wrong in my understanding (in abstract):
1. During indexing, Lucene keeps information of documents count against
every indexed items.
2. During search, it first checks, which condition has less number of
documents count before actually iterating.
3. Then, it iterates that restricted set against other set of conditions.

If the above is correct then how does Lucene calculates most restrictive
clause in case of Wildcard conditions?
Also, if Lucene first check for most restrictive clause, and then iterate
to match documents to the other clauses,
        Then when will the merging of documents happen?

Coming on to my main query for which I ask question in Lucene community:
What is the search performance benchmark against Lucene version, so that I
can benchmark my application throughput?


Regards
Rajnish

On Tue, Jan 3, 2017 at 6:09 PM, Rajnish kamboj <ra...@gmail.com>
wrote:

> OK, got it
>
> One thing still I need to know (which is not clear to me)....
> How does Lucene calculates the most restrictive clause?
>
> Correct me, if I am wrong in my understanding (in abstract):
> 1. During indexing, Lucene keeps information of documents count against
> every indexed items.
> 2. During search, it first checks, which condition has less number of
> documents count before actually iterating.
> 3. Then, it iterates that restricted set against other set of conditions.
>
> If the above is correct then how does Lucene calculates most restrictive
> clause in case of Wildcard conditions?
> Also, if Lucene first check for most restrictive clause, and then iterate
> to match documents to the other clauses,
>         Then when will the merging of documents happen?
>
> Coming on to my main query for which I ask question in Lucene community:
> What is the search performance benchmark against Lucene version, so that I
> can benchmark my application throughput?
>
>
>
> On Tue, Jan 3, 2017 at 5:12 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> When you add MUST sub-clauses to a BooleanQuery  (AND to the query
>> parsers) it can make the search run faster because Lucene will take
>> the most restrictive clause and use that to "drive" the iteration of
>> matching documents to the other clauses, allowing those other clauses
>> to iterate much faster than they would otherwise require if they were
>> not AND'd.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Tue, Jan 3, 2017 at 6:33 AM, Rajnish kamboj <ra...@gmail.com>
>> wrote:
>> > The answer is not clear.
>> >
>> > Suppose I have following query and I want 10 records.
>> > Condition1 AND Condition2 AND Condition3
>> >
>> > As per my understanding Lucene will first evaluate all conditions
>> > separately and then merge the Documents as per AND/OR clauses.
>> > At last it will return me 10 records.
>> >
>> > So, if I add one more condition, then it will add to search time and
>> merge
>> > time and hence increase latency, which results in decreased throughput.
>> >
>> >
>> > Also, what is the search performance benchmark against Lucene version?
>> >
>> >
>> > Regards
>> > Rajnish
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tuesday 3 January 2017, Michael Wilkowski <mw...@silenteight.com>
>> wrote:
>> >
>> >> My guess: more conditions = less documents to score and sort to return.
>> >>
>> >> On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj <
>> rajnishk7.info@gmail.com>
>> >> wrote:
>> >>
>> >> > Hi
>> >> >
>> >> > Is there any Lucene performance benchmark against certain set of
>> data?
>> >> > [i.e Is there any stats for search throughput which Lucene can
>> provide
>> >> for
>> >> > a certain data?]
>> >> >
>> >> > Search throughput Example:
>> >> > Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with
>> SSD)
>> >> > Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with
>> SSD)
>> >> > Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with
>> SSD)
>> >> > etc.
>> >> >
>> >> > Also, does the index size matters for search throughput?
>> >> >
>> >> > Our observation:
>> >> > When we increase the data size (hence index size) the search
>> throughput
>> >> > decreases.
>> >> > When we add more AND conditions, the search throughput increases.
>> Why?
>> >> > Ideally if we add more conditions then the Lucene should have more
>> work
>> >> to
>> >> > do (including merging) and the throughput should decrease but the
>> >> > throughput increases?
>> >> >
>> >> >
>> >> > Regards
>> >> > Rajnish
>> >> >
>> >>
>>
>
>

Re: Lucene performance benchmark | search throughput

Posted by Michael McCandless <lu...@mikemccandless.com>.

When you add MUST sub-clauses to a BooleanQuery  (AND to the query
parsers) it can make the search run faster because Lucene will take
the most restrictive clause and use that to "drive" the iteration of
matching documents to the other clauses, allowing those other clauses
to iterate much faster than they would otherwise require if they were
not AND'd.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jan 3, 2017 at 6:33 AM, Rajnish kamboj <ra...@gmail.com> wrote:
> The answer is not clear.
>
> Suppose I have following query and I want 10 records.
> Condition1 AND Condition2 AND Condition3
>
> As per my understanding Lucene will first evaluate all conditions
> separately and then merge the Documents as per AND/OR clauses.
> At last it will return me 10 records.
>
> So, if I add one more condition, then it will add to search time and merge
> time and hence increase latency, which results in decreased throughput.
>
>
> Also, what is the search performance benchmark against Lucene version?
>
>
> Regards
> Rajnish
>
>
>
>
>
>
>
> On Tuesday 3 January 2017, Michael Wilkowski <mw...@silenteight.com> wrote:
>
>> My guess: more conditions = less documents to score and sort to return.
>>
>> On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj <ra...@gmail.com>
>> wrote:
>>
>> > Hi
>> >
>> > Is there any Lucene performance benchmark against certain set of data?
>> > [i.e Is there any stats for search throughput which Lucene can provide
>> for
>> > a certain data?]
>> >
>> > Search throughput Example:
>> > Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with SSD)
>> > Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with SSD)
>> > Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with SSD)
>> > etc.
>> >
>> > Also, does the index size matters for search throughput?
>> >
>> > Our observation:
>> > When we increase the data size (hence index size) the search throughput
>> > decreases.
>> > When we add more AND conditions, the search throughput increases. Why?
>> > Ideally if we add more conditions then the Lucene should have more work
>> to
>> > do (including merging) and the throughput should decrease but the
>> > throughput increases?
>> >
>> >
>> > Regards
>> > Rajnish
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene performance benchmark | search throughput

Posted by Rajnish kamboj <ra...@gmail.com>.

The answer is not clear.

Suppose I have following query and I want 10 records.
Condition1 AND Condition2 AND Condition3

As per my understanding Lucene will first evaluate all conditions
separately and then merge the Documents as per AND/OR clauses.
At last it will return me 10 records.

So, if I add one more condition, then it will add to search time and merge
time and hence increase latency, which results in decreased throughput.

Also, what is the search performance benchmark against Lucene version?

Regards
Rajnish

On Tuesday 3 January 2017, Michael Wilkowski <mw...@silenteight.com> wrote:

> My guess: more conditions = less documents to score and sort to return.
>
> On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj <ra...@gmail.com>
> wrote:
>
> > Hi
> >
> > Is there any Lucene performance benchmark against certain set of data?
> > [i.e Is there any stats for search throughput which Lucene can provide
> for
> > a certain data?]
> >
> > Search throughput Example:
> > Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with SSD)
> > Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with SSD)
> > Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with SSD)
> > etc.
> >
> > Also, does the index size matters for search throughput?
> >
> > Our observation:
> > When we increase the data size (hence index size) the search throughput
> > decreases.
> > When we add more AND conditions, the search throughput increases. Why?
> > Ideally if we add more conditions then the Lucene should have more work
> to
> > do (including merging) and the throughput should decrease but the
> > throughput increases?
> >
> >
> > Regards
> > Rajnish
> >
>

Re: Lucene performance benchmark | search throughput

Posted by Michael Wilkowski <mw...@silenteight.com>.

My guess: more conditions = less documents to score and sort to return.

On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj <ra...@gmail.com>
wrote:

> Hi
>
> Is there any Lucene performance benchmark against certain set of data?
> [i.e Is there any stats for search throughput which Lucene can provide for
> a certain data?]
>
> Search throughput Example:
> Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with SSD)
> Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with SSD)
> Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with SSD)
> etc.
>
> Also, does the index size matters for search throughput?
>
> Our observation:
> When we increase the data size (hence index size) the search throughput
> decreases.
> When we add more AND conditions, the search throughput increases. Why?
> Ideally if we add more conditions then the Lucene should have more work to
> do (including merging) and the throughput should decrease but the
> throughput increases?
>
>
> Regards
> Rajnish
>