You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Florian Hopf <ma...@florian-hopf.de> on 2016/11/02 18:09:41 UTC

Understanding performance characteristics of the new point types

Hi,

we are indexing different types of documents in one Lucene index. They
have most fields in common but we need to filter some types for certain
queries. We are using numeric values to determine the types of documents
(1-4). Now, when querying these documents we see that the performance
degrades the more documents of a type are in the index.

Using a simple test that indexes 10 Mio documents I can see the
following when filtering on everything but 100000 documents:

* When issuing the query alone the new PointRangeQuery
(IntPoint.newExactQuery) is a lot faster than term and legacy numeric
(in my case around 2x the speed of the others)
* When issuing a bool query that contains a term query that selects 5
documents together with a must query that selects on the numeric the
points are 5x slower than legacy numeric
(LegacyNumericRangeQuery.newIntRange) and terms (TermQuery)
* When doing the same thing with SHOULD instead of MUST for the
additional term query the PointRangeQuery is fastests as well

I suspect this to be related to the discussion in
https://issues.apache.org/jira/browse/LUCENE-7254

Of course there could be something wrong with the way I am measuring the
performance, I'd be happy to share the code. But what I read in the
ticket above seems to hint that the points are not suited for every use
case? Is it recommended to use StringField in a case like this instead?

Regards
Florian

-- 
Florian Hopf
Freelance Software Developer

http://blog.florian-hopf.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Understanding performance characteristics of the new point types

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

FYI, the old NumericRangeQuery is fast here, because it rewrites to a constant score BooleanQuery for this low-cardinality case! If you have no real range, then it rewrites to a TermQuery!

Points are different, they are not so good for simple term-based lookups.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Florian Hopf [mailto:mailinglists@florian-hopf.de]
> Sent: Wednesday, November 2, 2016 8:19 PM
> To: Lucene Users <ja...@lucene.apache.org>
> Subject: Re: Understanding performance characteristics of the new point
> types
> 
> Thank you both for the explanation, we will switch to StringField with a
> TermQuery instead.
> 
> On 02.11.2016 20:09, Michael McCandless wrote:
> > Yeah it's best to use StringField for low-cardinality use cases.
> >
> > When cardinality is low (4 unique values in your case), legacy
> > numerics would rewrite to a BooleanQuery, which is much more
> > performant for MUST clauses, vs dimensional points which will always
> > need to construct an up front bitset for all documents with that
> > value.  Using StringField instead will ensure you always get a
> > BooleanQuery...
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Nov 2, 2016 at 2:43 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> >> Hi florian,
> >>
> >> If my understanting is correct, you are using IntPoint to index 4 different
> >> document types which is overkill; why not to try classic “non-tokenized”
> >> keyword field (a.k.a. “legacy string”) for document types? Cardinality is
> >> only four for document types.
> >>
> >>
> >> --
> >>
> >> Fuad Efendi
> >>
> >> (416) 993-2060
> >>
> >> http://www.tokenizer.ca
> >> Recommender Systems
> >>
> >>
> >> On November 2, 2016 at 2:10:14 PM, Florian Hopf (
> >> mailinglists@florian-hopf.de) wrote:
> >>
> >> Hi,
> >>
> >> we are indexing different types of documents in one Lucene index. They
> >> have most fields in common but we need to filter some types for certain
> >> queries. We are using numeric values to determine the types of
> documents
> >> (1-4). Now, when querying these documents we see that the performance
> >> degrades the more documents of a type are in the index.
> >>
> >> Using a simple test that indexes 10 Mio documents I can see the
> >> following when filtering on everything but 100000 documents:
> >>
> >> * When issuing the query alone the new PointRangeQuery
> >> (IntPoint.newExactQuery) is a lot faster than term and legacy numeric
> >> (in my case around 2x the speed of the others)
> >> * When issuing a bool query that contains a term query that selects 5
> >> documents together with a must query that selects on the numeric the
> >> points are 5x slower than legacy numeric
> >> (LegacyNumericRangeQuery.newIntRange) and terms (TermQuery)
> >> * When doing the same thing with SHOULD instead of MUST for the
> >> additional term query the PointRangeQuery is fastests as well
> >>
> >> I suspect this to be related to the discussion in
> >> https://issues.apache.org/jira/browse/LUCENE-7254
> >>
> >> Of course there could be something wrong with the way I am measuring
> the
> >> performance, I'd be happy to share the code. But what I read in the
> >> ticket above seems to hint that the points are not suited for every use
> >> case? Is it recommended to use StringField in a case like this instead?
> >>
> >> Regards
> >> Florian
> >>
> >> --
> >> Florian Hopf
> >> Freelance Software Developer
> >>
> >> http://blog.florian-hopf.de
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> 
> --
> Florian Hopf
> Freelance Software Developer
> 
> http://blog.florian-hopf.de
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Understanding performance characteristics of the new point types

Posted by Florian Hopf <ma...@florian-hopf.de>.

Thank you both for the explanation, we will switch to StringField with a
TermQuery instead.

On 02.11.2016 20:09, Michael McCandless wrote:
> Yeah it's best to use StringField for low-cardinality use cases.
> 
> When cardinality is low (4 unique values in your case), legacy
> numerics would rewrite to a BooleanQuery, which is much more
> performant for MUST clauses, vs dimensional points which will always
> need to construct an up front bitset for all documents with that
> value.  Using StringField instead will ensure you always get a
> BooleanQuery...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Wed, Nov 2, 2016 at 2:43 PM, Fuad Efendi <fu...@efendi.ca> wrote:
>> Hi florian,
>>
>> If my understanting is correct, you are using IntPoint to index 4 different
>> document types which is overkill; why not to try classic \u201cnon-tokenized\u201d
>> keyword field (a.k.a. \u201clegacy string\u201d) for document types? Cardinality is
>> only four for document types.
>>
>>
>> --
>>
>> Fuad Efendi
>>
>> (416) 993-2060
>>
>> http://www.tokenizer.ca
>> Recommender Systems
>>
>>
>> On November 2, 2016 at 2:10:14 PM, Florian Hopf (
>> mailinglists@florian-hopf.de) wrote:
>>
>> Hi,
>>
>> we are indexing different types of documents in one Lucene index. They
>> have most fields in common but we need to filter some types for certain
>> queries. We are using numeric values to determine the types of documents
>> (1-4). Now, when querying these documents we see that the performance
>> degrades the more documents of a type are in the index.
>>
>> Using a simple test that indexes 10 Mio documents I can see the
>> following when filtering on everything but 100000 documents:
>>
>> * When issuing the query alone the new PointRangeQuery
>> (IntPoint.newExactQuery) is a lot faster than term and legacy numeric
>> (in my case around 2x the speed of the others)
>> * When issuing a bool query that contains a term query that selects 5
>> documents together with a must query that selects on the numeric the
>> points are 5x slower than legacy numeric
>> (LegacyNumericRangeQuery.newIntRange) and terms (TermQuery)
>> * When doing the same thing with SHOULD instead of MUST for the
>> additional term query the PointRangeQuery is fastests as well
>>
>> I suspect this to be related to the discussion in
>> https://issues.apache.org/jira/browse/LUCENE-7254
>>
>> Of course there could be something wrong with the way I am measuring the
>> performance, I'd be happy to share the code. But what I read in the
>> ticket above seems to hint that the points are not suited for every use
>> case? Is it recommended to use StringField in a case like this instead?
>>
>> Regards
>> Florian
>>
>> --
>> Florian Hopf
>> Freelance Software Developer
>>
>> http://blog.florian-hopf.de
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


-- 
Florian Hopf
Freelance Software Developer

http://blog.florian-hopf.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Understanding performance characteristics of the new point types

Posted by Michael McCandless <lu...@mikemccandless.com>.

Yeah it's best to use StringField for low-cardinality use cases.

When cardinality is low (4 unique values in your case), legacy
numerics would rewrite to a BooleanQuery, which is much more
performant for MUST clauses, vs dimensional points which will always
need to construct an up front bitset for all documents with that
value.  Using StringField instead will ensure you always get a
BooleanQuery...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 2, 2016 at 2:43 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi florian,
>
> If my understanting is correct, you are using IntPoint to index 4 different
> document types which is overkill; why not to try classic “non-tokenized”
> keyword field (a.k.a. “legacy string”) for document types? Cardinality is
> only four for document types.
>
>
> --
>
> Fuad Efendi
>
> (416) 993-2060
>
> http://www.tokenizer.ca
> Recommender Systems
>
>
> On November 2, 2016 at 2:10:14 PM, Florian Hopf (
> mailinglists@florian-hopf.de) wrote:
>
> Hi,
>
> we are indexing different types of documents in one Lucene index. They
> have most fields in common but we need to filter some types for certain
> queries. We are using numeric values to determine the types of documents
> (1-4). Now, when querying these documents we see that the performance
> degrades the more documents of a type are in the index.
>
> Using a simple test that indexes 10 Mio documents I can see the
> following when filtering on everything but 100000 documents:
>
> * When issuing the query alone the new PointRangeQuery
> (IntPoint.newExactQuery) is a lot faster than term and legacy numeric
> (in my case around 2x the speed of the others)
> * When issuing a bool query that contains a term query that selects 5
> documents together with a must query that selects on the numeric the
> points are 5x slower than legacy numeric
> (LegacyNumericRangeQuery.newIntRange) and terms (TermQuery)
> * When doing the same thing with SHOULD instead of MUST for the
> additional term query the PointRangeQuery is fastests as well
>
> I suspect this to be related to the discussion in
> https://issues.apache.org/jira/browse/LUCENE-7254
>
> Of course there could be something wrong with the way I am measuring the
> performance, I'd be happy to share the code. But what I read in the
> ticket above seems to hint that the points are not suited for every use
> case? Is it recommended to use StringField in a case like this instead?
>
> Regards
> Florian
>
> --
> Florian Hopf
> Freelance Software Developer
>
> http://blog.florian-hopf.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Understanding performance characteristics of the new point types

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi florian,

If my understanting is correct, you are using IntPoint to index 4 different
document types which is overkill; why not to try classic “non-tokenized”
keyword field (a.k.a. “legacy string”) for document types? Cardinality is
only four for document types.


--

Fuad Efendi

(416) 993-2060

http://www.tokenizer.ca
Recommender Systems


On November 2, 2016 at 2:10:14 PM, Florian Hopf (
mailinglists@florian-hopf.de) wrote:

Hi,

we are indexing different types of documents in one Lucene index. They
have most fields in common but we need to filter some types for certain
queries. We are using numeric values to determine the types of documents
(1-4). Now, when querying these documents we see that the performance
degrades the more documents of a type are in the index.

Using a simple test that indexes 10 Mio documents I can see the
following when filtering on everything but 100000 documents:

* When issuing the query alone the new PointRangeQuery
(IntPoint.newExactQuery) is a lot faster than term and legacy numeric
(in my case around 2x the speed of the others)
* When issuing a bool query that contains a term query that selects 5
documents together with a must query that selects on the numeric the
points are 5x slower than legacy numeric
(LegacyNumericRangeQuery.newIntRange) and terms (TermQuery)
* When doing the same thing with SHOULD instead of MUST for the
additional term query the PointRangeQuery is fastests as well

I suspect this to be related to the discussion in
https://issues.apache.org/jira/browse/LUCENE-7254

Of course there could be something wrong with the way I am measuring the
performance, I'd be happy to share the code. But what I read in the
ticket above seems to hint that the points are not suited for every use
case? Is it recommended to use StringField in a case like this instead?

Regards
Florian

-- 
Florian Hopf
Freelance Software Developer

http://blog.florian-hopf.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org