You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Raj Yadav <ra...@cse.ism.ac.in> on 2020/07/22 11:18:06 UTC

Best field definition which is only use for filter query.

Below is the sample document





*{"filedA": 1,"filedB": "","filedC": "Sher","filedD":
"random","rules":[203,7843,43,283,6603,83,513,5303,243,103,323,163,403,363,5333,2483,313,703,523,503,563,8543,1003,483,1083,2043,6523,603,963,683,5353,763,443,643,743,723,1123,843,1243,1663,1803,1403,1783,7563,3843,1843,1523,1203,1563,1703,1883,8913,1923,1323,5313,1623,1963,2033,2763,2623,2083,2123,2143,123,2183,2333,8183,7323,2323,7243,2313,2463,2423,2383,5833,2343,2503,2663,8263,3083,2683,2543,8313,2883,2923,3043,2703,3243,3123,2263,3003,2393,3203,3163,6243,3283,3443,3343,3403,1913,3323,3483,3603,3723,3763,8333,3563,863,3683,3643,3523,3803,8323,3883,4003,3923,4043,4173,1163,2963,1743,6593,4083,4103,4143,1363,3983,4183,4223,6623,4383,1443,4303,4263,4403,4423,4283,4343,5043,4923,4983,4993,6633,4503,5843,8073,4663]}*
As you can see we have 5 fields and one of the field names is "rules".
Field Definition:
<field name="geo_rules" type="pint" indexed="true" stored="false"
multiValued="true">

The only operation that we do on this field is filtering.
example: => fq=rules:203

*Problems:*
1. The problem over here is, for `rules` field we have
marked indexed="true" and it is consuming a large percentage of total index
size.
2. Another problem is, a large chunk of our document update request is
mainly for this(rules) field.

If I marked `indexed=false` for this field (by default pint field type have
docValue=true)
*<field name="geo_rules" type="pint" indexed="false" stored="false"
multiValued="true">*
Then following thread is suggesting that filter operation (which is also
one kind of search operation) will be very slow
https://lucene.472066.n3.nabble.com/Facet-performance-problem-td4375925.html

Is there a way to not keep indexed=true for `rules` field and still does
not impact our search(filtering performance). Or any other solution which
can help in reducing our total index size and also does not increase
search(filter) latency

Regards,
Raj

Re: Best field definition which is only use for filter query.

Posted by Erik Hatcher <er...@gmail.com>.


> On Jul 22, 2020, at 08:52, raj.yadav <ra...@cse.ism.ac.in> wrote:
> 
> Erik Hatcher-4 wrote
>> Wouldn’t a “string” field be as good, if not better, for this use case?
> 
> What is the rationale behind this type change to 'string'. How will it speed
> up search/filtering? Will it not increase the index size. Since in general
> string type takes more space storage then int (not sure about whats case in
> lucene). 

You tell me? ;)   Easy enough to try in your environment, I imagine, in parallel in same collection index.  

As I understand it (in regards to Erick’s points), range queries aren’t being used here.  

    Erik

> 
> Regards,
> Raj
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Best field definition which is only use for filter query.

Posted by Erick Erickson <er...@gmail.com>.

pints 
1> take up less space (IIRC)
2> are better for range queries.

Best,
Erick

> On Jul 22, 2020, at 8:49 AM, raj.yadav <ra...@cse.ism.ac.in> wrote:
> 
> Erik Hatcher-4 wrote
>> Wouldn’t a “string” field be as good, if not better, for this use case?
> 
> What is the rationale behind this type change to 'string'. How will it speed
> up search/filtering? Will it not increase the index size. Since in general
> string type takes more space storage then int (not sure about whats case in
> lucene). 
> 
> Regards,
> Raj
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Best field definition which is only use for filter query.

Posted by "raj.yadav" <ra...@cse.ism.ac.in>.

Erik Hatcher-4 wrote
> Wouldn’t a “string” field be as good, if not better, for this use case?

What is the rationale behind this type change to 'string'. How will it speed
up search/filtering? Will it not increase the index size. Since in general
string type takes more space storage then int (not sure about whats case in
lucene). 

Regards,
Raj



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Best field definition which is only use for filter query.

Posted by Erik Hatcher <er...@gmail.com>.

Wouldn’t a “string” field be as good, if not better, for this use case?

> On Jul 22, 2020, at 08:02, Erick Erickson <er...@gmail.com> wrote:
> 
> fq clauses are just like the q clause except for two things:
> 1> no scoring is done
> 2> the entire result set _can_ be stored in the filterCache.
> 
> so if a value isn’t indexed, it can’t be used in either an fq or q clause.
> 
> The thread you reference is under the assumption (and this is the default in some versions of Solr) that docValues=true. And yes, that will be very, very slow. Think “table scan”.
> 
> Also, the default pint type is not as efficient for single-value searches like this, the trie fields are better. Trie support will be kept until there’s a good alternative for the single-value lookup with pint.
> 
> So for what you’re doing, I’d change to TrieInt, docValues=false, index=true. If you have neither docValues=true nor index=true, the query won’t work at all. You’ll have to adequately size your hardware if index size is a concern.
> 
> Best,
> Erick
> 
>> On Jul 22, 2020, at 7:18 AM, Raj Yadav <ra...@cse.ism.ac.in> wrote:
>> 
>> Below is the sample document
>> 
>> 
>> 
>> 
>> 
>> *{"filedA": 1,"filedB": "","filedC": "Sher","filedD":
>> "random","rules":[203,7843,43,283,6603,83,513,5303,243,103,323,163,403,363,5333,2483,313,703,523,503,563,8543,1003,483,1083,2043,6523,603,963,683,5353,763,443,643,743,723,1123,843,1243,1663,1803,1403,1783,7563,3843,1843,1523,1203,1563,1703,1883,8913,1923,1323,5313,1623,1963,2033,2763,2623,2083,2123,2143,123,2183,2333,8183,7323,2323,7243,2313,2463,2423,2383,5833,2343,2503,2663,8263,3083,2683,2543,8313,2883,2923,3043,2703,3243,3123,2263,3003,2393,3203,3163,6243,3283,3443,3343,3403,1913,3323,3483,3603,3723,3763,8333,3563,863,3683,3643,3523,3803,8323,3883,4003,3923,4043,4173,1163,2963,1743,6593,4083,4103,4143,1363,3983,4183,4223,6623,4383,1443,4303,4263,4403,4423,4283,4343,5043,4923,4983,4993,6633,4503,5843,8073,4663]}*
>> As you can see we have 5 fields and one of the field names is "rules".
>> Field Definition:
>> <field name="geo_rules" type="pint" indexed="true" stored="false"
>> multiValued="true">
>> 
>> The only operation that we do on this field is filtering.
>> example: => fq=rules:203
>> 
>> *Problems:*
>> 1. The problem over here is, for `rules` field we have
>> marked indexed="true" and it is consuming a large percentage of total index
>> size.
>> 2. Another problem is, a large chunk of our document update request is
>> mainly for this(rules) field.
>> 
>> If I marked `indexed=false` for this field (by default pint field type have
>> docValue=true)
>> *<field name="geo_rules" type="pint" indexed="false" stored="false"
>> multiValued="true">*
>> Then following thread is suggesting that filter operation (which is also
>> one kind of search operation) will be very slow
>> https://lucene.472066.n3.nabble.com/Facet-performance-problem-td4375925.html
>> 
>> Is there a way to not keep indexed=true for `rules` field and still does
>> not impact our search(filtering performance). Or any other solution which
>> can help in reducing our total index size and also does not increase
>> search(filter) latency
>> 
>> Regards,
>> Raj
>

Re: Best field definition which is only use for filter query.

Posted by "raj.yadav" <ra...@cse.ism.ac.in>.

Erick Erickson wrote
> Also, the default pint type is not as efficient for single-value searches
> like this, the trie fields are better. Trie support will be kept until
> there’s a good alternative for the single-value lookup with pint.
> 
> So for what you’re doing, I’d change to TrieInt, docValues=false,
> index=true.

 
So, we should use TrieInt type for single-value searches (on a single value
and multivalue field). Please correct me if I'm wrong. 

Also in what scenarios we should prefer pint over TrieInt from both document
search(index) and retrieval(stored) latency point of view (not looking from
sorting or faceting point of view). Is there any documentation that compares
these two field types.

Regards,
Raj



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Best field definition which is only use for filter query.

Posted by Erick Erickson <er...@gmail.com>.

fq clauses are just like the q clause except for two things:
1> no scoring is done
2> the entire result set _can_ be stored in the filterCache.

so if a value isn’t indexed, it can’t be used in either an fq or q clause.

The thread you reference is under the assumption (and this is the default in some versions of Solr) that docValues=true. And yes, that will be very, very slow. Think “table scan”.

Also, the default pint type is not as efficient for single-value searches like this, the trie fields are better. Trie support will be kept until there’s a good alternative for the single-value lookup with pint.

So for what you’re doing, I’d change to TrieInt, docValues=false, index=true. If you have neither docValues=true nor index=true, the query won’t work at all. You’ll have to adequately size your hardware if index size is a concern.

Best,
Erick

> On Jul 22, 2020, at 7:18 AM, Raj Yadav <ra...@cse.ism.ac.in> wrote:
> 
> Below is the sample document
> 
> 
> 
> 
> 
> *{"filedA": 1,"filedB": "","filedC": "Sher","filedD":
> "random","rules":[203,7843,43,283,6603,83,513,5303,243,103,323,163,403,363,5333,2483,313,703,523,503,563,8543,1003,483,1083,2043,6523,603,963,683,5353,763,443,643,743,723,1123,843,1243,1663,1803,1403,1783,7563,3843,1843,1523,1203,1563,1703,1883,8913,1923,1323,5313,1623,1963,2033,2763,2623,2083,2123,2143,123,2183,2333,8183,7323,2323,7243,2313,2463,2423,2383,5833,2343,2503,2663,8263,3083,2683,2543,8313,2883,2923,3043,2703,3243,3123,2263,3003,2393,3203,3163,6243,3283,3443,3343,3403,1913,3323,3483,3603,3723,3763,8333,3563,863,3683,3643,3523,3803,8323,3883,4003,3923,4043,4173,1163,2963,1743,6593,4083,4103,4143,1363,3983,4183,4223,6623,4383,1443,4303,4263,4403,4423,4283,4343,5043,4923,4983,4993,6633,4503,5843,8073,4663]}*
> As you can see we have 5 fields and one of the field names is "rules".
> Field Definition:
> <field name="geo_rules" type="pint" indexed="true" stored="false"
> multiValued="true">
> 
> The only operation that we do on this field is filtering.
> example: => fq=rules:203
> 
> *Problems:*
> 1. The problem over here is, for `rules` field we have
> marked indexed="true" and it is consuming a large percentage of total index
> size.
> 2. Another problem is, a large chunk of our document update request is
> mainly for this(rules) field.
> 
> If I marked `indexed=false` for this field (by default pint field type have
> docValue=true)
> *<field name="geo_rules" type="pint" indexed="false" stored="false"
> multiValued="true">*
> Then following thread is suggesting that filter operation (which is also
> one kind of search operation) will be very slow
> https://lucene.472066.n3.nabble.com/Facet-performance-problem-td4375925.html
> 
> Is there a way to not keep indexed=true for `rules` field and still does
> not impact our search(filtering performance). Or any other solution which
> can help in reducing our total index size and also does not increase
> search(filter) latency
> 
> Regards,
> Raj