You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2014/05/19 13:33:01 UTC

How does query on AND work

Hi

Lets say I have a Solr collection (running across several servers) 
containing 5 billion documents. A.o. each document have a value for 
field "no_dlng_doc_ind_sto" (a long) and field 
"timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto" 
and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored. Like 
this in schema.xml
<dynamicField name="*_dlng_doc_ind_sto" type="dlng" indexed="true" 
stored="true" required="true" docValues="true"/>
<fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" 
positionIncrementGap="0" docValuesFormat="Disk"/>

I make queries like this: no_dlng_doc_ind_sto:(<NO>) AND 
timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
* The "no_dlng_doc_ind_sto:(<NO>)"-part of a typical query will hit 
between 500 and 1000 documents out of the total 5 billion
* The "timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])"-part 
of a typical query will hit between 3-4 billion documents out of the 
total 5 billion

Question is how Solr/Lucene deals with such requests?
I am thinking that using the indices on both "no_dlng_doc_ind_sto" and 
"timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then make an 
intersection of those might not be the most efficient. You are making an 
intersection of two doc-id-sets of size 500-1000 and 3-4 billion. It 
might be faster to just use the index for "no_dlng_doc_ind_sto" to get 
the doc-ids for the 500-1000 documents, then for each of those fetch 
their "timestamp_dlng_doc_ind_sto"-value (using doc-value) to filter out 
the ones among the 500-1000 that does not match the timestamp-part of 
the query.
But what does Solr/Lucene actually do? Is it Solr- or Lucene-code that 
make the decision on what to do? Can you somehow "hint" the 
search-engine that you want one or the other method used?

Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference

Regards, Per Steffensen

Re: How does query on AND work

Posted by Per Steffensen <st...@designware.dk>.
Thanks for responding, Yonik. I tried out your suggestion, and it seems 
to work as it is supposed to, and it performs at least as well as the 
"hacky implementation I did myself". Wish you had responded earlier. Or 
maybe not, then I wouldn't have dived into it myself making an 
implementation that does (almost) exactly what seems to be done when 
using your approach, and then I wouldn't have learned so much. But the 
great thing is that now I do not have to go suggest (or implement 
myself) this idea as a new Solr/Lucene feature - it is already there!

See 
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html. 
Hope you do not mind that I reference you and the link you pointed out.

Thanks a lot!

Regards, Per Steffensen

On 23/05/14 18:13, Yonik Seeley wrote:
> On Fri, May 23, 2014 at 11:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
>> Per Steffensen [steff@designware.dk] wrote:
>>> * It IS more efficient to just use the index for the
>>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>>> the query.
>> Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?
> Maybe it already is?
> http://heliosearch.org/advanced-filter-caching-in-solr/
>
> Something like this:
>   &fq={!frange cache=false cost=150 v=timestampField l=beginTime u=endTime}
>
>
> -Yonik
> http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache
>


Re: How does query on AND work

Posted by Yonik Seeley <yo...@heliosearch.com>.
On Fri, May 23, 2014 at 11:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> Per Steffensen [steff@designware.dk] wrote:
>> * It IS more efficient to just use the index for the
>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>> the query.
>
> Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?

Maybe it already is?
http://heliosearch.org/advanced-filter-caching-in-solr/

Something like this:
 &fq={!frange cache=false cost=150 v=timestampField l=beginTime u=endTime}


-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache

Re: How does query on AND work

Posted by Per Steffensen <st...@designware.dk>.
Well, the only "search" i did, was ask this question on this 
mailing-list :-)

On 26/05/14 17:05, Alexandre Rafalovitch wrote:
> Did not follow the whole story but " post-query-value-filter" does exist in
> Solr. Have you tried searching for pretty much that expression. and maybe
> something about cost-based filter.
>
> Regards,
>      Alex


Re: How does query on AND work

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Did not follow the whole story but " post-query-value-filter" does exist in
Solr. Have you tried searching for pretty much that expression. and maybe
something about cost-based filter.

Regards,
    Alex
On 26/05/2014 6:49 pm, "Per Steffensen" <st...@designware.dk> wrote:

> Do not know if this is a special-case. I guess an AND-query where one side
> hits 500-1000 and the other side hits billions is a special-case. But this
> way of carrying out the query might also be an optimization in less uneven
> cases.
> It does not require that the "lots of hits"-part of the query is a
> range-query, and it does not necessarily require that the field used in
> this part is DocValue (you can go fetch the values from "slow" store). But
> I guess it has to be a very uneven case if this approach should be faster
> on a non-DocValue field.
>
> I think this can be generalized. I think of it as something similar as
> being able to "hint" relational databases not to use an specific index. I
> do not know that much about Solr/Lucene query-syntax, but I believe
> "filter-queries" (fq) are kinda queries that will be AND'ed onto the real
> query (q), and in order not to have to change the query-syntax too much
> (adding hits or something), I guess a first step for a feature doing what I
> am doing here, could be introduce something similar to "filter-queries" -
> queries that will be carried out on the result of (q + fqs) but looking a
> the values of the documents in that result instead of intersecting with
> doc-sets found from index. Lets call it "post-query-value-filter"s (yes, we
> can definitely come up with a better/shorter name)
>
> 1) q=no_dlng_doc_ind_sto:(<NO>) AND timestamp_dlng_doc_ind_sto:([<TIME_START>
> TO <TIME_END>])
> 2) q=no_dlng_doc_ind_sto:(<NO>),fq=timestamp_dlng_doc_ind_sto:([<TIME_START>
> TO <TIME_END>])
> 3) q=no_dlng_doc_ind_sto:(<NO>),post-query-value-filter=
> timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
>
> 1) and 2) both use index on both no_dlng_doc_ind_sto and
> timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto and
> does the time-interval filter part by fetching values (using DocValue if
> possible) for timestamp_dlng_doc_ind_sto for each of the docs found through
> the no_dlng_doc_ind_sto-index to see if this doc should really be included.
>
> There are some things that I did not initially tell about actually wanting
> to do a facet search etc. Well, here is the full story:
> http://solrlucene.blogspot.dk/2014/05/performance-of-and-
> queries-with-uneven.html
>
> Regards, Per Steffensen
>
> On 23/05/14 17:37, Toke Eskildsen wrote:
>
>> Per Steffensen [steff@designware.dk] wrote:
>>
>>> * It IS more efficient to just use the index for the
>>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>>> the query.
>>>
>> Thank you for the follow up. It sounds rather special-case though, with
>> requirement of DocValues for the range-field. Do you think this can be
>> generalized?
>>
>> - Toke Eskildsen
>>
>>
>

Re: How does query on AND work

Posted by Per Steffensen <st...@designware.dk>.
Do not know if this is a special-case. I guess an AND-query where one 
side hits 500-1000 and the other side hits billions is a special-case. 
But this way of carrying out the query might also be an optimization in 
less uneven cases.
It does not require that the "lots of hits"-part of the query is a 
range-query, and it does not necessarily require that the field used in 
this part is DocValue (you can go fetch the values from "slow" store). 
But I guess it has to be a very uneven case if this approach should be 
faster on a non-DocValue field.

I think this can be generalized. I think of it as something similar as 
being able to "hint" relational databases not to use an specific index. 
I do not know that much about Solr/Lucene query-syntax, but I believe 
"filter-queries" (fq) are kinda queries that will be AND'ed onto the 
real query (q), and in order not to have to change the query-syntax too 
much (adding hits or something), I guess a first step for a feature 
doing what I am doing here, could be introduce something similar to 
"filter-queries" - queries that will be carried out on the result of (q 
+ fqs) but looking a the values of the documents in that result instead 
of intersecting with doc-sets found from index. Lets call it 
"post-query-value-filter"s (yes, we can definitely come up with a 
better/shorter name)

1) q=no_dlng_doc_ind_sto:(<NO>) AND 
timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
2) 
q=no_dlng_doc_ind_sto:(<NO>),fq=timestamp_dlng_doc_ind_sto:([<TIME_START> TO 
<TIME_END>])
3) 
q=no_dlng_doc_ind_sto:(<NO>),post-query-value-filter=timestamp_dlng_doc_ind_sto:([<TIME_START> 
TO <TIME_END>])

1) and 2) both use index on both no_dlng_doc_ind_sto and 
timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto 
and does the time-interval filter part by fetching values (using 
DocValue if possible) for timestamp_dlng_doc_ind_sto for each of the 
docs found through the no_dlng_doc_ind_sto-index to see if this doc 
should really be included.

There are some things that I did not initially tell about actually 
wanting to do a facet search etc. Well, here is the full story: 
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html

Regards, Per Steffensen

On 23/05/14 17:37, Toke Eskildsen wrote:
> Per Steffensen [steff@designware.dk] wrote:
>> * It IS more efficient to just use the index for the
>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>> the query.
> Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?
>
> - Toke Eskildsen
>


RE: How does query on AND work

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Per Steffensen [steff@designware.dk] wrote:
> * It IS more efficient to just use the index for the
> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
> part and then fetch timestamp-doc-values for those doc-ids to filter out
> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
> the query.

Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?

- Toke Eskildsen

Re: How does query on AND work

Posted by Per Steffensen <st...@designware.dk>.
I can answer some of this myself now that I have dived into it to 
understand what Solr/Lucene does and to see if it can be done better
* In current Solr/Lucene (or at least in 4.4) indices on both 
"no_dlng_doc_ind_sto" and "timestamp_dlng_doc_ind_sto" are used and the 
doc-id-sets found are intersected to get the final set of doc-ids
* It IS more efficient to just use the index for the 
"no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that 
part and then fetch timestamp-doc-values for those doc-ids to filter out 
the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of 
the query. I have made changes to our version of Solr (and Lucene) to do 
that and response-times go from about 10 secs to about 1 sec (of course 
dependent on whats in file-cache etc.) - in cases where 
"no_dlng_doc_ind_sto" hit about 500-1000 docs and 
"timestamp_dlng_doc_ind_sto" hit about 3-4 billion.

Regards, Per Steffensen

On 19/05/14 13:33, Per Steffensen wrote:
> Hi
>
> Lets say I have a Solr collection (running across several servers) 
> containing 5 billion documents. A.o. each document have a value for 
> field "no_dlng_doc_ind_sto" (a long) and field 
> "timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto" 
> and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored. 
> Like this in schema.xml
> <dynamicField name="*_dlng_doc_ind_sto" type="dlng" indexed="true" 
> stored="true" required="true" docValues="true"/>
> <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" 
> positionIncrementGap="0" docValuesFormat="Disk"/>
>
> I make queries like this: no_dlng_doc_ind_sto:(<NO>) AND 
> timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
> * The "no_dlng_doc_ind_sto:(<NO>)"-part of a typical query will hit 
> between 500 and 1000 documents out of the total 5 billion
> * The "timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])"-part 
> of a typical query will hit between 3-4 billion documents out of the 
> total 5 billion
>
> Question is how Solr/Lucene deals with such requests?
> I am thinking that using the indices on both "no_dlng_doc_ind_sto" and 
> "timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then make 
> an intersection of those might not be the most efficient. You are 
> making an intersection of two doc-id-sets of size 500-1000 and 3-4 
> billion. It might be faster to just use the index for 
> "no_dlng_doc_ind_sto" to get the doc-ids for the 500-1000 documents, 
> then for each of those fetch their "timestamp_dlng_doc_ind_sto"-value 
> (using doc-value) to filter out the ones among the 500-1000 that does 
> not match the timestamp-part of the query.
> But what does Solr/Lucene actually do? Is it Solr- or Lucene-code that 
> make the decision on what to do? Can you somehow "hint" the 
> search-engine that you want one or the other method used?
>
> Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference
>
> Regards, Per Steffensen
>