You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2014/05/19 13:33:01 UTC
How does query on AND work
Hi
Lets say I have a Solr collection (running across several servers)
containing 5 billion documents. A.o. each document have a value for
field "no_dlng_doc_ind_sto" (a long) and field
"timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto"
and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored. Like
this in schema.xml
<dynamicField name="*_dlng_doc_ind_sto" type="dlng" indexed="true"
stored="true" required="true" docValues="true"/>
<fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0" docValuesFormat="Disk"/>
I make queries like this: no_dlng_doc_ind_sto:(<NO>) AND
timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
* The "no_dlng_doc_ind_sto:(<NO>)"-part of a typical query will hit
between 500 and 1000 documents out of the total 5 billion
* The "timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])"-part
of a typical query will hit between 3-4 billion documents out of the
total 5 billion
Question is how Solr/Lucene deals with such requests?
I am thinking that using the indices on both "no_dlng_doc_ind_sto" and
"timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then make an
intersection of those might not be the most efficient. You are making an
intersection of two doc-id-sets of size 500-1000 and 3-4 billion. It
might be faster to just use the index for "no_dlng_doc_ind_sto" to get
the doc-ids for the 500-1000 documents, then for each of those fetch
their "timestamp_dlng_doc_ind_sto"-value (using doc-value) to filter out
the ones among the 500-1000 that does not match the timestamp-part of
the query.
But what does Solr/Lucene actually do? Is it Solr- or Lucene-code that
make the decision on what to do? Can you somehow "hint" the
search-engine that you want one or the other method used?
Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference
Regards, Per Steffensen
Re: How does query on AND work
Posted by Per Steffensen <st...@designware.dk>.
Thanks for responding, Yonik. I tried out your suggestion, and it seems
to work as it is supposed to, and it performs at least as well as the
"hacky implementation I did myself". Wish you had responded earlier. Or
maybe not, then I wouldn't have dived into it myself making an
implementation that does (almost) exactly what seems to be done when
using your approach, and then I wouldn't have learned so much. But the
great thing is that now I do not have to go suggest (or implement
myself) this idea as a new Solr/Lucene feature - it is already there!
See
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html.
Hope you do not mind that I reference you and the link you pointed out.
Thanks a lot!
Regards, Per Steffensen
On 23/05/14 18:13, Yonik Seeley wrote:
> On Fri, May 23, 2014 at 11:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
>> Per Steffensen [steff@designware.dk] wrote:
>>> * It IS more efficient to just use the index for the
>>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>>> the query.
>> Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?
> Maybe it already is?
> http://heliosearch.org/advanced-filter-caching-in-solr/
>
> Something like this:
> &fq={!frange cache=false cost=150 v=timestampField l=beginTime u=endTime}
>
>
> -Yonik
> http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache
>
Re: How does query on AND work
Posted by Yonik Seeley <yo...@heliosearch.com>.
On Fri, May 23, 2014 at 11:37 AM, Toke Eskildsen <te...@statsbiblioteket.dk> wrote:
> Per Steffensen [steff@designware.dk] wrote:
>> * It IS more efficient to just use the index for the
>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>> the query.
>
> Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?
Maybe it already is?
http://heliosearch.org/advanced-filter-caching-in-solr/
Something like this:
&fq={!frange cache=false cost=150 v=timestampField l=beginTime u=endTime}
-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache
Re: How does query on AND work
Posted by Per Steffensen <st...@designware.dk>.
Well, the only "search" i did, was ask this question on this
mailing-list :-)
On 26/05/14 17:05, Alexandre Rafalovitch wrote:
> Did not follow the whole story but " post-query-value-filter" does exist in
> Solr. Have you tried searching for pretty much that expression. and maybe
> something about cost-based filter.
>
> Regards,
> Alex
Re: How does query on AND work
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Did not follow the whole story but " post-query-value-filter" does exist in
Solr. Have you tried searching for pretty much that expression. and maybe
something about cost-based filter.
Regards,
Alex
On 26/05/2014 6:49 pm, "Per Steffensen" <st...@designware.dk> wrote:
> Do not know if this is a special-case. I guess an AND-query where one side
> hits 500-1000 and the other side hits billions is a special-case. But this
> way of carrying out the query might also be an optimization in less uneven
> cases.
> It does not require that the "lots of hits"-part of the query is a
> range-query, and it does not necessarily require that the field used in
> this part is DocValue (you can go fetch the values from "slow" store). But
> I guess it has to be a very uneven case if this approach should be faster
> on a non-DocValue field.
>
> I think this can be generalized. I think of it as something similar as
> being able to "hint" relational databases not to use an specific index. I
> do not know that much about Solr/Lucene query-syntax, but I believe
> "filter-queries" (fq) are kinda queries that will be AND'ed onto the real
> query (q), and in order not to have to change the query-syntax too much
> (adding hits or something), I guess a first step for a feature doing what I
> am doing here, could be introduce something similar to "filter-queries" -
> queries that will be carried out on the result of (q + fqs) but looking a
> the values of the documents in that result instead of intersecting with
> doc-sets found from index. Lets call it "post-query-value-filter"s (yes, we
> can definitely come up with a better/shorter name)
>
> 1) q=no_dlng_doc_ind_sto:(<NO>) AND timestamp_dlng_doc_ind_sto:([<TIME_START>
> TO <TIME_END>])
> 2) q=no_dlng_doc_ind_sto:(<NO>),fq=timestamp_dlng_doc_ind_sto:([<TIME_START>
> TO <TIME_END>])
> 3) q=no_dlng_doc_ind_sto:(<NO>),post-query-value-filter=
> timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
>
> 1) and 2) both use index on both no_dlng_doc_ind_sto and
> timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto and
> does the time-interval filter part by fetching values (using DocValue if
> possible) for timestamp_dlng_doc_ind_sto for each of the docs found through
> the no_dlng_doc_ind_sto-index to see if this doc should really be included.
>
> There are some things that I did not initially tell about actually wanting
> to do a facet search etc. Well, here is the full story:
> http://solrlucene.blogspot.dk/2014/05/performance-of-and-
> queries-with-uneven.html
>
> Regards, Per Steffensen
>
> On 23/05/14 17:37, Toke Eskildsen wrote:
>
>> Per Steffensen [steff@designware.dk] wrote:
>>
>>> * It IS more efficient to just use the index for the
>>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>>> the query.
>>>
>> Thank you for the follow up. It sounds rather special-case though, with
>> requirement of DocValues for the range-field. Do you think this can be
>> generalized?
>>
>> - Toke Eskildsen
>>
>>
>
Re: How does query on AND work
Posted by Per Steffensen <st...@designware.dk>.
Do not know if this is a special-case. I guess an AND-query where one
side hits 500-1000 and the other side hits billions is a special-case.
But this way of carrying out the query might also be an optimization in
less uneven cases.
It does not require that the "lots of hits"-part of the query is a
range-query, and it does not necessarily require that the field used in
this part is DocValue (you can go fetch the values from "slow" store).
But I guess it has to be a very uneven case if this approach should be
faster on a non-DocValue field.
I think this can be generalized. I think of it as something similar as
being able to "hint" relational databases not to use an specific index.
I do not know that much about Solr/Lucene query-syntax, but I believe
"filter-queries" (fq) are kinda queries that will be AND'ed onto the
real query (q), and in order not to have to change the query-syntax too
much (adding hits or something), I guess a first step for a feature
doing what I am doing here, could be introduce something similar to
"filter-queries" - queries that will be carried out on the result of (q
+ fqs) but looking a the values of the documents in that result instead
of intersecting with doc-sets found from index. Lets call it
"post-query-value-filter"s (yes, we can definitely come up with a
better/shorter name)
1) q=no_dlng_doc_ind_sto:(<NO>) AND
timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
2)
q=no_dlng_doc_ind_sto:(<NO>),fq=timestamp_dlng_doc_ind_sto:([<TIME_START> TO
<TIME_END>])
3)
q=no_dlng_doc_ind_sto:(<NO>),post-query-value-filter=timestamp_dlng_doc_ind_sto:([<TIME_START>
TO <TIME_END>])
1) and 2) both use index on both no_dlng_doc_ind_sto and
timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto
and does the time-interval filter part by fetching values (using
DocValue if possible) for timestamp_dlng_doc_ind_sto for each of the
docs found through the no_dlng_doc_ind_sto-index to see if this doc
should really be included.
There are some things that I did not initially tell about actually
wanting to do a facet search etc. Well, here is the full story:
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html
Regards, Per Steffensen
On 23/05/14 17:37, Toke Eskildsen wrote:
> Per Steffensen [steff@designware.dk] wrote:
>> * It IS more efficient to just use the index for the
>> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
>> part and then fetch timestamp-doc-values for those doc-ids to filter out
>> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
>> the query.
> Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?
>
> - Toke Eskildsen
>
RE: How does query on AND work
Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Per Steffensen [steff@designware.dk] wrote:
> * It IS more efficient to just use the index for the
> "no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
> part and then fetch timestamp-doc-values for those doc-ids to filter out
> the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
> the query.
Thank you for the follow up. It sounds rather special-case though, with requirement of DocValues for the range-field. Do you think this can be generalized?
- Toke Eskildsen
Re: How does query on AND work
Posted by Per Steffensen <st...@designware.dk>.
I can answer some of this myself now that I have dived into it to
understand what Solr/Lucene does and to see if it can be done better
* In current Solr/Lucene (or at least in 4.4) indices on both
"no_dlng_doc_ind_sto" and "timestamp_dlng_doc_ind_sto" are used and the
doc-id-sets found are intersected to get the final set of doc-ids
* It IS more efficient to just use the index for the
"no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
part and then fetch timestamp-doc-values for those doc-ids to filter out
the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
the query. I have made changes to our version of Solr (and Lucene) to do
that and response-times go from about 10 secs to about 1 sec (of course
dependent on whats in file-cache etc.) - in cases where
"no_dlng_doc_ind_sto" hit about 500-1000 docs and
"timestamp_dlng_doc_ind_sto" hit about 3-4 billion.
Regards, Per Steffensen
On 19/05/14 13:33, Per Steffensen wrote:
> Hi
>
> Lets say I have a Solr collection (running across several servers)
> containing 5 billion documents. A.o. each document have a value for
> field "no_dlng_doc_ind_sto" (a long) and field
> "timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto"
> and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored.
> Like this in schema.xml
> <dynamicField name="*_dlng_doc_ind_sto" type="dlng" indexed="true"
> stored="true" required="true" docValues="true"/>
> <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
> positionIncrementGap="0" docValuesFormat="Disk"/>
>
> I make queries like this: no_dlng_doc_ind_sto:(<NO>) AND
> timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
> * The "no_dlng_doc_ind_sto:(<NO>)"-part of a typical query will hit
> between 500 and 1000 documents out of the total 5 billion
> * The "timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])"-part
> of a typical query will hit between 3-4 billion documents out of the
> total 5 billion
>
> Question is how Solr/Lucene deals with such requests?
> I am thinking that using the indices on both "no_dlng_doc_ind_sto" and
> "timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then make
> an intersection of those might not be the most efficient. You are
> making an intersection of two doc-id-sets of size 500-1000 and 3-4
> billion. It might be faster to just use the index for
> "no_dlng_doc_ind_sto" to get the doc-ids for the 500-1000 documents,
> then for each of those fetch their "timestamp_dlng_doc_ind_sto"-value
> (using doc-value) to filter out the ones among the 500-1000 that does
> not match the timestamp-part of the query.
> But what does Solr/Lucene actually do? Is it Solr- or Lucene-code that
> make the decision on what to do? Can you somehow "hint" the
> search-engine that you want one or the other method used?
>
> Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference
>
> Regards, Per Steffensen
>