You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Faraz Fallahi <fa...@googlemail.com> on 2017/12/04 13:54:35 UTC

Re: Huge Query execution time for multiple ORs

Hi guys,

Sorry to bother you again, but i am really confused:

Ive used solr admin website and created a query with lots of ORs using solr
4.7.

When i execute the query without a sort it executes in round about 3.5 - 4
seconds.
When i execute it with a sort on a field called pubdate it takes about
4-4.5 seconds.
When i execute it with a sort on the guid field it takes about 7 - 8
seconds !!!

After your explanations i was expecting the query without a sort to be the
slowest. What am i missing here?

Beat regards
Faraz

Am 30.11.2017 09:29 schrieb "Faraz Fallahi" <fa...@googlemail.com>:

> Uff... I See.. thx dir the explanation :)
>
> Am 30.11.2017 3:13 nachm. schrieb "Emir Arnautović" <
> emir.arnautovic@sematext.com>:
>
>> Hi Faraz,
>> It is a bit worse than that - it also needs to calculate score, so for
>> each matching doc of one query part it has to check if it appears in
>> results of other query parts. If you use term query parser, you avoid
>> calculating score - all doc will have score 1.
>> Solr is based on lucene, which is mainly inverted index:
>> https://en.wikipedia.org/wiki/Inverted_index <
>> https://en.wikipedia.org/wiki/Inverted_index> so knowing that helps
>> understand how expensive some queries are. It is relatively easy to figure
>> out what steps are needed for different query types. Of course, Lucene
>> includes a lot smartness, and it is probably not using the naive approach,
>> but it cannot avoid limitations of inverted index.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 30 Nov 2017, at 02:39, Faraz Fallahi <fa...@googlemail.com>
>> wrote:
>> >
>> > Hi Toke,
>> >
>> > Just to be clear and to understand. Does this mean that a query of the
>> form
>> > author:name1 OR author:name2 OR author:name3
>> >
>> > Is being processed like e.g.
>> >
>> > 1 query against the index with author:name1 getting 4 result
>> > Then 1 query against the index with author:name2 getting 3 result
>> > Then 1 query against the index with author:name3 getting 1 result
>> >
>> > And in the end all results are merged and i get a result of 8 ?
>> >
>> > So a query of thousand authors will be splitted into thousand single
>> > queries against the index?
>> >
>> > Do i understand this correctly?
>> >
>> > Thx for the help
>> > Faraz
>> >
>> >
>> > Am 28.11.2017 15:39 schrieb "Toke Eskildsen" <to...@kb.dk>:
>> >
>> > On Tue, 2017-11-28 at 11:07 +0100, Faraz Fallahi wrote:
>> >> I have a question regarding solr queries.
>> >> My query basically contains thousand of OR conditions for authors
>> >> (author:name1 OR author:name2 OR author:name3 OR author:name4 ...)
>> >> The execution time on my index is huge (around 15 sec). When i tag
>> >> all the associated documents with a custom field and value like
>> >> authorlist:1 and then i change my query to just search for
>> >> authorlist:1 it executes in 78 ms. How come there is such a big
>> >> difference in exec-time?
>> >
>> > Due to the nature of inverted indexes (which lies at the heart of
>> > Solr), your thousands of OR-queries means thousands of lookups, whereas
>> > your authorlist means a single lookup. Adding to this the results for
>> > each author needs to be merged with the other author-results - for
>> > authorlist the results are there directly.
>> >
>> > If your author lists are static, indexing them as you did in your test
>> > is the best solution.
>> >
>> > If they are not static, using a filter-query will ensure that they are
>> > at least cached subsequently, so that only the first call will be
>> > slow.
>> >
>> > If they are semi-static and there are not too many of them, you could
>> > do warm-up filter-queries for all the different groups so that the
>> > users does not pay the first-call penalty. This requires your filter-
>> > cache to be large enough to hold all the author lists.
>> >
>> > - Toke Eskildsen, Royal Danish Library
>>
>>

Re: Huge Query execution time for multiple ORs

Posted by Faraz Fallahi <fa...@googlemail.com>.

Will do thx

Am 04.12.2017 9:27 nachm. schrieb "Emir Arnautović" <
emir.arnautovic@sematext.com>:

> Hi Faraz,
> When you say query without sort, I assume that you mean you omit sort so
> you expect it to be sorted by score. It is expected to be slower than equal
> query without calculating score - e.g. run same query as fq.
> What you observe can be explained with:
> * Solr is calculating score even not sorted by score and not returning it
> (do you return score? Plus I am not sure about this - did not check the
> code)
> * Field that you are using for sorting do not have doc values so have to
> be uninverted
> * Fileld that you are using for sorting are not in OS cache so are read
> from disk.
>
> Try comparing same query running as q=..,. and fq=… Make sure that your
> filter cache is disabled if you are repeating the same queries and
> averaging.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 4 Dec 2017, at 14:54, Faraz Fallahi <fa...@googlemail.com>
> wrote:
> >
> > Hi guys,
> >
> > Sorry to bother you again, but i am really confused:
> >
> > Ive used solr admin website and created a query with lots of ORs using
> solr
> > 4.7.
> >
> > When i execute the query without a sort it executes in round about 3.5 -
> 4
> > seconds.
> > When i execute it with a sort on a field called pubdate it takes about
> > 4-4.5 seconds.
> > When i execute it with a sort on the guid field it takes about 7 - 8
> > seconds !!!
> >
> > After your explanations i was expecting the query without a sort to be
> the
> > slowest. What am i missing here?
> >
> > Beat regards
> > Faraz
> >
> > Am 30.11.2017 09:29 schrieb "Faraz Fallahi" <
> faraz.fallahi@googlemail.com>:
> >
> >> Uff... I See.. thx dir the explanation :)
> >>
> >> Am 30.11.2017 3:13 nachm. schrieb "Emir Arnautović" <
> >> emir.arnautovic@sematext.com>:
> >>
> >>> Hi Faraz,
> >>> It is a bit worse than that - it also needs to calculate score, so for
> >>> each matching doc of one query part it has to check if it appears in
> >>> results of other query parts. If you use term query parser, you avoid
> >>> calculating score - all doc will have score 1.
> >>> Solr is based on lucene, which is mainly inverted index:
> >>> https://en.wikipedia.org/wiki/Inverted_index <
> >>> https://en.wikipedia.org/wiki/Inverted_index> so knowing that helps
> >>> understand how expensive some queries are. It is relatively easy to
> figure
> >>> out what steps are needed for different query types. Of course, Lucene
> >>> includes a lot smartness, and it is probably not using the naive
> approach,
> >>> but it cannot avoid limitations of inverted index.
> >>>
> >>> HTH,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 30 Nov 2017, at 02:39, Faraz Fallahi <faraz.fallahi@googlemail.com
> >
> >>> wrote:
> >>>>
> >>>> Hi Toke,
> >>>>
> >>>> Just to be clear and to understand. Does this mean that a query of the
> >>> form
> >>>> author:name1 OR author:name2 OR author:name3
> >>>>
> >>>> Is being processed like e.g.
> >>>>
> >>>> 1 query against the index with author:name1 getting 4 result
> >>>> Then 1 query against the index with author:name2 getting 3 result
> >>>> Then 1 query against the index with author:name3 getting 1 result
> >>>>
> >>>> And in the end all results are merged and i get a result of 8 ?
> >>>>
> >>>> So a query of thousand authors will be splitted into thousand single
> >>>> queries against the index?
> >>>>
> >>>> Do i understand this correctly?
> >>>>
> >>>> Thx for the help
> >>>> Faraz
> >>>>
> >>>>
> >>>> Am 28.11.2017 15:39 schrieb "Toke Eskildsen" <to...@kb.dk>:
> >>>>
> >>>> On Tue, 2017-11-28 at 11:07 +0100, Faraz Fallahi wrote:
> >>>>> I have a question regarding solr queries.
> >>>>> My query basically contains thousand of OR conditions for authors
> >>>>> (author:name1 OR author:name2 OR author:name3 OR author:name4 ...)
> >>>>> The execution time on my index is huge (around 15 sec). When i tag
> >>>>> all the associated documents with a custom field and value like
> >>>>> authorlist:1 and then i change my query to just search for
> >>>>> authorlist:1 it executes in 78 ms. How come there is such a big
> >>>>> difference in exec-time?
> >>>>
> >>>> Due to the nature of inverted indexes (which lies at the heart of
> >>>> Solr), your thousands of OR-queries means thousands of lookups,
> whereas
> >>>> your authorlist means a single lookup. Adding to this the results for
> >>>> each author needs to be merged with the other author-results - for
> >>>> authorlist the results are there directly.
> >>>>
> >>>> If your author lists are static, indexing them as you did in your test
> >>>> is the best solution.
> >>>>
> >>>> If they are not static, using a filter-query will ensure that they are
> >>>> at least cached subsequently, so that only the first call will be
> >>>> slow.
> >>>>
> >>>> If they are semi-static and there are not too many of them, you could
> >>>> do warm-up filter-queries for all the different groups so that the
> >>>> users does not pay the first-call penalty. This requires your filter-
> >>>> cache to be large enough to hold all the author lists.
> >>>>
> >>>> - Toke Eskildsen, Royal Danish Library
> >>>
> >>>
>
>

Re: Huge Query execution time for multiple ORs

Posted by Emir Arnautović <em...@sematext.com>.

Hi Faraz,
When you say query without sort, I assume that you mean you omit sort so you expect it to be sorted by score. It is expected to be slower than equal query without calculating score - e.g. run same query as fq.
What you observe can be explained with:
* Solr is calculating score even not sorted by score and not returning it (do you return score? Plus I am not sure about this - did not check the code)
* Field that you are using for sorting do not have doc values so have to be uninverted
* Fileld that you are using for sorting are not in OS cache so are read from disk.

Try comparing same query running as q=..,. and fq=… Make sure that your filter cache is disabled if you are repeating the same queries and averaging.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Dec 2017, at 14:54, Faraz Fallahi <fa...@googlemail.com> wrote:
> 
> Hi guys,
> 
> Sorry to bother you again, but i am really confused:
> 
> Ive used solr admin website and created a query with lots of ORs using solr
> 4.7.
> 
> When i execute the query without a sort it executes in round about 3.5 - 4
> seconds.
> When i execute it with a sort on a field called pubdate it takes about
> 4-4.5 seconds.
> When i execute it with a sort on the guid field it takes about 7 - 8
> seconds !!!
> 
> After your explanations i was expecting the query without a sort to be the
> slowest. What am i missing here?
> 
> Beat regards
> Faraz
> 
> Am 30.11.2017 09:29 schrieb "Faraz Fallahi" <fa...@googlemail.com>:
> 
>> Uff... I See.. thx dir the explanation :)
>> 
>> Am 30.11.2017 3:13 nachm. schrieb "Emir Arnautović" <
>> emir.arnautovic@sematext.com>:
>> 
>>> Hi Faraz,
>>> It is a bit worse than that - it also needs to calculate score, so for
>>> each matching doc of one query part it has to check if it appears in
>>> results of other query parts. If you use term query parser, you avoid
>>> calculating score - all doc will have score 1.
>>> Solr is based on lucene, which is mainly inverted index:
>>> https://en.wikipedia.org/wiki/Inverted_index <
>>> https://en.wikipedia.org/wiki/Inverted_index> so knowing that helps
>>> understand how expensive some queries are. It is relatively easy to figure
>>> out what steps are needed for different query types. Of course, Lucene
>>> includes a lot smartness, and it is probably not using the naive approach,
>>> but it cannot avoid limitations of inverted index.
>>> 
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
>>>> On 30 Nov 2017, at 02:39, Faraz Fallahi <fa...@googlemail.com>
>>> wrote:
>>>> 
>>>> Hi Toke,
>>>> 
>>>> Just to be clear and to understand. Does this mean that a query of the
>>> form
>>>> author:name1 OR author:name2 OR author:name3
>>>> 
>>>> Is being processed like e.g.
>>>> 
>>>> 1 query against the index with author:name1 getting 4 result
>>>> Then 1 query against the index with author:name2 getting 3 result
>>>> Then 1 query against the index with author:name3 getting 1 result
>>>> 
>>>> And in the end all results are merged and i get a result of 8 ?
>>>> 
>>>> So a query of thousand authors will be splitted into thousand single
>>>> queries against the index?
>>>> 
>>>> Do i understand this correctly?
>>>> 
>>>> Thx for the help
>>>> Faraz
>>>> 
>>>> 
>>>> Am 28.11.2017 15:39 schrieb "Toke Eskildsen" <to...@kb.dk>:
>>>> 
>>>> On Tue, 2017-11-28 at 11:07 +0100, Faraz Fallahi wrote:
>>>>> I have a question regarding solr queries.
>>>>> My query basically contains thousand of OR conditions for authors
>>>>> (author:name1 OR author:name2 OR author:name3 OR author:name4 ...)
>>>>> The execution time on my index is huge (around 15 sec). When i tag
>>>>> all the associated documents with a custom field and value like
>>>>> authorlist:1 and then i change my query to just search for
>>>>> authorlist:1 it executes in 78 ms. How come there is such a big
>>>>> difference in exec-time?
>>>> 
>>>> Due to the nature of inverted indexes (which lies at the heart of
>>>> Solr), your thousands of OR-queries means thousands of lookups, whereas
>>>> your authorlist means a single lookup. Adding to this the results for
>>>> each author needs to be merged with the other author-results - for
>>>> authorlist the results are there directly.
>>>> 
>>>> If your author lists are static, indexing them as you did in your test
>>>> is the best solution.
>>>> 
>>>> If they are not static, using a filter-query will ensure that they are
>>>> at least cached subsequently, so that only the first call will be
>>>> slow.
>>>> 
>>>> If they are semi-static and there are not too many of them, you could
>>>> do warm-up filter-queries for all the different groups so that the
>>>> users does not pay the first-call penalty. This requires your filter-
>>>> cache to be large enough to hold all the author lists.
>>>> 
>>>> - Toke Eskildsen, Royal Danish Library
>>> 
>>>