You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Carlos Gonzalez-Cadenas <cg...@experienceon.com> on 2012/03/14 21:01:05 UTC

problems with DisjunctionMaxQuery and early-termination

Hello all,

We have a SOLR index filled with user queries and we want to retrieve the
ones that are more similar to a given query entered by an end-user. It is
kind of a "related queries" system.

The index is pretty big and we're using early-termination of queries (with
the index sorted so that the "more popular" queries have lower docids and
therefore the termination yields higher-quality results)

Clearly, when the user enters a user-level query into the search box, i.e.
"cheap hotels barcelona offers", we don't know whether there exists a
document (query) in the index that contains these four words or not.
 Therefore, when we're building the SOLR query, the first intuition would
be to do a query like this "cheap OR hotels OR barcelona OR offers".

If all the documents in the index where evaluated, the results of this
query would be good. For example, if there is no query in the index with
these four words but there's a query in the index with the text "cheap
hotels barcelona", it will probably be one of the top results, which is
precisely what we want.

The problem is that we're doing early termination and therefore this query
will exhaust very fast the top-K result limit (our custom collector limits
on the number of evaluated documents), given that queries like "hotels in
madrid" or "hotels in NYC" will match the OR expression described above
(because they all match "hotels").

Our next step was to think in a DisjunctionMaxQuery, trying to write a
query like this:

DisjunctionMaxQuery:
 1) +cheap +hotels +barcelona +offers
 2) +cheap +hotels +barcelona
 3) +cheap +hotels
 4) +hotels

We were thinking that perhaps the sub-queries within the
DisjunctionMaxQuery were going to get evaluated in "parallel" given that
they're separated queries, but in fact from a runtime perspective it does
behave in a similar way than the OR query that we described above.

Our desired behavior is to try match documents with each subquery within
the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
score them and return them all together (therefore we don't want all the
matches being done by a single sub-query, like it's happening now).

Clearly, we could create a script external to SOLR that just runs the
several sub-queries as standalone queries and then joins all the results
together, but before going for this we'd like to know if you have any ideas
on how to solve this problem within SOLR. We do have our own QParser, and
therefore we'd be able to implement any arbitrary query construction that
you can come up with, or even create a new Query type if it's needed.

Thanks a lot for your help,
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

problems with DisjunctionMaxQuery and early-termination

Posted by Carlos Gonzalez-Cadenas <cg...@experienceon.com>.
Hello all,

We have a SOLR index filled with user queries and we want to retrieve the
ones that are more similar to a given query entered by an end-user. It is
kind of a "related queries" system.

The index is pretty big and we're using early-termination of queries (with
the index sorted so that the "more popular" queries have lower docids and
therefore the termination yields higher-quality results)

Clearly, when the user enters a user-level query into the search box, i.e.
"cheap hotels barcelona offers", we don't know whether there exists a
document (query) in the index that contains these four words or not.
 Therefore, when we're building the SOLR query, the first intuition would
be to do a query like this "cheap OR hotels OR barcelona OR offers".

If all the documents in the index where evaluated, the results of this
query would be good. For example, if there is no query in the index with
these four words but there's a query in the index with the text "cheap
hotels barcelona", it will probably be one of the top results, which is
precisely what we want.

The problem is that we're doing early termination and therefore this query
will exhaust very fast the top-K result limit (our custom collector limits
on the number of evaluated documents), given that queries like "hotels in
madrid" or "hotels in NYC" will match the OR expression described above
(because they all match "hotels").

Our next step was to think in a DisjunctionMaxQuery, trying to write a
query like this:

DisjunctionMaxQuery:
 1) +cheap +hotels +barcelona +offers
 2) +cheap +hotels +barcelona
 3) +cheap +hotels
 4) +hotels

We were thinking that perhaps the sub-queries within the
DisjunctionMaxQuery were going to get evaluated in "parallel" given that
they're separated queries, but in fact from a runtime perspective it does
behave in a similar way than the OR query that we described above.

Our desired behavior is to try match documents with each subquery within
the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
score them and return them all together (therefore we don't want all the
matches being done by a single sub-query, like it's happening now).

Clearly, we could create a script external to SOLR that just runs the
several sub-queries as standalone queries and then joins all the results
together, but before going for this we'd like to know if you have any ideas
on how to solve this problem within SOLR. We do have our own QParser, and
therefore we'd be able to implement any arbitrary query construction that
you can come up with, or even create a new Query type if it's needed.

Thanks a lot for your help,
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas

Re: problems with DisjunctionMaxQuery and early-termination

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
On Fri, Mar 16, 2012 at 8:38 PM, Carlos Gonzalez-Cadenas <
cgc@experienceon.com> wrote:

> On Fri, Mar 16, 2012 at 9:26 AM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> Hello Carlos,
>>
>
>> so, search all terms with MUST first, you've got the best result in terms
>> of precision and recall if you've got something. Otherwise you still have a
>> lot of time. You need to drop one of the words or switch ones of them into
>> SHOULD.
>
>
> Agree, this is precisely what we're trying to do (the idea of having
> multiple queries, from narrow to broad). My question was more of a
> practical nature, that is, how can we do these queries without really
> having to do independent SOLR queries.
>

Sorry I forget this question. I did a tiny DelegateRequestHandler
which sequentially iterates through list of other request handlers until
the first not empty results has been found. Every of slave RequestHandlers
has own QParser params e.g. I search for conjunction, then apply spelling
correction, and after that I search for disjunction. You can see from my
"theory" that the total time is equal to the time of the last successful
search.


> Now we use DisjunctionMaxQuery, but it has the problems that I described
> in my former email w.r.t. early-termination.
>
> This morning we found two potential directions that might work (we're
> testing them as of now):
>
>    1. Implement a custom RequestHandler and execute several queries
>    within SOLR (https://issues.apache.org/jira/browse/SOLR-1093). This is
>    better than executing them from outside and having all the network / HTTP /
>    ... overhead, but still not very good.
>    2. Modify DisjunctionMaxQuery. In particular, modifying DisjunctionMaxScorer
>    so that it doesn't use a min heap for the subscorers. We'll try several
>    strategies to collect documents from the child subscorers, like round-robin
>    or collecting the narrower subscorers first and then go broader until the
>    upstream collector stops the collection. This looks like the most
>    interesting option.
>
>
> Enumerating all combinations is NPcomplete task I believe. But you have a
>> good heuristics:
>> * zero docFreq means that you can drop this term off or pass it through
>> spell correction
>> * if you have a instant suggest like app and has zero result for some
>> phrase, maybe dropping the last word gives you the phrase which had some
>> results before, and present in cache.
>> * otherwise excluding less frequent term from conjunction probably gives
>> non-zero results
>>
>
> This is not a problem in practice. We're using a bunch of heuristics in
> our QueryParser (including a lot of info extracted from the TermsEnum,
> stopword lists, etc ...) to severely cut the space.
>
> Thanks
> Carlos
>
>
>
>>
>> Regards
>>
>>
>> On Thu, Mar 15, 2012 at 12:01 AM, Carlos Gonzalez-Cadenas <
>> cgc@experienceon.com> wrote:
>>
>>> Hello all,
>>>
>>> We have a SOLR index filled with user queries and we want to retrieve the
>>> ones that are more similar to a given query entered by an end-user. It is
>>> kind of a "related queries" system.
>>>
>>> The index is pretty big and we're using early-termination of queries
>>> (with
>>> the index sorted so that the "more popular" queries have lower docids and
>>> therefore the termination yields higher-quality results)
>>>
>>> Clearly, when the user enters a user-level query into the search box,
>>> i.e.
>>> "cheap hotels barcelona offers", we don't know whether there exists a
>>> document (query) in the index that contains these four words or not.
>>>  Therefore, when we're building the SOLR query, the first intuition would
>>> be to do a query like this "cheap OR hotels OR barcelona OR offers".
>>>
>>> If all the documents in the index where evaluated, the results of this
>>> query would be good. For example, if there is no query in the index with
>>> these four words but there's a query in the index with the text "cheap
>>> hotels barcelona", it will probably be one of the top results, which is
>>> precisely what we want.
>>>
>>> The problem is that we're doing early termination and therefore this
>>> query
>>> will exhaust very fast the top-K result limit (our custom collector
>>> limits
>>> on the number of evaluated documents), given that queries like "hotels in
>>> madrid" or "hotels in NYC" will match the OR expression described above
>>> (because they all match "hotels").
>>>
>>> Our next step was to think in a DisjunctionMaxQuery, trying to write a
>>> query like this:
>>>
>>> DisjunctionMaxQuery:
>>>  1) +cheap +hotels +barcelona +offers
>>>  2) +cheap +hotels +barcelona
>>>  3) +cheap +hotels
>>>  4) +hotels
>>>
>>> We were thinking that perhaps the sub-queries within the
>>> DisjunctionMaxQuery were going to get evaluated in "parallel" given that
>>> they're separated queries, but in fact from a runtime perspective it does
>>> behave in a similar way than the OR query that we described above.
>>>
>>> Our desired behavior is to try match documents with each subquery within
>>> the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
>>> score them and return them all together (therefore we don't want all the
>>> matches being done by a single sub-query, like it's happening now).
>>>
>>> Clearly, we could create a script external to SOLR that just runs the
>>> several sub-queries as standalone queries and then joins all the results
>>> together, but before going for this we'd like to know if you have any
>>> ideas
>>> on how to solve this problem within SOLR. We do have our own QParser, and
>>> therefore we'd be able to implement any arbitrary query construction that
>>> you can come up with, or even create a new Query type if it's needed.
>>>
>>> Thanks a lot for your help,
>>> Carlos
>>>
>>>
>>> Carlos Gonzalez-Cadenas
>>> CEO, ExperienceOn - New generation search
>>> http://www.experienceon.com
>>>
>>> Mobile: +34 652 911 201
>>> Skype: carlosgonzalezcadenas
>>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Lucid Certified
>> Apache Lucene/Solr Developer
>> Grid Dynamics
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>>
>


-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: problems with DisjunctionMaxQuery and early-termination

Posted by Carlos Gonzalez-Cadenas <cg...@experienceon.com>.
On Fri, Mar 16, 2012 at 9:26 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello Carlos,
>

Hello Mikhail:

Thanks for your answer.


>
> I have two concerns about your approach. First-K (not top-K honestly)
> collector approach impacts recall of your search and using disjunctive
> queries impacts precision e.g. I want to find some fairly small and quiet,
> and therefore unpopular "Lemond Hotel" you parse my phrase into Lemond OR
> Hotel and return 1K of popular hotels but not Lemond one because it's
> nearly a hapax. So, I don't believe that it's a great search.
>

Yes, I agree that OR queries combined with top-K (or first-K as you say)
doesn't work very well (your results will be full of very popular yet not
very precise matches) and this is also what I tried to explain in my email.


> And the also concern from the end of your letter is about joining separate
> query result. I'd like to remind that absolute scores from the different
> queries are not comparable at all, and maybe, but I'm not sure the relative
> ones scaled by max score are comparable.

 I suppose you need conjunctive queries instead. And the great stuff about
> them is "not-found for free" getting the zero result found cost is
> proportional to number of query terms i.e. miserable.
> so, search all terms with MUST first, you've got the best result in terms
> of precision and recall if you've got something. Otherwise you still have a
> lot of time. You need to drop one of the words or switch ones of them into
> SHOULD.


Agree, this is precisely what we're trying to do (the idea of having
multiple queries, from narrow to broad). My question was more of a
practical nature, that is, how can we do these queries without really
having to do independent SOLR queries. Now we use DisjunctionMaxQuery, but
it has the problems that I described in my former email w.r.t.
early-termination.

This morning we found two potential directions that might work (we're
testing them as of now):

   1. Implement a custom RequestHandler and execute several queries within
   SOLR (https://issues.apache.org/jira/browse/SOLR-1093). This is better
   than executing them from outside and having all the network / HTTP / ...
   overhead, but still not very good.
   2. Modify DisjunctionMaxQuery. In particular, modifying DisjunctionMaxScorer
   so that it doesn't use a min heap for the subscorers. We'll try several
   strategies to collect documents from the child subscorers, like round-robin
   or collecting the narrower subscorers first and then go broader until the
   upstream collector stops the collection. This looks like the most
   interesting option.


Enumerating all combinations is NPcomplete task I believe. But you have a
> good heuristics:
> * zero docFreq means that you can drop this term off or pass it through
> spell correction
> * if you have a instant suggest like app and has zero result for some
> phrase, maybe dropping the last word gives you the phrase which had some
> results before, and present in cache.
> * otherwise excluding less frequent term from conjunction probably gives
> non-zero results
>

This is not a problem in practice. We're using a bunch of heuristics in our
QueryParser (including a lot of info extracted from the TermsEnum, stopword
lists, etc ...) to severely cut the space.

Thanks
Carlos



>
> Regards
>
>
> On Thu, Mar 15, 2012 at 12:01 AM, Carlos Gonzalez-Cadenas <
> cgc@experienceon.com> wrote:
>
>> Hello all,
>>
>> We have a SOLR index filled with user queries and we want to retrieve the
>> ones that are more similar to a given query entered by an end-user. It is
>> kind of a "related queries" system.
>>
>> The index is pretty big and we're using early-termination of queries (with
>> the index sorted so that the "more popular" queries have lower docids and
>> therefore the termination yields higher-quality results)
>>
>> Clearly, when the user enters a user-level query into the search box, i.e.
>> "cheap hotels barcelona offers", we don't know whether there exists a
>> document (query) in the index that contains these four words or not.
>>  Therefore, when we're building the SOLR query, the first intuition would
>> be to do a query like this "cheap OR hotels OR barcelona OR offers".
>>
>> If all the documents in the index where evaluated, the results of this
>> query would be good. For example, if there is no query in the index with
>> these four words but there's a query in the index with the text "cheap
>> hotels barcelona", it will probably be one of the top results, which is
>> precisely what we want.
>>
>> The problem is that we're doing early termination and therefore this query
>> will exhaust very fast the top-K result limit (our custom collector limits
>> on the number of evaluated documents), given that queries like "hotels in
>> madrid" or "hotels in NYC" will match the OR expression described above
>> (because they all match "hotels").
>>
>> Our next step was to think in a DisjunctionMaxQuery, trying to write a
>> query like this:
>>
>> DisjunctionMaxQuery:
>>  1) +cheap +hotels +barcelona +offers
>>  2) +cheap +hotels +barcelona
>>  3) +cheap +hotels
>>  4) +hotels
>>
>> We were thinking that perhaps the sub-queries within the
>> DisjunctionMaxQuery were going to get evaluated in "parallel" given that
>> they're separated queries, but in fact from a runtime perspective it does
>> behave in a similar way than the OR query that we described above.
>>
>> Our desired behavior is to try match documents with each subquery within
>> the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
>> score them and return them all together (therefore we don't want all the
>> matches being done by a single sub-query, like it's happening now).
>>
>> Clearly, we could create a script external to SOLR that just runs the
>> several sub-queries as standalone queries and then joins all the results
>> together, but before going for this we'd like to know if you have any
>> ideas
>> on how to solve this problem within SOLR. We do have our own QParser, and
>> therefore we'd be able to implement any arbitrary query construction that
>> you can come up with, or even create a new Query type if it's needed.
>>
>> Thanks a lot for your help,
>> Carlos
>>
>>
>> Carlos Gonzalez-Cadenas
>> CEO, ExperienceOn - New generation search
>> http://www.experienceon.com
>>
>> Mobile: +34 652 911 201
>> Skype: carlosgonzalezcadenas
>> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: problems with DisjunctionMaxQuery and early-termination

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello Carlos,

I have two concerns about your approach. First-K (not top-K honestly)
collector approach impacts recall of your search and using disjunctive
queries impacts precision e.g. I want to find some fairly small and quiet,
and therefore unpopular "Lemond Hotel" you parse my phrase into Lemond OR
Hotel and return 1K of popular hotels but not Lemond one because it's
nearly a hapax. So, I don't believe that it's a great search.
And the also concern from the end of your letter is about joining separate
query result. I'd like to remind that absolute scores from the different
queries are not comparable at all, and maybe, but I'm not sure the relative
ones scaled by max score are comparable.
I suppose you need conjunctive queries instead. And the great stuff about
them is "not-found for free" getting the zero result found cost is
proportional to number of query terms i.e. miserable.
so, search all terms with MUST first, you've got the best result in terms
of precision and recall if you've got something. Otherwise you still have a
lot of time. You need to drop one of the words or switch ones of them into
SHOULD. Enumerating all combinations is NPcomplete task I believe. But you
have a good heuristics:
* zero docFreq means that you can drop this term off or pass it through
spell correction
* if you have a instant suggest like app and has zero result for some
phrase, maybe dropping the last word gives you the phrase which had some
results before, and present in cache.
* otherwise excluding less frequent term from conjunction probably gives
non-zero results

Regards

On Thu, Mar 15, 2012 at 12:01 AM, Carlos Gonzalez-Cadenas <
cgc@experienceon.com> wrote:

> Hello all,
>
> We have a SOLR index filled with user queries and we want to retrieve the
> ones that are more similar to a given query entered by an end-user. It is
> kind of a "related queries" system.
>
> The index is pretty big and we're using early-termination of queries (with
> the index sorted so that the "more popular" queries have lower docids and
> therefore the termination yields higher-quality results)
>
> Clearly, when the user enters a user-level query into the search box, i.e.
> "cheap hotels barcelona offers", we don't know whether there exists a
> document (query) in the index that contains these four words or not.
>  Therefore, when we're building the SOLR query, the first intuition would
> be to do a query like this "cheap OR hotels OR barcelona OR offers".
>
> If all the documents in the index where evaluated, the results of this
> query would be good. For example, if there is no query in the index with
> these four words but there's a query in the index with the text "cheap
> hotels barcelona", it will probably be one of the top results, which is
> precisely what we want.
>
> The problem is that we're doing early termination and therefore this query
> will exhaust very fast the top-K result limit (our custom collector limits
> on the number of evaluated documents), given that queries like "hotels in
> madrid" or "hotels in NYC" will match the OR expression described above
> (because they all match "hotels").
>
> Our next step was to think in a DisjunctionMaxQuery, trying to write a
> query like this:
>
> DisjunctionMaxQuery:
>  1) +cheap +hotels +barcelona +offers
>  2) +cheap +hotels +barcelona
>  3) +cheap +hotels
>  4) +hotels
>
> We were thinking that perhaps the sub-queries within the
> DisjunctionMaxQuery were going to get evaluated in "parallel" given that
> they're separated queries, but in fact from a runtime perspective it does
> behave in a similar way than the OR query that we described above.
>
> Our desired behavior is to try match documents with each subquery within
> the DisjunctionMaxQuery (up to a per-subquery limit that we put) and then
> score them and return them all together (therefore we don't want all the
> matches being done by a single sub-query, like it's happening now).
>
> Clearly, we could create a script external to SOLR that just runs the
> several sub-queries as standalone queries and then joins all the results
> together, but before going for this we'd like to know if you have any ideas
> on how to solve this problem within SOLR. We do have our own QParser, and
> therefore we'd be able to implement any arbitrary query construction that
> you can come up with, or even create a new Query type if it's needed.
>
> Thanks a lot for your help,
> Carlos
>
>
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
>
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>