You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by michael dürr <du...@gmail.com> on 2020/10/28 07:48:25 UTC

Simulate facet.exists for json query facets

Hi,

I use json facets of type 'query'. As these queries are pretty slow and I'm
only interested in whether there is a match or not, I'd like to restrict
the query execution similar to the standard facetting (like with the
facet.exists parameter). My simplified query looks something like this (in
reality *:* may be replaced by a complex edismax query and multiple
subfacets similar to "tour" occur):

curl http://localhost:8983/solr/portal/select -d \
"q=*:*\
&json.facet={
  tour:{
    type : query,
     q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"
  }
}\
&rows=0"

Is there any possibility to modify my request to ensure that the facet
query stops as soon as it matches a hit for the first time?

Thanks!
Michael

Re: Simulate facet.exists for json query facets

Posted by Michael Gibney <mi...@michaelgibney.net>.

>If all of those facet queries are _known_ to be a performance hit,
you might be able to do something custom.That would require
custom code though and I wouldn’t go there unless you can
demonstrate need.

Yeah ... indeed if those facet queries are relatively static (and thus
cacheable ... even if there are a lot of them), an appropriately-sized
filterCache would allow them to be cached to good effect and then the
performance hit should be negligible. Knowing what the queries are up
front, you could even add them to your warming queries.

It'd also be unusual (though possible, sure?) to run these kinds of
facet queries with no intention of ever conditionally following up in
a way that would want the actual results/docSet -- even if the
initial/more common query only cares about boolean existence.

The case in which this type of functionality really might be indicated is:
1. only care about boolean result (obvious, ok)
2. dynamic (i.e., not-particularly-cacheable) queries
3. never intend to follow up with a request that calls for full results

If both of the first two conditions hold, and especially if the third
also holds, there would in principle definitely be efficiency to be
gained by early termination (and avoiding the creation of a DocSet,
which at the moment happens unconditionally for every facet query).
I'm also thinking about this through the lens of bringing the JSON
Facet API to parity with the legacy facet API, fwiw ...

On Fri, Oct 30, 2020 at 9:02 AM Erick Erickson <er...@gmail.com> wrote:
>
> I don’t think there’s anything to do what you’re asking OOB.
>
> If all of those facet queries are _known_ to be a performance hit,
> you might be able to do something custom.That would require
> custom code though and I wouldn’t go there unless you can
> demonstrate need.
>
> If you issue a debug=timing you’ll see the time each component
> takes,  and there’s a separate entry for faceting so that’ll give you
> a clue whether it’s worth the effort.
>
> Best,
> Erick
>
> > On Oct 30, 2020, at 8:10 AM, Michael Gibney <mi...@michaelgibney.net> wrote:
> >
> > Michael, sorry for the confusion; I was positing a *hypothetical*
> > "exists()" function that doesn't currently exist, that *is* an
> > aggregate function, and the *does* stop early. I didn't account for
> > the fact that there's already an "exists()" function *query* that
> > behaves very differently. So yes, definitely confusing :-). I guess
> > choosing a different name for the proposed aggregate function would
> > make sense. I was suggesting it mostly as an alternative to extending
> > the syntax of JSON Facet "query" facet type, and to say that I think
> > the implementation of such an aggregate function would be pretty
> > straightforward.
> >
> > On Fri, Oct 30, 2020 at 3:44 AM michael dürr <du...@gmail.com> wrote:
> >>
> >> @Erick
> >>
> >> Sorry! I chose a simple example as I wanted to reduce complexity.
> >> In detail:
> >> * We have distinct contents like tours, offers, events, etc which
> >> themselves may be categorized: A tour may be a hiking tour, a
> >> mountaineering tour, ...
> >> * We have hundreds of customers that want to facet their searches to that
> >> content types but often with distinct combinations of categories, i.e.
> >> customer A wants his facet "tours" to only count hiking tours, customer B
> >> only mountaineering tours, customer C a combination of both, etc
> >> * We use "query" facets as each facet request will be build dynamically (it
> >> is not feasible to aggregate certain categories and add them as an
> >> additional solr schema field as we have hundreds of different combinations).
> >> * Anyways, our ui only requires adding a toggle to filter for (for example)
> >> "tours" in case a facet result is present. We do not care about the number
> >> of tours.
> >> * As we have millions of contents and dozens of content types (and dozens
> >> of categories per content type) such queries may take a very long time.
> >>
> >> A complex example may look like this:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> *q=*:*&json.facet={   tour:{     type : query,     q: \"+categoryId:(21450
> >> 21453)\"   },   guide:{     type : query,     q: \"+categoryId:(21105 21401
> >> 21301 21302 21303 21304 21305 21403 21404)\"   },   story:{     type :
> >> query,     q: \"+categoryId:21515\"   },   condition:{     type : query,
> >> q: \"+categoryId:21514\"   },   hut:{     type : query,     q:
> >> \"+categoryId:8510\"   },   skiresort:{     type : query,     q:
> >> \"+categoryId:21493\"   },   offer:{     type : query,     q:
> >> \"+categoryId:21462\"   },   lodging:{     type : query,     q:
> >> \"+categoryId:6061\"   },   event:{     type : query,     q:
> >> \"+categoryId:21465\"   },   poi:{     type : query,     q:
> >> \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"   },   authors:{
> >> type : query,     q: \"+categoryId:(21205 21206)\"   },   partners:{
> >> type : query,     q: \"+categoryId:21200\"   },   list:{     type :
> >> query,     q: \"+categoryId:21481\"   } }\&rows=0"*
> >>
> >> @Michael
> >>
> >> Thanks for your suggestion but this does not work as
> >> * the facet module expects an aggregate function (which i simply added by
> >> embracing your call with sum(...))
> >> * and (please correct me if I am wrong) the exists() function not stops on
> >> the first match, but counts the number of results for which the query
> >> matches a document.
>

Re: Simulate facet.exists for json query facets

Posted by Erick Erickson <er...@gmail.com>.

I don’t think there’s anything to do what you’re asking OOB.

If all of those facet queries are _known_ to be a performance hit,
you might be able to do something custom.That would require 
custom code though and I wouldn’t go there unless you can
demonstrate need.

If you issue a debug=timing you’ll see the time each component 
takes,  and there’s a separate entry for faceting so that’ll give you
a clue whether it’s worth the effort.

Best,
Erick

> On Oct 30, 2020, at 8:10 AM, Michael Gibney <mi...@michaelgibney.net> wrote:
> 
> Michael, sorry for the confusion; I was positing a *hypothetical*
> "exists()" function that doesn't currently exist, that *is* an
> aggregate function, and the *does* stop early. I didn't account for
> the fact that there's already an "exists()" function *query* that
> behaves very differently. So yes, definitely confusing :-). I guess
> choosing a different name for the proposed aggregate function would
> make sense. I was suggesting it mostly as an alternative to extending
> the syntax of JSON Facet "query" facet type, and to say that I think
> the implementation of such an aggregate function would be pretty
> straightforward.
> 
> On Fri, Oct 30, 2020 at 3:44 AM michael dürr <du...@gmail.com> wrote:
>> 
>> @Erick
>> 
>> Sorry! I chose a simple example as I wanted to reduce complexity.
>> In detail:
>> * We have distinct contents like tours, offers, events, etc which
>> themselves may be categorized: A tour may be a hiking tour, a
>> mountaineering tour, ...
>> * We have hundreds of customers that want to facet their searches to that
>> content types but often with distinct combinations of categories, i.e.
>> customer A wants his facet "tours" to only count hiking tours, customer B
>> only mountaineering tours, customer C a combination of both, etc
>> * We use "query" facets as each facet request will be build dynamically (it
>> is not feasible to aggregate certain categories and add them as an
>> additional solr schema field as we have hundreds of different combinations).
>> * Anyways, our ui only requires adding a toggle to filter for (for example)
>> "tours" in case a facet result is present. We do not care about the number
>> of tours.
>> * As we have millions of contents and dozens of content types (and dozens
>> of categories per content type) such queries may take a very long time.
>> 
>> A complex example may look like this:
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> *q=*:*&json.facet={   tour:{     type : query,     q: \"+categoryId:(21450
>> 21453)\"   },   guide:{     type : query,     q: \"+categoryId:(21105 21401
>> 21301 21302 21303 21304 21305 21403 21404)\"   },   story:{     type :
>> query,     q: \"+categoryId:21515\"   },   condition:{     type : query,
>> q: \"+categoryId:21514\"   },   hut:{     type : query,     q:
>> \"+categoryId:8510\"   },   skiresort:{     type : query,     q:
>> \"+categoryId:21493\"   },   offer:{     type : query,     q:
>> \"+categoryId:21462\"   },   lodging:{     type : query,     q:
>> \"+categoryId:6061\"   },   event:{     type : query,     q:
>> \"+categoryId:21465\"   },   poi:{     type : query,     q:
>> \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"   },   authors:{
>> type : query,     q: \"+categoryId:(21205 21206)\"   },   partners:{
>> type : query,     q: \"+categoryId:21200\"   },   list:{     type :
>> query,     q: \"+categoryId:21481\"   } }\&rows=0"*
>> 
>> @Michael
>> 
>> Thanks for your suggestion but this does not work as
>> * the facet module expects an aggregate function (which i simply added by
>> embracing your call with sum(...))
>> * and (please correct me if I am wrong) the exists() function not stops on
>> the first match, but counts the number of results for which the query
>> matches a document.

Re: Simulate facet.exists for json query facets

Posted by Michael Gibney <mi...@michaelgibney.net>.

Michael, sorry for the confusion; I was positing a *hypothetical*
"exists()" function that doesn't currently exist, that *is* an
aggregate function, and the *does* stop early. I didn't account for
the fact that there's already an "exists()" function *query* that
behaves very differently. So yes, definitely confusing :-). I guess
choosing a different name for the proposed aggregate function would
make sense. I was suggesting it mostly as an alternative to extending
the syntax of JSON Facet "query" facet type, and to say that I think
the implementation of such an aggregate function would be pretty
straightforward.

On Fri, Oct 30, 2020 at 3:44 AM michael dürr <du...@gmail.com> wrote:
>
> @Erick
>
> Sorry! I chose a simple example as I wanted to reduce complexity.
> In detail:
> * We have distinct contents like tours, offers, events, etc which
> themselves may be categorized: A tour may be a hiking tour, a
> mountaineering tour, ...
> * We have hundreds of customers that want to facet their searches to that
> content types but often with distinct combinations of categories, i.e.
> customer A wants his facet "tours" to only count hiking tours, customer B
> only mountaineering tours, customer C a combination of both, etc
> * We use "query" facets as each facet request will be build dynamically (it
> is not feasible to aggregate certain categories and add them as an
> additional solr schema field as we have hundreds of different combinations).
> * Anyways, our ui only requires adding a toggle to filter for (for example)
> "tours" in case a facet result is present. We do not care about the number
> of tours.
> * As we have millions of contents and dozens of content types (and dozens
> of categories per content type) such queries may take a very long time.
>
> A complex example may look like this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *q=*:*&json.facet={   tour:{     type : query,     q: \"+categoryId:(21450
> 21453)\"   },   guide:{     type : query,     q: \"+categoryId:(21105 21401
> 21301 21302 21303 21304 21305 21403 21404)\"   },   story:{     type :
> query,     q: \"+categoryId:21515\"   },   condition:{     type : query,
>  q: \"+categoryId:21514\"   },   hut:{     type : query,     q:
> \"+categoryId:8510\"   },   skiresort:{     type : query,     q:
> \"+categoryId:21493\"   },   offer:{     type : query,     q:
> \"+categoryId:21462\"   },   lodging:{     type : query,     q:
> \"+categoryId:6061\"   },   event:{     type : query,     q:
> \"+categoryId:21465\"   },   poi:{     type : query,     q:
> \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"   },   authors:{
>  type : query,     q: \"+categoryId:(21205 21206)\"   },   partners:{
>  type : query,     q: \"+categoryId:21200\"   },   list:{     type :
> query,     q: \"+categoryId:21481\"   } }\&rows=0"*
>
> @Michael
>
> Thanks for your suggestion but this does not work as
> * the facet module expects an aggregate function (which i simply added by
> embracing your call with sum(...))
> * and (please correct me if I am wrong) the exists() function not stops on
> the first match, but counts the number of results for which the query
> matches a document.

Re: Simulate facet.exists for json query facets

Posted by michael dürr <du...@gmail.com>.

@Erick

Sorry! I chose a simple example as I wanted to reduce complexity.
In detail:
* We have distinct contents like tours, offers, events, etc which
themselves may be categorized: A tour may be a hiking tour, a
mountaineering tour, ...
* We have hundreds of customers that want to facet their searches to that
content types but often with distinct combinations of categories, i.e.
customer A wants his facet "tours" to only count hiking tours, customer B
only mountaineering tours, customer C a combination of both, etc
* We use "query" facets as each facet request will be build dynamically (it
is not feasible to aggregate certain categories and add them as an
additional solr schema field as we have hundreds of different combinations).
* Anyways, our ui only requires adding a toggle to filter for (for example)
"tours" in case a facet result is present. We do not care about the number
of tours.
* As we have millions of contents and dozens of content types (and dozens
of categories per content type) such queries may take a very long time.

A complex example may look like this:























































*q=*:*&json.facet={   tour:{     type : query,     q: \"+categoryId:(21450
21453)\"   },   guide:{     type : query,     q: \"+categoryId:(21105 21401
21301 21302 21303 21304 21305 21403 21404)\"   },   story:{     type :
query,     q: \"+categoryId:21515\"   },   condition:{     type : query,
 q: \"+categoryId:21514\"   },   hut:{     type : query,     q:
\"+categoryId:8510\"   },   skiresort:{     type : query,     q:
\"+categoryId:21493\"   },   offer:{     type : query,     q:
\"+categoryId:21462\"   },   lodging:{     type : query,     q:
\"+categoryId:6061\"   },   event:{     type : query,     q:
\"+categoryId:21465\"   },   poi:{     type : query,     q:
\"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"   },   authors:{
 type : query,     q: \"+categoryId:(21205 21206)\"   },   partners:{
 type : query,     q: \"+categoryId:21200\"   },   list:{     type :
query,     q: \"+categoryId:21481\"   } }\&rows=0"*

@Michael

Thanks for your suggestion but this does not work as
* the facet module expects an aggregate function (which i simply added by
embracing your call with sum(...))
* and (please correct me if I am wrong) the exists() function not stops on
the first match, but counts the number of results for which the query
matches a document.

Re: Simulate facet.exists for json query facets

Posted by Michael Gibney <mi...@michaelgibney.net>.

Separately, and in parallel to Erick's question: indeed I'm not aware
of any way to do this currently, but I *can* imagine cases where this
would be useful. I have a sense this could be cleanly implemented as a
stat facet function
(https://lucene.apache.org/solr/guide/8_6/json-facet-api.html#stat-facet-functions),
e.g.:

curl http://localhost:8983/solr/portal/select -d \
"q=*:*\
&json.facet={
  tour: \"exists(+categoryId:6000 -categoryId:(6061 21493 8510))\"
}\
&rows=0"

The return value of the `exists` function could be boolean, which
would be semantically clearer than capping count to 1, as I gather
`facet.exists` does. For the same reason, implementing this as a
function would probably be better than adding this functionality to
the `query` facet type, which carries certain useful assumptions (the
meaning of the "count" attribute in the response, the ability to nest
stats and subfacets, etc.) ... just thinking out loud at the moment
...

On Wed, Oct 28, 2020 at 9:17 AM Erick Erickson <er...@gmail.com> wrote:
>
> This really sounds like an XY problem. The whole point of facets is
> to count the number of documents that have a value in some
> number of buckets. So trying to stop your facet query as soon
> as it matches a hit for the first time seems like an odd thing to do.
>
> So what’s the “X”? In other words, what is the problem you’re trying
> to solve at a high level? Perhaps there’s a better way to figure this
> out.
>
> Best,
> Erick
>
> > On Oct 28, 2020, at 3:48 AM, michael dürr <du...@gmail.com> wrote:
> >
> > Hi,
> >
> > I use json facets of type 'query'. As these queries are pretty slow and I'm
> > only interested in whether there is a match or not, I'd like to restrict
> > the query execution similar to the standard facetting (like with the
> > facet.exists parameter). My simplified query looks something like this (in
> > reality *:* may be replaced by a complex edismax query and multiple
> > subfacets similar to "tour" occur):
> >
> > curl http://localhost:8983/solr/portal/select -d \
> > "q=*:*\
> > &json.facet={
> >  tour:{
> >    type : query,
> >     q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"
> >  }
> > }\
> > &rows=0"
> >
> > Is there any possibility to modify my request to ensure that the facet
> > query stops as soon as it matches a hit for the first time?
> >
> > Thanks!
> > Michael
>

Re: Simulate facet.exists for json query facets

Posted by Erick Erickson <er...@gmail.com>.

This really sounds like an XY problem. The whole point of facets is
to count the number of documents that have a value in some
number of buckets. So trying to stop your facet query as soon
as it matches a hit for the first time seems like an odd thing to do.

So what’s the “X”? In other words, what is the problem you’re trying
to solve at a high level? Perhaps there’s a better way to figure this
out.

Best,
Erick

> On Oct 28, 2020, at 3:48 AM, michael dürr <du...@gmail.com> wrote:
> 
> Hi,
> 
> I use json facets of type 'query'. As these queries are pretty slow and I'm
> only interested in whether there is a match or not, I'd like to restrict
> the query execution similar to the standard facetting (like with the
> facet.exists parameter). My simplified query looks something like this (in
> reality *:* may be replaced by a complex edismax query and multiple
> subfacets similar to "tour" occur):
> 
> curl http://localhost:8983/solr/portal/select -d \
> "q=*:*\
> &json.facet={
>  tour:{
>    type : query,
>     q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"
>  }
> }\
> &rows=0"
> 
> Is there any possibility to modify my request to ensure that the facet
> query stops as soon as it matches a hit for the first time?
> 
> Thanks!
> Michael