You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Jeremy Buckley - IQ-C <je...@gsa.gov.INVALID> on 2022/03/23 16:28:42 UTC

Representative filtering of very large result sets

We are using the collapse query parser for consolidating results based on a
field value, and are also faceting on a number of other fields.  The
collapse field and the facet fields all have docValues=true. For very large
(millions of documents) result sets, the heap usage gets a little out of
hand, and the resulting GC is problematic.  I am trying to figure out how
to reduce the number of documents that are being faceted over, and still
display facets that are "representative" of the entire result set.

Some sort of filter query seems to be the obvious answer, but what? I don't
want to accidentally exclude my most relevant results.

How can I facet over only the top N results?

Thanks for any tips.

-- 
Jeremy Buckley

Re: Representative filtering of very large result sets

Posted by Jeremy Buckley - IQ-C <je...@gsa.gov.INVALID>.

Thanks, Michael. I think this will work, and it is the direction I am
heading.  We are collapsing for deduplication, sort of.

We do need to search over the full uncollapsed domain, but I am pretty sure
that nobody needs to see 40 million results, and if they're dumb enough to
enter a query that matches that many documents, they deserve whatever they
get.

So my strategy is:
1. Check the query to see if it looks "safe" based on some heuristics.
2. If (1) fails do a search to get only the result count with rows=0 and no
faceting or sorting. This is usually pretty fast.
3. If the count returned in (2)  is above a certain threshold, add my extra
filter query before executing the full faceted search

Thanks, everyone!

On Thu, Mar 24, 2022 at 10:04 AM Michael Gibney <mi...@michaelgibney.net>
wrote:

> Are you determining your "top doc" for each collapsed group based on score?
> If your use case is such that you determine the "top doc" based on a static
> field with a manageable number of values, you may have other options
> available to you. (For some use cases it can be acceptable to "pre-filter"
> the domain with creative fq params. This works iff your "collapse" could be
> considered a type of "deduplication" with doc priority determined by a
> static field; but it's a non-starter if you know you need to search over
> the full uncollapsed domain.)
>
> Michael
>

Re: Representative filtering of very large result sets

Posted by Michael Gibney <mi...@michaelgibney.net>.

Are you determining your "top doc" for each collapsed group based on score?
If your use case is such that you determine the "top doc" based on a static
field with a manageable number of values, you may have other options
available to you. (For some use cases it can be acceptable to "pre-filter"
the domain with creative fq params. This works iff your "collapse" could be
considered a type of "deduplication" with doc priority determined by a
static field; but it's a non-starter if you know you need to search over
the full uncollapsed domain.)

Michael

On Thu, Mar 24, 2022 at 9:11 AM Joel Bernstein <jo...@gmail.com> wrote:

> Yeah, that's a tricky problem. Keeping the result set small without losing
> results. I don't have an answer except as you already mentioned which would
> be to limit the query in some way.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Mar 24, 2022 at 8:24 AM Jeremy Buckley - IQ-C
> <je...@gsa.gov.invalid> wrote:
>
> > Thanks, Joel, that is exactly what we are doing.  We have four shards and
> > are sharding on the collapse key.  Performance is fine (subsecond) as
> long
> > as the result set is relatively small.  I am really looking for the best
> > way to ensure that this is always true.
> >
> > On Wed, Mar 23, 2022 at 10:18 PM Joel Bernstein <jo...@gmail.com>
> > wrote:
> >
> > > To collapse on 30 million distinct values is going to cause memory
> > problems
> > > for sure. If the heap is growing as the result set grows that means you
> > are
> > > likely using a newer version of Solr which collapses into a hashmap.
> > Older
> > > versions of Solr would collapse into an array 30 million in length
> which
> > > probably would have blown up memory with even small result sets.
> > >
> > > I think you're going to need to shard to get this to perform well. With
> > > SolrCloud you can shard on the collapse key (
> > >
> > >
> >
> https://solr.apache.org/guide/8_7/shards-and-indexing-data-in-solrcloud.html#document-routing
> > > ).
> > > This will send all documents with the same collapse key to the same
> > shard.
> > > Then run the collapse query on the sharded collection.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> >
>

Re: Representative filtering of very large result sets

Posted by Joel Bernstein <jo...@gmail.com>.

Yeah, that's a tricky problem. Keeping the result set small without losing
results. I don't have an answer except as you already mentioned which would
be to limit the query in some way.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Mar 24, 2022 at 8:24 AM Jeremy Buckley - IQ-C
<je...@gsa.gov.invalid> wrote:

> Thanks, Joel, that is exactly what we are doing.  We have four shards and
> are sharding on the collapse key.  Performance is fine (subsecond) as long
> as the result set is relatively small.  I am really looking for the best
> way to ensure that this is always true.
>
> On Wed, Mar 23, 2022 at 10:18 PM Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > To collapse on 30 million distinct values is going to cause memory
> problems
> > for sure. If the heap is growing as the result set grows that means you
> are
> > likely using a newer version of Solr which collapses into a hashmap.
> Older
> > versions of Solr would collapse into an array 30 million in length which
> > probably would have blown up memory with even small result sets.
> >
> > I think you're going to need to shard to get this to perform well. With
> > SolrCloud you can shard on the collapse key (
> >
> >
> https://solr.apache.org/guide/8_7/shards-and-indexing-data-in-solrcloud.html#document-routing
> > ).
> > This will send all documents with the same collapse key to the same
> shard.
> > Then run the collapse query on the sharded collection.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
>

Re: Representative filtering of very large result sets

Posted by Jeremy Buckley - IQ-C <je...@gsa.gov.INVALID>.

Thanks, Joel, that is exactly what we are doing.  We have four shards and
are sharding on the collapse key.  Performance is fine (subsecond) as long
as the result set is relatively small.  I am really looking for the best
way to ensure that this is always true.

On Wed, Mar 23, 2022 at 10:18 PM Joel Bernstein <jo...@gmail.com> wrote:

> To collapse on 30 million distinct values is going to cause memory problems
> for sure. If the heap is growing as the result set grows that means you are
> likely using a newer version of Solr which collapses into a hashmap. Older
> versions of Solr would collapse into an array 30 million in length which
> probably would have blown up memory with even small result sets.
>
> I think you're going to need to shard to get this to perform well. With
> SolrCloud you can shard on the collapse key (
>
> https://solr.apache.org/guide/8_7/shards-and-indexing-data-in-solrcloud.html#document-routing
> ).
> This will send all documents with the same collapse key to the same shard.
> Then run the collapse query on the sharded collection.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>

Re: Representative filtering of very large result sets

Posted by Joel Bernstein <jo...@gmail.com>.

To collapse on 30 million distinct values is going to cause memory problems
for sure. If the heap is growing as the result set grows that means you are
likely using a newer version of Solr which collapses into a hashmap. Older
versions of Solr would collapse into an array 30 million in length which
probably would have blown up memory with even small result sets.

I think you're going to need to shard to get this to perform well. With
SolrCloud you can shard on the collapse key (
https://solr.apache.org/guide/8_7/shards-and-indexing-data-in-solrcloud.html#document-routing).
This will send all documents with the same collapse key to the same shard.
Then run the collapse query on the sharded collection.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Mar 23, 2022 at 9:42 PM Jeremy Buckley - IQ-C
<je...@gsa.gov.invalid> wrote:

> The number of documents in the collection is about 90 million. The
> collapse field has about 30 million distinct values, so I guess that
> qualifies as high cardinality.  We used to use result grouping but switched
> to collapse for improved performance.
>
> The faceting fields are more of a mix, 5-10 fields ranging from around a
> dozen to around 250,000 distinct values.
>
> On Wed, Mar 23, 2022 at 8:30 PM Joel Bernstein <jo...@gmail.com> wrote:
>
> > It sounds like you are collapsing on a high cardinality field and/or
> > faceting on high cardinality fields. Can you describe the cardinality of
> > the fields so we can get an idea of how large the problem is?
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
>

Re: Representative filtering of very large result sets

Posted by Jeremy Buckley - IQ-C <je...@gsa.gov.INVALID>.

The number of documents in the collection is about 90 million. The
collapse field has about 30 million distinct values, so I guess that
qualifies as high cardinality.  We used to use result grouping but switched
to collapse for improved performance.

The faceting fields are more of a mix, 5-10 fields ranging from around a
dozen to around 250,000 distinct values.

On Wed, Mar 23, 2022 at 8:30 PM Joel Bernstein <jo...@gmail.com> wrote:

> It sounds like you are collapsing on a high cardinality field and/or
> faceting on high cardinality fields. Can you describe the cardinality of
> the fields so we can get an idea of how large the problem is?
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>

Re: Representative filtering of very large result sets

Posted by Joel Bernstein <jo...@gmail.com>.

It sounds like you are collapsing on a high cardinality field and/or
faceting on high cardinality fields. Can you describe the cardinality of
the fields so we can get an idea of how large the problem is?



Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, Mar 23, 2022 at 12:30 PM Jeremy Buckley - IQ-C
<je...@gsa.gov.invalid> wrote:

> We are using the collapse query parser for consolidating results based on a
> field value, and are also faceting on a number of other fields.  The
> collapse field and the facet fields all have docValues=true. For very large
> (millions of documents) result sets, the heap usage gets a little out of
> hand, and the resulting GC is problematic.  I am trying to figure out how
> to reduce the number of documents that are being faceted over, and still
> display facets that are "representative" of the entire result set.
>
> Some sort of filter query seems to be the obvious answer, but what? I don't
> want to accidentally exclude my most relevant results.
>
> How can I facet over only the top N results?
>
> Thanks for any tips.
>
> --
> Jeremy Buckley
>