You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2021/01/11 23:55:03 UTC
[jira] [Updated] (SOLR-15079) Block Collapse (faster collapse code when groups are co-located via Block Join style nested doc indexing)

     [ https://issues.apache.org/jira/browse/SOLR-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris M. Hostetter updated SOLR-15079:
--------------------------------------
    Attachment: SOLR-15079.patch
        Status: Open  (was: Open)

The main difference between this new approach and the existing collapse approach is that the existing collapse PostFilter
 maintains a big in memory data structure of every "group key" (values from the collapse field) it sees in the matching docs, and the "best" matching doc of each group (ie: the current "group head" - along with the selector values corresponding to each of those group head docs that are needed to determine if they are better/worse then any other candidate doc for that group that might come alone (this might be the 'score' of each doc w/default collapsing, or some field values if one of the min/max/sort group head selectors are used). Once the PostFilter is done collecting all matching docs, then it does another pass over these data structures to delegate collection of just the (final) best "group heads"

In the new logic, since we know our grouping field is unique per "block" of indexed documents, then no large in memory data structures are needed to track _all_ groups at once – we can simply record the single best doc / group head selector values for the _current_ group, and once we encounter a doc with a new value in the collapse field (ie: a new "group key"), we can immedaitely delegate collection of the "previous" group's best matching doc, and throw away it's metadata.

This means the new impl uses a *LOT* less ram then the old impl.
----
I did some benchmarking using an index built from some ecommerce style data containing ~50,000 (Parent) Products, ~8.5 Million (Child) SKUs in collections that had 6 shards, 1 replica each, with each replica hosted on it's own Solr node. test clients issued randomized queries designed to match different permutations of docs, w/varying number o matches per group.
 * Long running query tests against the collection built using nested docs and using block collapse had (cumulative) query times of ~ 45% to 65% lower then a "typical" collection*
 ** the relative perf gains of the new impl were higher as the query load (ie: num concurrent clients) increased
 ** the relative perf gains were consistent regardless of how many docs matched the test query, how many unique groups those docs were in, or how many docs in those groups were matched by those queries
 ** there was some notable diff in relative perf based on the number of segments – but that was because the existing impl does significantly better when there are fewer segments (probably due to ordinal mapping?) while the new impl has largely consistent behavior regardless of the number of segments
 * A lot of the "overall gains" probably come from reduced GC/memory contention (which system monitoring demonstrated was notely reduced with the new impl), but even in micro load testing the new implementation is faster on individual requests – which makes sense because it only has to do a single pass over the matching documents (as opposed to the "one pass over matching docs + one pass over matching groups to sort the group head doc ids + one pass over the final docids"
 ** so the more unique groups matched by a query, the faster the new impl is (relatively speaking) compared to the existing impl

----
The attached patch includes this new logic/approach and uses it by default when the collapse field is {{_root_}} but it also supports a new {{hint=block}} option users can specify if they want this logic for other fields when they know their groups are co-located. This is necessary if you have "deeply nested" documents and you want group on something that isn't consistent for all descendants of the same {{_root_}} doc, but is consistent for all descendants of particular ancestor docs.

Example: each root (level-0) product doc may have multiple (level-1) SKU "child" docs, and each SKU doc may have it's own (level-2) "variant" child docs (ie: grand child of 'product') that include a "sku_s" field which is guaranteed to consistent in every "variant" doc (and guaranteed to be unique across all unique SKU level documents). You could use {{"hint=block field=sku_s"}} when searching against variant docs to collapse down to the "best" variant for each sku.x

NOTE: This approach is only valid for {{nullPolicy=expand}} or {{nullPolicy=ignore}} (the default). It would not be possible to implement {{nullPolicy=collapse}} with this type of "one pass" approach.

I feel like the current patch is really solid and ready to commit & backport to 8x, but I welcome any questions/concerns.

> Block Collapse (faster collapse code when groups are co-located via Block Join style nested doc indexing)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-15079
>                 URL: https://issues.apache.org/jira/browse/SOLR-15079
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Assignee: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-15079.patch
>
>
> A while back, JoelB had an idea for an optimized version of of the logic in the CollapsingQParserPlugin to take advantage of collapsing on fields where the user could knows that every doc with the same collapseKey were contiguous in the index - for example collapsing on the {{_root_}} field.
> Joel whipped up an initial PoC patch internally at lucidworks that only dealt with some limited cases (string field collapsing w/o any nulls, using default group head selection) to explain the idea, but other priorities prevented him from doing thorough benchmarking or flesh it out into "production ready" code.
> I took Joel's original PoC and fleshed it out with unit tests, fixed some bugs, and did some benchmarking against large indexes - the results look really good.
> I've since then beefed the code up more to include collapsing on numeric fields, and added support for all group head selector types, as well as adding support for {{nullPolicy=expand}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org