You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2020/08/11 21:00:00 UTC

[jira] [Commented] (SOLR-13807) Caching for term facet counts

    [ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175836#comment-17175836 ] 

Michael Gibney commented on SOLR-13807:
---------------------------------------

After SOLR-13132 was merged to master, it was a bit of challenge to reconcile with the complementary "term facet cache" (this issue). I've taken an initial stab at this and pushed to [PR #1357|https://github.com/apache/lucene-solr/pull/1357], and I think it's at the point where it's once again ready for consideration.

Below are some naive performance benchmarks, using [^SOLR-13807-benchmarks.tgz] (based on similar benchmarks for SOLR-13132).

{{filterCache}} is irrelevant for what's illustrated here (all count or sweep collection, single-shard thus no refinement). I included hooks in the included scripts to easily change the filterCache size and termFacetCache size for evaluation. For purpose of {{relatedness}} evaluation, fgSet == base search result domain. All results discussed here are for single-valued string fields, but multivalued string fields are also included in the benchmark attachment (results for multi-valued didn't differ substantially from those for single-valued).

There's a row for each docset domain recall percentage (percentage of \*:* domain returned by main query/fg), and a column for each field cardinality; cell values indicate latency (QTime) in ms against a single core with 3 million docs, no deletes; each value is the average of 10 repeated invocations of the the relevant request (standard deviation isn't captured here, but was quite low, fwiw).

Below are for current (including SOLR-13132) master; no caches (filterCache, if present, would be unused):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, master
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       4
1%      1       0       1       1       2       5
10%     7       7       8       8       10      16
20%     17      14      16      15      19      31
30%     22      19      23      20      24      42
40%     27      26      28      28      32      50
50%     33      32      35      32      38      59
99.99%  65      60      67      62      72      107

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, master
cdnlty: 10      100     1k      10k     100k    1m
.1%     179     174     183     190     192     225
1%      182     177     186     183     194     236
10%     193     191     196     197     226     256
20%     206     200     207     207     234     300
30%     216     210     217     216     239     316
40%     228     225     231     231     253     331
50%     239     234     241     240     266     347
99.99%  285     280     287     287     311     403
{code}

Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with _no_ termFacetCache configured (apples-to-apples, since there are changes in some of the hot facet code paths):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, no_cache
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       3
1%      1       1       1       1       1       6
10%     8       8       9       8       11      14
20%     16      15      16      15      20      32
30%     21      21      23      22      26      42
40%     28      27      31      28      34      53
50%     35      33      37      34      40      63
99.99%  68      64      71      66      74      108

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, no_cache
cdnlty: 10      100     1k      10k     100k    1m
.1%     96      80      89      97      96      129
1%      88      83      90      88      101     133
10%     99      97      103     102     122     162
20%     117     107     113     113     135     194
30%     120     117     123     122     144     211
40%     130     129     134     134     156     232
50%     143     140     147     144     169     249
99.99%  179     175     181     179     201     305
{code}

Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with {{solr.termFacetCacheSize=20}} configured.
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, cache size 20
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       2
1%      0       0       0       0       1       10
10%     3       4       4       4       5       16
20%     8       7       8       7       9       20
30%     11      10      12      11      13      25
40%     13      13      15      15      15      28
50%     15      16      16      18      20      32
99.99%  29      30      30      29      32      45

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, cache size 20
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       1       6
1%      0       0       0       1       4       14
10%     3       4       4       5       11      33
20%     9       8       8       8       16      41
30%     10      10      11      12      17      51
40%     13      13      13      14      20      61
50%     16      15      17      17      23      69
99.99%  30      28      30      30      37      101
{code}

The performance boost for sort-by-count has all the normal caveats of any type of caching, but could result in huge practical performance benefits for "main index page" and/or paging requests that use facets.

The performance boost for sort-by-skg, on the other hand, in many cases even transcends normal caching caveats (assuming sweep collection and a relatively static "background set"). With sweep collection, the common-case background set of \*:*, e.g., would be cached and used repeatedly even with a minimal termFacetCache (say, size=10), making for an uncharacteristically consistent cache boost (a good thing!).

Note that performance of "sort-by-skg" with termFacetCache is comparable to the performance of simple sort-by-count pre-termFacetCache, and consistent across field and domain cardinalities.

> Caching for term facet counts
> -----------------------------
>
>                 Key: SOLR-13807
>                 URL: https://issues.apache.org/jira/browse/SOLR-13807
>             Project: Solr
>          Issue Type: New Feature
>          Components: Facet Module
>    Affects Versions: master (9.0), 8.2
>            Reporter: Michael Gibney
>            Priority: Minor
>         Attachments: SOLR-13807-benchmarks.tgz, SOLR-13807__SOLR-13132_test_stub.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr does not have a facet count cache; so for _every_ request, term facets are recalculated for _every_ (facet) field, by iterating over _every_ field value for _every_ doc in the result domain, and incrementing the associated count.
> As a result, subsequent requests end up redoing a lot of the same work, including all associated object allocation, GC, etc. This situation could benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet calculation, latency is proportional to the size of the result domain. Consequently, one common/clear manifestation of this issue is high latency for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be observed on a top-level landing page that exposes facets. This type of "static" case is often mitigated by external (to Solr) caching, either with a caching layer between Solr and a front-end application, or within a front-end application, or even with a caching layer between the end user and a front-end application.
> But in addition to the overhead of handling this caching elsewhere in the stack (or, for a new user, even being aware of this as a potential issue to mitigate), any external caching mitigation is really only appropriate for relatively static cases like the "landing page" example described above. A Solr-internal facet count cache (analogous to the {{filterCache}}) would provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance concern
>  # compact (specifically caching count arrays, without the extra baggage that accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with variant requests over the same result domain (this would support common use cases like paging, but also potentially more interesting direct uses of facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a given domain are cached, a refinement request could simply look up the ordinal value for each enumerated term and directly grab the count out of the count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org