You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2020/03/05 00:02:00 UTC
[jira] [Updated] (SOLR-13807) Caching for term facet counts

     [ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris M. Hostetter updated SOLR-13807:
--------------------------------------
    Attachment: SOLR-13807__SOLR-13132_test_stub.patch
        Status: Open  (was: Open)

Per some conversation in SOLR-13132, Michael is planning on reviving this issue and creating a distinct PR just for the "facet cache" logic.

I'm attaching a "test_stub" patch that builds off of the existing commits in SOLR-13132's PR#751 to do some basic checking of the term cache and checking the resulting stats.
 (it currently has a lot of nocommits that need cleaned up – the key one at the moment being that caching does not working correctly with FacetFieldProcessorByArrayUIF)
----
[~mgibney] the other thing i noticed while poking around a bit more with the _structure_ of the entries in the facet cache (in order to understand what cache metrics to expect for a given request) is that the TermFacetCacheRegenerator implementation seems flawed.

If i understand correctly how the cache is designed to work, and how the regenerator _tries_ to work, each "field facet" has a "top level" entry that points to several other "segment level" entries, each of which is an encoded set of all the term counts.

(while i have some concerns/hesitations/questions about what that means if/when 'countCacheDf' differs between 2 otherwise identical facet requests (IIUC existing cache values can be mutated by (concurrent!) requests that get a "cache hit") i'll table those and focus on the regenerator.)

TermFacetCacheRegenerator is doing a direct copy of any 'old' cache values into the "new" cache for any "segment level" keys that corresponding to segments still in use by the top level reader – but unless i'm missing something that totally ignores the possibility of:
 * deleted documents in an existing segment
 * in-place DV updates (that could either be in fields being faceted on, or fields being queried on, causing completely different sets of documents to be involved in the facet)

am i missunderstanding something about the regenerator? how is this safe?

> Caching for term facet counts
> -----------------------------
>
>                 Key: SOLR-13807
>                 URL: https://issues.apache.org/jira/browse/SOLR-13807
>             Project: Solr
>          Issue Type: New Feature
>          Components: Facet Module
>    Affects Versions: 8.2, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Minor
>         Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets are recalculated for _every_ (facet) field, by iterating over _every_ field value for _every_ doc in the result domain, and incrementing the associated count.
> As a result, subsequent requests end up redoing a lot of the same work, including all associated object allocation, GC, etc. This situation could benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet calculation, latency is proportional to the size of the result domain. Consequently, one common/clear manifestation of this issue is high latency for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be observed on a top-level landing page that exposes facets. This type of "static" case is often mitigated by external (to Solr) caching, either with a caching layer between Solr and a front-end application, or within a front-end application, or even with a caching layer between the end user and a front-end application.
> But in addition to the overhead of handling this caching elsewhere in the stack (or, for a new user, even being aware of this as a potential issue to mitigate), any external caching mitigation is really only appropriate for relatively static cases like the "landing page" example described above. A Solr-internal facet count cache (analogous to the {{filterCache}}) would provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance concern
>  # compact (specifically caching count arrays, without the extra baggage that accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with variant requests over the same result domain (this would support common use cases like paging, but also potentially more interesting direct uses of facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a given domain are cached, a refinement request could simply look up the ordinal value for each enumerated term and directly grab the count out of the count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org