You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2020/11/19 15:25:00 UTC

[jira] [Commented] (SOLR-15008) Avoid building OrdinalMap for each facet

    [ https://issues.apache.org/jira/browse/SOLR-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235549#comment-17235549 ] 

Michael Gibney commented on SOLR-15008:
---------------------------------------

Interesting; I'm surprised that profiling indicated {{OrdinalMap}} building, since I'm pretty sure the {{OrdinalMap}} instances (as accessed via {{FacetFieldProcessorByArrayDV}}  are already cached in the way you're suggesting:
# in [FacetFieldProcessorByArrayDV.findStartAndEndOrds(...)|https://github.com/apache/lucene-solr/blob/40e2122b5a5b89f446e51692ef0d72e48c7b71e5/solr/core/src/java/org/apache/solr/search/facet/FacetFieldProcessorByArrayDV.java#L60]
# in [FieldUtil.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/1d85cd783863f75cea133fb9c452302214165a4d/solr/core/src/java/org/apache/solr/search/facet/FieldUtil.java#L55]
# in [SlowCompositeReaderWrapper.getSortedSetDocValues(...)|https://github.com/apache/lucene-solr/blob/c02f07f2d5db5c983c2eedf71febf9516189595d/solr/core/src/java/org/apache/solr/index/SlowCompositeReaderWrapper.java#L197-L211]

Do you have more information about the total numbers involved (high-cardinality field -- specifically how high per core? how many documents overall per core? how many cores? does the latency manifest even across a single indexSearcher -- i.e., no intervening updates?). A couple of things that might be worth doing in the meantime, just as a sanity check:
# disable refinement for the facet field ({{"refinement":"none"}}) -- among other things, this would take the {{filterCache}} out of the equation
# if possible, try optimizing each replica to a single segment, which should take {{OrdinalMap}} out of the equation (this of course strictly diagnostic, not a "workaround" suggestion).

{quote}Allow faceting on actual values (a Map) rather than ordinals
{quote}
Interesting -- even if {{OrdinalMap}} is already getting cached (as I think it is?), this would be one way to avoid the overhead of allocating a {{CountSlotArrAcc}} backed by an int array of a size matching the field cardinality (this is why I asked more specifically about the cardinality of the field involved). I'm not sure how big a problem this is in practice, but I imagine a value-Map-based faceting implementation would probably perform better for this type of use case ... not 100% sure though, and not sure how _much_ better ... (I think {{FacetFieldProcessorByHashDV}} was designed to meet this a similar use case, but it only works for single-valued fields).

> Avoid building OrdinalMap for each facet
> ----------------------------------------
>
>                 Key: SOLR-15008
>                 URL: https://issues.apache.org/jira/browse/SOLR-15008
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Facet Module
>    Affects Versions: 8.7
>            Reporter: Radu Gheorghe
>            Priority: Major
>              Labels: performance
>         Attachments: Screenshot 2020-11-19 at 12.01.55.png
>
>
> I'm running against the following scenario:
>  * [JSON] faceting on a high cardinality field
>  * few matching documents => few unique values
> Yet the query almost always takes a long time. Here's an example taking almost 4s for ~300 documents and unique values (edited a bit):
>  
> {code:java}
>     "QTime":3869,
>     "params":{
>       "json":"{\"query\": \"*:*\",
>       \"filter\": [\"type:test_type\", \"date:[1603670360 TO 1604361599]\", \"unique_id:49866\"]
>       \"facet\": {\"keywords\":{\"type\":\"terms\",\"field\":\"keywords\",\"limit\":20,\"mincount\":20}}}",
>       "rows":"0"}},
>   "response":{"numFound":333,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
>     "count":333,
>     "keywords":{
>       "buckets":[{
>           "val":"value1",
>           "count":124},
>   ...
> {code}
> I did some [profiling with our Sematext Monitoring|https://sematext.com/docs/monitoring/on-demand-profiling/] and it points me to OrdinalMap building (see attached screenshot). If I read the code right, an OrdinalMap is built with every facet. And it's expensive since there are many unique values in the shard (previously, there we more smaller shards, making latency better, but this approach doesn't scale for this particular use-case).
> If I'm right up to this point, I see a couple of potential improvements, [inspired from Elasticsearch|#search-aggregations-bucket-terms-aggregation-execution-hint]:
>  # *Keep the OrdinalMap cached until the next softCommit*, so that only the first query takes the penalty
>  # *Allow faceting on actual values (a Map) rather than ordinals*, for situations like the one above where we have few matching documents. We could potentially auto-detect this scenario (e.g. by configuring a threshold) and use a Map when there are few documents
> I'm curious about what you're thinking:
>  * would a PR/patch be welcome for any of the two ideas above?
>  * do you see better options? am I missing something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org