You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2020/05/01 20:27:00 UTC
[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

    [ https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097660#comment-17097660 ] 

Michael Gibney commented on SOLR-13132:
---------------------------------------

I was worrying about the order of sort keys out of an excess of caution, figuring that if it was possible to preserve order, it'd be worth doing so. If you don't think it's a concern then it's no concern to me (other than that preserving order seems achievable here, and might not be a _bad_ thing :)).
{quote}cleaning the code up to consistently use countAcc.setValues(...) so we can override it is a better approach. if the caller also needs to use countAcc.getCount(...) to decide when to recurse, so be it.
{quote}
This makes sense to me. The above is also related to the question raised here:
{code:java}
        // nocommit: if we're not using sweeping, why do we need to register any mappings?
        // baseSweepingAcc.registerMapping(this, this);
{code}
... indeed we probably could skip this if splitting full-domain accs between {{SweepingAcc}} and {{collectAcc}}. The reason I left this in initially was partly to preserve key order, but also to help clarify behavior of {{registerSweepingAccs(...)}}; specifically, if a processor calls {{registerSweepingAccs()}} on a {{SlotAcc}}, then the caller assumes responsibility of mediating all the output (calls to {{setValues(...)}}) for that acc. Now brainstorming other possible reasons to do this (aside from preserving key order), I can't really come up with anything. _If_ there's still any interest in preserving order, what would you think about modifying the method signature of {{SlotAcc.setValues(SimpleOrderedMap<Object> bucket, int slotNum)}} to return boolean instead of void, in the spirit of the {{setValues(...)}} method recently added to the {{SweepingAcc}} class? Or maybe less invasive, adding an analogous method to {{CountSlotAcc}} that returns boolean? The return value would normally be false, but if "true", would indicate that the method had handled setting values on all full-domain accs (what would formerly have been covered by calling {{collectAcc.setValues(...)}})?
{quote}unless there's another use for CollectSlotAccMappingAware i'm overlooking?
{quote}
The main/initial purpose of this was to provide an interface for notifying FacetProcessors when SlotAccs had been swapped out. Take the case where {{sortAcc}} is aliased to an instance of {{SKGSlotAcc}} which is wrapped in a {{MultiAcc}} {{collectAcc}}. {{collectAcc.registerSweepingAccs(...)}} could return itself (no top-level change -- still containing the other SlotAccs that need full collection, but with {{SKGSlotAcc}} removed). The new {{SweepSKGSlotAcc}} would be registered with the {{SweepingAcc}} (or potentially {{SweepingCountSlotAcc}}), but would still need to register the SlotAcc replacement with the {{FacetProcessor}} so that it could make corresponding changes to {{sortAcc}} or other aliases. In any case I think it's still necessary for a SlotAcc that replaces itself to separately register a new read-access instance with the {{SweepingAcc}} (or {{SweepingCountSlotAcc}}), since the return value of {{registerSweepingAccs(...)}} is used to modify "write" access (on the {{collectAcc}}).
{quote}CountSlotAcc.getBaseSweepingAcc(...) really messy and code-smell-ish right now
{quote}
Ah, yeah. I'm sorry I didn't explicitly address the nocommit comment about that, but I agree and I think your suggestion of "Perhaps a single initSweeping(CollectSlotAccMappingAware notify) for use by processors, that fails if called more then once? All other code paths use getBaseSweepingAcc()" is a good way forward. I was waiting to address that until after discussion of these other questions.
{quote}SweepingCountSlotAcc extends CountSlotArrAcc ... a new concrete CountSlotAcc subclass
{quote}
This seems reasonable. If you don't mind, I can take a crack at seeing what this might look like. I'm particularly interested in doing this in a way that keeps the "sweeping" stuff modular (under the hood, at least), since it's possible that other concrete {{CountSlotAcc}} subclasses (whose core "count" functionality is substantially different) could still use the exact same code for coordinating sweeping. (I'm thinking particularly about cached counts ... definitely conscious of not having that bleed into this issue, just trying to be transparent!).

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain base docSet, and then uses that initial pass as a pre-filter for a second-pass, inverted approach of fetching docSets for each relevant term (i.e., {{count > minCount}}?) and calculating intersection size of those sets with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and set intersection operations increases request latency to the point where relatedness sort may not be usable in practice (for my use case, even after applying the patch for SOLR-13108, for a field with ~220k unique terms per core, QTime for high-cardinality domain docSets were, e.g.: cardinality 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable ~300ms and ~250ms respectively. The approach calculates uninverted facet counts over domain base, foreground, and background docSets in parallel in a single pass. This allows us to take advantage of the efficiencies built into the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org