You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2018/05/10 21:39:00 UTC
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

    [ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471156#comment-16471156 ] 

Hoss Man commented on SOLR-12343:
---------------------------------

Ultimately what seems to be at issue here is a discrepency between how Yonik designed the "simple" facet algorithm, and how it's implemented – but its only problematic in these "additional information from refinement can make sort values 'worse'" type situations.

As Yonik noted in SOLR-11733 regarding the design of {{refine:simple|true}} ...
{quote}[compared to facet.field] ...the refinement algorithm being different (and for a single-level facet field, simpler).
 It can be explained as:
 1) find buckets to return as if you weren't doing refinement
 2) for those buckets, make sure all shards have contributed to the statistics
 i.e. simple refinement doesn't change the buckets you get back.
{quote}
But in actuality, adding {{refine:true}} _can_ change the buckets you get back. In my example above, if {{refine:false}} was used, termX would have ultimately been returned (with an unrefined count) – but because of refinement it's not returned, and termY is returned in it's place.
----
I've attached a simple test patch demonstrating the problem but I haven't yet dug into the code to figure out the best fix.

I _suspect_ what's needed (to stick to the intent of {{refine:simple}} ) is that after the coordinator picks buckets that need refined, it should prune down the list of "all known" (size {{limit=N + overrequest=R}}) buckets to just the "buckets to return" (size {{limit=N}}) so that once the refinement values come in the _set_ of buckets desn't change, even if the _order_ or the buckets does.

> JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-12343
>                 URL: https://issues.apache.org/jira/browse/SOLR-12343
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement can cause _refined_ buckets to be "bumped out" of the topN based on the refined counts/stats depending on the sort - causing _unrefined_ buckets originally discounted in phase#2 to bubble up into the topN and be returned to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a {{sort: 'count asc'}} facet:
>  * assume shard1 returns termX & termY in phase#1 because they have very low shard1 counts
>  ** but *not* returned at all by shard2, because these terms both have very high shard2 counts.
>  * Assume termX has a slightly lower shard1 count then termY, such that:
>  ** termX "makes the cut" off for the limit=N topN buckets
>  ** termY does not make the cut, and is the "N+1" known bucket at the end of phase#1
>  * termX then gets included in the phase#2 refinement request against shard2
>  ** termX now has a much higher _known_ total count then termY
>  ** the coordinator now sorts termX "worse" in the sorted list of buckets then termY
>  ** which causes termY to bubble up into the topN
>  * termY is ultimately included in the final result _with incomplete count/stat/sub-facet data_ instead of termX
>  ** this is all indepenent of the possibility that termY may actually have a significantly higher total count then termX across the entire collection
>  ** the key problem is that all/most of the other terms returned to the client have counts/stats that are the cumulation of all shards, but termY only has the contributions from shard1
> Important Notes:
>  * This scenerio can happen regardless of the amount of overrequest used. Additional overrequest just increases the number of "extra" terms needed in the index with "better" sort values then termX & termY in shard2
>  * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
>  ** any function sort where additional data provided shards during refinement can cause a bucket to "sort worse" can also cause this problem.
>  ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) asc|desc}} , etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org