You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2018/06/19 00:54:00 UTC
[jira] [Commented] (SOLR-12343) JSON Field Facet refinement can
return incorrect counts/stats for sorted buckets
[ https://issues.apache.org/jira/browse/SOLR-12343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516506#comment-16516506 ]
Hoss Man commented on SOLR-12343:
---------------------------------
Updated patch with more tests and some code tweaks based on a few things the new tests caught.
Still outstanding is the question of the new BitSets I added...
{quote}
* buckets now keep track of how many shards contributed to them ...
** there's a nocommit in here about the possibility of re-using the {{Context.sawShard}} BitSet instead – but i couldn't wrap my head around an efficient way to do it so i punted
* ...buckets are excluded if a bucket doesn't have contributions from as many shards as the FacetField...
** again, i needed a new BitSet in at the FacetField level to count the shards – because Context.numShards may include shards that never return any results for the facet (ie: empty shard) so they never merge any data at all)
{quote}
I _think_ it should be possible to re-implement the {{FacetBucket.getNumShardsMerged()}} method (i added) using {{Context.sawShard}} by using {{sawShard.get(bucketNum * numShards, bucketNum * numShards + numShards)}} to take a "slice" of the BitSet just for the current bucket and then look at it's cardinality. the added cost of taking the slice only for buckets being considered in sorted order is probably a better trade off them the overhead of creating a new BitSet for every FacetBucket even if they are never considered for the response.
But I still don't see anyway to efficiently figure out the "shards that participated" info needed at the {{FacetField}} level using the existing {{sawShard}} BitSet -- particularly with the changes I had to make to account for the case where a shard has docs participating in a facet, but not matching any buckets (see {{testSortedSubFacetRefinementWhenParentOnlyReturnedByOneShard}} ). Fortunately that's just one new BitSet per FacetField instance (not per bucket).
----
I'll look at refactoring {{FacetBucket.getNumShardsMerged()}} to use {{Context.sawShard}} soon.
> JSON Field Facet refinement can return incorrect counts/stats for sorted buckets
> --------------------------------------------------------------------------------
>
> Key: SOLR-12343
> URL: https://issues.apache.org/jira/browse/SOLR-12343
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Priority: Major
> Attachments: SOLR-12343.patch, SOLR-12343.patch, SOLR-12343.patch
>
>
> The way JSON Facet's simple refinement "re-sorts" buckets after refinement can cause _refined_ buckets to be "bumped out" of the topN based on the refined counts/stats depending on the sort - causing _unrefined_ buckets originally discounted in phase#2 to bubble up into the topN and be returned to clients *with inaccurate counts/stats*
> The simplest way to demonstrate this bug (in some data sets) is with a {{sort: 'count asc'}} facet:
> * assume shard1 returns termX & termY in phase#1 because they have very low shard1 counts
> ** but *not* returned at all by shard2, because these terms both have very high shard2 counts.
> * Assume termX has a slightly lower shard1 count then termY, such that:
> ** termX "makes the cut" off for the limit=N topN buckets
> ** termY does not make the cut, and is the "N+1" known bucket at the end of phase#1
> * termX then gets included in the phase#2 refinement request against shard2
> ** termX now has a much higher _known_ total count then termY
> ** the coordinator now sorts termX "worse" in the sorted list of buckets then termY
> ** which causes termY to bubble up into the topN
> * termY is ultimately included in the final result _with incomplete count/stat/sub-facet data_ instead of termX
> ** this is all indepenent of the possibility that termY may actually have a significantly higher total count then termX across the entire collection
> ** the key problem is that all/most of the other terms returned to the client have counts/stats that are the cumulation of all shards, but termY only has the contributions from shard1
> Important Notes:
> * This scenerio can happen regardless of the amount of overrequest used. Additional overrequest just increases the number of "extra" terms needed in the index with "better" sort values then termX & termY in shard2
> * {{sort: 'count asc'}} is not just an exceptional/pathelogical case:
> ** any function sort where additional data provided shards during refinement can cause a bucket to "sort worse" can also cause this problem.
> ** Examples: {{sum(price_i) asc}} , {{min(price_i) desc}} , {{avg(price_i) asc|desc}} , etc...
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org