You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2017/12/07 12:39:00 UTC
[jira] [Comment Edited] (SOLR-11733) json.facet refinement fails to bubble up some long tail (overrequested) terms?

    [ https://issues.apache.org/jira/browse/SOLR-11733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281258#comment-16281258 ] 

Yonik Seeley edited comment on SOLR-11733 at 12/7/17 12:38 PM:
---------------------------------------------------------------

I mentioned in SOLR-11729 the refinement algorithm being different (and for a single-level facet field, simpler).
It can be explained as:
1) find buckets to return as if you weren't doing refinement
2) for those buckets, make sure all shards have contributed to the statistics
i.e. simple refinement doesn't change the buckets you get back.

I started with the simplest for obvious reasons... to get something out.  From a correctness POV, smarter faceting is equivalent to increasing the overrequest amount... we still can't make guarantees.
We could easily implement a mode for some field facets that does the "could this possibly be in the top N" logic to consider more buckets in the first phase... but only if it's not a sub-facet of another partial facet (a facet with something like a limit).  If we're sorting by something other than count (like stddev for instance) then I guess we'd have to discard smart pruning and just try to get all buckets we saw in the first phase.

If a partial facet is a sub-facet of another partial-facet, the logic of what one can exclude seems to get harder, and then sub-facets need to add new candidate buckets to parent facets (I think? need to think about it more... but I guess that's part of my point ;-).  Good ideas perhaps, but definitely more difficult to implement.

Other refinement implementations could range all the way to "exact"... guarantee that no buckets are missed, and there's more than one way to go about that too.




was (Author: yseeley@gmail.com):
I mentioned in SOLR-11729 the refinement algorithm being different (and for a single-level facet field, simpler).
It can be explained as:
1) find buckets to return as if you weren't doing refinement
2) for those buckets, make sure all shards have contributed to the statistics

I started with the simplest for obvious reasons... to get something out.  From a correctness POV, smarter faceting is equivalent to increasing the overrequest amount... we still can't make guarantees.
We could easily implement a mode for some field facets that does the "could this possibly be in the top N" logic to consider more buckets in the first phase... but only if it's not a sub-facet of another partial facet (a facet with something like a limit).
If a partial facet is a sub-facet of another partial-facet, the logic of what one can exclude seems to get harder, and then sub-facets need to add new candidate buckets to parent facets (I think? need to think about it more... but I guess that's part of my point ;-).  Good ideas perhaps, but definitely more difficult to implement.

Other refinement implementations could range all the way to "exact"... guarantee that no buckets are missed, and there's more than one way to go about that too.



> json.facet refinement fails to bubble up some long tail (overrequested) terms?
> ------------------------------------------------------------------------------
>
>                 Key: SOLR-11733
>                 URL: https://issues.apache.org/jira/browse/SOLR-11733
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>
> Something wonky is happening with {{json.facet}} refinement.
> "Long Tail" terms that may not be in the "top n" on every shard, but are in the "top n + overrequest" for at least 1 shard aren't getting refined and included in the aggragated response in some cases.
> I don't understand the code enough to explain this, but I have some steps to reproduce that i'll post in a comment shortly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org