You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2014/07/03 02:12:31 UTC

[jira] [Updated] (SOLR-2894) Implement distributed pivot faceting

     [ https://issues.apache.org/jira/browse/SOLR-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-2894:
---------------------------

    Attachment: SOLR-2894.patch


bq. It looks like we need to rethink how the values are encoded into a path for the purpose of refinement so we can account for and differentiate between missing values, the empty string (0 chars), and the literal string "null" (4 chars)

I've been working on this for the last few days - cleaning up how we deal with the "refinement" strings so that facet.missing and/or empty strings ("") in fields won't be problematic.

It's been slow going as i tried to be systematic about refactoring & documenting methods as i went along and started understanding more and more of the code.

The bulk of the changes i made can be summarized as:
# make the "valuePath" tracking more structured via List<String> instead of building up single comma seperated refinement string right off the bat
# refactor the encoding/decoding of the refinement strings into a utility method thta can handle null and empty string.
# refactor the refinement count & subset computation so that it can actually handle facet.missing correctly (before attempts at refining facet.missing were just looking for the term "null" (ie: 4 characters)

Full details on how this patch differs from the lsat one are listed below -- but as things stand right now there is still a nasty bug somewhere in the facet.missing processing that i can't wrap my head arround...  

In short: when facet.missing is enabled in the SPECIAL test i mentioned in my last comment, it's somehow causing the refined counts of of the non-missing SPECIAL value to be wrong (even if the SPECIAL value is a regular string, and not ""). 

I can't really wrap my head arround how that's happening -- it's going to involve some more manual testing & some more unit tests to get to the bottom of it, but in the mean time I wanted to get this patch posted.

If folks could review it & sanity check that i'm not doing something stupid with the refinement that would be appreciated.

----

Detailed changes in this patch iteration...

* PivotFacetHelper
** add new encodeRefinementValuePath & decodeRefinementValuePath methods
*** special encoding to handle empty strings (should be valid when pivoting) and null values (needed for facet.missing refinement)
** add tests in TestPivotHelperCode
* PivotFacetValue & PivotFacetField
** in general, make these a bit more structured
** eliminate "fieldPath" since it's unused
** replace PivotFacetValue.field (String) with a ref to the actual parentPivot (PivotFacetField)
** add PivotFacetField.parentValue (PivotFacetValue) to ref the value this pivot field is nested under (if any)
** replace valuePath with getValuePath() (List<String>) to track the full structure
* FacetComponent
** prune some big chunks of commented out code (alt approaches no longer needed it looks like?)
** use new PivotFacetValue.getValuePath() + PivotFacetHelper.encodeRefinementValuePath instead of PivotFacetValue.valuePath
* SimpleFacets
** make getListedTermCounts(String,String) private again & add javadocs clarifing that it smarSplits the list of terms
** convert getListedTermCounts(String,String,DocSet) -> getListedTermCounts(String,DocSet,List<String>)
*** ie: pull the split logic out of this method, since it's confusing, and some callers don't need it.
*** add javadocs
*** updated SimpleFacets callers to do the split themselves
* PivotFacetProcessor
** refactor subset logic (that dealt with missing values via negatived range query) into "getSubset" helper method
*** add complimentary "getSubsetSize" method as well
** update previous callers of getListedTermCounts(String,String,DocSet) to use getSubsetSize instead in order to correctly handle the refinements of null (ie: facet.missing)
** refactor & cleanup processSingle:
*** have caller do the field splitting & validation (eliminates redundency when refining many values)
*** stop treating empty string as special case, switch conditionals that were looking at first value to look at list size directly
* misc new javadocs on various methods throughout hte above mentioned files

----

Misc notes for the future:

* even if/when we get the refinement logic fixed, we really need some safety check to ensure we've completely eliminated this possibility of an infinite loop on refinement:
** coordinator should assert that if if asks shard for a refinement, that refinement is returned
** shard should assert that if it's asked to refine, the #vals makes sense for the #fields in the pivot
* we need to include more testing of facet.missing:
** randomized testing in in TestCloudPivotFacet
** more usage of it in the Small & Large tests.
* in general, we need more testing that we know triggers refinement
** ie: the "Small" test already does a bunch with facet.missing, but I guess that never caught ny of these bugs, because refinement was never needed?
** randomly set small overrequest values in TestCloudPivotFacet ?
* for completeness, we should do some testing of literal string value "null" (4 chars) in pivot fields
* we aren't doing enough testing of multiple facet.pivot params in a single request - need to make sure that when refinement happens, those aren't colliding
** in particularly i'm wondering about {{facet.pivot=\{!key=aaa\}foo,bar&facet.pivot=\{!key=bbb\}foo,bar}} type stuff


> Implement distributed pivot faceting
> ------------------------------------
>
>                 Key: SOLR-2894
>                 URL: https://issues.apache.org/jira/browse/SOLR-2894
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Erik Hatcher
>            Assignee: Hoss Man
>             Fix For: 4.9, 5.0
>
>         Attachments: SOLR-2894-mincount-minification.patch, SOLR-2894-reworked.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894_cloud_test.patch, dateToObject.patch, pivot_mincount_problem.sh
>
>
> Following up on SOLR-792, pivot faceting currently only supports undistributed mode.  Distributed pivot faceting needs to be implemented.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org