You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2015/07/02 23:12:05 UTC
[jira] [Updated] (SOLR-4212) Let facet queries hang off of pivots

     [ https://issues.apache.org/jira/browse/SOLR-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-4212:
---------------------------
    Attachment: SOLR-6353-4212.patch

Updated patch wit ha few small changes, details mentioned below.

Comments in no particular order...

----

I reiterate my earlier comment about javadocs -- it's really hard to review a patch this size with so many new methods that don't have any javadocs at all.  Everytime I see a method mentioned in the new code, I look it up to jog my memory about what exactly it does and i wind up needing to re-read/skim each method over and over to remind myself what it does becaase they don't have any javadocs.  (I can't imagine how how any developer will ever be able to make sense of any of this code down the road w/o more method level docs)

On the flip side: PivotFacetHelper.mergeQueryCounts _does_ have javadocs, but it doesn't look like they match what the method does -- I'm pretty sure either the javadocs are wrong, or the method has a bug in it and isn't working as documented.

----

bq. I further refactored the code into into three separate classes to divide the responsibilities clearly. 

Cool ... This organization makes a lot more sense then previous patches.

----

bq. Fixed. I am not sure why that happened but once I refactored the code to do all the parsing and calculation in prepare, I could not reproduce the issue.

Well, the best way to be sure it's fixed is to add a test showing that it works.

I've updated your patch to add a request like this to DistributedFacetPivotLargeTest.

In order to get it to work, I had to also modify the {{Map<String,RangeFacetRequest>}} created by FacetComponent.prepare to be a LinkedHashMap (see more comments on this below) because otherwise it was a huge pain in the ass to try and write code to do anything with teh response in a clean way.

----

bq. Though I do not agree about the user counting on the same order because this is not a list but a complex object in all our responses. But I digress.

I don't really understand the basis for your comment.

My argument is simple:

* Given:
** (top level) Range Facet reults have always been returned as a NamedList, in the same order as specified by the facet.range params that requested them.
** This order is modeled in SolrJ's QueryResponse class as {{List<RangeFacet> getFacetRanges()}}
* Therefore:
** We should be consistent and return Range Facet results with the same predictible order when they are nested under pivots.
** This is already true in the SolrJ modeling in your patch, which adds {{List<RangeFacet> getFacetRanges()}} to PivotField.

If nothing else, having the RangeFacets returned in the same order for both the top level results and the per-pivot results is the only sane/simple way for users to compute ratios between the top level range counts and the per pivot value counts when dealing with many/variable ranges -- otherwise they have to sort them themselves to corrolate.

----

As I mentioned above regarding my test additions, I fixed the consistent ordering problem by changing the {{Map<String,RangeFacetRequest>}} created by FacetComponent.prepare to also be a LinkedHashMap.  But I still don't understand the purpose of this being a Map (where the keys are the original values of the facet.range params, not to be confused with having a Map keyed off of the _tags_ associated with the list of RangeFacetRequests that use that tag discussed below) at all.

As far as i can tell all usages of this {{Map<String,RangeFacetRequest>}} either:

* iterate over every entry in the Map and only look at the values.
* iterate over every facet.range param string (in order), and use it to "get()" the RangeFacetRequest object that corrisponds with that param string.

...wouldn't both of those usages be much simplier if instead of this Map there was just a {{List<RangeFacetRequest>}} that could be iterated over when the caller cares about looping over all RangeFacetRequest instances?

Is there some other purpose for having this {{Map<String,RangeFacetRequest>}} that i'm not understanding?

(see below for comments about creating a reused {{Map<String,List<RangeFacetRequest>>}} for when dealing with tags ... that's one of the current "loop over all RangeFacetRequest" usages that seems wrong to me anyway)

----

I think you should revive your earlier idea of having a "FacetContext" object to put in the request context and use it to track the RangeFacetRequests and FacetBase (FACET_QUERY) objects.  (Either that, or just use the existing FacetInfo class for this since that's pretty much it's current purpose.  Is the reason you didn't already do that because FacetInfo isn't currently in the single node code path?)

Either way (with or w/o a new FacetContext class, with or w/o reusing FacetInfo) the code to access the Map/List of all RangeFacetRequests and FacetBase objects should really be encapsulated in some way -- if nothing else to reduce the amount of casting when consumers have to pull them out of req.getContext(...), but also to help future proof us against attempts to access those structures before they've been initialized by FacetComponent.prepare.  (ie: throw IllegalAccessException if {{!req.getContext().containsKey(WHATEVER_CONTEXT_KEY)}}

example...

{code}
class FacetComponent {
  // ...
  public static List<RangeFacetRequest> getAllRangeFacetRequestsFromContext(SolrQueryRequest req) {
    List<RangeFacetRequest> result = (List<RangeFacetRequest>) req.getContext().get(MAGIC_CONTEXT_KEY);
    if (null == result) {
      throw new IllegalAccessException("RangeFacetRequests can't be accessed from context they are initalized");
    }
    return result;
  }
  // etc...
{code}

...or better still, with the FacetContext container...

{code}
class FacetContext {
  public static FacetContext getFacetContext(SolrQueryRequest req) {
    FacetContext result = (FacetContext) req.getContext().get(MAGIC_CONTEXT_KEY);
    if (null == result) {
      throw new IllegalAccessException("FacetContext can't be accessed before it's initalized in request context");
    }
    return result;
  }
  private final List<RangeFacetRequest> allRangeFacets; // init in constructor
  private final List<FacetBase> allQueryFacets; // init in constructor
  public List<RangeFacetRequest> getAllRangeFacetRequests() { ... }
  public List<FacetBase> getAllQueryFacets() { ... }
 
  // // (see next comment about PivotFacetProcessor.getTaggedRangeFacets)
  // private final Map<String,List<RangeFacetRequest>> taggedRangeFacets; // init in constructor
  // private final Map<String,List<FacetBase>> taggedQueryFacets; // init in constructor
  // public List<RangeFacetRequest> getRangeFacetRequestsForTag(String tag) { ... }
  // public List<FacetBase> getAllQueryFacetsForTag(String tag) { ... }
}
{code}

----

PivotFacetProcessor.getTaggedRangeFacets (and getTaggedQueryFacets) seem really inefficient to me...

* unless i'm missreading these, every facet.pivot param is going to cause the the list(s) of all facet.range RangeFacetRequest objects (and all facet.query FacetBase objects) to be iterated over to say "do you match this tag"
** in the case of facet.pivot refinement requests, that's going to mean a *lot* of looping over the same lists over and over.

* why doesn't this follow the same pattern as PivotFacetProcessor.getTaggedStatsFields (and StatsInfo.getStatsFieldsByTag)?
** when the RangeFacetRequest (or facet.query FacetBase objects) are constructed, why isn't there a {{Map<String,List<RangeFacetRequest>>}} (and {{Map<String,List<FacetBase>>}}) keyed off of the "tag" values returning each each RangeFacetRequest (and each FacetBase) built so that getTaggedRangeFacets and getTaggedQueryFacets can just be fast/simple lookups on these Maps instead of constantly looping over the same lists repeatedly?

(this "tag->RangeFacetRequest" mapping could easily be accessed via the same FacetContext mentioned above, see commented out part of that example code)

----

The way RangeFacetProcessor deals with FacetRangeOther seems really confusing and sketchy.  It took me a while to understand why those Exceptions were "safe to ignore"

Rather then rely on "expected exceptions" when looping over the List of FacetRanges, wouldn't the code be alot easier to understand if we changed FacetRange to look something like...

{code}
/** null if this range doesn't corrispond to an {@link FacetRangeOther} */
public final FacetRange other;
// ...other existing vars...
private FacetRange(FacetRangeOther other, String name, ...) {
  ...
}
public FacetRange(String name,...) {
  this(null, name, ...);
}
public FacetRange(FacetRangeOther other, ...) {
  this(other, other.name, ...);
}
{code}

...And then simplified the response building RangeFacetProcessor to look something like...

{code}
final NamedList<Object> res = new SimpleOrderedMap<>();
final NamedList<Integer> counts = new NamedList<>();
res.add("counts", counts); // built up below

// explicitly return the gap.  compute this early so we are more
// likely to catch parse errors before attempting math
res.add("gap", rfr.getGapObj());

// explicitly return the start and end so all the counts
// (including before/after/between) are meaningful - even if mincount
// has removed the neighboring ranges
res.add("start", rfr.getStartObj());
res.add("end", rfr.getEndObj());

for (RangeFacetRequest.FacetRange range : rfr.getFacetRanges()) {
  final int count = rangeCount(rfr, range);
  if (null == range.getFacetRangeOther()) {
    // normal range, added to count depending on mincount conditions...
    if (count >= rfr.getMinCount()) {
      counts.add(range.name, count);
    }
  } else {
    // special 'other' count, added to higher level res
    res.add(range.name, count);
  }
}

return res;
{code}

?

If I'm missing something, and there's a reason the current dual loops over all FacetRanges w/exception checking is better for some reason, then at a bare minimum those empty catch blocks need to make it a lot more clear why they are there and why the exceptions are there.

----

{code}
    // TODO: slight optimization would prevent double-parsing of any localParams
    Query qobj = QParser.getParser(parsed.facetValue, null, req).getQuery();
{code}

I added a nocommit here to ensure we don't forget to file a jira to track this.

It's not just an issue of localparams, but also the parsing & construction of the Query object -- doing that once per request (just like the way you've changed the range faceting to do all the bucket calculation once) could give us some serious performance savings when facet.query params are hung of of many and/or large pivot trees.  (should just mean a pretty trivial new subclass of FacetBase that also has a {{public Query getQuery()}} accessor)

----

RangeFacetRequest implementation weirdness...

* It's really weird that none of the public accessor methods (getStart, getEnd, getGap, getGapObj, etc...) have any javadocs, even though many of the protected variables they return _do_ have jdocs (and the info in those jdocs seems pretty important to the method callers -- like the distinction betwen end vs endObj).

* The method signature & jdocs for createCalculator look dangerous -- it says it will return a calculator based on the "rangeFacetRequest" argument, but it's a non static method and in a couple places (schemaField, facetOn) it ignores the "rangeFacetRequest" arg and uses properties of "this" (and in other places vice vesa) -- suggesting either this method should be static so it doesn't accidently use "this", or it should be changed to not take any arguments and always refer to "this" (which seems to be the intention based on the place i see this method being called currently.)

* There's a weird bit of "format and immediately reparse" logic in RangeFacetRequest + RangeEndpointCalculator that makes no sense to me...
** RangeEndpointCalculator knows the type specific "end" after looping over the ranges in {{computeRanges()}}
** RangeEndpointCalculator uses "formatValue(end)" to build "endStr"
** RangeEndpointCalculator.getComputedEnd() returns "endStr"
** RangeFacetRequest calls getComputedEnd() and then passes that String to RangeEndpointCalculator.getValue(String) to parse it back into a type specific object
** why not just have a {{public T getComputedEnd()}} method return the original (type specific) "end" object the calculator already figured out?
* likewise: the type specific start value...
** RangeEndpointCalculator.computeRanges() already figures out {{final T start = getValue(rfr.getStart());}}
** so why make RangeFacetRequest ask the calculator to reparse the original start string with {{calculator.getValue(start);}} ?
** why not just have a {{public T getStart()}} method on the calculator to return that value it already parsed?
* likewise: gap
** should change RangeEndpointCalculator.getGap(String) to just getGap()
** won't save us any parsing round trips like the start / end changes suggested above, but will keep the API consistent
** for the same reason the old code had this comment...{code}
// explicitly return the gap.  compute this early so we are more 
// likely to catch parse errors before attempting math
{code}...RangeEndpointCalculator should call & remember the value of parseGap() early in it's constructor, and then getGap() can just be a trivial accessor method.

----

PivotFacetProcessor.addPivotQueriesAndRanges concerns...

* this method was doing a lot of null checking against the facetQueries and facetRanges Lists -- even though there was never any possibility of those lists being null
* (partly because of these null checks that always passed) things like SimpleFacets instances and RangeFacetProcessors were getting constructed for every Pivot value (even on refinement) even when there are no ranges or facet queries hanging off of a pivot.

So in the updated patch I also cleaned this method up a bit to assert the lists are non null and keep the logicc minimal when they are empty. (had to try it myself to be certain i wans't missing anything)

I also tightened up the error handling to be more specific about where the SyntaxErrors are coming from.



> Let facet queries hang off of pivots
> ------------------------------------
>
>                 Key: SOLR-4212
>                 URL: https://issues.apache.org/jira/browse/SOLR-4212
>             Project: Solr
>          Issue Type: Sub-task
>          Components: search
>    Affects Versions: 4.0
>            Reporter: Steve Molloy
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 5.3, Trunk
>
>         Attachments: SOLR-4212-multiple-q.patch, SOLR-4212-multiple-q.patch, SOLR-4212.patch, SOLR-4212.patch, SOLR-4212.patch, SOLR-4212.patch, SOLR-4212.patch, SOLR-4212.patch, SOLR-6353-4212.patch, SOLR-6353-4212.patch, SOLR-6353-6686-4212.patch, SOLR-6353-6686-4212.patch, SOLR-6353-6686-4212.patch, SOLR-6353-6686-4212.patch, SOLR-6353-6686-4212.patch, patch-4212.txt
>
>
> Facet pivot provide hierarchical support for computing data used to populate a treemap or similar visualization. TreeMaps usually offer users extra information by applying an overlay color on top of the existing square sizes based on hierarchical counts. This second count is based on user choices, representing, usually with gradient, the proportion of the square that fits the user's choices.
> The proposition is to use local parameters to specify facet query to apply for pivot which matches a tag set on facet query. Parameter format would look like:
> facet.pivot={!query=r1}category,manufacturer
> facet.query={!tag=r1}somequery
> facet.query={!tag=r1}somedate:[NOW-1YEAR TO NOW]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org