You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2013/11/13 11:01:25 UTC

[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

    [ https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821139#comment-13821139 ] 

Shai Erera commented on LUCENE-5339:
------------------------------------

Some comments about the patch:

* You have a TODO file which seems to have been 'svn added' - are you aware of it? Just making sure, so it's not accidentally committed :)

* Maybe we should do this work in a branch and avoid the .simple package? It seems that the majority of the patch is about copy-pasting classes over to .simple, which I assume you did so you can work in isolation and have tests pass?

* FieldTypes:
** Can FT.FieldType provide a ctor taking these arguments and then DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false.
** I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you should name it DimType or something? And the upper class FacetsConfig?
** Maybe instead of setHierarchical + setMultiValued you do setDimType(DimType)? Then you could avoid the synchronization?
** Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if the dim is both hierarchical and multi-valued? That's a step back from the default ALL_BUT_DIM that we have today. I think we might want to have a requiresDimValue or something, because I do think the dimension's value (count) is most often unneeded, and it's a waste to encode its ordinal?

* Constants isn't documented, not sure if 'precommit' will like that, but in general I think the constants should have jdocs. Maybe put a nocommit?

* TaxonomyFacetCounts
** .getAllDims() -- If a dimension is not hierarchical, I think SimpleFacetResult.count == 0? In that case, sorting by its count is useless?
*** I think it's also relevant for SortedSet?

* LabelAndValue has a Number as the value rather than double. I see this class is used only in the final step (labeling), but what's wrong w/ the previous 'double' primitive? Is it to avoid casting or something?

On the approach in general - I personally don't think the current API is overwhelming, but looking at your patch, I do think it could use some simplification. But all in all, it allows an app extend the facets capabilities in ways that do not force it to duplicate a lot of code. Let me explain:

+CategoryListIterator+

That's the piece of code which is responsible for decoding the category list. It's currently used for two purposes: (1) allow you to encode the categories in different formats (using IntEncoder/Decoder abstraction) as well as load the categories from a different source. E.g. that's how we were able to test fast cutting over facets to DV from Payload, that's a nice abstraction for letting you load the values from a cache, and so forth.

If you want to specialize code, you can do what FastCountingFacetsAggregator does -- if you encoded w/ DGapVInt and you want counting, it doesn't go through the CLI API. But if you look at SumScoreFacetsAggregator, there's no reason for it to be limited to just DGapVInt encoding, or that you (the app) won't be able to pull the ordinals from a cache, without rewriting SumScoreFacetsAggregator.

Hack, even though it's not implemented yet, we could implement a SortedSetCLI (as far as I can tell) which uses SortedSetDVReaderState to pull the ordinals of document. It's really a useful interface.

So IMO the CLI interface is important. Most apps, I believe, don't extend the facets module (they'll use counting and maybe aggregate by ValueSource pre-built classes). But those that do want to add another aggregation method (e.g. ValueSourceFacetsAggregator), can simply use the CLI API, or copy-paste the DGapVInt decode logic if they want to specialize. Especially that that code is not that trivial, and we've spent a lot of time optimizing it. And the few apps that want to explore other encoding/decoding logic can write their own CLI.

Without that interface, if e.g. you want to encode the ordinals differently, or load them from a cache (e.g. CachedOrds), you'd have to rewrite *every* aggregation method that you want to use. And most probably you'll copy-paste the majority of the code. So why let go of the interface?

+FacetsCollector+

It's pretty much your SimpleFC, only it requires to get a FacetsAccumulator. That was done as a convenience so that apps can send in a collector and then call fc.getFacetResults. But its essense is really the fact that it collects MatchingDocs per segment, which the aggregators later use to aggreate the faces in those matching docs.

Given the changes in this patch, +1 for making it just return the List<MatchingDocs> and let whatever API we decide on to use that list, instead of FC.getFacetResults().

+FacetsAggregator+

This class lets you implement the aggregation function, be it count, sum-score, value-source or by-association. I think it's equivalent to TaxoFacetCounts, e.g. in your approach, if anyone will want to aggregate the ordinals by other than count, that's what they will need to write.

What I like about the approach in your patch is that (and I had similar thoughts on the last round of API changes) it "reverses" the flow. The app says "I want to count the ords in that field, and sum-score the ords in that field" and then the app can ask for the aggregated values from each such aggregator. Whereas today, app does "new CFR(), new CFR(), new CFR(), new SumScore()" and then receives back the List<FacetResult>. Today's API is convenient in that it allows you to pass a List<FR> around in your code and get back a List<FacetResult>. But I don't see this convenience getting away (you'll just pass a List<FacetsAggregator> and then pull the requested values later on). Also, you sort of pulled the specific requests, such as .getSpecificCount(CP) and .getAllDims() up to FacetsAggregator, where today this can be done by FacetResultsHandler (you'd have to write one, as I did on LUCENE-5333).

What I like less about it is that it folds in the logic coded by FacetsAccumulator and FacetResultsHandler:

* FacetsAccumulator
** Invokes FacetsAggregator to aggregate the facets in MatchingDocs
** Whether the request specifies a CP for which we need to rollupValues, it asks the aggregator to do so
** It uses a FacetResultsHandler to compute the top-K facets for each request.

* FacetResultsHandler
** Computes top-K values for each parent category
** Has variants that return a full sub-tree (TopKInEachNodeHandler)
** By means of extension, allows you to get the value of a specific CategoryPath, as well as get values for all dimensions (as I've shown on LUCENE-5333).

So I'm torn here. I really like how the API lets you extend things, and only the things that you care about. If you only want to implement a ValueSourceFacetsAggregator, you shouldn't be worried about whether or not to call rollupValues or how to compute top-K. That seems redundant. You should focus on what you want to extend. I also like less the fact that now every TaxoFacetsSomething will need to be aware of the facets configuration...

But I do like the approach presented in the patch, so I wonder if there's anything we can do to preserve the focused extension points on one hand, yet simplify the app's integration with the API on the other hand. Like, if an app did something like this:

{code}
FacetsAccumulator fa = new TaxoFA(..., matchingDocs, new CountingFacetsAggregator("dvFacets1", [CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA()
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions(); // does not depend on the aggregation function, only needs the Taxonomy and int[]/float[] array source.

fa = new TaxoFA(..., matchingDocs, new SumScoreFacetsAggregator("dvFacets2", [CategoryListIterator]), [FacetResultsHandler]); // or SortedSetFA()
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions();

fa = new TaxoFA(..., matchingDocs, new SumIntAssociationFacetsAggregator("dvFacets3"), [FacetResultsHandler]); // no CLI or SortedSetFA
fa.getTopK(dimension); // or fa.getTopK(new CategoryPath(...))
fa.getValueOf(CategoryPath);
fa.getAllDimensions();

RangeAccumulator range = new RangeAccumulator(..., matchingDocs, ranges); // RangeAccumulator might not even extend FacetsAccumulator
range.getTopK(dimension);
// I think the rest of the getters make no sense?
{code}

The idea is that internally, FacetsAccumulator uses FacetsAggregator to do the aggregation and rollup if needed, and you can optionally pass a FacetResultsHandler, while it defaults to a FlatFacetResultsHandler (today's silly name DepthOneFRH). You can also pass your CategoryListIterator to the FacetsAggregator (those that need it). That way, app's code looks similar to the one in the patch, only we allow more code reuse between different aggregation functions.

I hope this will allow us to support other aggregation functions by SortedSet too, as there's really no reason why not to do it. There are two differences between SortedSet and TaxoIndex: (1) the Taxonomy implementation (where you pull the children of an ordinal from, how you do labeling etc.) and (2) where the ordinals are encoded in the search index (BinaryDV in TaxoIndex case, SortedSetDV in SortedSet case). The latter can easily be solved by a CLI implementation of SortedSet and former by an abstract Taxonomy (read-only) API which both SortedSet and TaxoIndex implement. We should explore these on a separate issue though.

I think that the problem with your current patch is that the aggregation is done in the constructor, so it sort of eliminates reasonable ways to extend things. But if things were done differently, we could have an abstract FacetsAggregator instead of FacetsAccumulator which let you implement only the .aggregate() method, and take care of rollup + top-K computation itself. But that means you'd need to initialize the class and then call a .aggregate() or .compute() before you call any of the .getTopK() for instance (unless we check in every .getTopK() if it's already computed...).

+FacetArrays+

Unfortunately, there is no efficient way to hold either an int or float in Java, without specifying the type of the array (i.e. long[] or double[] are too expensive), so this class abstracts the way facets are aggregated, so that we can have different FacetAggregators implementations, again, reusing the majority of the irrelevant code (top-K computation, rollup, CategoryListIterator...).

I don't know if we can get rid of it... especially as there might be aggregators that are interested in both arrays (e.g. avgOfSomething). Maybe if we have a FacetsAccumulator abstract class, we can have an IntFA and FloatFA abstract classes which give you access to an int[]/float[] arrays, and matching FacetResultsHandler?

+FacetIndexingParams+

By all means, I'm +1 for simplifying this class and make it more user-friendly. I like how you can specify a DimType (hierarchical, singleValue...) instead of setting the more arbitrary OrdinalPolicy. I do think this class has some value today, e.g. in CategoryListParams (which you nuked), the app doesn't need to know about the field under which facets are indexed at all, only if it cares to change it or index multiple category lists. Not sure it's a blocker, just pointing that out. On the plus side - it makes app more aware about what's going on in its index...

I don't have strong feelings about this class, we can rename it to FacetsConfig or whatever, but let's put the configuration parameters under it (e.g. DimType and drillDownDelimiter)?

> Simplify the facet module APIs
> ------------------------------
>
>                 Key: LUCENE-5339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5339
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder).  I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
>   * Instead of *FacetRequest/Params/Result, you directly instantiate
>     the classes that do facet counting (currently TaxonomyFacetCounts,
>     RangeFacetCounts or SortedSetDVFacetCounts), passing in the
>     SimpleFacetsCollector, and then you interact with those classes to
>     pull labels + values (topN under a path, sparse, specific labels).
>   * At index time, no more FacetIndexingParams/CategoryListParams;
>     instead, you make a new SimpleFacetFields and pass it the field it
>     should store facets + drill downs under.  If you want more than
>     one CLI you create more than one instance of SimpleFacetFields.
>   * I added a simple schema, where you state which dimensions are
>     hierarchical or multi-valued.  From this we decide how to index
>     the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org