You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/03 15:11:58 UTC
[GitHub] [lucene] shaie commented on pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

shaie commented on PR #841:
URL: https://github.com/apache/lucene/pull/841#issuecomment-1146061771

   Hi Greg, thanks for your comments. Earlier today I tried to play with the
   new API to implement some other use cases just to get a feel for how they
   will work, and I realized why `HyperRectangle` was proposed and implemented
   the way it was (sorry for being too slow!). Let me try to clarify my
   thoughts a bit, from multiple perspectives:
   
   * HyperRectangles are indeed a generic way of matching an N-dimensional
   point. If one wants ranges, one passes a pair where min/max are different.
   If one wants an exact match, one would pass a range where min/max are equal.
     * What I proposed with the `ExactFacetSetMatcher` implementation is
   merely a specialization of the above. So instead of passing ranges where
   min/max are the same, and having the aggregation algo do redundant range
   checks, it just specializes on how the aggregation is done. Additionally,
   from an API perspective, it might be clearer to the user that they only
   need to pass the expected values, and not construct ranges "because that's
   what the API allows".
     * We could have let `LongPair` implement a `match()` API itself, and a
   sugar API for `LongPair.create(min, max)` which will return either a
   `RangeLongPair` or `ExactLongPair` (don't mind the names too much) to
   specialize the impl, but I'm not sure what will perform better -- calling
   the `Pair.match()` or just doing range checks always.
   
   * From an API perspective and the user, I wonder if HyperRectangle is a
   clear enough name to denote what we're building here. I.e., is it perhaps
   too expert? For instance I initially thought the proposal is for
   geo-something faceting before I realized it has nothing specifically to do
   with geo (again, sorry for being slow :)). Naming is hard, but I _think_
   that `FacetSet` with a bunch of helper classes might make the API clearer.
      * I totally think we should have a `HyperRectangle` impl, maybe call it
   `HyperRectangleFacetSetMatcher` or `RangeFacetSetMatcher`. This is the
   generic catch-all / fallback impl if one cannot find a specialized impl, or
   doesn't know how to write one.
     * I hope that with this API we'll also pave the way for users to realize
   they can implement their own `FacetSetMatcher`, for instance treating the
   first 2 dimensions as range, and 3rd and 4th as exact (again, to create
   specialized matchers).
     * I also think that the proposed API under `facetset` is easier to
   extend, even though I'm sure we can re-structure the `hyperrectangle`
   package to allow for such extension. Essentially you want a _Reader_ which
   knows how to read the `long[]` and a `Filter/Matcher/Whatever` which
   interprets them, returning a boolean or something else. That part is the
   extension point we'd hope users to implement, which will make the
   underlying storage of the points an abstraction that users don't have to
   deal with.
   
   * Regarding the other use cases I've mentioned, both `HyperRectangle` and
   `MatchingFacetSetCounts` do the same job -- they match an entire set of
   points against a given set of points. The `Matcher` even implements an API
   which returns a `boolean`. True, you can pass for some of the dimensions
   `(NEG_INF, POS_INF)` to denote that you "don't care" about some of the
   dimensions, but still at its core this implementation tells you how many
   docs matched each set.
     * What this impl doesn't let you do is use, say dims 1-3 for matching,
   and 4 for counting so that you can ask "What are the top-3 Years for
   Oscar+Drama awards" (I hope what I wrote makes sense!!). In this example
   you'll want to "match" docs if they have "Oscar" and "Drama" dimensions,
   but then count the "Year" dimension and compute the top-K. This use case
   cannot be implemented with neither of the current proposed impls, since
   they only match docs.
     * What I tried to say is that for this kind of use case we'll need a diff
   counting impl (but still use the same on-disk structure!), that's all. One
   that keeps track of the "Year" counts and its `getTopChildren` returns the
   top 3 Years. I hope that makes sense?
   
   I'll add the HyperRectangle impl to the `facetset` package (I'll reuse the
   existing classes from `hyperrectangle` for now and we can see how it works?
   
   On Fri, Jun 3, 2022 at 10:56 AM Greg Miller ***@***.***>
   wrote:
   
   > Trying to catch up on this now. I've been traveling and it's been
   > difficult to find time. Thanks for all your thoughts @shaie
   > <https://github.com/shaie>!
   >
   > I think I'm only half-following your thoughts on the different APIs
   > necessary, and will probably need to look at what you've documented in more
   > detail. But... as a half-baked response, I'm not convinced (yet?) that we
   > need this level of complexity in the API. In my mind, what we're trying to
   > build is a generalization of what is already supported in long/double-range
   > faceting (e.g., LongRangeFacetCounts), where the user specifies all the
   > ranges they want counts for, we count hits against those ranges, and
   > support returning those counts through a couple APIs. Those faceting
   > implementations allow ranges to be specified in a single dimension, and
   > determine which ranges the document points (in one-dimensional space) fall
   > in.
   >
   > So "hyperrectangle faceting"—in my original thinking at least—is just a
   > generalization of this to multiple dimensions. The points associated with
   > the documents are in n-dimensional space, and the user specifies the
   > different "hyperrectangles" they want counts for by providing a [min, max]
   > range in each dimension. For cases like the "automotive parts finder"
   > example, it's perfectly valid for the "hyperrectangles" provided by the
   > user to also be single points (where the min/max are equivalent values in
   > each dimension). But it's also valid to mix-and-match, where some
   > dimensions are single points and some are ranges (e.g., "all auto parts
   > that fit 'Chevy' (single point) for the years 2000 - 2010 (range)).
   >
   > In the situation where a user wants to "fix some dimension" and count over
   > others, it can still be described as a set of "hyperrectangles," but where
   > the specified ranges on some of the dimensions happen to be the same across
   > all of them.
   >
   > So I'm not quite sure if what you're suggesting in the API is just
   > syntactic sugar on top of this idea, or if we're possibly talking about
   > different things here? I'll try to dive into your suggestion more though
   > and understand. I feel like I'm just missing something important and need
   > to catch up on your thinking. Thanks again for sharing! I'll circle back in
   > a few days when I've (hopefully) had some more time to spend on this :)
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/lucene/pull/841#issuecomment-1145696479>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AA2PE3EZ4453PDPJ6ETLSP3VNG3BPANCNFSM5UNJB2OA>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org