You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/04/27 05:34:08 UTC

[GitHub] [lucene] Yuti-G opened a new pull request, #843: Lucene-10538: TopN is not being used in getTopChildren

Yuti-G opened a new pull request, #843:
URL: https://github.com/apache/lucene/pull/843

   # Description
   
   This change fixed the issue that top-n parameter is not being used in the getTopChildren function in RangeFacetCounts. 
   
   # Solution
   
   Evaluate if top-n value is less than the children size, if it does, set the result size to topN.
   
   # Tests
   
   Fixed existing tests that failed to test getTopChildren and added new tests in TestRangeFacetCounts
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability.
   - [X] I have created a Jira issue and added the issue ID to my pull request title.
   - [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `main` branch.
   - [X] I have run `./gradlew check`.
   - [X] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1111641082

   Hi @gsmiller, thanks for taking a look at my PR! Yes, I agreed that the current behavior - sorting children(range in this case) by range values instead of counts should be more preferred for the majority of use cases, and my PR keeps this behavior by returning the requested n children sorted by ranges. Indeed, this may create confusion for users who want to return ranges sorted by counts, because that is usually what top-n means. 
   
   However, I think the current getTopChildren functionality takes in a top-n param but does not use it seems buggy to me, and we have use cases where the user wants the first k ranges. I think utilizing the top-n param in the function should not break the current use cases (getAllChildren) as long as the user specifies a large top-n that can return all children. 
   
   Please let me know how you think. Thanks again for the feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1111546935

   Interesting find. But this fix isn't really providing "top" children is it? It's just providing the "first" N children right? Wouldn't we want to provide the "top" ranges based on their counts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts

Posted by GitBox <gi...@apache.org>.
Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
URL: https://github.com/apache/lucene/pull/843


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1113335265

   @Yuti-G thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112806474

   Thanks @gsmiller for confirming! I will create another Jira issue to propose adding getAllChildren to Facets, and revisit this issue after getting feedback from the community. Thanks again for the great suggestion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1110592769

   Please see the benchmark results below:
   ```
                  TaskQPS             baseline      StdDevQPS   candidate  StdDev    Pct diff            p-value
              BrowseMonthTaxoFacets       15.70     (27.9%)       14.64     (22.3%)   -6.7% ( -44% -   60%) 0.400
              BrowseMonthSSDVFacets       12.85     (19.4%)       12.55     (17.2%)   -2.3% ( -32% -   42%) 0.686
               HighTermTitleBDVSort      119.58     (13.6%)      116.91     (14.1%)   -2.2% ( -26% -   29%) 0.610
          BrowseDayOfYearTaxoFacets       15.33     (27.5%)       15.03     (22.7%)   -2.0% ( -40% -   66%) 0.805
               BrowseDateTaxoFacets       14.63     (26.4%)       14.35     (21.8%)   -1.9% ( -39% -   62%) 0.800
                       HighSpanNear       18.33      (3.8%)       18.07      (4.8%)   -1.4% (  -9% -    7%) 0.307
        BrowseRandomLabelTaxoFacets       11.09     (23.3%)       10.95     (20.2%)   -1.2% ( -36% -   55%) 0.863
                   HighSloppyPhrase       14.34      (3.1%)       14.19      (3.0%)   -1.0% (  -6% -    5%) 0.277
                        MedSpanNear       14.23      (2.7%)       14.10      (3.3%)   -0.9% (  -6% -    5%) 0.339
                          MedPhrase       25.36      (2.5%)       25.14      (3.0%)   -0.9% (  -6% -    4%) 0.321
                           HighTerm     1053.63      (5.4%)     1044.60      (5.9%)   -0.9% ( -11% -   11%) 0.631
                         AndHighLow      663.80      (2.2%)      658.98      (2.4%)   -0.7% (  -5% -    3%) 0.316
                          LowPhrase      230.84      (2.9%)      229.27      (2.6%)   -0.7% (  -5% -    4%) 0.432
               MedTermDayTaxoFacets       27.08      (4.8%)       26.90      (4.7%)   -0.7% (  -9% -    9%) 0.655
                         HighPhrase      194.15      (2.3%)      193.03      (2.1%)   -0.6% (  -4% -    3%) 0.405
                    MedSloppyPhrase       44.73      (3.7%)       44.48      (3.3%)   -0.5% (  -7% -    6%) 0.619
                            Respell       49.15      (2.2%)       48.95      (2.6%)   -0.4% (  -5% -    4%) 0.600
           AndHighHighDayTaxoFacets       17.72      (1.6%)       17.66      (1.5%)   -0.4% (  -3% -    2%) 0.455
                             IntNRQ      193.61      (0.9%)      193.14      (1.5%)   -0.2% (  -2% -    2%) 0.525
                        LowSpanNear      116.92      (3.4%)      116.72      (2.9%)   -0.2% (  -6% -    6%) 0.863
                             Fuzzy1       89.66      (1.8%)       89.54      (1.8%)   -0.1% (  -3% -    3%) 0.809
                            MedTerm     1240.40      (5.3%)     1238.75      (6.0%)   -0.1% ( -10% -   11%) 0.941
                          OrHighMed       38.61      (3.2%)       38.57      (3.8%)   -0.1% (  -6% -    7%) 0.917
                    LowSloppyPhrase       69.00      (2.6%)       68.92      (2.3%)   -0.1% (  -4% -    4%) 0.890
            AndHighMedDayTaxoFacets       46.51      (1.6%)       46.48      (1.7%)   -0.1% (  -3% -    3%) 0.919
          BrowseDayOfYearSSDVFacets       11.19     (16.1%)       11.19     (13.4%)   -0.1% ( -25% -   35%) 0.991
                         AndHighMed      120.06      (3.2%)      120.01      (5.1%)   -0.0% (  -8% -    8%) 0.974
                             Fuzzy2       22.55      (2.2%)       22.55      (1.8%)   -0.0% (  -3% -    4%) 0.975
                MedIntervalsOrdered        3.74      (4.5%)        3.74      (4.4%)    0.0% (  -8% -    9%) 0.981
                       OrHighNotLow      864.86      (4.4%)      865.48      (4.8%)    0.1% (  -8% -    9%) 0.961
                LowIntervalsOrdered        6.22      (4.5%)        6.22      (4.0%)    0.1% (  -7% -    8%) 0.943
                       OrHighNotMed      766.34      (4.6%)      767.48      (4.9%)    0.1% (  -9% -   10%) 0.922
             OrHighMedDayTaxoFacets        3.56      (3.8%)        3.57      (3.5%)    0.2% (  -6% -    7%) 0.851
                       OrNotHighMed      667.42      (4.0%)      668.99      (5.3%)    0.2% (  -8% -    9%) 0.875
                      OrNotHighHigh      596.82      (4.0%)      598.76      (5.4%)    0.3% (  -8% -   10%) 0.829
                          OrHighLow      276.08      (2.4%)      277.04      (2.4%)    0.3% (  -4% -    5%) 0.649
               HighIntervalsOrdered       11.75      (5.8%)       11.80      (5.2%)    0.4% ( -10% -   12%) 0.817
                         OrHighHigh       28.05      (2.9%)       28.17      (3.6%)    0.4% (  -5% -    7%) 0.686
               BrowseDateSSDVFacets        2.09     (10.1%)        2.10     (10.3%)    0.4% ( -18% -   23%) 0.890
                      OrHighNotHigh      842.72      (4.0%)      846.75      (4.8%)    0.5% (  -8% -    9%) 0.732
                           Wildcard      101.90      (6.7%)      102.54      (6.9%)    0.6% ( -12% -   15%) 0.770
                           PKLookup      136.17      (2.9%)      137.03      (4.1%)    0.6% (  -6% -    7%) 0.570
                        AndHighHigh       39.46      (3.8%)       39.99      (6.5%)    1.3% (  -8% -   12%) 0.424
                         TermDTSort       93.05     (12.0%)       94.39     (13.3%)    1.4% ( -21% -   30%) 0.719
                            LowTerm     1247.69      (3.6%)     1267.40      (5.9%)    1.6% (  -7% -   11%) 0.311
                       OrNotHighLow     1003.56      (4.0%)     1020.22      (2.9%)    1.7% (  -5% -    8%) 0.134
              HighTermDayOfYearSort      118.75     (12.6%)      121.20     (15.3%)    2.1% ( -23% -   34%) 0.643
        BrowseRandomLabelSSDVFacets        6.81      (5.5%)        6.96      (5.8%)    2.2% (  -8% -   14%) 0.224
                            Prefix3      186.45     (13.6%)      191.69     (14.2%)    2.8% ( -22% -   35%) 0.523
                  HighTermMonthSort       14.22     (13.7%)       15.04     (19.7%)    5.7% ( -24% -   45%) 0.288
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1111552499

   I also suspect the current functionality is more useful for the majority of use-cases than actually truncating by a top-n and sorting the ranges by their counts. I imagine the order of ranges actually means something in many of these cases (i.e., there may be a natural numeric ordering between the ranges that the user wants to maintain). If we instead sorted the resulting ranges by their counts, it might be somewhat challenging for the user to reconcile. It's also pretty trivial work to provide counts for all of the ranges. Unlike other faceting implementations, we don't have to do ordinal -> path lookups or anything for each value we provide.
   
   So I suspect the desired behavior for most users is actually what's implemented today, but I also would agree that the API is pretty wonky and confusing. It feels like we might benefit from a "get all children" type method for this sort of thing. I suspect this is a bit of an artifact of later adding range facet counting by implementing a faceting API more tailored towards taxonomy faceting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1122824451

   Thanks @Yuti-G !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1121351525

   @Yuti-G would it make sense to close out this PR since I don't think we plan to merge this as it is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112602017

   Thanks for all your thoughts @Yuti-G! I think I still disagree with changing the current behavior. If we want to implement the contract of `getTopChildren` properly, we'd need to sort by count. That's what "top" means in this API, and I think it's a bit confusing to start enforcing a "top n" parameter when "top" isn't really properly defined here. If you have a use-case that needs to actually truncate to `n` ranges while retaining the existing order of ranges originally provided as a user, why not do it outside of the facet code? There's no reason you can't get back all ranges as it works today and then just truncate to the number you want right? That's essentially what you're proposing putting in the faceting code, but I'm not sure doing that is the right thing to do here.
   
   All this said, I'm in favor of exploring the addition of a `getAllChildren` method that would retain this functionality and then properly implementing `getTopChildren` that orders by count if there's a use-case for that. I suspect there is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112633027

   Hi @gsmiller, thanks for your feedback! Yes, I was proposing keeping the existing behavior of getTopChildren but utilizing the top-n param, as I thought it would safe because we would not break the current use cases, but you are absolutely right that we should stick with what top-n means here. 
   
   Please correct me if misunderstood your proposal. Are you suggesting that I add another getAllChildren API to retain the current behavior of getTopChildren, and then fix getTopChildren by returning ranges sorted by counts, or do you mean we should just keep this API untouched? Thanks!
   
   Best,
   Yuting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112797399

   > Are you suggesting that I add another getAllChildren API to retain the current behavior of getTopChildren, and then fix getTopChildren by returning ranges sorted by counts, or do you mean we should just keep this API untouched? Thanks!
   
   I'm suggesting the former, but I think that's a bigger project that you meant to capture with this PR and associated Jira. I would recommend opening a separate Jira issue to propose the idea of adding `getAllChildren` to `Facets`. We have use-cases for that functionality here (but it's just sort of improperly implemented under `getTopChildren`) and also in `LongValueFacetCounts` (see `#getAllChildrenSortByValue`). I think it's reasonable to add that functionality to all faceting implementations and then come up with a migration plan to move the current `RangeFacetCounts#getTopChildren` over and properly implement `RangeFacetCounts#getTopChildren`. But by opening a Jira, maybe others will weigh in and agree/disagree there, which would be a healthy conversation.
   
   Thanks again for digging into this and working on improving faceting! This is great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org