You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/04/27 05:34:08 UTC
[GitHub] [lucene] Yuti-G opened a new pull request, #843: Lucene-10538: TopN is not being used in getTopChildren
Yuti-G opened a new pull request, #843:
URL: https://github.com/apache/lucene/pull/843
# Description
This change fixed the issue that top-n parameter is not being used in the getTopChildren function in RangeFacetCounts.
# Solution
Evaluate if top-n value is less than the children size, if it does, set the result size to topN.
# Tests
Fixed existing tests that failed to test getTopChildren and added new tests in TestRangeFacetCounts
# Checklist
Please review the following and check all that apply:
- [X] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability.
- [X] I have created a Jira issue and added the issue ID to my pull request title.
- [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
- [X] I have developed this patch against the `main` branch.
- [X] I have run `./gradlew check`.
- [X] I have added tests for my changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1111641082
Hi @gsmiller, thanks for taking a look at my PR! Yes, I agreed that the current behavior - sorting children(range in this case) by range values instead of counts should be more preferred for the majority of use cases, and my PR keeps this behavior by returning the requested n children sorted by ranges. Indeed, this may create confusion for users who want to return ranges sorted by counts, because that is usually what top-n means.
However, I think the current getTopChildren functionality takes in a top-n param but does not use it seems buggy to me, and we have use cases where the user wants the first k ranges. I think utilizing the top-n param in the function should not break the current use cases (getAllChildren) as long as the user specifies a large top-n that can return all children.
Please let me know how you think. Thanks again for the feedback!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1111546935
Interesting find. But this fix isn't really providing "top" children is it? It's just providing the "first" N children right? Wouldn't we want to provide the "top" ranges based on their counts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
Posted by GitBox <gi...@apache.org>.
Yuti-G closed pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
URL: https://github.com/apache/lucene/pull/843
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1113335265
@Yuti-G thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112806474
Thanks @gsmiller for confirming! I will create another Jira issue to propose adding getAllChildren to Facets, and revisit this issue after getting feedback from the community. Thanks again for the great suggestion!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1110592769
Please see the benchmark results below:
```
TaskQPS baseline StdDevQPS candidate StdDev Pct diff p-value
BrowseMonthTaxoFacets 15.70 (27.9%) 14.64 (22.3%) -6.7% ( -44% - 60%) 0.400
BrowseMonthSSDVFacets 12.85 (19.4%) 12.55 (17.2%) -2.3% ( -32% - 42%) 0.686
HighTermTitleBDVSort 119.58 (13.6%) 116.91 (14.1%) -2.2% ( -26% - 29%) 0.610
BrowseDayOfYearTaxoFacets 15.33 (27.5%) 15.03 (22.7%) -2.0% ( -40% - 66%) 0.805
BrowseDateTaxoFacets 14.63 (26.4%) 14.35 (21.8%) -1.9% ( -39% - 62%) 0.800
HighSpanNear 18.33 (3.8%) 18.07 (4.8%) -1.4% ( -9% - 7%) 0.307
BrowseRandomLabelTaxoFacets 11.09 (23.3%) 10.95 (20.2%) -1.2% ( -36% - 55%) 0.863
HighSloppyPhrase 14.34 (3.1%) 14.19 (3.0%) -1.0% ( -6% - 5%) 0.277
MedSpanNear 14.23 (2.7%) 14.10 (3.3%) -0.9% ( -6% - 5%) 0.339
MedPhrase 25.36 (2.5%) 25.14 (3.0%) -0.9% ( -6% - 4%) 0.321
HighTerm 1053.63 (5.4%) 1044.60 (5.9%) -0.9% ( -11% - 11%) 0.631
AndHighLow 663.80 (2.2%) 658.98 (2.4%) -0.7% ( -5% - 3%) 0.316
LowPhrase 230.84 (2.9%) 229.27 (2.6%) -0.7% ( -5% - 4%) 0.432
MedTermDayTaxoFacets 27.08 (4.8%) 26.90 (4.7%) -0.7% ( -9% - 9%) 0.655
HighPhrase 194.15 (2.3%) 193.03 (2.1%) -0.6% ( -4% - 3%) 0.405
MedSloppyPhrase 44.73 (3.7%) 44.48 (3.3%) -0.5% ( -7% - 6%) 0.619
Respell 49.15 (2.2%) 48.95 (2.6%) -0.4% ( -5% - 4%) 0.600
AndHighHighDayTaxoFacets 17.72 (1.6%) 17.66 (1.5%) -0.4% ( -3% - 2%) 0.455
IntNRQ 193.61 (0.9%) 193.14 (1.5%) -0.2% ( -2% - 2%) 0.525
LowSpanNear 116.92 (3.4%) 116.72 (2.9%) -0.2% ( -6% - 6%) 0.863
Fuzzy1 89.66 (1.8%) 89.54 (1.8%) -0.1% ( -3% - 3%) 0.809
MedTerm 1240.40 (5.3%) 1238.75 (6.0%) -0.1% ( -10% - 11%) 0.941
OrHighMed 38.61 (3.2%) 38.57 (3.8%) -0.1% ( -6% - 7%) 0.917
LowSloppyPhrase 69.00 (2.6%) 68.92 (2.3%) -0.1% ( -4% - 4%) 0.890
AndHighMedDayTaxoFacets 46.51 (1.6%) 46.48 (1.7%) -0.1% ( -3% - 3%) 0.919
BrowseDayOfYearSSDVFacets 11.19 (16.1%) 11.19 (13.4%) -0.1% ( -25% - 35%) 0.991
AndHighMed 120.06 (3.2%) 120.01 (5.1%) -0.0% ( -8% - 8%) 0.974
Fuzzy2 22.55 (2.2%) 22.55 (1.8%) -0.0% ( -3% - 4%) 0.975
MedIntervalsOrdered 3.74 (4.5%) 3.74 (4.4%) 0.0% ( -8% - 9%) 0.981
OrHighNotLow 864.86 (4.4%) 865.48 (4.8%) 0.1% ( -8% - 9%) 0.961
LowIntervalsOrdered 6.22 (4.5%) 6.22 (4.0%) 0.1% ( -7% - 8%) 0.943
OrHighNotMed 766.34 (4.6%) 767.48 (4.9%) 0.1% ( -9% - 10%) 0.922
OrHighMedDayTaxoFacets 3.56 (3.8%) 3.57 (3.5%) 0.2% ( -6% - 7%) 0.851
OrNotHighMed 667.42 (4.0%) 668.99 (5.3%) 0.2% ( -8% - 9%) 0.875
OrNotHighHigh 596.82 (4.0%) 598.76 (5.4%) 0.3% ( -8% - 10%) 0.829
OrHighLow 276.08 (2.4%) 277.04 (2.4%) 0.3% ( -4% - 5%) 0.649
HighIntervalsOrdered 11.75 (5.8%) 11.80 (5.2%) 0.4% ( -10% - 12%) 0.817
OrHighHigh 28.05 (2.9%) 28.17 (3.6%) 0.4% ( -5% - 7%) 0.686
BrowseDateSSDVFacets 2.09 (10.1%) 2.10 (10.3%) 0.4% ( -18% - 23%) 0.890
OrHighNotHigh 842.72 (4.0%) 846.75 (4.8%) 0.5% ( -8% - 9%) 0.732
Wildcard 101.90 (6.7%) 102.54 (6.9%) 0.6% ( -12% - 15%) 0.770
PKLookup 136.17 (2.9%) 137.03 (4.1%) 0.6% ( -6% - 7%) 0.570
AndHighHigh 39.46 (3.8%) 39.99 (6.5%) 1.3% ( -8% - 12%) 0.424
TermDTSort 93.05 (12.0%) 94.39 (13.3%) 1.4% ( -21% - 30%) 0.719
LowTerm 1247.69 (3.6%) 1267.40 (5.9%) 1.6% ( -7% - 11%) 0.311
OrNotHighLow 1003.56 (4.0%) 1020.22 (2.9%) 1.7% ( -5% - 8%) 0.134
HighTermDayOfYearSort 118.75 (12.6%) 121.20 (15.3%) 2.1% ( -23% - 34%) 0.643
BrowseRandomLabelSSDVFacets 6.81 (5.5%) 6.96 (5.8%) 2.2% ( -8% - 14%) 0.224
Prefix3 186.45 (13.6%) 191.69 (14.2%) 2.8% ( -22% - 35%) 0.523
HighTermMonthSort 14.22 (13.7%) 15.04 (19.7%) 5.7% ( -24% - 45%) 0.288
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1111552499
I also suspect the current functionality is more useful for the majority of use-cases than actually truncating by a top-n and sorting the ranges by their counts. I imagine the order of ranges actually means something in many of these cases (i.e., there may be a natural numeric ordering between the ranges that the user wants to maintain). If we instead sorted the resulting ranges by their counts, it might be somewhat challenging for the user to reconcile. It's also pretty trivial work to provide counts for all of the ranges. Unlike other faceting implementations, we don't have to do ordinal -> path lookups or anything for each value we provide.
So I suspect the desired behavior for most users is actually what's implemented today, but I also would agree that the API is pretty wonky and confusing. It feels like we might benefit from a "get all children" type method for this sort of thing. I suspect this is a bit of an artifact of later adding range facet counting by implementing a faceting API more tailored towards taxonomy faceting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1122824451
Thanks @Yuti-G !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1121351525
@Yuti-G would it make sense to close out this PR since I don't think we plan to merge this as it is?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112602017
Thanks for all your thoughts @Yuti-G! I think I still disagree with changing the current behavior. If we want to implement the contract of `getTopChildren` properly, we'd need to sort by count. That's what "top" means in this API, and I think it's a bit confusing to start enforcing a "top n" parameter when "top" isn't really properly defined here. If you have a use-case that needs to actually truncate to `n` ranges while retaining the existing order of ranges originally provided as a user, why not do it outside of the facet code? There's no reason you can't get back all ranges as it works today and then just truncate to the number you want right? That's essentially what you're proposing putting in the faceting code, but I'm not sure doing that is the right thing to do here.
All this said, I'm in favor of exploring the addition of a `getAllChildren` method that would retain this functionality and then properly implementing `getTopChildren` that orders by count if there's a use-case for that. I suspect there is.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112633027
Hi @gsmiller, thanks for your feedback! Yes, I was proposing keeping the existing behavior of getTopChildren but utilizing the top-n param, as I thought it would safe because we would not break the current use cases, but you are absolutely right that we should stick with what top-n means here.
Please correct me if misunderstood your proposal. Are you suggesting that I add another getAllChildren API to retain the current behavior of getTopChildren, and then fix getTopChildren by returning ranges sorted by counts, or do you mean we should just keep this API untouched? Thanks!
Best,
Yuting
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1112797399
> Are you suggesting that I add another getAllChildren API to retain the current behavior of getTopChildren, and then fix getTopChildren by returning ranges sorted by counts, or do you mean we should just keep this API untouched? Thanks!
I'm suggesting the former, but I think that's a bigger project that you meant to capture with this PR and associated Jira. I would recommend opening a separate Jira issue to propose the idea of adding `getAllChildren` to `Facets`. We have use-cases for that functionality here (but it's just sort of improperly implemented under `getTopChildren`) and also in `LongValueFacetCounts` (see `#getAllChildrenSortByValue`). I think it's reasonable to add that functionality to all faceting implementations and then come up with a migration plan to move the current `RangeFacetCounts#getTopChildren` over and properly implement `RangeFacetCounts#getTopChildren`. But by opening a Jira, maybe others will weigh in and agree/disagree there, which would be a healthy conversation.
Thanks again for digging into this and working on improving faceting! This is great!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org