You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/06/14 07:08:21 UTC
[GitHub] [lucene] gautamworah96 opened a new pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyWriter
gautamworah96 opened a new pull request #179:
URL: https://github.com/apache/lucene/pull/179
<!--
_(If you are a project committer then you may remove some/all of the following template.)_
Before creating a pull request, please file an issue in the ASF Jira system for Lucene:
* https://issues.apache.org/jira/projects/LUCENE
You will need to create an account in Jira in order to create an issue.
The title of the PR should reference the Jira issue number in the form:
* LUCENE-####: <short description of problem or changes>
LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
# Description
In LUCENE-9450 we switched the Taxonomy index from Stored Fields to BinaryDocValues. In the resulting implementation of the getPath code, we create a new BinaryDocValues's values instance for each ordinal.
It may happen that we may traverse over the same nodes over and over again if the getPath API is called multiple times for ordinals in the same segment/with the same readerIndex.
This PR takes advantage of that fact by sorting ordinals and then trying to find out if some of the ordinals are present in the same segment/have the same readerIndex (by trying to advanceExact to the correct position and not failing) thereby allowing us to reuse the previous BinaryDocValues object.
**This PR was originally proposed in the older `lucene-solr` repo [here](https://github.com/apache/lucene-solr/pull/2247). I discontinued it because we could not find a good "real" use case for it.**
**This new PR uses the `getBulkPath` API inside `IntTaxonomyFacetCounts` and `FloatTaxonomyFacetCounts` classes.** In the `getTopChildren` method within these classes, we usually collect all the `top-n` ordinals first and then translate them into `FacetLabels` one by one using an iterative API. This call has been replaced with the `getBulkPath` method in this PR.
# Solution
Steps:
1. Sort all ordinals and remember their position so as to store the path in the correct position
2. Try to advanceExact to the correct position with the previously calculated readerIndex. If the operation fails, try to find the correct segment for the ordinal and then advanceExact to the desired position.
3. Store this position for future ordinals.
# Tests
1. Added a new randomized test for the API that compares the individual `getPath` results from ordinals with the bulk `FacetLabels` returned by the getBulkPath API
2. Added a backwards compatibility test for the API that reads an older StoredFields based index
Benchmarks:
The stock `luceneutil` taxonomy tasks [call](https://github.com/mikemccand/luceneutil/blob/4d285784b41ea73f771aeffbc0addf7b626a3acf/src/main/perf/SearchTask.java#L208) `getTopChildren(10, ..)` for testing the performance of taxonomy facet counting classes. Since `FastTaxonomyFacetCounts` inherits from `IntTaxonomyFacetCounts` (and uses the same `getTopChildren` logic) the user should be able to reap the benefits of the bulk API directly.
`luceneutil` that tests the top 10 children:
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy1 44.07 (6.0%) 42.44 (7.2%) -3.7% ( -16% - 10%) 0.078
HighTermMonthSort 22.69 (12.1%) 21.99 (11.3%) -3.1% ( -23% - 23%) 0.402
HighTermTitleBDVSort 19.59 (12.4%) 18.99 (12.4%) -3.1% ( -24% - 24%) 0.434
HighTermDayOfYearSort 54.33 (15.7%) 53.11 (16.3%) -2.3% ( -29% - 35%) 0.656
TermDTSort 15.58 (9.1%) 15.24 (10.9%) -2.2% ( -20% - 19%) 0.488
HighTerm 1010.56 (4.0%) 988.75 (5.2%) -2.2% ( -10% - 7%) 0.142
OrHighNotHigh 422.64 (4.4%) 415.30 (5.6%) -1.7% ( -11% - 8%) 0.274
MedTerm 734.85 (5.5%) 722.45 (4.8%) -1.7% ( -11% - 9%) 0.301
Fuzzy2 14.51 (3.6%) 14.39 (2.9%) -0.9% ( -7% - 5%) 0.399
AndHighHigh 18.22 (3.4%) 18.07 (2.6%) -0.8% ( -6% - 5%) 0.388
AndHighLow 181.77 (1.7%) 180.51 (2.3%) -0.7% ( -4% - 3%) 0.277
Respell 30.50 (2.0%) 30.36 (1.7%) -0.5% ( -4% - 3%) 0.438
OrNotHighLow 347.94 (4.7%) 346.46 (3.6%) -0.4% ( -8% - 8%) 0.747
LowSpanNear 3.83 (1.9%) 3.82 (2.2%) -0.3% ( -4% - 3%) 0.626
HighPhrase 39.95 (7.0%) 39.85 (7.3%) -0.3% ( -13% - 15%) 0.909
OrHighNotMed 405.04 (4.9%) 404.00 (5.4%) -0.3% ( -10% - 10%) 0.874
OrHighLow 110.15 (3.0%) 109.87 (3.2%) -0.2% ( -6% - 6%) 0.800
OrHighMed 61.11 (3.1%) 60.96 (2.9%) -0.2% ( -6% - 5%) 0.795
MedSloppyPhrase 5.47 (6.0%) 5.46 (3.1%) -0.2% ( -8% - 9%) 0.890
MedSpanNear 12.25 (1.6%) 12.24 (2.0%) -0.1% ( -3% - 3%) 0.887
IntNRQ 25.39 (1.3%) 25.37 (2.0%) -0.1% ( -3% - 3%) 0.880
AndHighMed 64.96 (3.1%) 64.95 (2.7%) -0.0% ( -5% - 6%) 0.977
HighSpanNear 17.37 (2.0%) 17.38 (3.7%) 0.0% ( -5% - 5%) 0.963
LowSloppyPhrase 55.39 (4.0%) 55.42 (2.6%) 0.1% ( -6% - 6%) 0.958
OrHighHigh 9.85 (2.1%) 9.86 (2.9%) 0.1% ( -4% - 5%) 0.918
Wildcard 29.06 (4.5%) 29.08 (3.4%) 0.1% ( -7% - 8%) 0.942
HighIntervalsOrdered 7.97 (2.5%) 7.98 (3.1%) 0.1% ( -5% - 5%) 0.907
LowTerm 883.85 (4.4%) 884.97 (6.1%) 0.1% ( -9% - 11%) 0.940
Prefix3 40.49 (7.1%) 40.55 (5.2%) 0.2% ( -11% - 13%) 0.939
OrHighNotLow 440.52 (5.1%) 441.49 (4.8%) 0.2% ( -9% - 10%) 0.888
MedPhrase 138.15 (4.3%) 138.51 (4.7%) 0.3% ( -8% - 9%) 0.855
LowPhrase 111.07 (2.9%) 111.40 (3.0%) 0.3% ( -5% - 6%) 0.751
OrNotHighMed 375.26 (6.7%) 376.64 (6.9%) 0.4% ( -12% - 15%) 0.865
BrowseMonthTaxoFacets 0.60 (5.2%) 0.60 (5.6%) 0.4% ( -9% - 11%) 0.806
HighSloppyPhrase 2.11 (5.7%) 2.12 (3.5%) 0.5% ( -8% - 10%) 0.730
BrowseDateTaxoFacets 0.57 (5.1%) 0.58 (5.5%) 0.5% ( -9% - 11%) 0.746
BrowseDayOfYearTaxoFacets 0.57 (5.2%) 0.58 (5.5%) 0.6% ( -9% - 11%) 0.742
OrNotHighHigh 391.99 (5.2%) 394.52 (5.3%) 0.6% ( -9% - 11%) 0.701
PKLookup 113.96 (3.1%) 115.11 (3.4%) 1.0% ( -5% - 7%) 0.320
BrowseMonthSSDVFacets 2.62 (1.0%) 2.69 (1.8%) 3.0% ( 0% - 5%) 0.000
BrowseDayOfYearSSDVFacets 2.46 (0.3%) 2.54 (2.7%) 3.3% ( 0% - 6%) 0.000
```
Since, the previous results did not show any gain in the `Browse*TaxoFacets` tasks, I realized that maybe it was because the ordinalCache with a stock size of 4000 had all ordinals cached which led to the bulk API behaving the same as the iterative `getPath` call.
The next benchmark tested the `getTopChildren` call with 10,000 children
`luceneutil` that tests the top 10000 children:
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Fuzzy2 35.18 (4.4%) 34.81 (7.4%) -1.0% ( -12% - 11%) 0.588
HighTermMonthSort 47.71 (12.1%) 47.28 (11.8%) -0.9% ( -22% - 26%) 0.812
AndHighMed 33.50 (3.4%) 33.35 (2.6%) -0.4% ( -6% - 5%) 0.648
OrHighMed 31.86 (2.2%) 31.75 (1.9%) -0.3% ( -4% - 3%) 0.595
HighIntervalsOrdered 2.59 (1.8%) 2.58 (1.4%) -0.3% ( -3% - 2%) 0.540
OrHighLow 175.82 (2.7%) 175.43 (3.4%) -0.2% ( -6% - 5%) 0.816
AndHighHigh 29.36 (3.9%) 29.33 (2.7%) -0.1% ( -6% - 6%) 0.915
MedSloppyPhrase 18.63 (2.0%) 18.61 (2.6%) -0.1% ( -4% - 4%) 0.909
TermDTSort 37.97 (10.0%) 37.97 (10.2%) 0.0% ( -18% - 22%) 0.997
Fuzzy1 24.97 (4.4%) 24.97 (4.1%) 0.0% ( -8% - 8%) 0.983
MedSpanNear 51.30 (2.0%) 51.35 (1.6%) 0.1% ( -3% - 3%) 0.860
HighTermDayOfYearSort 45.86 (16.5%) 45.92 (16.4%) 0.1% ( -28% - 39%) 0.980
OrHighHigh 8.55 (2.1%) 8.56 (1.9%) 0.1% ( -3% - 4%) 0.826
HighSpanNear 9.30 (1.7%) 9.32 (1.7%) 0.2% ( -3% - 3%) 0.766
BrowseDateTaxoFacets 0.58 (4.7%) 0.58 (4.8%) 0.2% ( -8% - 10%) 0.915
BrowseMonthTaxoFacets 0.60 (4.9%) 0.60 (4.8%) 0.2% ( -9% - 10%) 0.894
BrowseDayOfYearTaxoFacets 0.57 (4.7%) 0.58 (4.8%) 0.3% ( -8% - 10%) 0.861
PKLookup 116.98 (3.6%) 117.30 (3.1%) 0.3% ( -6% - 7%) 0.794
LowSloppyPhrase 5.58 (4.0%) 5.60 (3.9%) 0.3% ( -7% - 8%) 0.785
LowSpanNear 11.11 (1.6%) 11.15 (1.7%) 0.4% ( -2% - 3%) 0.451
HighTermTitleBDVSort 25.79 (7.2%) 25.89 (8.7%) 0.4% ( -14% - 17%) 0.875
IntNRQ 25.27 (1.5%) 25.37 (1.3%) 0.4% ( -2% - 3%) 0.347
LowPhrase 40.67 (2.2%) 40.84 (1.4%) 0.4% ( -3% - 4%) 0.478
HighSloppyPhrase 5.99 (4.1%) 6.01 (4.1%) 0.4% ( -7% - 8%) 0.745
Wildcard 20.27 (3.5%) 20.41 (3.4%) 0.7% ( -5% - 7%) 0.500
Respell 34.15 (2.1%) 34.41 (1.4%) 0.8% ( -2% - 4%) 0.175
AndHighLow 358.53 (3.8%) 361.84 (3.8%) 0.9% ( -6% - 8%) 0.438
OrNotHighLow 369.77 (4.3%) 374.69 (2.7%) 1.3% ( -5% - 8%) 0.239
MedTerm 980.66 (4.2%) 994.40 (5.2%) 1.4% ( -7% - 11%) 0.347
OrHighNotHigh 388.09 (4.6%) 394.10 (4.0%) 1.5% ( -6% - 10%) 0.258
OrNotHighMed 419.20 (3.9%) 426.30 (3.9%) 1.7% ( -5% - 9%) 0.173
MedPhrase 281.59 (4.2%) 287.03 (3.3%) 1.9% ( -5% - 9%) 0.107
LowTerm 758.64 (5.7%) 775.00 (4.9%) 2.2% ( -8% - 13%) 0.200
HighPhrase 69.25 (6.5%) 70.85 (6.2%) 2.3% ( -9% - 16%) 0.250
OrNotHighHigh 517.94 (5.7%) 530.26 (3.5%) 2.4% ( -6% - 12%) 0.111
OrHighNotMed 438.39 (5.5%) 449.26 (3.8%) 2.5% ( -6% - 12%) 0.096
HighTerm 650.61 (4.9%) 667.25 (4.4%) 2.6% ( -6% - 12%) 0.081
OrHighNotLow 308.34 (4.3%) 317.07 (5.0%) 2.8% ( -6% - 12%) 0.053
Prefix3 88.13 (8.6%) 91.32 (8.5%) 3.6% ( -12% - 22%) 0.181
BrowseMonthSSDVFacets 2.61 (1.0%) 2.71 (1.4%) 3.9% ( 1% - 6%) 0.000
BrowseDayOfYearSSDVFacets 2.46 (0.4%) 2.56 (2.1%) 4.1% ( 1% - 6%) 0.000
```
I surprisingly see some gain in `BrowseDayOfYearSSDVFacets` and `BrowseMonthSSDVFacets` tasks but I think that is noise (because taxonomy facets have a different impl. as compared to `SortedSetDocValuesFacets`).
`Browse*TaxoFacets` tasks don't show gain.
# Checklist
Please review the following and check all that apply:
- [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the `main` branch.
- [x] I have run `./gradlew check`.
- [x] I have added tests for my changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
sonatype-lift[bot] commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697042596
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
*THREAD_SAFETY_VIOLATION:* Read/Write race. Non-private method `DirectoryTaxonomyReader.getBulkPath(...)` indirectly reads without synchronization from `this.categoryCache`. Potentially races with write in method `DirectoryTaxonomyReader.doClose()`.
Reporting because this access may occur on a background thread.
(at-me [in a reply](https://help.sonatype.com/lift) with `help` or `ignore`)
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -318,23 +322,16 @@ public FacetLabel getPath(int ordinal) throws IOException {
// doOpenIfChanged, we need to ensure that the ordinal is one that this DTR
// instance recognizes. Therefore we do this check up front, before we hit
// the cache.
- if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
- return null;
- }
+ checkOrdinalBounds(ordinal);
- // TODO: can we use an int-based hash impl, such as IntToObjectMap,
- // wrapped as LRU?
- Integer catIDInteger = Integer.valueOf(ordinal);
- synchronized (categoryCache) {
- FacetLabel res = categoryCache.get(catIDInteger);
- if (res != null) {
- return res;
- }
+ FacetLabel[] ordinalPath = getPathFromCache(ordinal);
Review comment:
*THREAD_SAFETY_VIOLATION:* Read/Write race. Non-private method `DirectoryTaxonomyReader.getPath(...)` indirectly reads without synchronization from `this.categoryCache`. Potentially races with write in method `DirectoryTaxonomyReader.doClose()`.
Reporting because this access may occur on a background thread.
(at-me [in a reply](https://help.sonatype.com/lift) with `help` or `ignore`)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r661074967
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
+ */
+ if (values == null || ordinals[i] >= leafReaderMaxDoc) {
Review comment:
This was a big miss. Thanks for catching this.
Here is why the test was passing:
When the ordinals were all in the first leaf, the ordinal was < leafReaderMaxDoc and the wrong code worked correctly.
When the ordinals started spilling into the second leaf, the wrong code would recalculate the leaf each time and then find the correct `Label`. Thus were not actually making use of the increasing values of the ordinals (and were recalculating the `leafContext` each time).
This would go on for all other leaves as well.
I'll rerun benchmarks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
Review comment:
Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged :D
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698055359
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
##########
@@ -89,6 +89,28 @@ private void createNewTaxonomyIndex(String dirName) throws IOException {
dir.close();
}
+ // Opens up a pre-existing index and tries to run getBulkPath on it
+ public void testGetBulkPathOnOlderCodec() throws Exception {
+ Path indexDir = createTempDir(oldTaxonomyIndexName);
+ TestUtil.unzip(getDataInputStream(oldTaxonomyIndexName + ".zip"), indexDir);
+ Directory dir = newFSDirectory(indexDir);
+
+ DirectoryTaxonomyWriter writer = new DirectoryTaxonomyWriter(dir);
+ DirectoryTaxonomyReader reader = new DirectoryTaxonomyReader(writer);
+
+ FacetLabel[] facetLabels = {
+ new FacetLabel("a"), new FacetLabel("b"),
+ };
+ int[] ordinals =
+ new int[] {reader.getOrdinal(facetLabels[0]), reader.getOrdinal(facetLabels[1])};
+ // The zip file already contains a category "a" and a category "b" stored with the older
Review comment:
We have already added this condition in the `createNewTaxonomyIndex()` test (i.e the back compat test we had added for the BDV switch). See [here](https://github.com/apache/lucene/blob/56eb76dbaf7b2e9a4fbf7c1a25af2ba81ab1bc92/lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java#L74)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #179:
URL: https://github.com/apache/lucene/pull/179#issuecomment-869081261
That is really odd that the `SSDVFacets` show ~4% gain! I wonder if there is a bug in `luceneutil` mis-labeling which facets task is using which impl!? You could run a benchmark, filtering tasks to only do these `SSDVFacets` tasks, and then check whether the taxo facet impls are being used at all maybe.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r660156740
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
Review comment:
In the previous incarnation of this PR, @shaie had [suggested](https://github.com/apache/lucene-solr/pull/2247#discussion_r564733802) that I give it a more descriptive name that actually describes what it does.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697622283
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
Review comment:
I actually picked this up from tests in `TestDirectoryTaxonomyWriter`. Fixed both places in the next iteration
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r699665391
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
And maybe if we wind up removing this cache entirely, we don't need to do this issue!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on pull request #179:
URL: https://github.com/apache/lucene/pull/179#issuecomment-912671850
> I think we should backport this one to 8.x
This change relies on the older repo having the "LUCENE-9450 Switch taxonomy index to use BinaryDocValues" change.
We had initially decided that we will not backport it to 8x because it was a major change to the index type/format. Thus this change also can't be backported :/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #179:
URL: https://github.com/apache/lucene/pull/179#issuecomment-916098137
> > I think we should backport this one to 8.x
>
> This change relies on the older repo having the "LUCENE-9450 Switch taxonomy index to use BinaryDocValues" change.
>
> We had initially decided that we will not backport it to 8x because it was a major change to the index type/format. Thus this change also can't be backported :/
Ahh, yes, sorry! We cannot backport this change then. So this is 9.0 only optimization.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand merged pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand merged pull request #179:
URL: https://github.com/apache/lucene/pull/179
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
Review comment:
Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged.
The fix should be trivial actually. We can just use the class variable `useOlderStoredFieldsIndex` from `TaxonomyDirectoryWriter`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696973661
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ float PROBABILITY_OF_COMMIT = 0.5f;
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
Review comment:
Sure. I modified the test to that it uses random multiple threads, each thread use random iterations, each iteration tests the path of random number of ordinals . Debugging this test will be a task if it fails :|
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r661068707
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
Review comment:
I actually did not understand the "wrapped as an LRU" part of this preexisting comment.
Could you elaborate a bit?
Hppc has an existing class called `IntObjectHashMap` but it does not have the provisions to act as a cache today. When you insert a new key it allocates new memory and then adds the key.
On the other hand, the existing `LinkedHashMap` behaves as a cache but replacing the LRU key
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696168048
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
Review comment:
Opened LUCENE-10068
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r691713240
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
Review comment:
Fixed this in the next commit 2f4d4ba
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696167851
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i]);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
Review comment:
Yeah. It is indeed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698061294
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
Ahh. This can indeed happen. Opened https://issues.apache.org/jira/browse/LUCENE-10077
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697508778
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i]);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
Review comment:
Oh sorry I thought it was a correctness issue! Since it was just performance, and very tricky to test, let's skip the test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696169642
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ float PROBABILITY_OF_COMMIT = 0.5f;
Review comment:
Sure. I'll fix
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r699792786
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
Benchmarks show that the category cache is not very effective. Looks like we might indeed have to remove it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on pull request #179:
URL: https://github.com/apache/lucene/pull/179#issuecomment-901547047
Hey @mikemccand , my apologies for leaving this PR hanging! The unexpected benchmark results threw me off..
I've updated the PR to fix the bug you had previously discovered and updated the PR to bring it on par with mainline.
I re-ran benchmarks after fixing the `leafReaderDocBase` not added to the `leafReaderMaxDoc` bug:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighLow 394.22 (4.1%) 390.60 (4.7%) -0.9% ( -9% - 8%) 0.508
OrHighNotHigh 393.24 (5.0%) 393.11 (4.4%) -0.0% ( -9% - 9%) 0.982
HighIntervalsOrdered 2.23 (1.3%) 2.23 (1.2%) 0.1% ( -2% - 2%) 0.768
HighPhrase 169.85 (4.9%) 170.09 (5.7%) 0.1% ( -9% - 11%) 0.932
BrowseDateTaxoFacets 0.57 (1.6%) 0.57 (1.6%) 0.2% ( -2% - 3%) 0.684
HighSpanNear 3.57 (2.2%) 3.58 (2.1%) 0.2% ( -3% - 4%) 0.711
LowSpanNear 11.32 (1.8%) 11.35 (1.4%) 0.3% ( -2% - 3%) 0.604
OrNotHighMed 398.53 (6.1%) 399.62 (7.5%) 0.3% ( -12% - 14%) 0.900
MedSpanNear 11.37 (1.8%) 11.41 (1.2%) 0.3% ( -2% - 3%) 0.522
LowTerm 949.41 (5.7%) 952.69 (5.2%) 0.3% ( -9% - 11%) 0.841
OrHighHigh 27.45 (2.5%) 27.55 (2.1%) 0.4% ( -4% - 5%) 0.615
IntNRQ 19.98 (30.1%) 20.06 (29.6%) 0.4% ( -45% - 85%) 0.967
Respell 24.48 (2.4%) 24.58 (2.5%) 0.4% ( -4% - 5%) 0.575
LowSloppyPhrase 19.15 (2.6%) 19.23 (2.3%) 0.4% ( -4% - 5%) 0.569
OrHighMed 23.02 (2.9%) 23.14 (1.8%) 0.5% ( -4% - 5%) 0.482
BrowseDayOfYearTaxoFacets 6311.77 (3.7%) 6349.39 (4.2%) 0.6% ( -7% - 8%) 0.634
Fuzzy1 42.80 (3.9%) 43.06 (5.0%) 0.6% ( -7% - 9%) 0.666
OrHighNotMed 446.39 (5.5%) 449.32 (5.6%) 0.7% ( -9% - 12%) 0.707
HighSloppyPhrase 3.57 (4.4%) 3.59 (3.8%) 0.7% ( -7% - 9%) 0.610
OrNotHighLow 418.72 (3.6%) 421.68 (3.6%) 0.7% ( -6% - 8%) 0.535
LowPhrase 12.47 (2.8%) 12.56 (2.8%) 0.8% ( -4% - 6%) 0.391
OrHighNotLow 494.57 (4.9%) 498.46 (5.4%) 0.8% ( -9% - 11%) 0.629
HighTermDayOfYearSort 7.07 (12.8%) 7.13 (10.2%) 0.9% ( -19% - 27%) 0.799
MedSloppyPhrase 21.85 (2.8%) 22.06 (2.5%) 0.9% ( -4% - 6%) 0.266
Fuzzy2 35.04 (3.6%) 35.43 (3.7%) 1.1% ( -5% - 8%) 0.330
HighTermTitleBDVSort 39.93 (8.7%) 40.40 (6.9%) 1.2% ( -13% - 18%) 0.638
Wildcard 38.16 (4.4%) 38.60 (3.9%) 1.2% ( -6% - 9%) 0.370
AndHighMed 47.87 (2.7%) 48.45 (3.6%) 1.2% ( -4% - 7%) 0.224
OrNotHighHigh 367.90 (4.8%) 372.56 (3.9%) 1.3% ( -7% - 10%) 0.362
AndHighHigh 20.63 (3.2%) 20.94 (4.6%) 1.5% ( -6% - 9%) 0.221
HighTermMonthSort 25.15 (12.1%) 25.54 (13.8%) 1.6% ( -21% - 31%) 0.703
MedPhrase 54.64 (6.1%) 55.54 (6.9%) 1.6% ( -10% - 15%) 0.424
TermDTSort 16.01 (14.6%) 16.28 (16.7%) 1.7% ( -25% - 38%) 0.737
BrowseMonthSSDVFacets 2.62 (0.9%) 2.66 (2.1%) 1.7% ( -1% - 4%) 0.001
AndHighLow 252.16 (3.3%) 256.71 (3.3%) 1.8% ( -4% - 8%) 0.086
HighTerm 998.70 (6.8%) 1020.66 (5.8%) 2.2% ( -9% - 15%) 0.271
BrowseMonthTaxoFacets 6354.02 (5.9%) 6495.78 (5.4%) 2.2% ( -8% - 14%) 0.211
PKLookup 113.56 (4.2%) 116.24 (4.4%) 2.4% ( -6% - 11%) 0.084
BrowseDayOfYearSSDVFacets 2.46 (0.5%) 2.52 (2.9%) 2.5% ( 0% - 5%) 0.000
MedTerm 815.18 (4.9%) 837.70 (6.0%) 2.8% ( -7% - 14%) 0.109
Prefix3 144.90 (10.0%) 149.12 (11.5%) 2.9% ( -16% - 27%) 0.392
The gain in taxonomy based facets is about 2%. I ran it multiple times locally and it gave around 2% gain in those runs as well. SSDV facets still show a 3% gain which is unexplained.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697510650
##########
File path: lucene/CHANGES.txt
##########
@@ -137,6 +137,9 @@ API Changes
Improvements
+* LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more efficiently retrieve FacetLabels for multiple
+ facet ordinals at once (Gautam Worah, Mike McCandless)
+
Review comment:
Well, if you look at the Jira numbers then they may appear out of order lol. But the "rough" ordering is order that we fixed things, i.e. each of us tries to append our new CHANGES entry when we fix an issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696973661
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ float PROBABILITY_OF_COMMIT = 0.5f;
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
Review comment:
Sure. I modified the test so that it uses random multiple threads, each thread use random iterations, each iteration tests the path of random number of ordinals .
Thanks for suggesting this idea.
Especially helpful in cases like ours where the cache storage/retrieval operations are complex
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697014864
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i]);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
Review comment:
So this was a performance bug and not a correctness bug.
**The performance bug was that for leaves 2 and all leaves after that, we were recalculating the leaf for each ordinal.** Instead what the correct code does now is that it checks if the current leaf is sufficient for calculating the values, if not, **then** it tries to find the next leaf.
It is a bit hard to write a test that checks if we are entering that if condition or no. Do you have any ideas on testing these kinds of conditionals?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697880899
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java
##########
@@ -366,9 +366,9 @@ public void testReaderBasic() throws Exception {
}
}
// (also test invalid ordinals:)
- assertNull(tr.getPath(-1));
- assertNull(tr.getPath(tr.getSize()));
- assertNull(tr.getPath(TaxonomyReader.INVALID_ORDINAL));
+ expectThrows(IllegalArgumentException.class, () -> tr.getPath(-1));
Review comment:
I guess this is a small API break? Previously this API leniently returned `null` but now it throws exception. I think that's OK, but could you advertise this in the `CHANGES.txt` entry?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
Review comment:
It's sort of weird to do this check per-segment, since we can know up front, based on indexed created version, whether it's stored fields or doc values? I.e. this is not a segment by segment decision. But let's do it in follow-on issue. I think this one is ready!
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestBackwardsCompatibility.java
##########
@@ -89,6 +89,28 @@ private void createNewTaxonomyIndex(String dirName) throws IOException {
dir.close();
}
+ // Opens up a pre-existing index and tries to run getBulkPath on it
+ public void testGetBulkPathOnOlderCodec() throws Exception {
+ Path indexDir = createTempDir(oldTaxonomyIndexName);
+ TestUtil.unzip(getDataInputStream(oldTaxonomyIndexName + ".zip"), indexDir);
+ Directory dir = newFSDirectory(indexDir);
+
+ DirectoryTaxonomyWriter writer = new DirectoryTaxonomyWriter(dir);
+ DirectoryTaxonomyReader reader = new DirectoryTaxonomyReader(writer);
+
+ FacetLabel[] facetLabels = {
+ new FacetLabel("a"), new FacetLabel("b"),
+ };
+ int[] ordinals =
+ new int[] {reader.getOrdinal(facetLabels[0]), reader.getOrdinal(facetLabels[1])};
+ // The zip file already contains a category "a" and a category "b" stored with the older
Review comment:
Maybe the test should assert that `ordinals[0] != -1` and same for `ordinals[1]`?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
Yeah good point ...
Though I do think it's weird that we set `categoryCache = null` in `doClose` because this means if the application is (illegally) still trying to do faceting in one query thread, while another thread closes, then the query threads will hit `NullPointerException`. I think it'd be better to make `categoryCache` `final` and throw `AlreadyClosedException`. But let's not do that here/now! Can you open a followon issue?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
Review comment:
Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged.
The fix should be trivial actually. We can just use the class variable `useOlderStoredFieldsIndex` from `DirectoryTaxonomyWriter`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on pull request #179:
URL: https://github.com/apache/lucene/pull/179#issuecomment-911588393
Merged! @gautamworah96 I think we should backport this one to 8.x? It is just adding a faster API.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r660154028
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
Review comment:
`gradle tidy` did this! I'll see if I can get gradle to work with something else
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r692233337
##########
File path: lucene/CHANGES.txt
##########
@@ -137,6 +137,9 @@ API Changes
Improvements
+* LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more efficiently retrieve FacetLabels for multiple
+ facet ordinals at once (Gautam Worah, Mike McCandless)
+
Review comment:
We usually append these entries at the bottom of the section, not inserting at the top :)
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i]);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
Review comment:
Can you change this so that we `sync` once on the `categoryCache`, do all the lookups, then release the lock?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i]);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
Review comment:
Awesome! Such a useful sorting utility class...
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ float PROBABILITY_OF_COMMIT = 0.5f;
Review comment:
How about randomizing this? I.e. `float PROBABILITY_OF_COMMIT = (float) random().nextDouble();`?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i]);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
Review comment:
Did you add a test case that exposed this bug (missing the `docBase` before)?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
##########
@@ -146,10 +147,18 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
}
LabelAndValue[] labelValues = new LabelAndValue[q.size()];
+ int[] ordinals = new int[labelValues.length];
+ float[] values = new float[labelValues.length];
+
for (int i = labelValues.length - 1; i >= 0; i--) {
TopOrdAndFloatQueue.OrdAndValue ordAndValue = q.pop();
- FacetLabel child = taxoReader.getPath(ordAndValue.ord);
- labelValues[i] = new LabelAndValue(child.components[cp.length], ordAndValue.value);
+ ordinals[i] = ordAndValue.ord;
+ values[i] = ordAndValue.value;
+ }
+
+ FacetLabel[] bulkPath = ((DirectoryTaxonomyReader) taxoReader).getBulkPath(ordinals);
Review comment:
Hmm OK but this is quite messy. This shows the cost of abstractions actually :) Because we have these two separated abstractions, even though there is only one example of the `abstract` base class today, it is dangerous to do a hard cast like this. If a future alternative `TaxonomyReader` comes along, then this code will fail?
How about implementing `getBulkPath` in `TaxonomyReader`, falling back to looking up the individual paths in the base class, but overriding to the more efficient implementation (the one you already wrote) in `DirectoryTaxonomyReader`?
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
Review comment:
OK :)
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ float PROBABILITY_OF_COMMIT = 0.5f;
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
Review comment:
Could you randomize what paths you look up, instead of always looking up all paths? Randomly pick a (randomly sized) subset each time?
Also, could you run multiple threads here, each running multiple iterations, given the sort of complex thread concurrency with the locking to find cache misses / to insert them after?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
Review comment:
Well, the existing `LinkedHashMap` is horribly RAM-inefficient. I.e. it has high overhead since we are caching little tiny objects. `IntObjectHashMap` would be better. One simple approach to make an approximate LRU from this HPPC class is a "double barrel" cache. At all times there are two `IntObjectHashMap`s. One is primary, the other is secondary. On get we first check primary. If it's not there, check secondary and if that was a hit, also store it in primary. If it's a full cache miss, store in primary. And then periodically, when there are too many items in both maps, drop secondary, move primary to secondary, and make a new empty map as primary.
This is not a "true" LRU, only approximate, but I bet given the big RAM efficiency gains we'd see from `IntObjectHashMap`, probably worth it.
But let's at least open an issue to get the pebble rolling?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
+ */
+ if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return getBulkPathForOlderIndexes(ordinals);
+ }
+ }
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ // add the value to the categoryCache after computation
+ synchronized (categoryCache) {
Review comment:
I think a better solution would be to accumulate, into a temporary array, the `int`/`FacetLabel` pairs that need to be inserted.
Then, outside this `for` loop, if there were any of those, acquire the `sync` lock once, iterate over those cache misses and insert them. This way we are 1) only acquiring/releasing `sync` lock once, and 2) only if there was at least one cache miss, and 3) only actually iterating on those cache misses.
These sorts of very fine grained lock acquire/release can lead to serious contention with many threads/cores. Let's take advantage of the "bulkness" of this new API to also increase lock granularity?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697584332
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
+ for (int z = 0; z < addThreads.length; z++) {
+ addThreads[z] =
+ new Thread() {
+ @Override
+ public void run() {
+ // each thread iterates for maxThreadIterations times
+ int maxThreadIterations = random().nextInt(10);
+ for (int threadIterations = 0;
+ threadIterations < maxThreadIterations;
+ threadIterations++) {
+
+ // length of the FacetLabel array that we are going to check
+ int numOfOrdinalsToCheck = random().nextInt(allOrdinals.length);
Review comment:
Yeah could be 0 at times which is also good.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
Review comment:
Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697599711
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
Not sure if I understand this concern. The `getPathFromCache` method accesses the cache under a `synchronized` block.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697513953
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyReader.java
##########
@@ -212,6 +212,19 @@ public int getOrdinal(String dim, String... path) throws IOException {
/** Returns the path name of the category with the given ordinal. */
public abstract FacetLabel getPath(int ordinal) throws IOException;
+ /**
+ * Returns the path names of the list of ordinals associated with different categories. The
+ * implementation in {@link org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader} is
+ * generally faster than iteratively calling {@link #getPath(int)}
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
Review comment:
Maybe add a comment that this is a default (slow) implementation that does just iteratively call `getPath`?
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
+ for (int z = 0; z < addThreads.length; z++) {
+ addThreads[z] =
+ new Thread() {
+ @Override
+ public void run() {
+ // each thread iterates for maxThreadIterations times
+ int maxThreadIterations = random().nextInt(10);
+ for (int threadIterations = 0;
+ threadIterations < maxThreadIterations;
+ threadIterations++) {
+
+ // length of the FacetLabel array that we are going to check
+ int numOfOrdinalsToCheck = random().nextInt(allOrdinals.length);
Review comment:
Sometimes this will be `0`, which I guess is OK :) We confirm that passing empty `int[]` is acceptable.
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return super.getBulkPath(ordinals);
+ }
+ }
+ // values is leaf specific so you only advance till you reach the target within the leaf
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ uncachedOrdinalPositions.add(i);
+ }
+ }
+
+ synchronized (categoryCache) {
Review comment:
Can we add `if (uncachedOrdinalPositions.isEmpty() == false) { ... }` around this? This way we don't try to lock if there is everything was already cached (hopefully a common case). It is possible hotspot might do this opto for us but let's be explicit :)
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
This is a real hazard, but I think we can ignore it because users are supposed to ensure all search/faceting threads complete before closing the taxonomy reader.
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
+ for (int z = 0; z < addThreads.length; z++) {
+ addThreads[z] =
+ new Thread() {
+ @Override
+ public void run() {
+ // each thread iterates for maxThreadIterations times
+ int maxThreadIterations = random().nextInt(10);
Review comment:
Maybe rename to `numThreadIterations` since it is exactly how many iterations this thread will do?
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextBoolean()) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
+ allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+ }
+
+ // create multiple threads to check result correctness and thread contention in the cache
+ Thread[] addThreads = new Thread[atLeast(4)];
Review comment:
Hmm `atLeast(4)` is a bit dangerous because in NIGHTLY build that might generate some really massive ints. How about `RandomNumbers.randomIntBetween(random(), 1, 12)` instead?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r659242645
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
Review comment:
Hmm, leftover semi-colon?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
Review comment:
Hmm, the incoming `int indexReaderMaxDoc` is always `indexReader.maxDoc()` right? Maybe remove these (redundant) parameter?
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
Review comment:
Maybe just rename to `testBulkPath`?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
Review comment:
Could you open an issue for this? This seems like a low-hanging-fruit, now that we use lots of HPPC classes in the `facet` module today! And this would improve RAM efficiency of this cache, allowing us to make it larger maybe.
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
+ */
+ if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return getBulkPathForOlderIndexes(ordinals);
+ }
+ }
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ // add the value to the categoryCache after computation
+ synchronized (categoryCache) {
Review comment:
Could you add a `// TODO` to also bulk-insert into the cached the newly retrieved `ordinal -> FacetLabel`, rather than one at a time like this? It might improve thread contention on this lock for concurrent searches.
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
+ */
+ if (values == null || ordinals[i] >= leafReaderMaxDoc) {
Review comment:
Hmm, don't we need to compare `ordinals[i] >= leafReaderMaxDoc + leafReaderDocBase` instead? I.e. the ordinal is in the global `IndexReader` docid space, but the `leafReaderMaxDoc` is just for this leaf. How come tests are not failing due to this :)
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
##########
@@ -146,10 +147,18 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
}
LabelAndValue[] labelValues = new LabelAndValue[q.size()];
+ int[] ordinals = new int[labelValues.length];
+ float[] values = new float[labelValues.length];
+
for (int i = labelValues.length - 1; i >= 0; i--) {
TopOrdAndFloatQueue.OrdAndValue ordAndValue = q.pop();
- FacetLabel child = taxoReader.getPath(ordAndValue.ord);
- labelValues[i] = new LabelAndValue(child.components[cp.length], ordAndValue.value);
+ ordinals[i] = ordAndValue.ord;
+ values[i] = ordAndValue.value;
+ }
+
+ FacetLabel[] bulkPath = ((DirectoryTaxonomyReader) taxoReader).getBulkPath(ordinals);
Review comment:
Hmm is this cast always safe? Are there other implementations of `TaxonomyReader`?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
Review comment:
Maybe also include the maximum ordinal in this exception message too?
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
Review comment:
Can you change to `index is based on BinaryDocValues`, and note the Lucene version that first switched to writing `BinaryDocValues` for the taxonomy index?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r699665391
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
And maybe if we wind up removing this cache entirely, we don't need to do this issue!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697620977
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return super.getBulkPath(ordinals);
+ }
+ }
+ // values is leaf specific so you only advance till you reach the target within the leaf
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ uncachedOrdinalPositions.add(i);
+ }
+ }
+
+ synchronized (categoryCache) {
Review comment:
Sure
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696495553
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i]);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
Review comment:
Yes. This should obviously be better. Fixed in the next commit
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r661068707
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
Review comment:
I actually did not understand the "wrapped as an LRU" part of this preexisting comment.
Could you elaborate a bit?
Hppc has an existing class called `IntObjectHashMap` but it does not have the provisions to act as a cache today. When you insert a new key it allocates new memory and then adds the key.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698056689
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyCombined.java
##########
@@ -366,9 +366,9 @@ public void testReaderBasic() throws Exception {
}
}
// (also test invalid ordinals:)
- assertNull(tr.getPath(-1));
- assertNull(tr.getPath(tr.getSize()));
- assertNull(tr.getPath(TaxonomyReader.INVALID_ORDINAL));
+ expectThrows(IllegalArgumentException.class, () -> tr.getPath(-1));
Review comment:
Sure. Added it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r660154028
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
Review comment:
`gradle tidy` did this! I'll see if I can get gradle to work with something else
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
Review comment:
In the previous incarnation of this PR, @shaie had [suggested](https://github.com/apache/lucene-solr/pull/2247#discussion_r564733802) that I give it a more descriptive name that actually describes what it does.
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
##########
@@ -146,10 +147,18 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
}
LabelAndValue[] labelValues = new LabelAndValue[q.size()];
+ int[] ordinals = new int[labelValues.length];
+ float[] values = new float[labelValues.length];
+
for (int i = labelValues.length - 1; i >= 0; i--) {
TopOrdAndFloatQueue.OrdAndValue ordAndValue = q.pop();
- FacetLabel child = taxoReader.getPath(ordAndValue.ord);
- labelValues[i] = new LabelAndValue(child.components[cp.length], ordAndValue.value);
+ ordinals[i] = ordAndValue.ord;
+ values[i] = ordAndValue.value;
+ }
+
+ FacetLabel[] bulkPath = ((DirectoryTaxonomyReader) taxoReader).getBulkPath(ordinals);
Review comment:
Yes this cast is safe. `DirectoryTaxonomyReader` is the only implementation of `TaxonomyReader`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696168048
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
Review comment:
Created [LUCENE-10068](https://issues.apache.org/jira/browse/LUCENE-10068) with some ideas on what was discussed and a few implementation specs.
I actually think, once we have implemented this double array cache we may be able to use it in all places where the current `LRUCache` is used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696969300
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
+ */
+ if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return getBulkPathForOlderIndexes(ordinals);
+ }
+ }
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ // add the value to the categoryCache after computation
+ synchronized (categoryCache) {
Review comment:
Sure. I've implemented this in the next iteration where the `uncachedOrdinalPositions` array stores the positions of ordinals that were not cached.
We later on use these positions in one go to bulk insert paths into the cache.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r699792786
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
Review comment:
Benchmarks show that the category cache is not very effective. Looks like we might indeed have to remove it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r661073094
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
+ */
+ if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return getBulkPathForOlderIndexes(ordinals);
+ }
+ }
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ // add the value to the categoryCache after computation
+ synchronized (categoryCache) {
Review comment:
One of the reasons why we decided to put the keys one at at a time was that we could only insert the keys that were not present in the cache (and is better when all ordinals are in the cache already). [Discussion in previous PR](https://github.com/apache/lucene-solr/pull/2247#discussion_r574162026)
If we decided to bulk insert keys into the cache we will reinsert keys that were already present.
The good news is that there is already a `putAll` bulk kinda API for the cache
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r661073094
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +349,139 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel getPathFromCache(int ordinal) {
+ // TODO: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ return categoryCache.get(ordinal);
+ }
+ }
+
+ private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc)
+ throws IllegalArgumentException {
+ if (ordinal < 0 || ordinal >= indexReaderMaxDoc) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString());
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on DocValues.
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+ int indexReaderMaxDoc = indexReader.maxDoc();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ // check whether the ordinal is valid before accessing the cache
+ checkOrdinalBounds(ordinals[i], indexReaderMaxDoc);
+ // check the cache before trying to find it in the index
+ FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+ if (ordinalPath != null) {
+ bulkPath[i] = ordinalPath;
+ }
+ }
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+ ;
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderMaxDoc then we find the next leaf that contains our ordinal
+ */
+ if (values == null || ordinals[i] >= leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
+ return getBulkPathForOlderIndexes(ordinals);
+ }
+ }
+ boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase);
+ assert success;
+ bulkPath[originalPosition[i]] =
+ new FacetLabel(FacetsConfig.stringToPath(values.binaryValue().utf8ToString()));
+
+ // add the value to the categoryCache after computation
+ synchronized (categoryCache) {
Review comment:
One of the reasons why we decided to put the keys one at at a time was that we could only insert the keys that were not present in the cache. [Discussion in previous PR](https://github.com/apache/lucene-solr/pull/2247#discussion_r574162026)
If we decided to bulk insert keys into the cache we will reinsert keys that were already present.
The good news is that there is already a `putAll` bulk kinda API for the cache
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696169458
##########
File path: lucene/CHANGES.txt
##########
@@ -137,6 +137,9 @@ API Changes
Improvements
+* LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more efficiently retrieve FacetLabels for multiple
+ facet ordinals at once (Gautam Worah, Mike McCandless)
+
Review comment:
Sure. I think I had seen entries that were unordered :/ I'll fix
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r698080163
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##########
@@ -351,12 +348,142 @@ public FacetLabel getPath(int ordinal) throws IOException {
}
synchronized (categoryCache) {
- categoryCache.put(catIDInteger, ret);
+ categoryCache.put(ordinal, ret);
}
return ret;
}
+ private FacetLabel[] getPathFromCache(int... ordinals) {
+ FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+ // TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap,
+ // wrapped as LRU?
+ synchronized (categoryCache) {
+ for (int i = 0; i < ordinals.length; i++) {
+ facetLabels[i] = categoryCache.get(ordinals[i]);
+ }
+ }
+ return facetLabels;
+ }
+
+ /**
+ * Checks if the ordinals in the array are >=0 and < {@code
+ * DirectoryTaxonomyReader#indexReader.maxDoc()}
+ *
+ * @param ordinals Integer array of ordinals
+ * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is
+ * out of bounds
+ */
+ private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException {
+ for (int ordinal : ordinals) {
+ if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+ throw new IllegalArgumentException(
+ "ordinal "
+ + ordinal
+ + " is out of the range of the indexReader "
+ + indexReader.toString()
+ + ". The maximum possible ordinal number is "
+ + (indexReader.maxDoc() - 1));
+ }
+ }
+ }
+
+ /**
+ * Returns an array of FacetLabels for a given array of ordinals.
+ *
+ * <p>This API is generally faster than iteratively calling {@link #getPath(int)} over an array of
+ * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index
+ * was created using StoredFields (with no performance gains) and uses DocValues based iteration
+ * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0
+ *
+ * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy
+ * index
+ */
+ @Override
+ public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ ensureOpen();
+ checkOrdinalBounds(ordinals);
+
+ int ordinalsLength = ordinals.length;
+ FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+ // remember the original positions of ordinals before they are sorted
+ int[] originalPosition = new int[ordinalsLength];
+ Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+ getPathFromCache(ordinals);
+
+ /* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */
+ new InPlaceMergeSorter() {
+ @Override
+ protected void swap(int i, int j) {
+ int x = ordinals[i];
+ ordinals[i] = ordinals[j];
+ ordinals[j] = x;
+
+ x = originalPosition[i];
+ originalPosition[i] = originalPosition[j];
+ originalPosition[j] = x;
+ }
+
+ @Override
+ public int compare(int i, int j) {
+ return Integer.compare(ordinals[i], ordinals[j]);
+ }
+ }.sort(0, ordinalsLength);
+
+ int readerIndex;
+ int leafReaderMaxDoc = 0;
+ int leafReaderDocBase = 0;
+ LeafReader leafReader;
+ LeafReaderContext leafReaderContext;
+ BinaryDocValues values = null;
+ List<Integer> uncachedOrdinalPositions = new ArrayList<>();
+
+ for (int i = 0; i < ordinalsLength; i++) {
+ if (bulkPath[originalPosition[i]] == null) {
+ /*
+ If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal.
+ Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc
+ (which is the maxDoc of the specific leaf)
+ */
+ if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) {
+
+ readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+ leafReaderContext = indexReader.leaves().get(readerIndex);
+ leafReader = leafReaderContext.reader();
+ leafReaderMaxDoc = leafReader.maxDoc();
+ leafReaderDocBase = leafReaderContext.docBase;
+ values = leafReader.getBinaryDocValues(Consts.FULL);
+
+ /*
+ If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null
+ */
+ if (values == null) {
Review comment:
Sure. I'll create it after this PR is merged. Otherwise, it just feels weird to create an issue for a fix in a piece of code that has not been merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r660157061
##########
File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java
##########
@@ -146,10 +147,18 @@ public FacetResult getTopChildren(int topN, String dim, String... path) throws I
}
LabelAndValue[] labelValues = new LabelAndValue[q.size()];
+ int[] ordinals = new int[labelValues.length];
+ float[] values = new float[labelValues.length];
+
for (int i = labelValues.length - 1; i >= 0; i--) {
TopOrdAndFloatQueue.OrdAndValue ordAndValue = q.pop();
- FacetLabel child = taxoReader.getPath(ordAndValue.ord);
- labelValues[i] = new LabelAndValue(child.components[cp.length], ordAndValue.value);
+ ordinals[i] = ordAndValue.ord;
+ values[i] = ordAndValue.value;
+ }
+
+ FacetLabel[] bulkPath = ((DirectoryTaxonomyReader) taxoReader).getBulkPath(ordinals);
Review comment:
Yes this cast is safe. `DirectoryTaxonomyReader` is the only implementation of `TaxonomyReader`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
Posted by GitBox <gi...@apache.org>.
gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r696973661
##########
File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##########
@@ -567,4 +567,39 @@ public void testAccountable() throws Exception {
taxoReader.close();
dir.close();
}
+
+ public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+ float PROBABILITY_OF_COMMIT = 0.5f;
+ Directory src = newDirectory();
+ DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+ String randomArray[] = new String[random().nextInt(1000)];
+ // adding a smaller bound on ints ensures that we will have some duplicate ordinals in random
+ // test cases
+ Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+ FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+ int allOrdinals[] = new int[randomArray.length];
+
+ for (int i = 0; i < randomArray.length; i++) {
+ allPaths[i] = new FacetLabel(randomArray[i]);
+ w.addCategory(allPaths[i]);
+ // add random commits to create multiple segments in the index
+ if (random().nextFloat() < PROBABILITY_OF_COMMIT) {
+ w.commit();
+ }
+ }
+ w.commit();
+ w.close();
+
+ DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+ for (int i = 0; i < allPaths.length; i++) {
Review comment:
Sure. I modified the test so that it uses random multiple threads, each thread use random iterations, each iteration tests the path of random number of ordinals . Debugging this test will be a task if it fails :|
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org