You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by nknize <gi...@git.apache.org> on 2018/09/14 18:51:21 UTC
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
GitHub user nknize opened a pull request:
https://github.com/apache/lucene-solr/pull/451
LUCENE-8496: Add selective indexing to BKD/POINTS codec
PR for [LUCENE-8496](https://issues.apache.org/jira/browse/LUCENE-8496) enabling fields to use the first `numIndexDimensions` from `numDataDimensions` while creating the BKD index.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nknize/lucene-solr selectiveIndexing
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucene-solr/pull/451.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #451
----
commit a2f88a22297ca9c4a1471beb23e4c598704bc144
Author: Nicholas Knize <nicholas knize>
Date: 2018-09-14T17:39:40Z
LUCENE-8496: Add selective indexing to BKD/POINTS codec
Allows fields to use the first `numIndexDimensions` from `numDataDimensions` when creating the BKD index.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by jpountz <gi...@git.apache.org>.
Github user jpountz commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r218197102
--- Diff: lucene/core/src/java/org/apache/lucene/codecs/lucene60/Lucene60FieldInfosFormat.java ---
@@ -149,9 +149,14 @@ public FieldInfos read(Directory directory, SegmentInfo segmentInfo, String segm
attributes = lastAttributes;
}
lastAttributes = attributes;
- int pointDimensionCount = input.readVInt();
+ int packedDimensionCount = input.readVInt();
+ int pointDataDimensionCount = 0x0000FFFF & packedDimensionCount;
+ int pointIndexDimensionCount = packedDimensionCount >>> 16;
+ if (pointIndexDimensionCount == 0) {
+ pointIndexDimensionCount = pointDataDimensionCount;
+ }
--- End diff --
could we use a new version number instead to handle backcompat?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r221582158
--- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
@@ -442,11 +442,12 @@ void visitDocValues(int[] commonPrefixLengths, byte[] scratchPackedValue1, byte[
readCommonPrefixes(commonPrefixLengths, scratchPackedValue1, in);
if (numIndexDims != 1 && version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
- byte[] minPackedValue = scratchPackedValue1;
- byte[] maxPackedValue = scratchPackedValue2;
+ byte[] minPackedValue = new byte[packedIndexBytesLength];
+ byte[] maxPackedValue = new byte[packedIndexBytesLength];
--- End diff --
Maybe we should try to avoid to allocate a new array for every visited leaf node. We can create byte array helpers on IntersectState with the correct length.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r217970618
--- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ---
@@ -981,15 +988,15 @@ public long finish(IndexOutput out) throws IOException {
assert pointCount / numLeaves <= maxPointsInLeafNode: "pointCount=" + pointCount + " numLeaves=" + numLeaves + " maxPointsInLeafNode=" + maxPointsInLeafNode;
// Sort all docs once by each dimension:
- PathSlice[] sortedPointWriters = new PathSlice[numDims];
+ PathSlice[] sortedPointWriters = new PathSlice[numDataDims];
// This is only used on exception; on normal code paths we close all files we opened:
List<Closeable> toCloseHeroically = new ArrayList<>();
boolean success = false;
try {
//long t0 = System.nanoTime();
- for(int dim=0;dim<numDims;dim++) {
+ for(int dim=0;dim<numDataDims;dim++) {
--- End diff --
I think we should only create sorted point writers for indexed dims. There is no point sort data dims as it only creates overhead when writing the tree.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r221581265
--- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
@@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
Directory dir = newFSDirectory(createTempDir());
int numDocs = 100000;
- BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
+ BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
--- End diff --
I have realised that the above assertion is incorrect for multivalued docs so please discard it. Maybe we can just assert the number of bytes in packedValue.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by jpountz <gi...@git.apache.org>.
Github user jpountz commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r218199848
--- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
@@ -52,10 +53,17 @@
/** Caller must pre-seek the provided {@link IndexInput} to the index location that {@link BKDWriter#finish} returned */
public BKDReader(IndexInput in) throws IOException {
version = CodecUtil.checkHeader(in, BKDWriter.CODEC_NAME, BKDWriter.VERSION_START, BKDWriter.VERSION_CURRENT);
- numDims = in.readVInt();
+ int packedDims = in.readVInt();
+ if (version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
+ numDataDims = packedDims & 0x0000FFFF;
+ numIndexDims = packedDims >>> 16;
--- End diff --
let's read two vints instead?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r217971501
--- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
@@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
Directory dir = newFSDirectory(createTempDir());
int numDocs = 100000;
- BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
+ BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
--- End diff --
Couldn't we add a mode on the random tests where dataDim != indexedDim. The test can only check that the data dims are properly propagated to the leaf nodes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr issue #451: LUCENE-8496: Add selective indexing to BKD/POINTS co...
Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on the issue:
https://github.com/apache/lucene-solr/pull/451
closing with commit 1118299c338253cea09640acdc48dc930dc27fda
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr issue #451: LUCENE-8496: Add selective indexing to BKD/POINTS co...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on the issue:
https://github.com/apache/lucene-solr/pull/451
I have only one comment for this version that I think we should address. We are allocation two byte arrays for each leaf we are visiting and I think we can prevent it. Apart from that it looks there!
Regarding the Lucene version I have no strong opinion. Because it is a backward compatible change I think there is no harm to commit it to Lucene 7.x branch as well.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r218112746
--- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
@@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
Directory dir = newFSDirectory(createTempDir());
int numDocs = 100000;
- BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
+ BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
--- End diff --
I have just seen there is already a test using different index and data dimensions, still it would be worthy to check if possible that data dimensions are properly propagated.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr issue #451: LUCENE-8496: Add selective indexing to BKD/POINTS co...
Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on the issue:
https://github.com/apache/lucene-solr/pull/451
Updated the PR and attached an update patch to [LUCENE-8496](https://issues.apache.org/jira/browse/LUCENE-8496) along with a patch that modifies `LatLonShape.Triangle` encoding to take advantage of selective indexing. I think this is close. Wouldn't mind another round of review along with thoughts on whether we should commit this for a Lucene 7.6 release or hold off for 8.0.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r221972799
--- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
@@ -442,11 +442,12 @@ void visitDocValues(int[] commonPrefixLengths, byte[] scratchPackedValue1, byte[
readCommonPrefixes(commonPrefixLengths, scratchPackedValue1, in);
if (numIndexDims != 1 && version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
- byte[] minPackedValue = scratchPackedValue1;
- byte[] maxPackedValue = scratchPackedValue2;
+ byte[] minPackedValue = new byte[packedIndexBytesLength];
+ byte[] maxPackedValue = new byte[packedIndexBytesLength];
--- End diff --
hmmm.. this is outdated. Commit 8517380 reverted that change.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r220196714
--- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
@@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
Directory dir = newFSDirectory(createTempDir());
int numDocs = 100000;
- BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
+ BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
--- End diff --
Thanks @nknize!
Would it be possible to add the following assertion in the method for IntersectVisitor#visit. That would signal that we are propagating the index and data dimensions to the method?
```
byte[][] expectedDocs = docValues[docID];
for (int dim =0; dim < numDataDims; dim++) {
assertTrue(FutureArrays.compareUnsigned(packedValue, dim * numBytesPerDim, dim * numBytesPerDim + numBytesPerDim, expectedDocs[dim], 0, numBytesPerDim) == 0);
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by nknize <gi...@git.apache.org>.
Github user nknize closed the pull request at:
https://github.com/apache/lucene-solr/pull/451
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r219310315
--- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ---
@@ -981,15 +988,15 @@ public long finish(IndexOutput out) throws IOException {
assert pointCount / numLeaves <= maxPointsInLeafNode: "pointCount=" + pointCount + " numLeaves=" + numLeaves + " maxPointsInLeafNode=" + maxPointsInLeafNode;
// Sort all docs once by each dimension:
- PathSlice[] sortedPointWriters = new PathSlice[numDims];
+ PathSlice[] sortedPointWriters = new PathSlice[numDataDims];
// This is only used on exception; on normal code paths we close all files we opened:
List<Closeable> toCloseHeroically = new ArrayList<>();
boolean success = false;
try {
//long t0 = System.nanoTime();
- for(int dim=0;dim<numDims;dim++) {
+ for(int dim=0;dim<numDataDims;dim++) {
--- End diff --
I made the change but then I remembered that the leaf nodes (which write all dimensions) require even the non indexed dimensions be sorted for compression (in order to compute the common prefix). I could explore delaying that sort operation only when its needed, but I think the recursion still needs sorted point writers passed through PathSlice[] so this is still necessary?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by jpountz <gi...@git.apache.org>.
Github user jpountz commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r218184521
--- Diff: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextBKDReader.java ---
@@ -53,15 +54,16 @@
final int version;
protected final int packedBytesLength;
- public SimpleTextBKDReader(IndexInput in, int numDims, int maxPointsInLeafNode, int bytesPerDim, long[] leafBlockFPs, byte[] splitPackedValues,
+ public SimpleTextBKDReader(IndexInput in, int numDataDims, int numIndexDims, int maxPointsInLeafNode, int bytesPerDim, long[] leafBlockFPs, byte[] splitPackedValues,
byte[] minPackedValue, byte[] maxPackedValue, long pointCount, int docCount) throws IOException {
this.in = in;
- this.numDims = numDims;
+ this.numDataDims = numDataDims;
+ this.numIndexDims = numIndexDims;
this.maxPointsInLeafNode = maxPointsInLeafNode;
this.bytesPerDim = bytesPerDim;
// no version check here because callers of this API (SimpleText) have no back compat:
- bytesPerIndexEntry = numDims == 1 ? bytesPerDim : bytesPerDim + 1;
- packedBytesLength = numDims * bytesPerDim;
+ bytesPerIndexEntry = numDataDims == 1 ? bytesPerDim : bytesPerDim + 1;
+ packedBytesLength = numDataDims * bytesPerDim;
--- End diff --
can we make it numIndexDims on the two above lines?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...
Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/451#discussion_r217970947
--- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
@@ -52,10 +53,17 @@
/** Caller must pre-seek the provided {@link IndexInput} to the index location that {@link BKDWriter#finish} returned */
public BKDReader(IndexInput in) throws IOException {
version = CodecUtil.checkHeader(in, BKDWriter.CODEC_NAME, BKDWriter.VERSION_START, BKDWriter.VERSION_CURRENT);
- numDims = in.readVInt();
+ int packedDims = in.readVInt();
+ if (version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
+ numDataDims = packedDims & 0x0000FFFF;
+ numIndexDims = packedDims >>> 16;
+ } else {
+ numDataDims = packedDims;
+ numIndexDims = numDataDims;
+ }
--- End diff --
Should we create a new version of the tree for this change and not mix it up with VERSION_LEAF_STORES_BOUNDS?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org