You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by nknize <gi...@git.apache.org> on 2018/09/14 18:51:21 UTC

[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

GitHub user nknize opened a pull request:

    https://github.com/apache/lucene-solr/pull/451

    LUCENE-8496: Add selective indexing to BKD/POINTS codec

    PR for [LUCENE-8496](https://issues.apache.org/jira/browse/LUCENE-8496) enabling fields to use the first `numIndexDimensions` from `numDataDimensions` while creating the BKD index.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nknize/lucene-solr selectiveIndexing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucene-solr/pull/451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #451
    
----
commit a2f88a22297ca9c4a1471beb23e4c598704bc144
Author: Nicholas Knize <nicholas knize>
Date:   2018-09-14T17:39:40Z

    LUCENE-8496: Add selective indexing to BKD/POINTS codec
    
    Allows fields to use the first `numIndexDimensions` from `numDataDimensions` when creating the BKD index.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by jpountz <gi...@git.apache.org>.
Github user jpountz commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218197102
  
    --- Diff: lucene/core/src/java/org/apache/lucene/codecs/lucene60/Lucene60FieldInfosFormat.java ---
    @@ -149,9 +149,14 @@ public FieldInfos read(Directory directory, SegmentInfo segmentInfo, String segm
                 attributes = lastAttributes;
               }
               lastAttributes = attributes;
    -          int pointDimensionCount = input.readVInt();
    +          int packedDimensionCount = input.readVInt();
    +          int pointDataDimensionCount = 0x0000FFFF & packedDimensionCount;
    +          int pointIndexDimensionCount = packedDimensionCount >>> 16;
    +          if (pointIndexDimensionCount == 0) {
    +            pointIndexDimensionCount = pointDataDimensionCount;
    +          }
    --- End diff --
    
    could we use a new version number instead to handle backcompat?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r221582158
  
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
    @@ -442,11 +442,12 @@ void visitDocValues(int[] commonPrefixLengths, byte[] scratchPackedValue1, byte[
         readCommonPrefixes(commonPrefixLengths, scratchPackedValue1, in);
     
         if (numIndexDims != 1 && version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
    -      byte[] minPackedValue = scratchPackedValue1;
    -      byte[] maxPackedValue = scratchPackedValue2;
    +      byte[] minPackedValue = new byte[packedIndexBytesLength];
    +      byte[] maxPackedValue = new byte[packedIndexBytesLength];
    --- End diff --
    
    Maybe we should try to avoid to allocate a new array for every visited leaf node. We can create byte array helpers on IntersectState with the correct length.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r217970618
  
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ---
    @@ -981,15 +988,15 @@ public long finish(IndexOutput out) throws IOException {
         assert pointCount / numLeaves <= maxPointsInLeafNode: "pointCount=" + pointCount + " numLeaves=" + numLeaves + " maxPointsInLeafNode=" + maxPointsInLeafNode;
     
         // Sort all docs once by each dimension:
    -    PathSlice[] sortedPointWriters = new PathSlice[numDims];
    +    PathSlice[] sortedPointWriters = new PathSlice[numDataDims];
     
         // This is only used on exception; on normal code paths we close all files we opened:
         List<Closeable> toCloseHeroically = new ArrayList<>();
     
         boolean success = false;
         try {
           //long t0 = System.nanoTime();
    -      for(int dim=0;dim<numDims;dim++) {
    +      for(int dim=0;dim<numDataDims;dim++) {
    --- End diff --
    
    I think we should only create sorted point writers for indexed dims. There is no point sort data dims as it only creates overhead when writing the tree.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r221581265
  
    --- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
    @@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
     
         Directory dir = newFSDirectory(createTempDir());
         int numDocs = 100000;
    -    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
    +    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
    --- End diff --
    
    I have realised that the above assertion is incorrect for multivalued docs so please discard it. Maybe we can just assert the number of bytes in packedValue. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by jpountz <gi...@git.apache.org>.
Github user jpountz commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218199848
  
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
    @@ -52,10 +53,17 @@
       /** Caller must pre-seek the provided {@link IndexInput} to the index location that {@link BKDWriter#finish} returned */
       public BKDReader(IndexInput in) throws IOException {
         version = CodecUtil.checkHeader(in, BKDWriter.CODEC_NAME, BKDWriter.VERSION_START, BKDWriter.VERSION_CURRENT);
    -    numDims = in.readVInt();
    +    int packedDims = in.readVInt();
    +    if (version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
    +      numDataDims = packedDims & 0x0000FFFF;
    +      numIndexDims = packedDims >>> 16;
    --- End diff --
    
    let's read two vints instead?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r217971501
  
    --- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
    @@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
     
         Directory dir = newFSDirectory(createTempDir());
         int numDocs = 100000;
    -    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
    +    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
    --- End diff --
    
    Couldn't we add a mode on the random tests where dataDim != indexedDim. The test can only check that the data dims are properly propagated to the leaf nodes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr issue #451: LUCENE-8496: Add selective indexing to BKD/POINTS co...

Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on the issue:

    https://github.com/apache/lucene-solr/pull/451
  
    closing with commit 1118299c338253cea09640acdc48dc930dc27fda


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr issue #451: LUCENE-8496: Add selective indexing to BKD/POINTS co...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on the issue:

    https://github.com/apache/lucene-solr/pull/451
  
    I have only one comment for this version that I think we should address. We are allocation two byte arrays for each leaf we are visiting and I think we can prevent it. Apart from that it looks there!
    
    Regarding the Lucene version I have no strong opinion. Because it is a backward compatible change I think there is no harm to commit it to Lucene 7.x branch as well. 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218112746
  
    --- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
    @@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
     
         Directory dir = newFSDirectory(createTempDir());
         int numDocs = 100000;
    -    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
    +    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
    --- End diff --
    
    I have just seen there is already a test using different index and data dimensions, still it would be worthy to check if possible that data dimensions are properly propagated.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr issue #451: LUCENE-8496: Add selective indexing to BKD/POINTS co...

Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on the issue:

    https://github.com/apache/lucene-solr/pull/451
  
    Updated the PR and attached an update patch to [LUCENE-8496](https://issues.apache.org/jira/browse/LUCENE-8496) along with a patch that modifies `LatLonShape.Triangle` encoding to take advantage of selective indexing. I think this is close. Wouldn't mind another round of review along with thoughts on whether we should commit this for a Lucene 7.6 release or hold off for 8.0. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r221972799
  
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
    @@ -442,11 +442,12 @@ void visitDocValues(int[] commonPrefixLengths, byte[] scratchPackedValue1, byte[
         readCommonPrefixes(commonPrefixLengths, scratchPackedValue1, in);
     
         if (numIndexDims != 1 && version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
    -      byte[] minPackedValue = scratchPackedValue1;
    -      byte[] maxPackedValue = scratchPackedValue2;
    +      byte[] minPackedValue = new byte[packedIndexBytesLength];
    +      byte[] maxPackedValue = new byte[packedIndexBytesLength];
    --- End diff --
    
    hmmm.. this is outdated. Commit 8517380 reverted that change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r220196714
  
    --- Diff: lucene/core/src/test/org/apache/lucene/util/bkd/TestBKD.java ---
    @@ -1063,7 +1064,7 @@ public void testWastedLeadingBytes() throws Exception {
     
         Directory dir = newFSDirectory(createTempDir());
         int numDocs = 100000;
    -    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, bytesPerDim, 32, 1f, numDocs, true);
    +    BKDWriter w = new BKDWriter(numDocs+1, dir, "tmp", numDims, numDims, bytesPerDim, 32, 1f, numDocs, true);
    --- End diff --
    
    Thanks @nknize! 
    Would it be possible to add the following assertion in the method for IntersectVisitor#visit. That would signal that we are propagating the index and data dimensions to the method?
    
    ```
       byte[][] expectedDocs = docValues[docID];
       for (int dim =0; dim < numDataDims; dim++) {
            assertTrue(FutureArrays.compareUnsigned(packedValue, dim * numBytesPerDim, dim * numBytesPerDim + numBytesPerDim, expectedDocs[dim], 0, numBytesPerDim) == 0);
        }
    ```
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by nknize <gi...@git.apache.org>.
Github user nknize closed the pull request at:

    https://github.com/apache/lucene-solr/pull/451


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by nknize <gi...@git.apache.org>.
Github user nknize commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r219310315
  
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ---
    @@ -981,15 +988,15 @@ public long finish(IndexOutput out) throws IOException {
         assert pointCount / numLeaves <= maxPointsInLeafNode: "pointCount=" + pointCount + " numLeaves=" + numLeaves + " maxPointsInLeafNode=" + maxPointsInLeafNode;
     
         // Sort all docs once by each dimension:
    -    PathSlice[] sortedPointWriters = new PathSlice[numDims];
    +    PathSlice[] sortedPointWriters = new PathSlice[numDataDims];
     
         // This is only used on exception; on normal code paths we close all files we opened:
         List<Closeable> toCloseHeroically = new ArrayList<>();
     
         boolean success = false;
         try {
           //long t0 = System.nanoTime();
    -      for(int dim=0;dim<numDims;dim++) {
    +      for(int dim=0;dim<numDataDims;dim++) {
    --- End diff --
    
    I made the change but then I remembered that the leaf nodes (which write all dimensions) require even the non indexed dimensions be sorted for compression (in order to compute the common prefix). I could explore delaying that sort operation only when its needed, but I think the recursion still needs sorted point writers passed through PathSlice[] so this is still necessary?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by jpountz <gi...@git.apache.org>.
Github user jpountz commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r218184521
  
    --- Diff: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextBKDReader.java ---
    @@ -53,15 +54,16 @@
       final int version;
       protected final int packedBytesLength;
     
    -  public SimpleTextBKDReader(IndexInput in, int numDims, int maxPointsInLeafNode, int bytesPerDim, long[] leafBlockFPs, byte[] splitPackedValues,
    +  public SimpleTextBKDReader(IndexInput in, int numDataDims, int numIndexDims, int maxPointsInLeafNode, int bytesPerDim, long[] leafBlockFPs, byte[] splitPackedValues,
                                  byte[] minPackedValue, byte[] maxPackedValue, long pointCount, int docCount) throws IOException {
         this.in = in;
    -    this.numDims = numDims;
    +    this.numDataDims = numDataDims;
    +    this.numIndexDims = numIndexDims;
         this.maxPointsInLeafNode = maxPointsInLeafNode;
         this.bytesPerDim = bytesPerDim;
         // no version check here because callers of this API (SimpleText) have no back compat:
    -    bytesPerIndexEntry = numDims == 1 ? bytesPerDim : bytesPerDim + 1;
    -    packedBytesLength = numDims * bytesPerDim;
    +    bytesPerIndexEntry = numDataDims == 1 ? bytesPerDim : bytesPerDim + 1;
    +    packedBytesLength = numDataDims * bytesPerDim;
    --- End diff --
    
    can we make it numIndexDims on the two above lines?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[GitHub] lucene-solr pull request #451: LUCENE-8496: Add selective indexing to BKD/PO...

Posted by iverase <gi...@git.apache.org>.
Github user iverase commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/451#discussion_r217970947
  
    --- Diff: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java ---
    @@ -52,10 +53,17 @@
       /** Caller must pre-seek the provided {@link IndexInput} to the index location that {@link BKDWriter#finish} returned */
       public BKDReader(IndexInput in) throws IOException {
         version = CodecUtil.checkHeader(in, BKDWriter.CODEC_NAME, BKDWriter.VERSION_START, BKDWriter.VERSION_CURRENT);
    -    numDims = in.readVInt();
    +    int packedDims = in.readVInt();
    +    if (version >= BKDWriter.VERSION_LEAF_STORES_BOUNDS) {
    +      numDataDims = packedDims & 0x0000FFFF;
    +      numIndexDims = packedDims >>> 16;
    +    } else {
    +      numDataDims = packedDims;
    +      numIndexDims = numDataDims;
    +    }
    --- End diff --
    
    Should we create a new version of the tree for this change and not mix it up with VERSION_LEAF_STORES_BOUNDS? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org