You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/06/11 18:24:55 UTC
[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1543: LUCENE-9378: Disable compression on binary values whose length is less than 32.

mikemccand commented on a change in pull request #1543:
URL: https://github.com/apache/lucene-solr/pull/1543#discussion_r438928868



##########
File path: lucene/CHANGES.txt
##########
@@ -218,6 +218,10 @@ Optimizations
 * LUCENE-9087: Build always trees with full leaves and lower the default value for maxPointsPerLeafNode to 512.
   (Ignacio Vera)
 
+* LUCENE-9378: Disabled compression on short binary values, as compression

Review comment:
       Maybe say `Disable doc values compression on short binary values, ...`?  (To make it clear we are talking about doc values and not maybe stored fields).

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -404,32 +406,51 @@ private void flushData() throws IOException {
         // Write offset to this block to temporary offsets file
         totalChunks++;
         long thisBlockStartPointer = data.getFilePointer();
-        
-        // Optimisation - check if all lengths are same
-        boolean allLengthsSame = true;
-        for (int i = 1; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; i++) {
-          if (docLengths[i] != docLengths[i-1]) {
-            allLengthsSame = false;
+
+        final int avgLength = uncompressedBlockLength >>> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT;
+        int offset = 0;
+        // Turn docLengths into deltas from expected values from the average length
+        for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+          offset += docLengths[i];
+          docLengths[i] = offset - avgLength * (i + 1);
+        }
+        int numBytes = 0;
+        for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+          if (docLengths[i] < Byte.MIN_VALUE || docLengths[i] > Byte.MAX_VALUE) {
+            numBytes = Integer.BYTES;
             break;
+          } else if (docLengths[i] != 0) {
+            numBytes = Math.max(numBytes, Byte.BYTES);
           }
         }
-        if (allLengthsSame) {
-            // Only write one value shifted. Steal a bit to indicate all other lengths are the same
-            int onlyOneLength = (docLengths[0] <<1) | 1;
-            data.writeVInt(onlyOneLength);
-        } else {
-          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; i++) {
-            if (i == 0) {
-              // Write first value shifted and steal a bit to indicate other lengths are to follow
-              int multipleLengths = (docLengths[0] <<1);
-              data.writeVInt(multipleLengths);              
-            } else {
-              data.writeVInt(docLengths[i]);
-            }
+        data.writeVLong((((long) uncompressedBlockLength) << 4) | numBytes);
+
+        if (numBytes == Integer.BYTES) {
+          // encode deltas as ints
+          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+            data.writeInt(Integer.reverseBytes(docLengths[i]));
+          }
+        } else if (numBytes == 1) {
+          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+            data.writeByte((byte) docLengths[i]);
           }
+        } else if (numBytes != 0) {
+          throw new AssertionError();
         }
+
         maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength);
-        LZ4.compress(block, 0, uncompressedBlockLength, data, ht);
+
+        // Compression proved to hurt latency in some cases, so we're only
+        // enabling it on long inputs for now. Can we reduce the compression
+        // overhead and enable compression again, e.g. by building shared
+        // dictionaries that allow decompressing one value at once instead of
+        // forcing 32 values to be decompressed even when you only need one?
+        if (uncompressedBlockLength >= BINARY_LENGTH_COMPRESSION_THRESHOLD * numDocsInCurrentBlock) {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt);
+        } else {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, noCompHt);

Review comment:
       Hmm do we know that our new `LZ4.NoCompressionHashTable` is actually really close to doing nothing?  I don't understand `LZ4` well enough to know that e.g. `return -1` from `int get (int offset)` method is really a no-op overall...

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -762,6 +764,97 @@ public BytesRef binaryValue() throws IOException {
   // Decompresses blocks of binary values to retrieve content
   class BinaryDecoder {
     
+    private final LongValues addresses;
+    private final IndexInput compressedData;
+    // Cache of last uncompressed block 
+    private long lastBlockId = -1;
+    private final ByteBuffer deltas;
+    private int numBytes;
+    private int uncompressedBlockLength;  
+    private int avgLength;
+    private final byte[] uncompressedBlock;
+    private final BytesRef uncompressedBytesRef;
+    private final int docsPerChunk;
+    private final int docsPerChunkShift;
+    
+    public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize, int docsPerChunkShift) {
+      super();
+      this.addresses = addresses;
+      this.compressedData = compressedData;
+      // pre-allocate a byte array large enough for the biggest uncompressed block needed.
+      this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+      uncompressedBytesRef = new BytesRef(uncompressedBlock);
+      this.docsPerChunk = 1 << docsPerChunkShift;
+      this.docsPerChunkShift = docsPerChunkShift;
+      deltas = ByteBuffer.allocate((docsPerChunk + 1) * Integer.BYTES);
+      deltas.order(ByteOrder.LITTLE_ENDIAN);
+    }
+
+    private void decodeBlock(int blockId) throws IOException {
+      long blockStartOffset = addresses.get(blockId);
+      compressedData.seek(blockStartOffset);
+
+      final long token = compressedData.readVLong();
+      uncompressedBlockLength = (int) (token >>> 4);
+      avgLength = uncompressedBlockLength >>> docsPerChunkShift;
+      numBytes = (int) (token & 0x0f);
+      switch (numBytes) {
+        case Integer.BYTES:
+          deltas.putInt(0, (int) 0);
+          compressedData.readBytes(deltas.array(), Integer.BYTES, docsPerChunk * Integer.BYTES);
+          break;
+        case Byte.BYTES:
+          compressedData.readBytes(deltas.array(), Byte.BYTES, docsPerChunk * Byte.BYTES);
+          break;
+        case 0:
+          break;
+        default:
+          throw new CorruptIndexException("Invalid number of bytes: " + numBytes, compressedData);
+      }
+
+      if (uncompressedBlockLength == 0) {
+        uncompressedBytesRef.offset = 0;
+        uncompressedBytesRef.length = 0;
+      } else {
+        assert uncompressedBlockLength <= uncompressedBlock.length;
+        LZ4.decompress(compressedData, uncompressedBlockLength, uncompressedBlock);
+      }
+    }
+
+    BytesRef decode(int docNumber) throws IOException {
+      int blockId = docNumber >> docsPerChunkShift; 
+      int docInBlockId = docNumber % docsPerChunk;
+      assert docInBlockId < docsPerChunk;
+      
+      
+      // already read and uncompressed?
+      if (blockId != lastBlockId) {
+        decodeBlock(blockId);
+        lastBlockId = blockId;
+      }
+
+      int startDelta = 0, endDelta = 0;
+      switch (numBytes) {
+        case Integer.BYTES:
+          startDelta = deltas.getInt(docInBlockId * Integer.BYTES);
+          endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES);

Review comment:
       Doesn't that mean we need to write an extra value (33 total)?  I tried to find where we are doing that (above) but could not.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -64,6 +63,8 @@
 /** writer for {@link Lucene80DocValuesFormat} */
 final class Lucene80DocValuesConsumer extends DocValuesConsumer implements Closeable {
 
+  private static final int BINARY_LENGTH_COMPRESSION_THRESHOLD = 32;

Review comment:
       Maybe add a comment?  This threshold is applied to the average length across all (32) docs in each block?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -404,32 +406,51 @@ private void flushData() throws IOException {
         // Write offset to this block to temporary offsets file
         totalChunks++;
         long thisBlockStartPointer = data.getFilePointer();
-        
-        // Optimisation - check if all lengths are same

Review comment:
       I wonder if removing this optimization is maybe hurting our usage ... not certain.
   
   With this change, with all lengths same, we now write 32 zero bytes?

##########
File path: lucene/CHANGES.txt
##########
@@ -218,6 +218,10 @@ Optimizations
 * LUCENE-9087: Build always trees with full leaves and lower the default value for maxPointsPerLeafNode to 512.
   (Ignacio Vera)
 
+* LUCENE-9378: Disabled compression on short binary values, as compression

Review comment:
       Maybe say `Disable doc values compression on short binary values, ...`?  (To make it clear we are talking about doc values and not maybe stored fields).

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -404,32 +406,51 @@ private void flushData() throws IOException {
         // Write offset to this block to temporary offsets file
         totalChunks++;
         long thisBlockStartPointer = data.getFilePointer();
-        
-        // Optimisation - check if all lengths are same
-        boolean allLengthsSame = true;
-        for (int i = 1; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; i++) {
-          if (docLengths[i] != docLengths[i-1]) {
-            allLengthsSame = false;
+
+        final int avgLength = uncompressedBlockLength >>> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT;
+        int offset = 0;
+        // Turn docLengths into deltas from expected values from the average length
+        for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+          offset += docLengths[i];
+          docLengths[i] = offset - avgLength * (i + 1);
+        }
+        int numBytes = 0;
+        for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+          if (docLengths[i] < Byte.MIN_VALUE || docLengths[i] > Byte.MAX_VALUE) {
+            numBytes = Integer.BYTES;
             break;
+          } else if (docLengths[i] != 0) {
+            numBytes = Math.max(numBytes, Byte.BYTES);
           }
         }
-        if (allLengthsSame) {
-            // Only write one value shifted. Steal a bit to indicate all other lengths are the same
-            int onlyOneLength = (docLengths[0] <<1) | 1;
-            data.writeVInt(onlyOneLength);
-        } else {
-          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; i++) {
-            if (i == 0) {
-              // Write first value shifted and steal a bit to indicate other lengths are to follow
-              int multipleLengths = (docLengths[0] <<1);
-              data.writeVInt(multipleLengths);              
-            } else {
-              data.writeVInt(docLengths[i]);
-            }
+        data.writeVLong((((long) uncompressedBlockLength) << 4) | numBytes);
+
+        if (numBytes == Integer.BYTES) {
+          // encode deltas as ints
+          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+            data.writeInt(Integer.reverseBytes(docLengths[i]));
+          }
+        } else if (numBytes == 1) {
+          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+            data.writeByte((byte) docLengths[i]);
           }
+        } else if (numBytes != 0) {
+          throw new AssertionError();
         }
+
         maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength);
-        LZ4.compress(block, 0, uncompressedBlockLength, data, ht);
+
+        // Compression proved to hurt latency in some cases, so we're only
+        // enabling it on long inputs for now. Can we reduce the compression
+        // overhead and enable compression again, e.g. by building shared
+        // dictionaries that allow decompressing one value at once instead of
+        // forcing 32 values to be decompressed even when you only need one?
+        if (uncompressedBlockLength >= BINARY_LENGTH_COMPRESSION_THRESHOLD * numDocsInCurrentBlock) {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt);
+        } else {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, noCompHt);

Review comment:
       Hmm do we know that our new `LZ4.NoCompressionHashTable` is actually really close to doing nothing?  I don't understand `LZ4` well enough to know that e.g. `return -1` from `int get (int offset)` method is really a no-op overall...

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -762,6 +764,97 @@ public BytesRef binaryValue() throws IOException {
   // Decompresses blocks of binary values to retrieve content
   class BinaryDecoder {
     
+    private final LongValues addresses;
+    private final IndexInput compressedData;
+    // Cache of last uncompressed block 
+    private long lastBlockId = -1;
+    private final ByteBuffer deltas;
+    private int numBytes;
+    private int uncompressedBlockLength;  
+    private int avgLength;
+    private final byte[] uncompressedBlock;
+    private final BytesRef uncompressedBytesRef;
+    private final int docsPerChunk;
+    private final int docsPerChunkShift;
+    
+    public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize, int docsPerChunkShift) {
+      super();
+      this.addresses = addresses;
+      this.compressedData = compressedData;
+      // pre-allocate a byte array large enough for the biggest uncompressed block needed.
+      this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+      uncompressedBytesRef = new BytesRef(uncompressedBlock);
+      this.docsPerChunk = 1 << docsPerChunkShift;
+      this.docsPerChunkShift = docsPerChunkShift;
+      deltas = ByteBuffer.allocate((docsPerChunk + 1) * Integer.BYTES);
+      deltas.order(ByteOrder.LITTLE_ENDIAN);
+    }
+
+    private void decodeBlock(int blockId) throws IOException {
+      long blockStartOffset = addresses.get(blockId);
+      compressedData.seek(blockStartOffset);
+
+      final long token = compressedData.readVLong();
+      uncompressedBlockLength = (int) (token >>> 4);
+      avgLength = uncompressedBlockLength >>> docsPerChunkShift;
+      numBytes = (int) (token & 0x0f);
+      switch (numBytes) {
+        case Integer.BYTES:
+          deltas.putInt(0, (int) 0);
+          compressedData.readBytes(deltas.array(), Integer.BYTES, docsPerChunk * Integer.BYTES);
+          break;
+        case Byte.BYTES:
+          compressedData.readBytes(deltas.array(), Byte.BYTES, docsPerChunk * Byte.BYTES);
+          break;
+        case 0:
+          break;
+        default:
+          throw new CorruptIndexException("Invalid number of bytes: " + numBytes, compressedData);
+      }
+
+      if (uncompressedBlockLength == 0) {
+        uncompressedBytesRef.offset = 0;
+        uncompressedBytesRef.length = 0;
+      } else {
+        assert uncompressedBlockLength <= uncompressedBlock.length;
+        LZ4.decompress(compressedData, uncompressedBlockLength, uncompressedBlock);
+      }
+    }
+
+    BytesRef decode(int docNumber) throws IOException {
+      int blockId = docNumber >> docsPerChunkShift; 
+      int docInBlockId = docNumber % docsPerChunk;
+      assert docInBlockId < docsPerChunk;
+      
+      
+      // already read and uncompressed?
+      if (blockId != lastBlockId) {
+        decodeBlock(blockId);
+        lastBlockId = blockId;
+      }
+
+      int startDelta = 0, endDelta = 0;
+      switch (numBytes) {
+        case Integer.BYTES:
+          startDelta = deltas.getInt(docInBlockId * Integer.BYTES);
+          endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES);

Review comment:
       Doesn't that mean we need to write an extra value (33 total)?  I tried to find where we are doing that (above) but could not.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -64,6 +63,8 @@
 /** writer for {@link Lucene80DocValuesFormat} */
 final class Lucene80DocValuesConsumer extends DocValuesConsumer implements Closeable {
 
+  private static final int BINARY_LENGTH_COMPRESSION_THRESHOLD = 32;

Review comment:
       Maybe add a comment?  This threshold is applied to the average length across all (32) docs in each block?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -404,32 +406,51 @@ private void flushData() throws IOException {
         // Write offset to this block to temporary offsets file
         totalChunks++;
         long thisBlockStartPointer = data.getFilePointer();
-        
-        // Optimisation - check if all lengths are same

Review comment:
       I wonder if removing this optimization is maybe hurting our usage ... not certain.
   
   With this change, with all lengths same, we now write 32 zero bytes?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org