You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/06/11 20:33:27 UTC

[GitHub] [lucene-solr] jpountz commented on a change in pull request #1543: LUCENE-9378: Disable compression on binary values whose length is less than 32.

jpountz commented on a change in pull request #1543:
URL: https://github.com/apache/lucene-solr/pull/1543#discussion_r439048284



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -404,32 +406,51 @@ private void flushData() throws IOException {
         // Write offset to this block to temporary offsets file
         totalChunks++;
         long thisBlockStartPointer = data.getFilePointer();
-        
-        // Optimisation - check if all lengths are same

Review comment:
       If all docs are the same length, then `numBytes`would be 0 below and we only encode the average length, so this case is still optimized.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -762,6 +764,97 @@ public BytesRef binaryValue() throws IOException {
   // Decompresses blocks of binary values to retrieve content
   class BinaryDecoder {
     
+    private final LongValues addresses;
+    private final IndexInput compressedData;
+    // Cache of last uncompressed block 
+    private long lastBlockId = -1;
+    private final ByteBuffer deltas;
+    private int numBytes;
+    private int uncompressedBlockLength;  
+    private int avgLength;
+    private final byte[] uncompressedBlock;
+    private final BytesRef uncompressedBytesRef;
+    private final int docsPerChunk;
+    private final int docsPerChunkShift;
+    
+    public BinaryDecoder(LongValues addresses, IndexInput compressedData, int biggestUncompressedBlockSize, int docsPerChunkShift) {
+      super();
+      this.addresses = addresses;
+      this.compressedData = compressedData;
+      // pre-allocate a byte array large enough for the biggest uncompressed block needed.
+      this.uncompressedBlock = new byte[biggestUncompressedBlockSize];
+      uncompressedBytesRef = new BytesRef(uncompressedBlock);
+      this.docsPerChunk = 1 << docsPerChunkShift;
+      this.docsPerChunkShift = docsPerChunkShift;
+      deltas = ByteBuffer.allocate((docsPerChunk + 1) * Integer.BYTES);
+      deltas.order(ByteOrder.LITTLE_ENDIAN);
+    }
+
+    private void decodeBlock(int blockId) throws IOException {
+      long blockStartOffset = addresses.get(blockId);
+      compressedData.seek(blockStartOffset);
+
+      final long token = compressedData.readVLong();
+      uncompressedBlockLength = (int) (token >>> 4);
+      avgLength = uncompressedBlockLength >>> docsPerChunkShift;
+      numBytes = (int) (token & 0x0f);
+      switch (numBytes) {
+        case Integer.BYTES:
+          deltas.putInt(0, (int) 0);
+          compressedData.readBytes(deltas.array(), Integer.BYTES, docsPerChunk * Integer.BYTES);
+          break;
+        case Byte.BYTES:
+          compressedData.readBytes(deltas.array(), Byte.BYTES, docsPerChunk * Byte.BYTES);
+          break;
+        case 0:
+          break;
+        default:
+          throw new CorruptIndexException("Invalid number of bytes: " + numBytes, compressedData);
+      }
+
+      if (uncompressedBlockLength == 0) {
+        uncompressedBytesRef.offset = 0;
+        uncompressedBytesRef.length = 0;
+      } else {
+        assert uncompressedBlockLength <= uncompressedBlock.length;
+        LZ4.decompress(compressedData, uncompressedBlockLength, uncompressedBlock);
+      }
+    }
+
+    BytesRef decode(int docNumber) throws IOException {
+      int blockId = docNumber >> docsPerChunkShift; 
+      int docInBlockId = docNumber % docsPerChunk;
+      assert docInBlockId < docsPerChunk;
+      
+      
+      // already read and uncompressed?
+      if (blockId != lastBlockId) {
+        decodeBlock(blockId);
+        lastBlockId = blockId;
+      }
+
+      int startDelta = 0, endDelta = 0;
+      switch (numBytes) {
+        case Integer.BYTES:
+          startDelta = deltas.getInt(docInBlockId * Integer.BYTES);
+          endDelta = deltas.getInt((docInBlockId + 1) * Integer.BYTES);

Review comment:
       The trick I'm using is that I'm reading 32 values starting at offset 1. This helps avoid a condition for the first value of the block, but we're still writing/reading only 32 values.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -404,32 +406,51 @@ private void flushData() throws IOException {
         // Write offset to this block to temporary offsets file
         totalChunks++;
         long thisBlockStartPointer = data.getFilePointer();
-        
-        // Optimisation - check if all lengths are same
-        boolean allLengthsSame = true;
-        for (int i = 1; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; i++) {
-          if (docLengths[i] != docLengths[i-1]) {
-            allLengthsSame = false;
+
+        final int avgLength = uncompressedBlockLength >>> Lucene80DocValuesFormat.BINARY_BLOCK_SHIFT;
+        int offset = 0;
+        // Turn docLengths into deltas from expected values from the average length
+        for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+          offset += docLengths[i];
+          docLengths[i] = offset - avgLength * (i + 1);
+        }
+        int numBytes = 0;
+        for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+          if (docLengths[i] < Byte.MIN_VALUE || docLengths[i] > Byte.MAX_VALUE) {
+            numBytes = Integer.BYTES;
             break;
+          } else if (docLengths[i] != 0) {
+            numBytes = Math.max(numBytes, Byte.BYTES);
           }
         }
-        if (allLengthsSame) {
-            // Only write one value shifted. Steal a bit to indicate all other lengths are the same
-            int onlyOneLength = (docLengths[0] <<1) | 1;
-            data.writeVInt(onlyOneLength);
-        } else {
-          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; i++) {
-            if (i == 0) {
-              // Write first value shifted and steal a bit to indicate other lengths are to follow
-              int multipleLengths = (docLengths[0] <<1);
-              data.writeVInt(multipleLengths);              
-            } else {
-              data.writeVInt(docLengths[i]);
-            }
+        data.writeVLong((((long) uncompressedBlockLength) << 4) | numBytes);
+
+        if (numBytes == Integer.BYTES) {
+          // encode deltas as ints
+          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+            data.writeInt(Integer.reverseBytes(docLengths[i]));
+          }
+        } else if (numBytes == 1) {
+          for (int i = 0; i < Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; ++i) {
+            data.writeByte((byte) docLengths[i]);
           }
+        } else if (numBytes != 0) {
+          throw new AssertionError();
         }
+
         maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength);
-        LZ4.compress(block, 0, uncompressedBlockLength, data, ht);
+
+        // Compression proved to hurt latency in some cases, so we're only
+        // enabling it on long inputs for now. Can we reduce the compression
+        // overhead and enable compression again, e.g. by building shared
+        // dictionaries that allow decompressing one value at once instead of
+        // forcing 32 values to be decompressed even when you only need one?
+        if (uncompressedBlockLength >= BINARY_LENGTH_COMPRESSION_THRESHOLD * numDocsInCurrentBlock) {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, highCompHt);
+        } else {
+          LZ4.compress(block, 0, uncompressedBlockLength, data, noCompHt);

Review comment:
       I verified this. In that case the compressed stream only consists of a length in a variable length integer followed by the raw bytes. I also checked whether it helped to just call `IndexOutput#writeBytes` but the benchmark didn't suggest any performance improvement (or regression), so I kept this, which helps have a single format to handle on the read side.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org