You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/01/22 11:31:46 UTC

[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r562555734



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -731,7 +731,22 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
       meta.writeLong(data.getFilePointer() - start); // ordsLength
     }
 
-    addTermsDict(DocValues.singleton(valuesProducer.getSorted(field)));
+    int valuesCount = values.getValueCount();
+    switch (mode) {

Review comment:
       Could we have a "if" instead of a switch? The if could call the right method on the SortedSetDocValues singleton.
   (for a switch, we should handle the "default" case with an exception but I don't think it useful here)

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       This is the job of ArrayUtil.grow to double the size and pad if needed. We can simply call ArrayUtil.grow(termsDictBuffer, termsDictBuffer.length + termLength) and remove line 903.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
   }
 
   private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;

Review comment:
       What is PADDING_LENGTH?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -370,6 +378,11 @@ public void close() throws IOException {
     long termsIndexLength;
     long termsIndexAddressesOffset;
     long termsIndexAddressesLength;
+
+    boolean compressed;
+    // Reserved for support other compressors.
+    int compressorCode;

Review comment:
       I'm not sure we need this compressorCode field. The current code does not use it and this future seems too fuzzy.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.

Review comment:
       This first time I read this line I thought it was reserved in the output stream, but no it's in the buffer. Maybe it would be clearer to just put this suffixLength + 11 directly in the next line.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -280,12 +275,25 @@ private SortedSetEntry readSortedSet(IndexInput meta) throws IOException {
 
   private static void readTermDict(IndexInput meta, TermsDictEntry entry) throws IOException {
     entry.termsDictSize = meta.readVLong();
-    entry.termsDictBlockShift = meta.readInt();
+    int termsDictBlockCode = meta.readInt();
+    if (Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE == termsDictBlockCode) {
+      // This is a LZ4 compressed block.
+      entry.compressed = true;
+      entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE >> 16;

Review comment:
       This could be simply (my prefered option)
   entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
   entry.compressorCode = Lucene80DocValuesFormat.TERMS_DICT_COMPRESSOR_LZ4_CODE;
   
   or if we want to prepare a future support of another compression (which seems not obvious to me because we would probably need more code change)
   line 279: if (termsDictBlockCode > 1 << 16) { // with a 1 << 16 constant which means "compression code"
   and
   entry.termsDictBlockShift = termsDictBlockCode >> 16;
   entry.compressorCode = termsDictBlockCode & 0xFFFF;
   

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);

Review comment:
       This line could be grouped with line 839 with a double Math.max.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {

Review comment:
       This method shares many lines identical to addTermsDict(). As a reviewer I had to check line by line for the diff.
   IMO it would be clearer to keep only one method and insert some compression-conditional code in addTermsDict().

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -962,6 +1078,21 @@ public SortedDocValues getSorted(FieldInfo field) throws IOException {
     addressesWriter.finish();
     meta.writeLong(data.getFilePointer() - start); // addressesLength
 
-    addTermsDict(values);
+    long valuesCount = values.getValueCount();

Review comment:
       Could we factor this code? It is nearly identical to the new code in doAddSortedField().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org