You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/01/17 16:08:02 UTC

[GitHub] [lucene-solr] jaisonbi opened a new pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

jaisonbi opened a new pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213


   # Description
   
   Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
   LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by adding LZ4 compression to current  prefix-compression. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
   
   This feature is only for the high-cardinality fields. 
   
   # Tests
   
   See Jira LUCENE-9663 for details.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r564318463



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -280,12 +275,25 @@ private SortedSetEntry readSortedSet(IndexInput meta) throws IOException {
 
   private static void readTermDict(IndexInput meta, TermsDictEntry entry) throws IOException {
     entry.termsDictSize = meta.readVLong();
-    entry.termsDictBlockShift = meta.readInt();
+    int termsDictBlockCode = meta.readInt();
+    if (Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE == termsDictBlockCode) {
+      // This is a LZ4 compressed block.
+      entry.compressed = true;
+      entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE >> 16;

Review comment:
       Will choose the first option, it will be easy to read...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi edited a comment on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi edited a comment on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771296030


   If I understood correctly, the route via PerFieldDocValuesFormat need to change the usage of SortedSetDocValues.
   The idea is adding another constructor for enabling terms dict compression, as below:
   ```
   public SortedSetDocValuesField(String name, BytesRef bytes, boolean compression) {
       super(name, compression ? COMPRESSION_TYPE: TYPE);
       fieldsData = bytes;
   }
   ```
   And below is the definition of COMPRESSION_TYPE:
   
     ```
   public static final FieldType COMPRESSION_TYPE = new FieldType();
     static {
       COMPRESSION_TYPE.setDocValuesType(DocValuesType.SORTED_SET);
       // add one new attribute for telling PerFieldDocValuesFormat that terms dict compression is enabled for this field
       COMPRESSION_TYPE.putAttribute("docvalue.sortedset.compression", "true");
       COMPRESSION_TYPE.freeze();
     }
   ```
   Not sure if I've got it right :)
   
   @msokolov @bruno-roustant 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi closed pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi closed pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r564319900



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.

Review comment:
       makes sense,  no need to introduce the new variable "reservedSize"...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r562555734



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -731,7 +731,22 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
       meta.writeLong(data.getFilePointer() - start); // ordsLength
     }
 
-    addTermsDict(DocValues.singleton(valuesProducer.getSorted(field)));
+    int valuesCount = values.getValueCount();
+    switch (mode) {

Review comment:
       Could we have a "if" instead of a switch? The if could call the right method on the SortedSetDocValues singleton.
   (for a switch, we should handle the "default" case with an exception but I don't think it useful here)

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       This is the job of ArrayUtil.grow to double the size and pad if needed. We can simply call ArrayUtil.grow(termsDictBuffer, termsDictBuffer.length + termLength) and remove line 903.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
   }
 
   private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;

Review comment:
       What is PADDING_LENGTH?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -370,6 +378,11 @@ public void close() throws IOException {
     long termsIndexLength;
     long termsIndexAddressesOffset;
     long termsIndexAddressesLength;
+
+    boolean compressed;
+    // Reserved for support other compressors.
+    int compressorCode;

Review comment:
       I'm not sure we need this compressorCode field. The current code does not use it and this future seems too fuzzy.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.

Review comment:
       This first time I read this line I thought it was reserved in the output stream, but no it's in the buffer. Maybe it would be clearer to just put this suffixLength + 11 directly in the next line.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -280,12 +275,25 @@ private SortedSetEntry readSortedSet(IndexInput meta) throws IOException {
 
   private static void readTermDict(IndexInput meta, TermsDictEntry entry) throws IOException {
     entry.termsDictSize = meta.readVLong();
-    entry.termsDictBlockShift = meta.readInt();
+    int termsDictBlockCode = meta.readInt();
+    if (Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE == termsDictBlockCode) {
+      // This is a LZ4 compressed block.
+      entry.compressed = true;
+      entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE >> 16;

Review comment:
       This could be simply (my prefered option)
   entry.termsDictBlockShift = Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
   entry.compressorCode = Lucene80DocValuesFormat.TERMS_DICT_COMPRESSOR_LZ4_CODE;
   
   or if we want to prepare a future support of another compression (which seems not obvious to me because we would probably need more code change)
   line 279: if (termsDictBlockCode > 1 << 16) { // with a 1 << 16 constant which means "compression code"
   and
   entry.termsDictBlockShift = termsDictBlockCode >> 16;
   entry.compressorCode = termsDictBlockCode & 0xFFFF;
   

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);

Review comment:
       This line could be grouped with line 839 with a double Math.max.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {

Review comment:
       This method shares many lines identical to addTermsDict(). As a reviewer I had to check line by line for the diff.
   IMO it would be clearer to keep only one method and insert some compression-conditional code in addTermsDict().

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -962,6 +1078,21 @@ public SortedDocValues getSorted(FieldInfo field) throws IOException {
     addressesWriter.finish();
     meta.writeLong(data.getFilePointer() - start); // addressesLength
 
-    addTermsDict(values);
+    long valuesCount = values.getValueCount();

Review comment:
       Could we factor this code? It is nearly identical to the new code in doAddSortedField().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r564317038



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);

Review comment:
       ok..it looks better:)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771782427


   I expect you don't need to change Lucene code but just write a new custom codec (with a specific name) which provides a custom DocValuesFormat. It extends PerFieldDocValuesFormat and implements the method
   DocValuesFormat getDocValuesFormatForField(String field).
   This method provides either a standard Lucene80DocValuesFormat (no compression) or another new custom DocValuesFormat (with a specific name to write in the index) extending Lucene80DocValuesFormat with BEST_COMPRESSION mode.
   The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-767522530


   @bruno-roustant I've updated the patch according to your comments, thanks a lot:)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-773179495


   > To be clear, I'm not suggesting that we add a new Codec in Lucene, but that you create a new custom Codec on your side, as a Lucene expert user
   
   I think this improvement is also valuable to other Lucene/ES users, so I hope I can complete this PR :-).. I just commit the new changes, but some checks are still running...In the latest commit, I removed the compression mode from Lucene80DocValuesFormat, but add different switches for terms-dict compression/binary DocValue compression.
   
   Thanks very much @bruno-roustant 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-772993254


   Thanks for the detailed suggestion. @bruno-roustant 
   
   Will add a **new custom codec** to support compression or un-compression mode.  I name it as "Lucene90ExpertCodec", in this codec, user will have more compression choices, likes Terms Dict compression, Binary DocValue compression..
   
   One doubt, PerFieldDocValuesFormat provides one method for getting Format according to field name. 
   `public abstract DocValuesFormat getDocValuesFormatForField(String field);`
   so we need to use the approach as @bruno-roustant suggested:
   > The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.
   we need to maintain the config file, and the field name list may got change...Regarding on the "naming convention" approach, we need to give rules to the field name definition(Please correct me if I am wrong)...I am afraid it's not simple enough:)
   
   There will be 2 options:
   1. Use a global switch in "Lucene90ExpertCodec", so Terms-Dict compression/Binary DocValue Compression can be enabled easily.
   2. Based on PerFieldDocValuesFormat, but add one more method:
        `public abstract DocValuesFormat getDocValuesFormatForField(FieldInfo field);`
       We can add new attribute in FieldInfo to identify the compression-enabled fields.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r565127459



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
   private void addTermsDict(SortedSetDocValues values) throws IOException {
     final long size = values.getValueCount();
     meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode

Review comment:
       I didn't find a good solution to resolve this problem, since terms dict compression could not be enabled by default currently.....there will be two options:
   1.  introduce more configuration/mode.  it will make things more complicated.
   2. optimize BinaryDocValues compression.
   
   We will try this feature in our cluster first, and see the actual performance..so can we resolve this in another issue? or introduce more configurations in this patch?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r563785114



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
   }
 
   private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;

Review comment:
       Ok, in this case can we rename it LZ4_DECOMPRESSOR_PADDING and add this comment about the decompression speed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-773200323


   Sorry, I did something wrong while doing revert...will create a new PR.. @bruno-roustant 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-769672391


   +1 to customize via PerFieldDocValuesFormat. This approach is the preferred design, as discussed earlier in LUCENE-9378 https://github.com/apache/lucene-solr/pull/2069/files#r520035432.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r565404249



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
   private void addTermsDict(SortedSetDocValues values) throws IOException {
     final long size = values.getValueCount();
     meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode

Review comment:
       @zhaih could you be more specific about this? Could you post something in the Jira issue discussion to explain how this compression is harmful for BinaryDocValues? It needs to be more visible than just in this review.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -790,6 +833,28 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private int compressAndGetTermsDictBlockLength(
+      ByteArrayDataOutput bufferedOutput, LZ4.FastCompressionHashTable ht) throws IOException {
+    int uncompressedLength = bufferedOutput.getPosition();
+    data.writeVInt(uncompressedLength);
+    long before = data.getFilePointer();
+    LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+    int compressedLength = (int) (data.getFilePointer() - before);
+    // Block length will be used for creating buffer for decompression, one corner case is that
+    // compressed length might be bigger than un-compressed length, so just return the bigger one.
+    return Math.max(uncompressedLength, compressedLength);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int targetLength = (originalLength + termLength) << 1;

Review comment:
       The idea is to remove this line 851 (double the size) to let the ArrayUtil.oversize() inside ArrayUtil.grow() decide what is the appropriate size growth (exponential growth with less mem waste than classic x2).
   The following line would become something like:
   termsDictBuffer = ArrayUtil.grow(termsDictBuffer, pos + termLength)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r564271912



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
   }
 
   private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;

Review comment:
       ok..will rename and add this comment:)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-773179495






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r563698470



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -731,7 +731,22 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
       meta.writeLong(data.getFilePointer() - start); // ordsLength
     }
 
-    addTermsDict(DocValues.singleton(valuesProducer.getSorted(field)));
+    int valuesCount = values.getValueCount();
+    switch (mode) {

Review comment:
       yes, should use "if" instead of "switch", thanks:)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r565760915



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
   private void addTermsDict(SortedSetDocValues values) throws IOException {
     final long size = values.getValueCount();
     meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode

Review comment:
       Terms dict compression and BinaryDocValues compression now share the same configuration...it means these two compression will be enabled and disabled together...But BinaryDocValues compression will cause performance reduction.. Please check the discussions from [LUCENE-9378](https://issues.apache.org/jira/browse/LUCENE-9378) :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] zhaih commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r568246420



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
   private void addTermsDict(SortedSetDocValues values) throws IOException {
     final long size = values.getValueCount();
     meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode

Review comment:
       Sorry for late response. I agree we could solve it in a follow-up issue. And I could still test this via a customized PerFieldDocValuesFormat, thank you!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771782427


   I expect you don't need to change Lucene code but just write a new custom codec (with a specific name) which provides a custom DocValuesFormat. It extends PerFieldDocValuesFormat and implements the method
   DocValuesFormat getDocValuesFormatForField(String field).
   This method provides either a standard Lucene80DocValuesFormat (no compression) or another new custom DocValuesFormat (with a specific name to write in the index) extending Lucene80DocValuesFormat with BEST_COMPRESSION mode.
   The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-768786435


   I also want to use different configurations, so we can enable terms dict compression only.  I plan to remove the enum Mode from Lucene80DocValuesFormat, and add a new config class as below:
   
   ```
   public enum CompressionMode {
       NONE,
       LZ4
     }
   
     public static class DocValuesConfig {
       // Compression mode for terms dict from SortedSet/Sorted DocValues.
       private CompressionMode termsDictCompressionMode;
       // Compression mode for binary DocValues.
       private CompressionMode binaryDocValueCompressionMode;
   
       public DocValuesConfig() {
         this(CompressionMode.NONE, CompressionMode.NONE);
       }
   
       public DocValuesConfig(CompressionMode termsDictCompressionMode, CompressionMode binaryDocValueCompressionMode) {
         this.termsDictCompressionMode = termsDictCompressionMode;
         this.binaryDocValueCompressionMode = binaryDocValueCompressionMode;
       }
   
       public CompressionMode getBinaryDocValueCompressionMode() {
         return binaryDocValueCompressionMode;
       }
   
       public boolean isTermsDictCompressionEnabled() {
         return CompressionMode.LZ4 == this.termsDictCompressionMode;
       }
   
       public boolean isBinaryDocValueCompressionEnabled() {
         return CompressionMode.LZ4 == this.binaryDocValueCompressionMode;
       }
     }
   ```
   Before I commit the new changes, please share your thoughts on this.  @bruno-roustant  @zhaih 
   Thanks.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r564311781



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -962,6 +1078,21 @@ public SortedDocValues getSorted(FieldInfo field) throws IOException {
     addressesWriter.finish();
     meta.writeLong(data.getFilePointer() - start); // addressesLength
 
-    addTermsDict(values);
+    long valuesCount = values.getValueCount();

Review comment:
       will remove this:)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi edited a comment on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi edited a comment on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-772993254


   Thanks for the detailed suggestion:)  @bruno-roustant 
   
   Will add a **new custom codec** to support compression or un-compression mode.  I name it as "Lucene90ExpertCodec", in this codec, user will have more compression choices, likes Terms Dict compression, Binary DocValue compression..
   
   One doubt, PerFieldDocValuesFormat provides one method for getting Format according to field name. 
   `public abstract DocValuesFormat getDocValuesFormatForField(String field);`
   so we need to use the approach as @bruno-roustant suggested:
   > The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.
   
   we need to maintain the config file, and the field name list may got change...Regarding on the "naming convention" approach, we need to give rules to the field name definition(Please correct me if I am wrong)...I am afraid it's a heavy burden for users if they want to enable terms-dict compression for the exist indices.
   
   There's another two options:
   1. Use a global switch in "Lucene90ExpertCodec", so Terms-Dict compression/Binary DocValue Compression can be enabled easily.
   2. Based on PerFieldDocValuesFormat, but add one more method:
        `public abstract DocValuesFormat getDocValuesFormatForField(FieldInfo field);`
       We can add new attribute in FieldInfo to identify the compression-enabled fields.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r566092357



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       I don't completely understand the discussion here - line 851 is the call to ArrayUtil.grow? It looks like there is still a line 851 :) So -- I'll just say +1 to using Lucene's existing array-growing utility. It has an exponential growth that is < 2x, and unless this use case has some very special requirement, we should stick with that. If you feel the need to try something different, please substantiate it with tests on real data that show the value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771296030


   If I understood correctly, the route via PerFieldDocValuesFormat need to change the usage of SortedSetDocValues.
   The idea is adding another constructor for enabling terms dict compression, as below:
   ```
   public SortedSetDocValuesField(String name, BytesRef bytes, boolean compression) {
       super(name, compression ? COMPRESSION_TYPE: TYPE);
       fieldsData = bytes;
   }
   ```
   In COMPRESSION_TYPE, add one new attribute for telling PerFieldDocValuesFormat that terms dict compression is enabled for this field.   Not sure if I've got it right :)
   @msokolov @bruno-roustant 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] zhaih commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
zhaih commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771265310


   I see, so I think for now I could test it via a customized PerFieldDocValuesFormat, I'll give PerFieldDocValuesFormat route a try then.
   
   Tho IMO I would prefer a simpler configuration (as proposed by @jaisonbi) rather than customize using PerFieldDocValuesFormat in the future, if these 2 compression are showing different performance characteristic. Since if my understand is correct, to enable only TermDictCompression using PerFieldDOcValuesFormat we need to enumerate all SSDV field names in that class? Which sounds not quite maintainable if there's regularly field addition/deletion. Please correct me if I'm wrong as I'm not quite familiar with codec part...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-773199705


   @jaisonbi I didn't want to close this PR. I wanted to merge the first part related to LZ4 compression, and let you push a PR on another jira issue with the remaining commits you proposed.
   Do you want to reopen the PR?
   Or do you think the commits must not be separated?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] zhaih commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
zhaih commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771265310


   I see, so I think for now I could test it via a customized PerFieldDocValuesFormat, I'll give PerFieldDocValuesFormat route a try then.
   
   Tho IMO I would prefer a simpler configuration (as proposed by @jaisonbi) rather than customize using PerFieldDocValuesFormat in the future, if these 2 compression are showing different performance characteristic. Since if my understand is correct, to enable only TermDictCompression using PerFieldDOcValuesFormat we need to enumerate all SSDV field names in that class? Which sounds not quite maintainable if there's regularly field addition/deletion. Please correct me if I'm wrong as I'm not quite familiar with codec part...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi edited a comment on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi edited a comment on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771296030


   If I understood correctly, the route via PerFieldDocValuesFormat need to change the usage of SortedSetDocValues.
   The idea is adding another constructor for enabling terms dict compression, as below:
   ```
   public SortedSetDocValuesField(String name, BytesRef bytes, boolean compression) {
       super(name, compression ? COMPRESSION_TYPE: TYPE);
       fieldsData = bytes;
   }
   ```
   And below is the definition of COMPRESSION_TYPE:
   
     ```
   public static final FieldType COMPRESSION_TYPE = new FieldType();
     static {
       COMPRESSION_TYPE.setDocValuesType(DocValuesType.SORTED_SET);
       // add one new attribute for telling PerFieldDocValuesFormat that terms dict compression is enabled for this field
       COMPRESSION_TYPE.putAttribute("docvalue.sortedset.compression", "true");
       COMPRESSION_TYPE.freeze();
     }
   ```
   Not sure if I've got it right :)
   
   @msokolov @bruno-roustant 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r565767001



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -790,6 +833,28 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private int compressAndGetTermsDictBlockLength(
+      ByteArrayDataOutput bufferedOutput, LZ4.FastCompressionHashTable ht) throws IOException {
+    int uncompressedLength = bufferedOutput.getPosition();
+    data.writeVInt(uncompressedLength);
+    long before = data.getFilePointer();
+    LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+    int compressedLength = (int) (data.getFilePointer() - before);
+    // Block length will be used for creating buffer for decompression, one corner case is that
+    // compressed length might be bigger than un-compressed length, so just return the bigger one.
+    return Math.max(uncompressedLength, compressedLength);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int targetLength = (originalLength + termLength) << 1;

Review comment:
       I just worry about rely on ArrayUtil.grow is slower..I re-tested this method and try to understand the growing behavior..
   suppose the original length is 16 * 1024, and the termLength is 112...so the new length will be: 18560
   
   so remove line 851 makes sense to me. will change it immediately. Thanks @bruno-roustant 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r562555734



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -731,7 +731,22 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
       meta.writeLong(data.getFilePointer() - start); // ordsLength
     }
 
-    addTermsDict(DocValues.singleton(valuesProducer.getSorted(field)));
+    int valuesCount = values.getValueCount();
+    switch (mode) {

Review comment:
       Could we have a "if" instead of a switch? The if could call the right method on the SortedSetDocValues singleton.
   (for a switch, we should handle the "default" case with an exception but I don't think it's useful here)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] zhaih commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r565026524



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
   private void addTermsDict(SortedSetDocValues values) throws IOException {
     final long size = values.getValueCount();
     meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode

Review comment:
       Is it possible to just enable compression for `SortedDocValues` but not `BinaryDocValues`? Since compressed `BinaryDocValues` is proved to be harmful for our service but we still want to try this one out...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r563700602



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -370,6 +378,11 @@ public void close() throws IOException {
     long termsIndexLength;
     long termsIndexAddressesOffset;
     long termsIndexAddressesLength;
+
+    boolean compressed;
+    // Reserved for support other compressors.
+    int compressorCode;

Review comment:
       will remove this..just thought we could support more types of compression algorithms here...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r566531012



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       > I don't completely understand the discussion here - line 851 is the call to ArrayUtil.grow? 
   
   In the old solution:
   ```
   int newLength = (originalLength + termLength) << 1;
   ArrayUtil.grow(termsDictBuffer, newLength);
   ```
   I doubled the array length before calling ArrayUtil.grow, I was worry about it may call ArrayUtil.grow many times for the long-value and high-cardinality fields.
   After some tests, I found ArrayUtil.grow grow the array in a heuristic behavior...so I removed the line of `int newLength = (originalLength + termLength) << 1;` in the latest commit:)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r566003862



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       I try to be consistent with the approach taken in Lucene to not systematically double the buffer size but let a common tool apply uniformly a selected heuristic. Maybe @jpountz @msokolov you would have an opinion here to share.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] zhaih commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
zhaih commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-768505413


   @bruno-roustant Sorry I didn't make it clear. I didn't mean this is harmful for `BinaryDocValues`, I mean compression of `BinaryDocValues` makes our searching slower. 
   But we do want to test this out, so I want to know whether it's possible to config the codec that only enables compression of `SortedDocValues` but not enables other compressions. Seems like for now they are bind in doc values compression mode together.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r563698470



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -731,7 +731,22 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
       meta.writeLong(data.getFilePointer() - start); // ordsLength
     }
 
-    addTermsDict(DocValues.singleton(valuesProducer.getSorted(field)));
+    int valuesCount = values.getValueCount();
+    switch (mode) {

Review comment:
       yes, should use "if" instead of "switch", thanks:)

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -370,6 +378,11 @@ public void close() throws IOException {
     long termsIndexLength;
     long termsIndexAddressesOffset;
     long termsIndexAddressesLength;
+
+    boolean compressed;
+    // Reserved for support other compressors.
+    int compressorCode;

Review comment:
       will remove this..just thought we could support more types of compression algorithms here...

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
   }
 
   private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;

Review comment:
       Just refer from CompressionMode$LZ4_DECOMPRESSOR...it said add 7 padding bytes can help decompression run faster...

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {

Review comment:
       I will try to optimize this method...thanks for the comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r563702718



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
   }
 
   private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;

Review comment:
       Just refer from CompressionMode$LZ4_DECOMPRESSOR...it said add 7 padding bytes can help decompression run faster...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r566042201



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       Just copied my comment from another conversation: 
   > I re-tested this method and try to understand the growing behavior..
   > suppose the original length is 16 * 1024, and the termLength is 112...so the new length will be: 18560
   > so remove line 851 makes sense to me. 
   
   so the above description "...it will call ArrayUtil.grow too many times..." is in-correct. I've removed line 851.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi edited a comment on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi edited a comment on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-772993254


   Thanks for the detailed suggestion:)  @bruno-roustant 
   
   Will add a **new custom codec** to support compression or un-compression mode.  I name it as "Lucene90ExpertCodec", in this codec, user will have more compression choices, likes Terms Dict compression, Binary DocValue compression..
   
   One doubt, PerFieldDocValuesFormat provides one method for getting Format according to field name. 
   `public abstract DocValuesFormat getDocValuesFormatForField(String field);`
   so we need to use the approach as @bruno-roustant suggested:
   > The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.
   
   we need to maintain the config file, and the field name list may got change...Regarding on the "naming convention" approach, we need to give rules to the field name definition(Please correct me if I am wrong)...I am afraid it's a heavy burden for users if they want to enable terms-dict compression for the exist indices.
   
   There's another two options:
   1. Use a global switch in "Lucene90ExpertCodec", so Terms-Dict compression/Binary DocValue Compression can be enabled easily.
   2. Based on PerFieldDocValuesFormat, but add one more method:
        `public abstract DocValuesFormat getDocValuesFormatForField(FieldInfo field);`
       We can add new attribute in FieldInfo to identify the compression-enabled fields.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi closed pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi closed pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-771296030


   If I understood correctly, the route via PerFieldDocValuesFormat need to change the usage of SortedSetDocValues.
   The idea is adding another constructor for enabling terms dict compression, as below:
   ```
   public SortedSetDocValuesField(String name, BytesRef bytes, boolean compression) {
       super(name, compression ? COMPRESSION_TYPE: TYPE);
       fieldsData = bytes;
   }
   ```
   In COMPRESSION_TYPE, add one new attribute for telling PerFieldDocValuesFormat that terms dict compression is enabled for this field.   Not sure if I've got it right :)
   @msokolov @bruno-roustant 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
msokolov commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-769042377


   @jaisonbi I'm curious what your plan is for surfacing this DocValuesConfig configuration in the API - are you thinking of adding it to the DocValuesFormat in place of the existing mode? Also, I think we do have a customization route via  PerFieldDocValuesFormat. It does require users to create a custom Codec, so perhaps a bit expert for most, but if we begin to add this kind of customization support as a convenience, it probably ought to go in FieldType?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-773199705


   @jaisonbi I didn't want to close this PR. I wanted to merge the first part related to LZ4 compression, and let you push a PR on another jira issue with the remaining commits you proposed.
   Do you want to reopen the PR?
   Or do you think the commits must not be separated?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] zhaih commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
zhaih commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r568246420



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -736,49 +736,92 @@ private void doAddSortedField(FieldInfo field, DocValuesProducer valuesProducer)
   private void addTermsDict(SortedSetDocValues values) throws IOException {
     final long size = values.getValueCount();
     meta.writeVLong(size);
-    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_SHIFT);
+    boolean compress =
+        Lucene80DocValuesFormat.Mode.BEST_COMPRESSION == mode

Review comment:
       Sorry for late response. I agree we could solve it in a follow-up issue. And I could still test this via a customized PerFieldDocValuesFormat, thank you!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r563708394



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {

Review comment:
       I will try to optimize this method...thanks for the comment.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-769534164


   > I'm curious what your plan is for surfacing this DocValuesConfig configuration in the API - are you thinking of adding it to the DocValuesFormat in place of the existing mode?
   
   Just want to add different configurations, so user can enable terms dict compression only(without enabling binary DocValue compression)...Yes, I planed to add it to DocValuesFormat, and remove Mode from DocValuesFormat... 
   
   will check the route via PerFieldDocValuesFormat...Thanks @msokolov 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r564272513



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       Suppose the original termsDictBuffer length is 32, the term length is 3, so the new length after calling ArrayUtil.grow will be 40.....so only add 8 bytes to the original array...
   
   I'm just worry about 16 KB is too small for some long value and high-cardinality fields, so it will call ArrayUtil.grow too many times....keep line 903 will make it faster...what do you think? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r564272513



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -791,6 +806,107 @@ private void addTermsDict(SortedSetDocValues values) throws IOException {
     writeTermsIndex(values);
   }
 
+  private void addCompressedTermsDict(SortedSetDocValues values) throws IOException {
+    final long size = values.getValueCount();
+    meta.writeVLong(size);
+    meta.writeInt(Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_CODE);
+
+    ByteBuffersDataOutput addressBuffer = new ByteBuffersDataOutput();
+    ByteBuffersIndexOutput addressOutput =
+        new ByteBuffersIndexOutput(addressBuffer, "temp", "temp");
+    meta.writeInt(DIRECT_MONOTONIC_BLOCK_SHIFT);
+    long numBlocks =
+        (size + Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK)
+            >>> Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_SHIFT;
+    DirectMonotonicWriter writer =
+        DirectMonotonicWriter.getInstance(
+            meta, addressOutput, numBlocks, DIRECT_MONOTONIC_BLOCK_SHIFT);
+
+    LZ4.FastCompressionHashTable ht = new LZ4.FastCompressionHashTable();
+    ByteArrayDataOutput bufferedOutput = new ByteArrayDataOutput(termsDictBuffer);
+    long ord = 0;
+    long start = data.getFilePointer();
+    int maxLength = 0;
+    TermsEnum iterator = values.termsEnum();
+    int maxBlockLength = 0;
+    BytesRefBuilder previous = new BytesRefBuilder();
+    for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
+      int termLength = term.length;
+      if ((ord & Lucene80DocValuesFormat.TERMS_DICT_BLOCK_LZ4_MASK) == 0) {
+        if (bufferedOutput.getPosition() > 0) {
+          int uncompressedLength = bufferedOutput.getPosition();
+          data.writeVInt(uncompressedLength);
+          maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+          long before = data.getFilePointer();
+          // Compress block
+          LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+          int compressedLength = (int) (data.getFilePointer() - before);
+          // Corner case: Compressed length might be bigger than un-compressed length.
+          maxBlockLength = Math.max(maxBlockLength, compressedLength);
+          bufferedOutput.reset(termsDictBuffer);
+        }
+
+        writer.add(data.getFilePointer() - start);
+        data.writeVInt(termLength);
+        data.writeBytes(term.bytes, term.offset, termLength);
+      } else {
+        final int prefixLength = StringHelper.bytesDifference(previous.get(), term);
+        final int suffixLength = term.length - prefixLength;
+        assert suffixLength > 0; // terms are unique
+        int reservedSize = suffixLength + 11; // 1 byte + 2 vint.
+        bufferedOutput = maybeGrowBuffer(bufferedOutput, reservedSize);
+        bufferedOutput.writeByte(
+            (byte) (Math.min(prefixLength, 15) | (Math.min(15, suffixLength - 1) << 4)));
+
+        if (prefixLength >= 15) {
+          bufferedOutput.writeVInt(prefixLength - 15);
+        }
+        if (suffixLength >= 16) {
+          bufferedOutput.writeVInt(suffixLength - 16);
+        }
+        bufferedOutput.writeBytes(term.bytes, term.offset + prefixLength, suffixLength);
+      }
+      maxLength = Math.max(maxLength, termLength);
+      previous.copyBytes(term);
+      ++ord;
+    }
+
+    // Compress and write out the last block
+    if (bufferedOutput.getPosition() > 0) {
+      int uncompressedLength = bufferedOutput.getPosition();
+      data.writeVInt(uncompressedLength);
+      maxBlockLength = Math.max(maxBlockLength, uncompressedLength);
+      long before = data.getFilePointer();
+      LZ4.compress(termsDictBuffer, 0, uncompressedLength, data, ht);
+      int compressedLength = (int) (data.getFilePointer() - before);
+      maxBlockLength = Math.max(maxBlockLength, compressedLength);
+    }
+
+    writer.finish();
+    meta.writeInt(maxLength);
+    // Write one more int for storing max block length. For compressed terms dict only.
+    meta.writeInt(maxBlockLength);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+    start = data.getFilePointer();
+    addressBuffer.copyTo(data);
+    meta.writeLong(start);
+    meta.writeLong(data.getFilePointer() - start);
+
+    // Now write the reverse terms index
+    writeTermsIndex(values);
+  }
+
+  private ByteArrayDataOutput maybeGrowBuffer(ByteArrayDataOutput bufferedOutput, int termLength) {
+    int pos = bufferedOutput.getPosition(), originalLength = termsDictBuffer.length;
+    if (pos + termLength >= originalLength - 1) {
+      int newLength = (originalLength + termLength) << 1;

Review comment:
       Makes sense. Call ArrayUtil.grow is enough.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-773189456


   > 1- We complete this PR for the scope of adding LZ4 compression to SortedSet/Sorted DocValues. We remove the latest commit about the configuration. This is the public scope of the Jira description which has been discussed.
   
   ok...I will revert the new commits, and create another JIRA for flexible configurations. @bruno-roustant 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi edited a comment on pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
jaisonbi edited a comment on pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#issuecomment-772993254


   Thanks for the detailed suggestion. @bruno-roustant 
   
   Will add a **new custom codec** to support compression or un-compression mode.  I name it as "Lucene90ExpertCodec", in this codec, user will have more compression choices, likes Terms Dict compression, Binary DocValue compression..
   
   One doubt, PerFieldDocValuesFormat provides one method for getting Format according to field name. 
   `public abstract DocValuesFormat getDocValuesFormatForField(String field);`
   so we need to use the approach as @bruno-roustant suggested:
   > The choice can be made either based on a config (e.g. file) which lists all compressed DocValue based fields, or based on a naming convention.
   
   we need to maintain the config file, and the field name list may got change...Regarding on the "naming convention" approach, we need to give rules to the field name definition(Please correct me if I am wrong)...I am afraid it's not simple enough:)
   
   There will be 2 options:
   1. Use a global switch in "Lucene90ExpertCodec", so Terms-Dict compression/Binary DocValue Compression can be enabled easily.
   2. Based on PerFieldDocValuesFormat, but add one more method:
        `public abstract DocValuesFormat getDocValuesFormatForField(FieldInfo field);`
       We can add new attribute in FieldInfo to identify the compression-enabled fields.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on a change in pull request #2213: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on a change in pull request #2213:
URL: https://github.com/apache/lucene-solr/pull/2213#discussion_r563785114



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
##########
@@ -1144,6 +1157,7 @@ public TermsEnum termsEnum() throws IOException {
   }
 
   private static class TermsDict extends BaseTermsEnum {
+    static final int PADDING_LENGTH = 7;

Review comment:
       Ok, in this case can we rename it LZ4_DECOMPRESSOR_PADDING and add this comment about the decompression speed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org