You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/04 13:44:40 UTC

[GitHub] [lucene-solr] jaisonbi opened a new pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

jaisonbi opened a new pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302


   # Description
   
   Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
   LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by adding LZ4 compression to current prefix-compression. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
   
   This improvement is only for the high-cardinality fields.
   
   # Tests
   
   See Jira [LUCENE-9663](https://issues.apache.org/jira/browse/LUCENE-9663) for details.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-773326313


   @bruno-roustant  I checked all the review comments from #2213, and ensure all the changes included in this PR.  The main items:
   1. Rely ArrayUtil.grow to grow the buffer.
   2. Remove method Lucene80DocValuesConsumer#addCompressedTermsDict, added some compression-conditinal code in Lucene80DocValuesConsumer#addTermsDict. Try to share the same lines across different mode.
   3. Remove "switch" from Lucene80DocValuesConsumer#addSortedSetField and Lucene80DocValuesConsumer#doAddSortedField.
   4. Rename PADDING_LENGTH(from Lucene80DocValuesProducer$TermsDict) to LZ4_DECOMPRESSOR_PADDING, and add comment.
   
   Sorry for the inconvenience:-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#discussion_r570467210



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -43,11 +43,7 @@
 import org.apache.lucene.index.TermsEnum;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.SortedSetSelector;
-import org.apache.lucene.store.ByteBuffersDataOutput;
-import org.apache.lucene.store.ByteBuffersIndexOutput;
-import org.apache.lucene.store.ChecksumIndexInput;
-import org.apache.lucene.store.IOContext;
-import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.*;

Review comment:
       hmmm did this pass spotless check? I don't think we typically use wildcard imports




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi edited a comment on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi edited a comment on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-773733335


   > It looks to me as if when compression is not enabled, the index format is unchanged, so this should be backwards-compatible. Is that correct? 
   
   Correct.
   
   And if terms-dict compression is enabled, it is only for high-cardinality fields...so normally, some fields kept un-compressed even if terms-dict compression is enabled...The compression state is stored in *dvm file per field, so during reading, we know which fields need to be decompressed..
   
   >  I think we should have a test that proves that, but perhaps there is already a back-compat test that covers this case? 
   
   I already added one test to prove that merge works fine if both old segments(before this improvement) and the new segments involved...Please see TestDocValuesCompression#testMergeWithUncompressedSegment...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] howardhuanghua commented on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
howardhuanghua commented on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-881204743


   Merged in https://issues.apache.org/jira/browse/LUCENE-9663.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#discussion_r570467210



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -43,11 +43,7 @@
 import org.apache.lucene.index.TermsEnum;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.SortedSetSelector;
-import org.apache.lucene.store.ByteBuffersDataOutput;
-import org.apache.lucene.store.ByteBuffersIndexOutput;
-import org.apache.lucene.store.ChecksumIndexInput;
-import org.apache.lucene.store.IOContext;
-import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.*;

Review comment:
       hmmm did this pass spotless check? I don't think we typically use wildcard imports




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#discussion_r570670271



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -43,11 +43,7 @@
 import org.apache.lucene.index.TermsEnum;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.SortedSetSelector;
-import org.apache.lucene.store.ByteBuffersDataOutput;
-import org.apache.lucene.store.ByteBuffersIndexOutput;
-import org.apache.lucene.store.ChecksumIndexInput;
-import org.apache.lucene.store.IOContext;
-import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.*;

Review comment:
       Yes, the check is okay...I will revert the changes here if we should not use wildcard imports.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-773326313






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on a change in pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on a change in pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#discussion_r570670271



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
##########
@@ -43,11 +43,7 @@
 import org.apache.lucene.index.TermsEnum;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.search.SortedSetSelector;
-import org.apache.lucene.store.ByteBuffersDataOutput;
-import org.apache.lucene.store.ByteBuffersIndexOutput;
-import org.apache.lucene.store.ChecksumIndexInput;
-import org.apache.lucene.store.IOContext;
-import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.*;

Review comment:
       Yes, the check is okay...I will revert the changes here if we should not use wildcard imports.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-773733335


   >  I think we should have a test that proves that, but perhaps there is already a back-compat test that covers this case? 
   
   I already added one test to prove that merge works fine if both old segments(before this improvement) and the new segments involved...Please see TestDocValuesCompression#testMergeWithUncompressedSegment...
   
   And if terms-dict compression is enabled, it is only for high-cardinality fields...so normally, some fields kept un-compressed even if terms-dict compression is enabled...The compression state is stored in *dvm file per field, so during reading, we know which fields need to be decompressed..
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi commented on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi commented on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-774575260


   @bruno-roustant  I got it, since TestDocValuesCompression#testMergeWithUncompressedSegment() uses Lucene90Codec to write the uncompressed segment.  Thanks. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant commented on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
bruno-roustant commented on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-773991718


   @msokolov I confirm it is backward compatible, not thanks to TestDocValuesCompression#testMergeWithUncompressedSegment() but thanks to TestBackwardsCompatibility.testSearchOldIndex(). Indeed TestDocValuesCompression#testMergeWithUncompressedSegment writes and reads compressed/uncompressed with the new code. So it would not detect an incompatibility if, say 1 additional byte is written and read. However TestBackwardsCompatibility.testSearchOldIndex() reads and searches saved old indexes, including all DocValues types, so it would detect this 1 additional/missing byte incompatibility.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] jaisonbi edited a comment on pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
jaisonbi edited a comment on pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302#issuecomment-773733335


   > It looks to me as if when compression is not enabled, the index format is unchanged, so this should be backwards-compatible. Is that correct? 
   
   Correct.
   
   And if terms-dict compression is enabled, it is only for high-cardinality fields...so normally, some fields kept un-compressed even if terms-dict compression is enabled...The compression state is stored in *dvm file per field, so during reading, we know which fields need to be decompressed..
   
   >  I think we should have a test that proves that, but perhaps there is already a back-compat test that covers this case? 
   
   I already added one test to prove that merge works fine if both old segments(before this improvement) and the new segments involved...Please see TestDocValuesCompression#testMergeWithUncompressedSegment...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] bruno-roustant closed pull request #2302: LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted D…

Posted by GitBox <gi...@apache.org>.
bruno-roustant closed pull request #2302:
URL: https://github.com/apache/lucene-solr/pull/2302


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org