You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/04/26 13:19:53 UTC

[GitHub] [lucene] jpountz opened a new pull request, #838: LUCENE-10536: Slightly better compression of doc values' terms dictionaries.

jpountz opened a new pull request, #838:
URL: https://github.com/apache/lucene/pull/838

   Doc values terms dictionaries keep the first term of each block uncompressed so
   that they can somewhat efficiently perform binary searches across blocks.
   Suffixes of the other 63 terms are compressed together using LZ4 to leverage
   redundancy across suffixes. This change improves compression a bit by using the
   first (uncompressed) term of each block as a dictionary when compressing
   suffixes of the 63 other terms. This helps with compressing the first few
   suffixes when there's not much context yet that can be leveraged to find
   duplicates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz merged pull request #838: LUCENE-10536: Slightly better compression of doc values' terms dictionaries.

Posted by GitBox <gi...@apache.org>.
jpountz merged PR #838:
URL: https://github.com/apache/lucene/pull/838


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #838: LUCENE-10536: Slightly better compression of doc values' terms dictionaries.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #838:
URL: https://github.com/apache/lucene/pull/838#issuecomment-1109956838

   > is it just the case for the old indexes that because we indexed with initial empty dictionary that anything we put in the dictionary will be effectively "ignored" (e.g. basically pushed out eventually by new stuff that then gets references)
   
   This is correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #838: LUCENE-10536: Slightly better compression of doc values' terms dictionaries.

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #838:
URL: https://github.com/apache/lucene/pull/838#issuecomment-1109826778

   nice idea. but is it really backwards-compatible? Shouldn't we force a new version (preferred) or at least bump internal constant?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #838: LUCENE-10536: Slightly better compression of doc values' terms dictionaries.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #838:
URL: https://github.com/apache/lucene/pull/838#issuecomment-1124690454

   I used the new OrdinalMapBenchmark in luceneutil to check if this change would make things any slower, and as far as I can tell there wasn't any difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #838: LUCENE-10536: Slightly better compression of doc values' terms dictionaries.

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #838:
URL: https://github.com/apache/lucene/pull/838#issuecomment-1109947215

   i'm just asking questions because i have only worked with the deflate dictionaries not the lz4 ones. is it just the case for the old indexes that because we indexed with initial empty dictionary that anything we put in the dictionary will be effectively "ignored" (e.g. basically pushed out eventually by new stuff that then gets references). sorry for the paranoia, no i'm not asking for unnecessary back compat, just want to be sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #838: LUCENE-10536: Slightly better compression of doc values' terms dictionaries.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #838:
URL: https://github.com/apache/lucene/pull/838#issuecomment-1109906929

   It's backward compatible, old indices will just never have references to content that lives in the compression dictionary.
   
   I'm happy to still create a new Lucene92Codec+Lucene92DocValuesFormat if you like that approach better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org