You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/07/27 16:44:32 UTC

[GitHub] [lucene] jpountz opened a new pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

jpountz opened a new pull request #227:
URL: https://github.com/apache/lucene/pull/227


   This moves doc values to an approach that is more similar to postings, where
   values are grouped in blocks of 128 values that are compressed together.
   Decoding a single value requires decoding the entire block that contains the
   value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] weizijun commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

Posted by GitBox <gi...@apache.org>.
weizijun commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-890784238


   @jpountz  Can you consider using lz4 or zstd to directly compress the blocks? After index sorting of time series id, we compress blocks by lz4 or zstd, and we can get a large compression ratio?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

Posted by GitBox <gi...@apache.org>.
gsmiller commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-888503779


   > and explicitly rejects numbers of bits per value > 32
   
   Ah right, of course this would be an issue here. Thanks for clarifying!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-891574486


   I suspect that general-purpose compression algorithms like LZ4 or Zstd would not be good fits for this, but it could indeed be interesting to see if we can reuse ideas from these compression algorithms e.g. to be able to detect cycles in the data.
   
   For now I'm focusing on not making queries too much slower with this change so that it has a chance of making it to the default codec. I don't plan on adding more fancy compression schemes, which tend to make things slower. I'd rather look into things like that in a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

Posted by GitBox <gi...@apache.org>.
gsmiller commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-888465657


   This is really interesting/exciting!
   
   I'm working through this PR now but I notice you've used a slightly different approach to the FOR encoding (compared to what's done in the postings). Is this intentional for some reason, or is it more to get something out quickly for benchmarking (results were interesting by the way!)? Is there a reason you chose not to use the existing `ForUtil` directly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] janhoy commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

Posted by GitBox <gi...@apache.org>.
janhoy commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-932077876


   I see this JIRA is closed, please close this PR as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #227: LUCENE-10033: Encode numeric doc values and ordinals of SORTED(_SET) doc values in blocks.

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #227:
URL: https://github.com/apache/lucene/pull/227#issuecomment-888479181


   Indeed I wanted to get something out quickly for benchmarking where I could easily play with different block sizes, while ForUtil is very rigid (hardcoded block size of 128 and explicitly rejects numbers of bits per value > 32).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org