You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Greg Miller (Jira)" <ji...@apache.org> on 2021/03/26 04:24:00 UTC

[jira] [Updated] (LUCENE-9877) Explore increasing the allowable exceptions in PForUtil

     [ https://issues.apache.org/jira/browse/LUCENE-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Miller updated LUCENE-9877:
--------------------------------
    Description: 
Piggybacking a little off of the investigation I was doing over in LUCENE-9850 I thought it might also be worth-while exploring the impact of increasing the number of allowable exceptions in PForUtil. The aim of this investigation is to see if we could reduce index size by allowing for more exceptions without significant negative impact to performance.

PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to encode the number of exceptions (using the remaining 3 bits of the byte used to also encode the number of bits-per-value, which requires 5 bits). Each exception used is encoded with a two full bytes, using a maximum of 6 bytes per block.

It seems to me like 7 might be a more ideal number of exceptions if index size is the driving motivation. My thought process is that, in the worst-case, 7 exceptions would be used to save only a single bit-per-value in the corresponding block. With 128 entries per block, this would save 16 bytes. So with 14 bytes used to encode the exception values (7 x 2 bytes per exception), we would save a two bytes in total (just slightly better than breaking even). If we need fewer than the 7 exceptions, or if we're able to save more than 1 bit-per-value, it's all additional savings. I suppose the question is what kind of performance hit we might observe due to decoding more exceptions.

Also note that 7 exceptions is the max we can encode with the 3 bits we currently have available for the number of exceptions. So moving to 8 exceptions would not only take 16 bytes to encode the exceptions (if using all of them), but we'd need one more byte per block to encode the exception count. So in the worst case of using all 8 exceptions to save 1 bit per value, we'd actually be worse off.

I'll post some results here for discussion or at least for public record of my work for future reference.

  was:
Piggybacking a little off of the investigation I was doing over in LUCENE-9850 I thought it might also be worth-while exploring the impact of increasing the number of allowable exceptions in PForUtil. The aim of this investigation is to see if we could reduce index size by allowing for more exceptions without significant negative impact to performance.

PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to encode the number of exceptions (using the remaining 3 bits of the byte used to also encode the number of bits-per-value, which requires 5 bits). Each exception used is encoded with a full byte, using a maximum of 3 bytes per block.

It seems to me like 14 might be a more ideal number of exceptions if index size is the driving motivation. My thought process is that, in the worst-case, 14 exceptions would be used to save only a single bit-per-value in the corresponding block. With 128 entries per block, this would save 16 bytes. To encode the exception count, we'd need to use an extra byte per block since we need more than the current 3 bits. So with that extra byte in addition to the 14 bytes used to encode the exception values, we would save a single byte in total (just slightly better than breaking even). If we need fewer than the 14 exceptions, or if we're able to save more than 1 bit-per-value, it's all additional savings.

As a first experiment, I think testing 7 exceptions might be a good middle-ground. Seven exceptions is the most we can encode with the 3 bits we have allocated today. So for 7 exceptions, we don't need the extra byte for encoding the exception count, and in the worst case of needing all 7 to save 1 bit-per-value, we'd save a total of 9 bytes in a block. I suppose the question is what kind of performance hit we might observe due to decoding more exceptions.

I'll post some results here for discussion or at least for public record of my work for future reference.


> Explore increasing the allowable exceptions in PForUtil
> -------------------------------------------------------
>
>                 Key: LUCENE-9877
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9877
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs
>    Affects Versions: main (9.0)
>            Reporter: Greg Miller
>            Priority: Minor
>
> Piggybacking a little off of the investigation I was doing over in LUCENE-9850 I thought it might also be worth-while exploring the impact of increasing the number of allowable exceptions in PForUtil. The aim of this investigation is to see if we could reduce index size by allowing for more exceptions without significant negative impact to performance.
> PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to encode the number of exceptions (using the remaining 3 bits of the byte used to also encode the number of bits-per-value, which requires 5 bits). Each exception used is encoded with a two full bytes, using a maximum of 6 bytes per block.
> It seems to me like 7 might be a more ideal number of exceptions if index size is the driving motivation. My thought process is that, in the worst-case, 7 exceptions would be used to save only a single bit-per-value in the corresponding block. With 128 entries per block, this would save 16 bytes. So with 14 bytes used to encode the exception values (7 x 2 bytes per exception), we would save a two bytes in total (just slightly better than breaking even). If we need fewer than the 7 exceptions, or if we're able to save more than 1 bit-per-value, it's all additional savings. I suppose the question is what kind of performance hit we might observe due to decoding more exceptions.
> Also note that 7 exceptions is the max we can encode with the 3 bits we currently have available for the number of exceptions. So moving to 8 exceptions would not only take 16 bytes to encode the exceptions (if using all of them), but we'd need one more byte per block to encode the exception count. So in the worst case of using all 8 exceptions to save 1 bit per value, we'd actually be worse off.
> I'll post some results here for discussion or at least for public record of my work for future reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org