You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Greg Miller (Jira)" <ji...@apache.org> on 2021/03/25 22:33:00 UTC

[jira] [Created] (LUCENE-9877) Explore increasing the allowable exceptions in PForUtil

Greg Miller created LUCENE-9877:
-----------------------------------

             Summary: Explore increasing the allowable exceptions in PForUtil
                 Key: LUCENE-9877
                 URL: https://issues.apache.org/jira/browse/LUCENE-9877
             Project: Lucene - Core
          Issue Type: Task
          Components: core/codecs
    Affects Versions: main (9.0)
            Reporter: Greg Miller


Piggybacking a little off of the investigation I was doing over in LUCENE-9850 I thought it might also be worth-while exploring the impact of increasing the number of allowable exceptions in PForUtil. The aim of this investigation is to see if we could reduce index size by allowing for more exceptions without significant negative impact to performance.

PForUtil currently allows for up to 3 exceptions, and it only uses 3 bits to encode the number of exceptions (using the remaining 3 bits of the byte used to also encode the number of bits-per-value, which requires 5 bits). Each exception used is encoded with a full byte, using a maximum of 3 bytes per block.

It seems to me like 14 might be a more ideal number of exceptions if index size is the driving motivation. My thought process is that, in the worst-case, 14 exceptions would be used to save only a single bit-per-value in the corresponding block. With 128 entries per block, this would save 16 bytes. To encode the exception count, we'd need to use an extra byte per block since we need more than the current 3 bits. So with that extra byte in addition to the 14 bytes used to encode the exception values, we would save a single byte in total (just slightly better than breaking even). If we need fewer than the 14 exceptions, or if we're able to save more than 1 bit-per-value, it's all additional savings.

As a first experiment, I think testing 7 exceptions might be a good middle-ground. Seven exceptions is the most we can encode with the 3 bits we have allocated today. So for 7 exceptions, we don't need the extra byte for encoding the exception count, and in the worst case of needing all 7 to save 1 bit-per-value, we'd save a total of 9 bytes in a block. I suppose the question is what kind of performance hit we might observe due to decoding more exceptions.

I'll post some results here for discussion or at least for public record of my work for future reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org