You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Bruno Roustant (Jira)" <ji...@apache.org> on 2019/09/16 14:14:00 UTC

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

    [ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930601#comment-16930601 ] 

Bruno Roustant commented on LUCENE-8920:
----------------------------------------

I tried a quick benchmark to evaluate the load factor (num labels / array size) we can expect for open-addressing. This gives us a good estimate of the space requirement.

This load factor depends on the number of tries (called L above) we accept - this defines the worst case perf. For each power-of-two array size, the benchmark tries various L, and outputs the max load factor (above this load factor, the open-addressing aborts and we fallback to binary search) (benchmark print in next post below).

Outcome:
 * Load factor slightly above 50% for good perf (for L = log(N)/2 + 1), as expected for open-addressing in literature.
 * It is also possible to use a power-of-two x 1.5 array size with efficient hash (to stay at load factor 50% in all cases).
 * We have to encode the “empty slot” (no label in a given array slot). Probably with both a 0 label and 0 value (if the node needs that, then abort and fallback to binary search).

This means we have to expect a space increase of 2x (compared to binary search), for better perf than binary search (L = log(N)/2 + 1, which is the worst case, most of the time the open-addressing stops when encountering an empty slot before L).

 

To me this balance between space and performance cannot be hardcoded. This depends on the use-case. There should be a balance tuning parameter in the FST constructor (e.g. max-perf, perf-over-space, space-over-perf, min-space). And based on this balance we could set the values of a couple of thresholds that define when to use direct-addressing, open-addressing, binary-search, compact-list.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size increase we're seeing while building (or perhaps do a preliminary pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly. Associating each label with a dense id and having an intermediate lookup, ie. lookup label -> id and then id->arc offset instead of doing label->arc directly could save a lot of space in some cases? Also it seems that we are repeating the label in the arc metadata when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org