You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Michael Sokolov (Jira)" <ji...@apache.org> on 2019/10/11 17:14:00 UTC

[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

    [ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949647#comment-16949647 ] 

Michael Sokolov edited comment on LUCENE-8920 at 10/11/19 5:13 PM:
-------------------------------------------------------------------

{{For posterity, this is the worst case test that spreads out terms}}

for (int i = 0; i < 1000000; ++i) {
   byte[] b = new byte[5];
   random().nextBytes(b);
   for (int j = 0; j < b.length; ++j)

{     b[j] &= 0xfc; // make this byte a multiple of 4   }

 entries.add(new BytesRef(b));
 }

buildFST(entries).ramBytesUsed();


was (Author: sokolov):
{{For posterity, this is the worst case test that spreads out terms}}

{{}}for (int i = 0; i < 1000000; ++i) {
  byte[] b = new byte[5];
  random().nextBytes(b);
  for (int j = 0; j < b.length; ++j) {
    b[j] &= 0xfc; // make this byte a multiple of 4
  }
 entries.add(new BytesRef(b));
}

buildFST(entries).ramBytesUsed();

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Blocker
>             Fix For: 8.3
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size increase we're seeing while building (or perhaps do a preliminary pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly. Associating each label with a dense id and having an intermediate lookup, ie. lookup label -> id and then id->arc offset instead of doing label->arc directly could save a lot of space in some cases? Also it seems that we are repeating the label in the arc metadata when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org