You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mike Sokolov (JIRA)" <ji...@apache.org> on 2019/07/17 01:28:00 UTC
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

    [ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886603#comment-16886603 ] 

Mike Sokolov edited comment on LUCENE-8920 at 7/17/19 1:27 AM:
---------------------------------------------------------------

Yes, that makes sense. Because we reverted the "current version" in FST.java, we can no longer read FSTs created with the newer version, so we need to revert the dictionary file.  I'll do that and run a full suite of tests just to make sure something else isn't still broken. Thanks for pointing this out, [~hossman] and finding the fix [~tomoko], and sorry for not being more careful with the "fix" the first time!


was (Author: sokolov):
Yes, that makes sense. Because we reverted the "current version" in FST.java, we can no longer read FSTs created with the newer version, so we need to revert the dictionary file.  I'll do that and run a full suite of tests just to make sure something else isn't still broken

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size increase we're seeing while building (or perhaps do a preliminary pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly. Associating each label with a dense id and having an intermediate lookup, ie. lookup label -> id and then id->arc offset instead of doing label->arc directly could save a lot of space in some cases? Also it seems that we are repeating the label in the arc metadata when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org