You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Bruno Roustant (Jira)" <ji...@apache.org> on 2019/11/08 17:36:00 UTC
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

    [ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970444#comment-16970444 ] 

Bruno Roustant edited comment on LUCENE-8920 at 11/8/19 5:35 PM:
-----------------------------------------------------------------

It works. I removed the labels for direct-addressing, except the first label of the arc.
 It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my previous patch).

I also did several things:
 The presence bits table is now a reused structure (does not create long[] anymore each time we read an arc).
 I touched more code, but I think the code is now a bit clearer with more comments.
 I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary search for reverse lookup (lookup by output).

Is there a periodic automated monitoring of the performance / memory of commits? Could you give me the link? I would like to watch the change with this PR.

 

 I definitely would like to clean more FST. But that will be the subject of another Jira issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing docs and method contracts (e.g. expected position in the byte input).


was (Author: bruno.roustant):
It works. I removed the labels for direct-addressing, except the first label of the arc.
It makes direct-addressing encoding even more compact. Now we have 48% of the nodes with fixed length arcs encoded with direct-encoding by using only +12% memory (instead of +23% with my previous patch).

I also did several things:
The presence bits table is now a reused structure (does not create long[] anymore each time we read an arc).
I touched more code, but I think the code is now a bit clearer with more comments.
I moved the bit twiddling methods to BitUtil.

Thanks reviewers!

 

I noticed the direct-addressing limitation: it is less performant than binary search for reverse lookup (lookup by output).

Is there an periodic automated monitoring of the performance / memory of commits? Could you give me the link? I would like to watch the change with this PR.

 

 I definitely would like to clean more FST. But that will be the subject of another Jira issue later. Too much duplicated code that makes reasoning and modifications hard. Also missing docs and method contracts (e.g. expected position in the byte input).

> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Sokolov
>            Priority: Minor
>         Attachments: TestTermsDictRamBytesUsed.java
>
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, the size increase we're seeing while building (or perhaps do a preliminary pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) which make gaps very costly. Associating each label with a dense id and having an intermediate lookup, ie. lookup label -> id and then id->arc offset instead of doing label->arc directly could save a lot of space in some cases? Also it seems that we are repeating the label in the arc metadata when array-with-gaps is used, even though it shouldn't be necessary since the label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org