You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Hendrik Muhs (Jira)" <ji...@apache.org> on 2021/11/22 08:47:00 UTC
[jira] [Commented] (LUCENE-10247) Reduce FST size by using absolute and relative coding for target pointers

    [ https://issues.apache.org/jira/browse/LUCENE-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447253#comment-17447253 ] 

Hendrik Muhs commented on LUCENE-10247:
---------------------------------------

POC: https://github.com/apache/lucene/pull/460 

> Reduce FST size by using absolute and relative coding for target pointers
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-10247
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10247
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Hendrik Muhs
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> FST's use various tricks to reduce size. One more trick that can be added is using relative coding for the target pointers that connect nodes.
> Currently if a node has a target and it's not optimized using the `next` flag, it writes a VLong which contains the address of the child. This is an absolute address. A relative address can be shorter, relative address means taking the address of the node and expressing the result of child node as diff between the parent and the child. E.g. if the child is _10099_ and the parent {_}10000{_}, an absolute pointer requires 2 bytes, but the relative pointer ({_}10099 - 10000{_}) requires only 1 byte. Of course that's not always true, the absolute pointer could be smaller. Therefore the idea is to switch between the two, dependent on what's smaller.
> (FWIW: From a theory standpoint this works nicely, because the FST is constructed from the leafs to the root, child node addresses are always smaller than parent ones and due to good locality, connected nodes are usually stored close to each other )
> I have a prototype for this, it will be pushed as a draft PR and added as link here. It introduces a new bit in _flags_ and changes the compiler to use relative coding if possible. The lookup side needs the current node address and checks one more bit flag. Both additions seem reasonable cheap and do not add a lot of runtime.
> In my measurements I was able to reduce the size of a 4.2 million key dictionary from 92MB ot 87MB, so saved 5MB, which is roughly 5.5%. If you have better reference datasets, let me know. It obviously works better the more data.
> The POC implementation only supports 1 output type (positive int output) and I just made it compile. For this 1 type of dictionary it's possible to create and dump dictionaries correctly.
> Before spending more time on this I like to hear feedback and get agreement on whether this is something you like to have. It seems to me that a size reduction of FST's is always welcome and as pointed out, the runtime additions are reasonable small. If you wonder about the implementation, I happily discuss the nitty gritty details on gh, it had some interesting challenges.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org