You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Duo Zhang (JIRA)" <ji...@apache.org> on 2016/02/27 08:36:18 UTC

[jira] [Commented] (HBASE-15352) FST BlockEncoder

    [ https://issues.apache.org/jira/browse/HBASE-15352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170424#comment-15170424 ] 

Duo Zhang commented on HBASE-15352:
-----------------------------------

I used to think about this but there is a little difference between lucene's index and our index.

For lucene, the index only need to report hit or not hit, and if hit, return the offset. The algorithm is straight forward on a sorted list.

But for us, our index does not reject any input, that means, for every input we need to return an offset for it.

For example, two pairs

aa -> 10
az -> 20

For lucene, input 'aa' will return 10 and input 'az' will return 20, input 'ab' or 'ac' or 'aaz' will just be rejected.

But for us, besides 'aa' and 'az', we need to return 10 for every input that between 'aa' and 'az'. So we need to modify the algorithm otherwise we will have an infinite number of states. And how to compress the states will be the key issue here.
And I'm even not sure whether it could be represented by an FSA... maybe a PDA here? If so, that's a completely different problem...

Thanks.

> FST BlockEncoder
> ----------------
>
>                 Key: HBASE-15352
>                 URL: https://issues.apache.org/jira/browse/HBASE-15352
>             Project: HBase
>          Issue Type: New Feature
>          Components: regionserver
>            Reporter: Nick Dimiduk
>             Fix For: 2.0.0, 0.98.19, 1.4.0
>
>
> We could improve on the existing [PREFIX_TREE block|http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/codec/prefixtree/package-summary.html] encoder by upgrading the persistent data structure from a trie to a finite state transducer. This would theoretically allow us to reuse bytes not just for rowkey prefixes, but infixes and suffixes as well. My read of the literature means we may also be able to encode values as well, further reducing storage size when values are repeated (ie, a "customer id" field with very low cardinality -- probably happens a lot in our denormalized world). There's a really nice [blog post|http://blog.burntsushi.net/transducers/] about this data structure, and apparently our siblings in Lucene make heavy use of [their implementation|http://lucene.apache.org/core/5_5_0/core/org/apache/lucene/util/fst/package-summary.html#package_description].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)