You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2012/06/01 00:27:22 UTC

[jira] [Commented] (HBASE-6014) Support for block-granularity bitmap indexes

    [ https://issues.apache.org/jira/browse/HBASE-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287006#comment-13287006 ] 

Todd Lipcon commented on HBASE-6014:
------------------------------------

I had some airplane time last week so I spent some of it thinking about how to go about this. It's not quite as straight-forward in HBase as it is in your standard RDBMS -- at least on the read path:

This is because a given row may actually span multiple "layers" of the stack of storefiles. For example, for a query "col_a = 1 AND col_b = 1", we may have the following files in a region:

HFile 1: row1: col_a="1", no value for col_b
HFile 2: row1: col_a="2"  col_b="1"
(so the "current" version of the row has col_a=1 and col_b=1

So, if we naively apply the bitmap index to each HFile in turn, we would end up excluding the block in both, and not see the correct result.

Instead, I think we have to apply each predicate to each HFile in turn to come up with a set of ranges:

Predicate col_a = 1, HFile 1: output key range for block containing row1
Predicate col_a = 1, HFile 2: no output

Predicate col_b = 1, HFile 1: no output
Predicate col_b = 1, Hfile 2: output key range for block containing row1

For each predicate, we take the union of the key ranges across HFiles. Then, we take the intersection across predicates, to come up with a total set of applicable key ranges. Then, we can push this key range list down into the scanner stack to provide skip-ahead hints in the filter.

Any interns out there interested in this? :)
                
> Support for block-granularity bitmap indexes
> --------------------------------------------
>
>                 Key: HBASE-6014
>                 URL: https://issues.apache.org/jira/browse/HBASE-6014
>             Project: HBase
>          Issue Type: New Feature
>          Components: regionserver
>            Reporter: Todd Lipcon
>
> This came up in a discussion with Kannan today, so I promised to write something brief on JIRA -- this was suggested as a potential summer intern project. The idea is as follows:
> We have several customers who periodically run full table scan MR jobs against large HBase tables while applying fairly restrictive predicates. The predicates are often reasonably simple boolean expressions across known columns, and those columns often are enum-typed or otherwise have a fairly restricted range of values. For example, a real time process may mark rows as dirty, and a background MR job may scan for dirty rows in order to perform further processing like rebuilding inverted indexes.
> One way to speed up this type of query is to add bitmap indexes. In the context of HBase, I would envision this as a new type of metadata block included in the HFile which has a series of tuples: (qualifier, value range, compressed bitmap). A 1 bit in the bitmap indicates that the corresponding HFile block has at least one cell for which a column with the given qualifier falls within the given range. Queries which have an equality or comparison predicate against an indexed qualifier can then use the bitmap index to seek directly to those blocks which may contain relevant data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira