You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2015/02/03 01:02:36 UTC

[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

    [ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302507#comment-14302507 ] 

Owen O'Malley commented on HIVE-9188:
-------------------------------------

Suggestions:
* Pick m to always be a multiple of 64 (since you are using longs are the representation)
* change the representation of BloomFilter in orc_proto to record the number of hash functions and not the size or fpp.
* use fixed64 for the bit field
* you'll also need to update the specification in the wiki with the change to the format (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-specORCFormatSpecification)
* revert the spurious change to CliDriver.java
* revert the spurious change to .gitignore
* it seems suboptimal to convert long values to bytes before hashing


> BloomFilter in ORC row group index
> ----------------------------------
>
>                 Key: HIVE-9188
>                 URL: https://issues.apache.org/jira/browse/HIVE-9188
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>    Affects Versions: 0.15.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>              Labels: orcfile
>         Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch
>
>
> BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)