You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2021/10/10 10:13:00 UTC
[jira] [Commented] (ORC-1024) BloomFilter hash computation is inconsistent between Java and C++ clients

    [ https://issues.apache.org/jira/browse/ORC-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426786#comment-17426786 ] 

Quanlong Huang commented on ORC-1024:
-------------------------------------

I simply changed the c++ codes to use {{int64_t}} , then it works the same as the Java client. Code changes: [https://github.com/stiga-huang/orc/tree/bloom-filter-debug]

Note that this issue only happens when the writer and reader uses different clients, e.g. Java writer and C++ reader. The C++ bloom filter codes are added since 1.6.0 (ORC-488). There may already be a lot of ORC files generated by the C++ writer with bloom filters. Simply changing {{uint64_t}} to {{int64_t}} is not enought. We need a compatible change so new version of C++ readers can still make use of the bloom filters in those files.

CC [~wgtmac]

> BloomFilter hash computation is inconsistent between Java and C++ clients
> -------------------------------------------------------------------------
>
>                 Key: ORC-1024
>                 URL: https://issues.apache.org/jira/browse/ORC-1024
>             Project: ORC
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 1.6.4, 1.6.5, 1.6.6, 1.7.0, 1.6.7, 1.6.8, 1.6.9, 1.6.10, 1.6.11
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>         Attachments: id_name_with_bloom_filters.orc
>
>
> [~drorke] found that the C++ reader could incorrectly filter out some rows (RowGroup) when reading Hive generated ORC files with SearchArgument "x = value" using some special values. It only happens when Hive generates bloom filters in these files.
> I finally reproduced this by using the java tool (with ORC-1023) to generate an ORC file with bloom filters, and read it using the c++ reader. Attached the orc file (id_name_with_bloom_filters.orc). It contains 2 columns and 3 rows:
> {code:java}
> {"id": 0, "name": "Alice"}
> {"id": 1, "name": "Bob"}
> {"id": 18000000000, "name": "Mike"}
> {code}
> Using SearchArgument "id = 18000000000" in the C++ reader, no rows will be read out.
> Looking into the codes, the Java codes use {{long}} as hash key, while the C++ codes use {{uint64_t}} as hash key. {{long}} in Java is signed so should correspond to {{int64_t}} in C++. I think this causes the issue.
> In Java codes, the hash key of 18000000000 is -1097054448615658549. In the C++ codes, the hash key of it is 15298148493198126027. This results in different results in testHash().
> Java codes: 
>  [https://github.com/apache/orc/blob/93b7aa67830104d6bd7fc55399947ee938549f55/java/core/src/java/org/apache/orc/util/BloomFilter.java#L195-L204]
>  C++ codes:
>  [https://github.com/apache/orc/blob/93b7aa67830104d6bd7fc55399947ee938549f55/c%2B%2B/src/BloomFilter.cc#L106-L115]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)