You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/06/03 06:34:00 UTC

[jira] [Commented] (IMPALA-10641) Support reading Parquet Bloom filters - missing types

    [ https://issues.apache.org/jira/browse/IMPALA-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356216#comment-17356216 ] 

ASF subversion and git services commented on IMPALA-10641:
----------------------------------------------------------

Commit 817ca5920d93442b591a00bb8b280bd4e05f470c in impala's branch refs/heads/master from Daniel Becker
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=817ca59 ]

IMPALA-10640: Support reading Parquet Bloom filters - most common types

This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---------------------------------------
|Parquet type |  Impala type            |
|---------------------------------------|
|INT32        |  TINYINT, SMALLINT, INT |
|INT64        |  BIGINT                 |
|FLOAT        |  FLOAT                  |
|DOUBLE       |  DOUBLE                 |
|BYTE_ARRAY   |  STRING                 |
 ---------------------------------------

The following types are not supported for the given reasons:

 ----------------------------------------------------------------
|Impala type |  Problem                                          |
|----------------------------------------------------------------|
|VARCHAR(N)  | truncation can change hash                        |
|CHAR(N)     | padding / truncation can change hash              |
|DECIMAL     | multiple encodings supported                      |
|TIMESTAMP   | multiple encodings supported, timezone conversion |
|DATE        | not considered yet                                |
 ----------------------------------------------------------------

Support may be added for these types later, see IMPALA-10641.

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
    whether Parquet Bloom filtering works for the supported types and
    that we do not incorrectly discard row groups for the unsupported
    type VARCHAR. The Parquet file used in the test was generated with
    an external tool.
  - Added unit tests for ParquetBloomFilter in file
    be/src/util/parquet-bloom-filter-test.cc
  - A minor, unrelated change was done in
    be/src/util/bloom-filter-test.cc: the MakeRandom() function had
    return type uint64_t, the documentation claimed it returned a 64 bit
    random number, but the actual number of random bits is 32, which is
    what is intended in the tests. The return type and documentation
    have been corrected to use 32 bits.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Reviewed-on: http://gerrit.cloudera.org:8080/17026
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Tested-by: Csaba Ringhofer <cs...@cloudera.com>


> Support reading Parquet Bloom filters - missing types
> -----------------------------------------------------
>
>                 Key: IMPALA-10641
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10641
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Daniel Becker
>            Priority: Major
>
> This Jira tracks the addition of read support for Parquet Bloom filters for the types not dealt with in IMPALA-10640.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org