You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/07/14 13:12:01 UTC
[jira] [Commented] (IMPALA-10640) Support reading Parquet Bloom
filters - most common types
[ https://issues.apache.org/jira/browse/IMPALA-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380583#comment-17380583 ]
ASF subversion and git services commented on IMPALA-10640:
----------------------------------------------------------
Commit a5de2acc47723fdaee4ebe6d904d16be505b7cfb in impala's branch refs/heads/master from Daniel Becker
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=a5de2ac ]
IMPALA-10642: Write support for Parquet Bloom filters - most common types
This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.
Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
NEVER - never write Parquet Bloom filters
IF_NO_DICT - write Parquet Bloom filters if specified in the table
properties AND if the row group is not fully
dictionary encoded (the number of distinct values exceeds
the maximum dictionary size)
ALWAYS - always write Parquet Bloom filters if specified in the
table properties, even if the row group is fully
dictionary encoded
The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.
Testing:
- Added a test in tests/query_test/test_parquet_bloom_filter.py that
uses Impala to write the same table as in the test file
'testdata/data/parquet-bloom-filtering.parquet' and checks whether
the Parquet Bloom filter header and bitset are identical.
- 'test_fallback_from_dict' tests falling back from dict encoding to
plain and using Bloom filters.
- 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
from dict encoding to plain when Bloom filters are NOT enabled.
Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Reviewed-on: http://gerrit.cloudera.org:8080/17262
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Tested-by: Csaba Ringhofer <cs...@cloudera.com>
> Support reading Parquet Bloom filters - most common types
> ---------------------------------------------------------
>
> Key: IMPALA-10640
> URL: https://issues.apache.org/jira/browse/IMPALA-10640
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Reporter: Daniel Becker
> Assignee: Daniel Becker
> Priority: Major
> Labels: parquet
>
> Support reading Parquet Bloom filters for the most common types: integers, float, double and Impala strings. Supporting these types is relatively easy in comparison to most other types. Support for other types may be added later.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org