You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Daniel Becker (Code Review)" <ge...@cloudera.org> on 2021/04/12 15:11:49 UTC

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Daniel Becker has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17262


Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option which has the following
possible values:
  NEVER      - never write Parquet Bloom filters
  TBL_PROPS  - write Parquet Bloom filters as set in table properties
  IF_NO_DICT - write Parquet Bloom filters if the row group is not
               fully dictionary encoded
  ALWAYS     - always write Parquet Bloom filters, even if the row
               group is fully dictionary encoded

TODO: Implement table properties involving Parquet Bloom filters.

TODO: Decide size of Parquet Bloom filter based on NDV heuristics or
configuration.

Testing:
  TODO

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
12 files changed, 313 insertions(+), 2 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 3
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 20:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9059/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 20
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Jul 2021 09:32:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8661/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 29 Apr 2021 17:13:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 23: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7286/


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 17:08:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#17). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 961 insertions(+), 83 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/17
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 17
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 19:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9051/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Jul 2021 15:17:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/17262/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17262/3//COMMIT_MSG@16
PS3, Line 16: TBL_PROPS
TBL_PROPS will configure bloom filters per column, right?


http://gerrit.cloudera.org:8080/#/c/17262/3//COMMIT_MSG@16
PS3, Line 16:   TBL_PROPS  - write Parquet Bloom filters as set in table properties
            :   IF_NO_DICT - write Parquet Bloom filters if the row group is not
            :                fully dictionary encoded
What is the relation of TBL_PROPS and IF_NO_DICT? E.g. with TBL_PROPS will you write bloom filter even if the column dictionary encoded?


http://gerrit.cloudera.org:8080/#/c/17262/1/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/1/be/src/exec/parquet/hdfs-parquet-table-writer.cc@605
PS1, Line 605:   // The ParquetBloomFilter object if one is being written. If
             :   // 'ShouldInitParquetBloomFilter()' is false, the combination of the impala type and the
             :   // parquet type is not supported or some error occurs during the initialisation of the
             :   // ParquetBloomFilter object, it is set to NULL.
             :   unique_ptr<ParquetB
> I find using dst_ptr a bit risky:
I like the solution, but I disagree here:
"in case of int8 and int16 we need to pad them to int32 because parquet doesn't support smaller ints, so in these cases we cannot use the buffer, adding even more special cases."
Why can't we use the buffer? dst_ptr points to the plain encoded page buffer, where the i16/i8 are already converted to i32


http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@493
PS3, Line 493: if (parent_
I would prefer to move this to a separate function like FlushDictionaryToBloomFilterIfNeeded()

ProcessValue() is a critical function when trying to understand how Impala writes Parquet files, so I think that we should try to minimize noise.


http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@542
PS3, Line 542:     if (ShouldUpdateParquetBloomFilter()) {
I would prefer to move this to a separate function like UpdateBloomFilterIfNeeded()


http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@597
PS3, Line 597: whwn
typo


http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/parquet-bloom-filter-util.cc
File be/src/exec/parquet/parquet-bloom-filter-util.cc:

http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/parquet-bloom-filter-util.cc@220
PS3, Line 220: {
nit: brace placement



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 3
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Apr 2021 14:03:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8937/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 13
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Jun 2021 14:28:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 19:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/17262/19/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/19/be/src/exec/parquet/hdfs-parquet-table-writer.cc@708
PS19, Line 708:     if (parquet_bloom_filter_state_ == ParquetBloomFilterState::WAIT_FOR_FALLBACK_FROM_DICT) {
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17262/19/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17262/19/tests/query_test/test_parquet_bloom_filter.py@122
PS19, Line 122: ;
flake8: E703 statement ends with a semicolon


http://gerrit.cloudera.org:8080/#/c/17262/19/tests/query_test/test_parquet_bloom_filter.py@125
PS19, Line 125: ;
flake8: E703 statement ends with a semicolon


http://gerrit.cloudera.org:8080/#/c/17262/19/tests/query_test/test_parquet_bloom_filter.py@132
PS19, Line 132: T
flake8: E501 line too long (95 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/17262/19/tests/query_test/test_parquet_bloom_filter.py@160
PS19, Line 160: ;
flake8: E703 statement ends with a semicolon


http://gerrit.cloudera.org:8080/#/c/17262/19/tests/query_test/test_parquet_bloom_filter.py@162
PS19, Line 162: ;
flake8: E703 statement ends with a semicolon



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Jul 2021 14:57:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 7:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py@28
PS6, Line 28: p
> flake8: E126 continuation line over-indented for hanging indent
Done


http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py@126
PS6, Line 126: s
> flake8: E226 missing whitespace around arithmetic operator
Done


http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py@145
PS6, Line 145: w
> flake8: F841 local variable 'bloom_filter_header' is assigned to but never 
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 26 Apr 2021 17:44:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option which has the following
possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

Introduced the 'parquet.bloom.filter.columns' table property. It is a
comma separated pairs of 'col_name:bytes' pairs. The 'bytes' part means
the size of the bitset of the Bloom filter, and is optional. If the size
is not given, it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - TODO: Test falling back from dict encoding to plain and using Bloom
    filters.
  - Test table properties.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 603 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/8
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 12:

(6 comments)

Change looks good. I have added few comments mostly nit.

http://gerrit.cloudera.org:8080/#/c/17262/12/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/12/be/src/exec/parquet/hdfs-parquet-table-writer.cc@632
PS12, Line 632:   bool ShouldUpdateParquetBloomFilter() const {
              :     if (parquet_bloom_filter_ == nullptr) {
              :       // There is no initialised ParquetBloomFilter, an error may have occured earlier.
              :       return false;
              :     }
              : 
              :     DCHECK(parquet_bloom_filter_tbl_prop_enabled_);
              :     switch (parent_->state_->query_options().parquet_bloom_filter_write) {
              :       case TParquetBloomFilterWrite::NEVER: {
              :         return false;
              :       }
              :       case TParquetBloomFilterWrite::IF_NO_DICT: {
              :         return !IsDictionaryEncoding(current_encoding_);
              :       }
              :       case TParquetBloomFilterWrite::ALWAYS: {
              :         return true;
              :       }
              :       default: {
              :         DCHECK(false) << "Unexpected enum variant: " << PrintThriftEnum(
              :             parent_->state_->query_options().parquet_bloom_filter_write);
              :         return false;
              :       }
              :     }
              :   }
              : 
              :   bool ShouldInitParquetBloomFilter() const {
              :     if (!parquet_bloom_filter_tbl_prop_enabled_) return false;
              : 
              :     switch (parent_->state_->query_options().parquet_bloom_filter_write) {
              :       case TParquetBloomFilterWrite::NEVER: {
              :         return false;
              :       }
              :       case TParquetBloomFilterWrite::IF_NO_DICT: {
              :         return true;
              :       }
              :       case TParquetBloomFilterWrite::ALWAYS: {
              :         return true;
              :       }
              :       default: {
              :         DCHECK(false) << "Unexpected enum variant: " << PrintThriftEnum(
              :             parent_->state_->query_options().parquet_bloom_filter_write);
              :         return false;
              :       }
              :     }
              :   }
We can probably combine these 2 validation under something like isParquetBloomFilterAllowed(boolean update), where update is true for Update operation and for init its false. That may reduce the code duplication.


http://gerrit.cloudera.org:8080/#/c/17262/12/be/src/exec/parquet/hdfs-parquet-table-writer.cc@715
PS12, Line 715:     if (parquet_bloom_filter_tbl_prop_enabled_
If this method is invoked when parquet_bloom_filter_ is nullptr ( may be due to Init failure or Update failure), I think it might run into issue. We should probably check for that. Some check like ShouldUpdateParquetBloomFilter might be required.


http://gerrit.cloudera.org:8080/#/c/17262/12/be/src/util/parquet-bloom-filter.h
File be/src/util/parquet-bloom-filter.h:

http://gerrit.cloudera.org:8080/#/c/17262/12/be/src/util/parquet-bloom-filter.h@41
PS12, Line 41:   /// If 'always_false' is true, the implementation assumes that the directory is empty.
nit: 'always_false' -> 'always_false_'


http://gerrit.cloudera.org:8080/#/c/17262/12/be/src/util/parquet-bloom-filter.h@42
PS12, Line 42: 'always_false'
nit: same as above


http://gerrit.cloudera.org:8080/#/c/17262/12/be/src/util/parquet-bloom-filter.h@122
PS12, Line 122:   bool always_false_;
a comment above this member field will help.


http://gerrit.cloudera.org:8080/#/c/17262/12/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

http://gerrit.cloudera.org:8080/#/c/17262/12/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java@249
PS12, Line 249:   private Map<String, Long> parseParquetBloomFilterWritingTblProp(final String tbl_prop) {
Is it possible to write any Unit tests for this parsing logic covering various scenarios ?



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 12
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 15 Jun 2021 16:08:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#13). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 706 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/13
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 13
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option which has the following
possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

Introduced the 'parquet.bloom.filter.columns' table property. It is a
comma separated pairs of 'col_name:bytes' pairs. The 'bytes' part means
the size of the bitset of the Bloom filter, and is optional. If the size
is not given, it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 694 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/9
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 24:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7291/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 24
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 22:43:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 15:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8981/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jun 2021 08:54:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17262/9/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17262/9/tests/query_test/test_parquet_bloom_filter.py@108
PS9, Line 108: h
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/17262/9/tests/query_test/test_parquet_bloom_filter.py@128
PS9, Line 128: d
flake8: E122 continuation line missing indentation or outdented



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 29 Apr 2021 16:54:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 694 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/10
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 10
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#16). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 961 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/16
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 16
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 23:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9074/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 16:11:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#25). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 997 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/25
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 25
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 26: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7293/


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 26
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 15:10:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has removed a vote on this change.

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Removed Verified-1 by Impala Public Jenkins <im...@cloudera.com>
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 26
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 25: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 25
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 09:04:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8641/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 26 Apr 2021 18:03:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 990 insertions(+), 83 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/19
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 15:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17262/15/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
File fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java:

http://gerrit.cloudera.org:8080/#/c/17262/15/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java@30
PS15, Line 30:   private static final Logger LOG = Logger.getLogger(ParquetBloomFilterTblPropParserTest.class);
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17262/15/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java@77
PS15, Line 77:     final Map<String, Long> res = HdfsTableSink.parseParquetBloomFilterWritingTblProp(tbl_props);
line too long (97 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 23 Jun 2021 08:33:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8730/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 11
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Comment-Date: Mon, 17 May 2021 13:17:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#23). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 996 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/23
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8667/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 10
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 30 Apr 2021 10:29:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option which has the following
possible values:
  NEVER      - never write Parquet Bloom filters
  TBL_PROPS  - write Parquet Bloom filters as set in table properties
  IF_NO_DICT - write Parquet Bloom filters if the row group is not
               fully dictionary encoded
  ALWAYS     - always write Parquet Bloom filters, even if the row
               group is fully dictionary encoded

TODO: Implement table properties involving Parquet Bloom filters.

TODO: Decide size of Parquet Bloom filter based on NDV heuristics or
configuration.

Testing:
  TODO

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
12 files changed, 321 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 4
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 20: Code-Review+2

Thanks for the changes!


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 20
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Jul 2021 12:34:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 23:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7286/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 16:55:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17262/12/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17262/12/tests/query_test/test_parquet_bloom_filter.py@170
PS12, Line 170: )
flake8: E501 line too long (91 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/17262/12/tests/query_test/test_parquet_bloom_filter.py@195
PS12, Line 195: d
flake8: E126 continuation line over-indented for hanging indent



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 12
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Jun 2021 10:34:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#15). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 959 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/15
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 23: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 16:55:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 709 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/14
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 14
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 999 insertions(+), 83 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/22
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 22
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#11). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 694 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/11
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 11
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 21: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7283/


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 21
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Jul 2021 18:38:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py@28
PS6, Line 28: g
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py@126
PS6, Line 126: *
flake8: E226 missing whitespace around arithmetic operator


http://gerrit.cloudera.org:8080/#/c/17262/6/tests/query_test/test_parquet_bloom_filter.py@145
PS6, Line 145: b
flake8: F841 local variable 'bloom_filter_header' is assigned to but never used



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 23 Apr 2021 14:46:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 17:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9041/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 17
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 07 Jul 2021 11:36:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 17:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/17262/16//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17262/16//COMMIT_MSG@35
PS16, Line 35:     'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
nit: wrap at 72


http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/hdfs-table-sink.h@284
PS16, Line 284: 
nit: extra row


http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/parquet/hdfs-parquet-table-writer.cc@632
PS16, Line 632:   /// Returns whether we are using Parquet Bloom filtering. It can be called in two cases:
              :   /// when deciding whether to initialise 'parquet_bloom_filter_' and when deciding
              :   /// whether it should be updated (a new value is to be inserted). In the first case,
              :   /// 'init' should be set to true, otherwise to false.
optional: using an enum to represent the Bloom filter state machine seems easier to understand to me. It could have states like uninitialized/enabled/disabled/wait_for_fallback_from_dictionary_encoding and make parquet_bloom_filter_tbl_prop_enabled_ unnecessary.


http://gerrit.cloudera.org:8080/#/c/17262/16/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
File fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java:

http://gerrit.cloudera.org:8080/#/c/17262/16/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java@37
PS16, Line 37:     exp.put("col1", HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES);
             :     exp.put("col2", HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES);
             :     exp.put("col3", HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES);
optional: it could be a bit nicer to fill the map with values in one command.
We are already using Gueva's ImmutableMap.of(...), see https://github.com/apache/impala/blob/bb3062197b134f33e2796fac603e3367ab8bef1a/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java#L299

see https://stackoverflow.com/questions/6802483/how-to-directly-initialize-a-hashmap-in-a-literal-way for more details


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@113
PS16, Line 113:         reference_col_to_bloom_filter, col_to_bloom_filter)
can you add a select query and check the profile similarly to test_fallback_from_dict? This is probably redundant as the original Parquet file has the exaxt same bloom filters, but it would still worth to verify that we can use the bloom filters written by Impala.


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@196
PS16, Line 196:       db=unique_database, tbl=tbl_name, col_name=column_name, size=bitset_size)
nit: +2 indentation


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@225
PS16, Line 225: _create_empty_test_database
naming: This creates a table and not a database.


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@225
PS16, Line 225: unique_database
naming: here and at other helper functions: I would prefer db_name to unique_database, as unique_database has a specific meaning for test functions


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@240
PS16, Line 240: _populate_database
naming: this populates a table, not a database


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@320
PS16, Line 320:           with open("exp_dir.bin", "wb") as out:
              :             out.write(exp_directory)
              :           with open("dir.bin", "wb") as out:
              :             out.write(directory)
I don't get it why do we need to write these to files - what is the type of exp_directory-directory? If they are byte arrays, can't we simply compare them? If printing non-utf8 bytes is a problem, than we could compare them and print there hex encoded versions is there is a problem


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@325
PS16, Line 325: column index
bloom filter?



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 17
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 07 Jul 2021 11:28:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8645/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 27 Apr 2021 13:26:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 717 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/12
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 12
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 16:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8997/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 16
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 24 Jun 2021 08:34:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 26: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 26
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 09:04:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 25:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9081/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 25
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 09:18:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@495
PS3, Line 495:           // Write dictionary keys to Parquet Bloom filter if we haven't been filling it so
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 3
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Apr 2021 15:12:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#24). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 997 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/24
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 24
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 26:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7293/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 26
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 09:04:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 24: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 24
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 22:41:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 24: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7291/


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 24
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 04:47:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 22:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9073/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 22
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 16:14:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 14:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/8938/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 14
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 16 Jun 2021 14:30:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 21:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7283/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 21
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Jul 2021 12:35:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option which has the following
possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

Introduced the 'parquet.bloom.filter.columns' table property. It is a
comma separated pairs of 'col_name:bytes' pairs. The 'bytes' part means
the size of the bitset of the Bloom filter, and is optional. If the size
is not given, it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - TODO: Test falling back from dict encoding to plain and using Bloom
    filters.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 584 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/7
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17262/7/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/7/be/src/exec/parquet/hdfs-parquet-table-writer.cc@453
PS7, Line 453:       parquet_bloom_filter_bytes_ = parent->parquet_bloom_filter_col_sizes_[column_name()];
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 26 Apr 2021 17:45:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 24:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9078/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 24
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 20:03:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option which has the following
possible values:
  NEVER      - never write Parquet Bloom filters
  TBL_PROPS  - write Parquet Bloom filters as set in table properties
  IF_NO_DICT - write Parquet Bloom filters if the row group is not
               fully dictionary encoded
  ALWAYS     - always write Parquet Bloom filters, even if the row
               group is fully dictionary encoded

TODO: Implement table properties involving Parquet Bloom filters.

TODO: Decide size of Parquet Bloom filter based on NDV heuristics or
configuration.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M tests/query_test/test_parquet_bloom_filter.py
16 files changed, 486 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/6
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 21: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 21
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Jul 2021 12:35:00 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8630/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 23 Apr 2021 15:03:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8588/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 4
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 15 Apr 2021 13:18:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Reviewed-on: http://gerrit.cloudera.org:8080/17262
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Tested-by: Csaba Ringhofer <cs...@cloudera.com>
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 997 insertions(+), 82 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved
  Csaba Ringhofer: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 27
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8549/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 3
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Apr 2021 15:32:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8845/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 12
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Comment-Date: Fri, 04 Jun 2021 10:55:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has uploaded a new patch set (#20). ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................

IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option and the
'parquet.bloom.filter.columns' table property. The query option has the
following possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded (the number of distinct values exceeds
               the maximum dictionary size)
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

The 'parquet.bloom.filter.columns' table property is a comma separated
list of 'col_name:bytes' pairs. The 'bytes' part means the size of the
bitset of the Bloom filter, and is optional. If the size is not given,
it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether
    the Parquet Bloom filter header and bitset are identical.
  - 'test_fallback_from_dict' tests falling back from dict encoding to
    plain and using Bloom filters.
  - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back
    from dict encoding to plain when Bloom filters are NOT enabled.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
A be/src/util/parquet-bloom-filter-avx2.cc
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
A fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
M tests/query_test/test_parquet_bloom_filter.py
23 files changed, 992 insertions(+), 83 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/20
-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 20
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )

Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types
......................................................................


Patch Set 26: Verified+1 Code-Review+2

The only failed test seems to be a flaky test: IMPALA-10754
The issue is resolved, but there were other occurrences since the supposed fix.


-- 
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 26
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Jul 2021 12:24:22 +0000
Gerrit-HasComments: No