You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2023/05/17 17:21:19 UTC

[Impala-ASF-CR] IMPALA-8253: Parquet delta encoding and decoding.

Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12621 )

Change subject: IMPALA-8253: Parquet delta encoding and decoding.
......................................................................


Patch Set 19:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/benchmarks/parquet-delta-benchmark.cc
File be/src/benchmarks/parquet-delta-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/benchmarks/parquet-delta-benchmark.cc@1060
PS19, Line 1060:   const std::vector<int> strides = {4, 8, 12, 16, 20, 30, 40, 50, 80, 100, 120, 150, 180, 200, 400};
nit: long line


http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/benchmarks/parquet-delta-benchmark.cc@1062
PS19, Line 1062:   /// This is used to tune which configurations should be measured.
               :   auto filter = [] (const Config& config) -> bool {
               :     return true
               :         && config.parquet_type == Config::INT32
               :         && config.out_type == Config::int32
               :         // && (int) config.parquet_type == (int) config.out_type
               :         && config.access == Config::BATCH
               :         // && config.encoding == Config::PLAIN && config.mean_delta == 1
               :         && (config.stride == 4 /* || config.stride == 8 || config.stride == 20 || config.stride == 100 || config.stride == 400 */)
               :         && config.mean_delta == 1
               :         ;
What is the long term intention with this filter?
It would be nice to find 8-16 combinations to run and insert the results to the beginning of this file


http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-decoder.cc
File be/src/exec/parquet/parquet-delta-decoder.cc:

http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-decoder.cc@181
PS19, Line 181:     num_buffered_values_ - next_buffered_value_index_;
nit: +2 indentation


http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-decoder.cc@298
PS19, Line 298: template int ParquetDeltaDecoder<int32_t>::NextValuesConverted<int8_t>(int num_values,
> I'm not sure it is a good idea to explicitly instantiate NextValuesConverte
We may be able to remove most int64_t ones, see my comment in bit-packing.cc


http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.h
File be/src/exec/parquet/parquet-delta-encoder.h:

http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.h@54
PS19, Line 54:     static constexpr int MAX_TOTAL_VALUE_COUNT = 16000;
Note that there is a query option for setting max row count in page: parquet_page_row_count_limit


http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.cc
File be/src/exec/parquet/parquet-delta-encoder.cc:

http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.cc@117
PS19, Line 117:   std::memmove(new_data_start_address, old_data_start_address, data_len);
Can you skip this if reserved_space_for_header_ == actual_header_size? Even if memmove would optimize this case I think that the code would be clearer.

Alternatively we could avoid the copy and point to a bit later point, similarly to Arrow


http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/exec/parquet/parquet-delta-encoder.cc@289
PS19, Line 289:       miniblock_index * miniblock_size_in_values_;
nit: indentation


http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/util/bit-packing.cc
File be/src/util/bit-packing.cc:

http://gerrit.cloudera.org:8080/#/c/12621/19/be/src/util/bit-packing.cc@75
PS19, Line 75: INSTANTIATE_UNPACK_AND_DELTA_DECODE(int8_t, int64_t);
             : INSTANTIATE_UNPACK_AND_DELTA_DECODE(int16_t, int64_t);
             : INSTANTIATE_UNPACK_AND_DELTA_DECODE(int32_t, int64_t);
Do we need these combinations?
We only seem to support BIGINT with int64 Parquet columns:
https://github.com/apache/impala/blob/cf28a4c5292fdb3504d1fe11183c78243ed148a4/be/src/exec/parquet/parquet-metadata-utils.cc#L54



-- 
To view, visit http://gerrit.cloudera.org:8080/12621
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie7378ac1a490a6c89a0a4349aae86cbc0fbc80f8
Gerrit-Change-Number: 12621
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 May 2023 17:21:19 +0000
Gerrit-HasComments: Yes