You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2019/04/03 15:22:20 UTC
[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )

Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 8: Code-Review+1

(12 comments)

I have only minor comments, mainly about naming and line breaking.

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@119
PS7, Line 119:   num_stats_filtered_row_groups_by_page_index_counter_ =
> Good idea! Added counter for it.
I am not sure how to track this, but it is also possible that some columns have page indexes, while some do not. For example Parquet-mr does not write stats for int96 timestamps, and I think that it is also not written for double/float if there is a NaN value.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117
PS8, Line 117: NumStatsFilteredRowGroups
This name is not precise anymore, as it only means row groups filtered by column index. I am not sure how we should handle this, as keeping the old name is probably useful for compatibility reasons. A possibility is to count both cases in this stat, and have two separate stats for the two reasons.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@120
PS8, Line 120: NumStatsFilteredRowGroupsByPageIndex
Maybe "completely" could be added to the name to make its meaning more exact. The fact that row groups that survived column chunk stats can still be completely eliminated by page filtering may not be obvious for the reader at first, and the current name could be interpreted as the number of row groups where page filtering was applied.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@517
PS8, Line 517:     const parquet::ColumnOrder* col_order = col_idx < col_orders.size() ?
             :         &col_orders[col_idx] : nullptr;
nit: I prefer breaking after =


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@525
PS8, Line 525:     ColumnStatsReader stat_reader = CreateColumnStatsReader(col_chunk, col_type,
nit: I prefer breaking after =


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@752
PS8, Line 752: =
nit: I prefer breaking after =


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@755
PS8, Line 755: col_order, *node->element);
nit: I prefer breaking after =


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@773
PS8, Line 773:         skip_ranges.push_back(GetRowRangeForPage(row_group, scalar_reader->offset_index_,
nit: I would prefer breaking before GetRowRangeForPage


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h@503
PS8, Line 503: IsLastCandidateRange
I think that adding "InRowGroup" to the name would make the call site clearer.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h@526
PS8, Line 526: CandidateRowsRemainingInCurrentPage
The name mislead me a bit, I have assumed that it returns an integer. Maybe something like PageHasRemainingCandidateRows would be better?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.cc@1732
PS8, Line 1732:   num_values_read_ -= num_buffered_values_;
This seems to be the only place where num_values_read_ is decreased, so it looks like

num_values_read_ = total_number_of_values_in_row_group - number_of_values_in_completely_skipped_pages - number_of_values_skipped_at_the_end_of_pages.

It seems that we only use this variable in cases that are irrelevant with PageFiltering(), so I would remove this subtraction, and add some comments to  num_values_read_ about its role if there is no page filtering, and its irrelevance if there is page filtering.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@329
PS8, Line 329:     str_len += sizeof(int32_t);
             :     if (UNLIKELY(str_len < 0 || buffer_end - buffer < str_len)) return -1;
I am not sure if this matters but str_len of -4 is accepted as OK.



-- 
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Apr 2019 15:22:20 +0000
Gerrit-HasComments: Yes