You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2018/04/12 00:23:08 UTC

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 10:

(9 comments)

Overall this is looking good. I had some specific concerns about some of the nitty-gritty details.

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301
PS10, Line 301:   std::vector<std::string> min_values_;
I'm still concerned about the amount of untracked memory from min_values_ and max_values_, even if we truncate the string values to 1KB or similar - it seems like could end up with multiple MB of untracked memory. We could probably live with it since it's smaller than the actual data, but it's a step in the wrong direction.

Maybe we could store min_values_ and max_values_ as StringValues backed by memory per_file_mem_pool_ and then only convert to strings when writing out each column to the page index?


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@735
PS10, Line 735:     min_values_.push_back(std::string(""));
I don't know if we need the call to std::string() here, I think it should work if we just emplace_back() to instantiate an empty string.


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@1227
PS10, Line 1227:   for (auto& column : columns_) {
nit: can fit loop on one line.


http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h@159
PS10, Line 159:   // If true, min/max values are ascending.
Maybe briefly mention why they both start off true? And both can be true at the same time? It's slightly subtle.


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@37
PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
We've got a lot of good coverage in this test.

I'm wondering if we're missing some basic tests that confirm that the values in the page match the min/max values in the page index. It seems like these validations might not catch some kinds of bugs. E.g. min/max values in the index are somehow out-of-sync with the pages. Most bugs that I can imagine would get caught by one validation or another but it would be nice to have a sanity test where we confirm that the values in each page match the values in the page index.


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@177
PS10, Line 177: previouse_value
typo in variable name


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205
PS10, Line 205: falied
nit: failed


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205
PS10, Line 205: column_info_schema
this variable isn't defined - did you mean column_info?


http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@244
PS10, Line 244: chars_formats
chars_formats is weird in that it's created by a different test - TestCharsFormats. I.e. it's not present unless that test ran before this one. Maybe we should change it so that the table is loaded during normal data loading?



-- 
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Apr 2018 00:23:08 +0000
Gerrit-HasComments: Yes