You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org> on 2017/06/02 17:46:11 UTC

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics in the parquet table writer

Pooja Nilangekar has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7058

Change subject: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer

The null_count in the statistics field is updated each time a null
value is encountered by paquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
6 files changed, 209 insertions(+), 97 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#3).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by paquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
6 files changed, 180 insertions(+), 100 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics in the parquet table writer

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#2).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer

The null_count in the statistics field is updated each time a null
value is encountered by paquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
6 files changed, 207 insertions(+), 96 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7058/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 553:       // We need to get min stats.
> Where did "to get" go?
Done


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS5, Line 197: CK(page_stats_ba
> I think changing this to a DCHECK is preferable, since it always has to be 
Done


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 162:   void Update(const T& min_value, const T& max_value);
> My idea was to keep this line, but change it to
Done


http://gerrit.cloudera.org:8080/#/c/7058/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS6, Line 60:  w
> nit: double space
Done


PS6, Line 99: appended 
> nit: typo
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#8).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by parquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
4 files changed, 129 insertions(+), 119 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7058

to look at the new patch set (#9).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by parquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
4 files changed, 134 insertions(+), 126 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 8: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 4:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/7058/4//COMMIT_MSG
Commit Message:

Line 10: value is encountered by paquet table writer. The value is written
nit typo


http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 197:     if(page_stats_base_ != nullptr) {
nit: single line


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 79:   static bool ReadCountFromThrift(
> If the return type is int64, what would the return value be in case the sca
For frequently called functions it's often better to return a bool instead of the more heavy-weight Status. In this case you would probably pass an int64_t* to return the value.

Let's remove the method here altogether and re-introduce it in a change that handles reading the stats.


http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS4, Line 63: NULL_COUNT
remove


PS4, Line 105: row
column?


PS4, Line 105: Initilizes
typo


PS4, Line 116: NULL
nit: lowercase


http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 32: inline void ColumnStatsBase::UpdateNullCount() {
It looks like this will fit into one line in the header. If so, please move it there.


Line 186:   if (cs->has_values_) {
You could also change the interface of Update() to take a min and max parameter and then clean up this method by calling it. You don't have to do it in this change though if you think it's too much.


Line 205:   null_count_ += cs->null_count_;
If you add a "int64_t count" parameter to the UpdateNullCount method, you can call it from here.


http://gerrit.cloudera.org:8080/#/c/7058/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 427:         ColumnStats('bigint_col', 0, 90, 0),
> Done
Cool :)


http://gerrit.cloudera.org:8080/#/c/7058/4/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 324:       null_count = None
How about moving this to L330, initialize it from stats.null_count and then check it's not None?


Line 480:     """Test that we don't write min/max statistics for null columns. Ensure null_count 
Please remove the trailing whitespace. You can also set-up git in a way that it warns about trailing whitespace during commit time. Even better, you may be able to configure your editor to highlight it for you.


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#6).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by parquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
5 files changed, 135 insertions(+), 124 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 7:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7058/7/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS7, Line 474: Update null_count in page_stat_
page_stat_ doesn't exist. I think it would be easier to understand if you just inline ProcessNullValue here and update the comment accordingly. The extra level of indirection doesn't seem to help much with the readability.


http://gerrit.cloudera.org:8080/#/c/7058/7/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS7, Line 159: value
nit values


PS7, Line 160: values
to either value until...


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 5:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 961:           Status err(Substitute("Corrupt Parquet file '$0': column '$1' "
If possible, please don't rebase changes while they are under review. Otherwise the rebase can clutter the view when comparing recent change sets.


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS5, Line 197: page_stats_base_
can this ever be nullptr?


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

Line 58:       DCHECK(false) << "Unsupported statistics value field requested";
undo


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 60:   /// Enum to select the statistics field to be read. Values do not
nit: Please undo this change, too.


Line 72:   static bool ReadValueFromThrift(const parquet::ColumnChunk& col_chunk,
Undo


Line 100:   void UpdateNullCount(int64_t count = 1) { null_count_ += count; }
Update sounds like it will set it to a new value. Maybe "AddToNullCount" or "IncrementNullCount" would be better names. When calling this, IncrementNullCount(1) may be more explicit and thus more clear. Then you could drop the default value.


Line 162:   void Update(const T& v);
You can remove the implementation of Update(const T& v) and instead put a single line "Update(v, v);" here.


Line 163:   void Update(const T& min_value, const T& max_value);
Please add a comment to this method, too.


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 61:     Update(cs->min_value_, cs->max_value_);
nit: This could go on a single line now


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics in the parquet table writer

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer
......................................................................


Patch Set 2:

(12 comments)

Thank you for working on this!

http://gerrit.cloudera.org:8080/#/c/7058/2//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer
Nit: Make the first line of the commit message fit in 80 chars. You can end it after "statistics"


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 553:       // We need to get min_value.
Let's only do the write path in this change and then address the read path in a subsequent change.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS2, Line 198: {}
Do you need the default implementation here? Without it, you could also remove the "Implemented in.." comment, since then it'd be clear.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 79:   static bool ReadCountFromThrift(
I think it will be easier to return an int64 here. However we should do the read path in a later change altogether.


Line 115:   bool has_null_count_;
Why do we need has_null_count_ instead of just always counting them? Are there types for which we don't know the number of nulls we wrote?


Line 170:   void UpdateNullCount();
Can this go into the base class? It doesn't depend on T.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 50:   null_count_++;
nit: we use prefix operators, mostly for consistency, but sometimes for performance reasons.


http://gerrit.cloudera.org:8080/#/c/7058/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 323:       if ((stats.min_value is None and stats.max_value is None) 
With the additional checks below, is this line still needed? What is its intent?

Please remove the trailing whitespace


Line 328:       min_value = max_value = null_count = None
You should have one assignment per line


Line 427:         ColumnStats('id', 0, 7299, None),
Why are these None and not 0? I think we should count 0s even if there are none.


PS2, Line 607: 	
Replace tabs with spaces, here and elsewhere


Line 620: 									"functional_parquet.zipcode_incomes", expected_min_max_values)
The formatting here seems off


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 10:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/736/

-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 6:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 961:           Status err(Substitute("Corrupt Parquet file '$0': column '$1' "
> Sorry about that! I won't do it henceforth.
No worries :)


http://gerrit.cloudera.org:8080/#/c/7058/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 553:       // We need min stats.
Where did "to get" go?


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS5, Line 197: page_stats_base_
> I am not sure, I noticed Line #688 does a DCHECK. Should I change it to a D
I think changing this to a DCHECK is preferable, since it always has to be set.


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 162:   void Update(const T& min_value, const T& max_value);
> Done
My idea was to keep this line, but change it to

void Update(const T& v) { Update(v, v); }

and remove the implementation from the .cc file.


http://gerrit.cloudera.org:8080/#/c/7058/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS6, Line 60:   
nit: double space


PS6, Line 99: appeneded
nit: typo


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7058/8/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS8, Line 96: pdate the 
> is that dead now? if so, how about we delete it?
Done


Line 107: 
> it looks like has_value_ doesn't indicate whether null_count_ is valid or n
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 8:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7058/7/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS7, Line 474: 
> page_stat_ doesn't exist. I think it would be easier to understand if you j
Done


http://gerrit.cloudera.org:8080/#/c/7058/7/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS7, Line 159: value
> nit values
Done


PS7, Line 160: either
> to either value until...
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 4:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7058/2//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5061: Populate null_count in parquet::statistics
> Nit: Make the first line of the commit message fit in 80 chars. You can end
Done


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 553:       // We need to get min_value.
> Let's only do the write path in this change and then address the read path 
Sure, I haven't made a call to read the null_count anywhere. Just defined a function which could be used in the future.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS2, Line 198: Co
> Do you need the default implementation here? Without it, you could also rem
Done


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 79:   static bool ReadCountFromThrift(
> I think it will be easier to return an int64 here. However we should do the
If the return type is int64, what would the return value be in case the scanner is reading a parquet file which does not have it's stats populated? Nos sure if '-1' is consistent with the rest of the code. The other option would be to return a Status object. I was trying to make the function consistent with ReadValueFromThrift. What would you suggest?


Line 115: 
> Why do we need has_null_count_ instead of just always counting them? Are th
Done


Line 170:   virtual void Merge(const ColumnStatsBase& other) override;
> Can this go into the base class? It doesn't depend on T.
Done


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 50:   DCHECK(dynamic_cast<const ColumnStats<T>*>(&other));
> nit: we use prefix operators, mostly for consistency, but sometimes for per
Done


http://gerrit.cloudera.org:8080/#/c/7058/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 323:       max_value = None
> With the additional checks below, is this line still needed? What is its in
Done


Line 328:         max_value = decode_stats_value(schema, stats.max_value)
> You should have one assignment per line
Done


Line 427:         ColumnStats('bigint_col', 0, 90, 0),
> Why are these None and not 0? I think we should count 0s even if there are 
Done


PS2, Line 607:  
> Replace tabs with spaces, here and elsewhere
Done


Line 620
> The formatting here seems off
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 5:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7058/4//COMMIT_MSG
Commit Message:

Line 10: value is encountered by parquet table writer. The value is written
> nit typo
Done


http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 197:     if (page_stats_base_ != nullptr) page_stats_base_->UpdateNullCount();
> nit: single line
Done


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 79: 
> For frequently called functions it's often better to return a bool instead 
Done


http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS4, Line 63: ;
> remove
Done


PS4, Line 105: ing
> column?
Done


PS4, Line 116: r'. 
> nit: lowercase
Done


http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 32: template <typename T>
> It looks like this will fit into one line in the header. If so, please move
Done


Line 186:     has_values_ = true;
> You could also change the interface of Update() to take a min and max param
Done


Line 205: template <>
> If you add a "int64_t count" parameter to the UpdateNullCount method, you c
Done


http://gerrit.cloudera.org:8080/#/c/7058/4/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 324: 
> How about moving this to L330, initialize it from stats.null_count and then
Done


Line 480:     is set for columns with null values."""
> Please remove the trailing whitespace. You can also set-up git in a way tha
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by parquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Reviewed-on: http://gerrit.cloudera.org:8080/7058
Reviewed-by: Lars Volker <lv...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
4 files changed, 134 insertions(+), 126 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Lars Volker: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 6:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 961:           Status err(Substitute("Corrupt Parquet file '$0': column '$1' "
> If possible, please don't rebase changes while they are under review. Other
Sorry about that! I won't do it henceforth.


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS5, Line 197: page_stats_base_
> can this ever be nullptr?
I am not sure, I noticed Line #688 does a DCHECK. Should I change it to a DCHECK or entirely remove this line?


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.cc
File be/src/exec/parquet-column-stats.cc:

Line 58:       DCHECK(false) << "Unsupported statistics field requested";
> undo
Done


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 60:   /// Enum to select  whether to read minimum or maximum statistics. Values do not
> nit: Please undo this change, too.
Done


Line 72:   static bool ReadFromThrift(const parquet::ColumnChunk& col_chunk,
> Undo
Done


Line 100:   void IncrementNullCount(int64_t count) { null_count_ += count; }
> Update sounds like it will set it to a new value. Maybe "AddToNullCount" or
Done. Changed it to IncrementNullCount(int64_t count)


Line 162:   void Update(const T& min_value, const T& max_value);
> You can remove the implementation of Update(const T& v) and instead put a s
Done


Line 163: 
> Please add a comment to this method, too.
Done


http://gerrit.cloudera.org:8080/#/c/7058/5/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 61:     std::string min_str;
> nit: This could go on a single line now
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#5).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by parquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
6 files changed, 144 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 9:

Pooja, please rebase on latest master and push the rebased diff and then we can kick off the gvo.

-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 10: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#4).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by paquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
6 files changed, 180 insertions(+), 100 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#7).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by parquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
4 files changed, 134 insertions(+), 119 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Pooja Nilangekar (Code Review)" <ge...@cloudera.org>.
Pooja Nilangekar has uploaded a new patch set (#4).

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................

IMPALA-5061: Populate null_count in parquet::statistics

The null_count in the statistics field is updated each time a null
value is encountered by paquet table writer. The value is written
to the parquet header if it has one or more null values in the
row_group.

Testing: Modified the existing end-to-end test in the
test_insert_parquet.py file to make sure each parquet header has
the appropriate null_count. Verified the correctness of the nulltable
test and added an additional test which populates a parquet file with
the functional_parquet.zipcode_incomes table and ensures that the
expected null_count is populated.

Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-stats.cc
M be/src/exec/parquet-column-stats.h
M be/src/exec/parquet-column-stats.inline.h
M tests/query_test/test_insert_parquet.py
6 files changed, 180 insertions(+), 100 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/7058/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7058/8/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS8, Line 96: has_values
is that dead now? if so, how about we delete it?


Line 107:   /// Stores whether the current object has been initialized with a set of values.
it looks like has_value_ doesn't indicate whether null_count_ is valid or not, right? Let's clarify that, maybe say: ... with min and max values.


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5061: Populate null count in parquet::statistics

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics
......................................................................


Patch Set 10: Code-Review+2

Carrying Dan's +2

-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <po...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <po...@cloudera.com>
Gerrit-HasComments: No