You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tianyi Wang (Code Review)" <ge...@cloudera.org> on 2017/08/23 04:40:58 UTC

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Tianyi Wang has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7776

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch addes collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
> select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items couting into
collection_items_read_counter. Both counters are updated every row batch
read. In the query described above, scanning time is decreased by 10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
6 files changed, 77 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7776/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 478:   /// used to reduce the frequency updating HdfsScanNode counter. It is updated by the
> ... the frequency of updating the corresponding HdfsScanNode counter.
Done


Line 479:   /// callees of AssembleRows() and is merged into HdfsScanNode counter at the end of
> into the HdfsScanNode counter
Done


Line 554:   /// allocated from 'coll_value_builder'. Increases 'this->coll_items_read_counter_' by
> this-> is not necessary because our "_" suffix already indicates this is a 
Done


http://gerrit.cloudera.org:8080/#/c/7776/8/be/src/exec/scan-node.h
File be/src/exec/scan-node.h:

Line 153:   /// # nested collection items read from the scanner. For example, for schema
> # items the scanner read into CollectionValues. For example, ...
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7776/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS5, Line 963: int64_t num_rows_read = 
> You could make that argument for all parameters, but it's not a good patter
Done. I agree that non-obvious side-effects are worse than code redundancy. Thanks for the explanation.


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7776/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS5, Line 963: coll_items_read_counter_
> does that need to be a field in the object, given that it's reset ever time
If we pass it through function calls about 10 function signatures would be affected. Patch 1 would be what it looks like. Is there a better way to do it?


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7776

to look at the new patch set (#7).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 34 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 2:

(3 comments)

Few more small typos

http://gerrit.cloudera.org:8080/#/c/7776/2//COMMIT_MSG
Commit Message:

PS2, Line 9: addes
typo


PS2, Line 26: couting
typo


PS2, Line 27: every
for every...


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7776

to look at the new patch set (#6).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 34 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7776/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS5, Line 963: coll_items_read_counter_
> If we pass it through function calls about 10 function signatures would be 
You could make that argument for all parameters, but it's not a good pattern to follow in general.

That said, given the way our counters currently work, I'm okay with having this thread-local counter live in the scanner object. But I think it would help clarify it if we follow a slightly different pattern to accumulate the thread-local counter in the the global counter:

COUNTER_ADD(global_counter, thread_local_counter_);
thread_local_counter_ = 0; // Was accounted into global_counter.

i.e. move this line down to after line 1010.  That way it's clearer that no counts are ever lost or double counted, etc.

It would also be nice to explain concisely why we have this thread local counter in its header comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1210:     CollectionValueBuilder* coll_value_builder, int64_t* num_coll_item_read) {
> Can you remove these from the function signatures (see my comment in the he
Done


Line 1277:   (*num_coll_item_read) += rows_read;
> are the () needed here?
No. Code removed.


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

PS2, Line 505: 
> removing the virtual keyword breaks consistency with the other methods in t
Reverted.


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7776/5/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS5, Line 963: coll_items_read_counter_
does that need to be a field in the object, given that it's reset ever time through?  The counter code is already over complicated, but whatever we can do to not add complexity would be good.


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 5: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 6: Code-Review+1

(1 comment)

Alex, could you do the +2 after checking the new counter semantics make sense to you.

http://gerrit.cloudera.org:8080/#/c/7776/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS6, Line 962: rows and read
not sure what that means to say


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Reviewed-on: http://gerrit.cloudera.org:8080/7776
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 33 insertions(+), 12 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Alex Behm: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 7: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7776/7/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 962:   // Top-level rows read in this call.
thanks. i think it would also be okay to remove the comment since the variable naming and the fact that it's a local variable makes it obvious.


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 964:   auto update_read_count = MakeScopeExitTrigger([&] {
> Done. If RETURN_IF_ERROR or num_tuples_mismatch branch is taken the value o
Yeah, it changes the semantics to something like "Number of successfully committed items/rows" but I think that's ok, given it keeps the code simpler.


PS2, Line 978: continue_execution
> Reverted to original code. But is it good practice to add a dead initializa
You're right, I think it's better to move it into the loop here. If someone tries to use it without initializing it, the compiler would issue a warning, too. Apologies for the back and forth.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1008:     num_rows_read += scratch_batch_->num_tuples;
You could use COUNTER_SET here and get rid of num_rows_read altogether.


PS3, Line 1009:   int num_r
How about removing the initialization at the beginning of the method and using COUNTER_SET here instead? This way it is also more in line with the variable comment in the .h file.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 470:   /// Number of scanners that end up doing no reads because their splits don't overlap
Can you describe the scope of this in the comment, e.g. "Total number of collection items read by this scanner."

If you switch the scope to the scanner's lifetime you should also initialize it.


PS3, Line 551: d of the collection has 
coll_items_read_counter_.


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS2, Line 242: /*num_coll_item_read*/
> The parameter doesn't exist now. Can we add that to the style guide wiki pa
Sure, feel free to add it here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65868536

I think it could also pass as "consistency with the surrounding code", though that's harder to spot.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/scan-node.h
File be/src/exec/scan-node.h:

Line 153:   /// # collection items read from the scanner
Can you explain here that this is across nested collections, so that A = {B, C} would actually result in three items being read?


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has uploaded a new patch set (#3).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 26 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 9: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has uploaded a new patch set (#2).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch addes collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
> select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items couting into
collection_items_read_counter. Both counters are updated every row batch
read. In the query described above, scanning time is decreased by 10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
6 files changed, 77 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7776/2//COMMIT_MSG
Commit Message:

PS2, Line 9: addes
> typo
Done


PS2, Line 26: couting
> typo
Done


PS2, Line 27: every
> for every...
Done


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 963:   int64_t num_rows_read = 0, num_coll_items_read = 0;
> We usually limit declarations to one per line per our style guide.
Done


Line 964:   auto update_read_count = MakeScopeExitTrigger([&] {
> I find that this construct makes the code harder to reason about. Now someo
Done. If RETURN_IF_ERROR or num_tuples_mismatch branch is taken the value of counter would be different, but it could be OK since counter is for profiling purpose.


PS2, Line 978: continue_execution
> initialize
Reverted to original code. But is it good practice to add a dead initialization?


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 475:   RuntimeProfile::Counter* num_dict_filtered_row_groups_counter_;
> Can you do the counting using a member in the scanner instead of passing it
Done. Much better.


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS2, Line 242: /*num_coll_item_read*/
> We usually still name parameters the same, even if some methods don't use t
The parameter doesn't exist now. Can we add that to the style guide wiki page as a difference from google C++ style guide?


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

PS2, Line 546: 4
> ?
Will fix.


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 3:

(1 comment)

> (3 comments)
 > 
 > Only a comment left.

Done

http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 470:   /// Number of collection items read
> Ah, I see. I think I misunderstood the scope. It actually looks good to me.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1194/

-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7776

to look at the new patch set (#9).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 33 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 8:

(4 comments)

lgtm after minor comment fixes

http://gerrit.cloudera.org:8080/#/c/7776/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 478:   /// used to reduce the frequency updating HdfsScanNode counter. It is updated by the
... the frequency of updating the corresponding HdfsScanNode counter.


Line 479:   /// callees of AssembleRows() and is merged into HdfsScanNode counter at the end of
into the HdfsScanNode counter


Line 554:   /// allocated from 'coll_value_builder'. Increases 'this->coll_items_read_counter_' by
this-> is not necessary because our "_" suffix already indicates this is a member


http://gerrit.cloudera.org:8080/#/c/7776/8/be/src/exec/scan-node.h
File be/src/exec/scan-node.h:

Line 153:   /// # nested collection items read from the scanner. For example, for schema
# items the scanner read into CollectionValues. For example, ...

(The new sentence is more accurate because a query  like "select oid from mydb.customers.orders" does scan a nested collection, but the values are materialized into top-level tuples)


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7776/7/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 962:   int64_t num_rows_read = 0;
> thanks. i think it would also be okay to remove the comment since the varia
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 964:   while (!column_readers[0]->RowGroupAtEnd()) {
> Yeah, it changes the semantics to something like "Number of successfully co
Done


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1008:   COUNTER_ADD(scan_node_->rows_read_counter(), num_rows_read);
> You could use COUNTER_SET here and get rid of num_rows_read altogether.
I'm not sure what value should be COUNTER_SET here. Could you elaborate?


PS3, Line 1009: COUNTER_ADD
> How about removing the initialization at the beginning of the method and us
Same as above.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 470:   /// Number of collection items read
> Can you describe the scope of this in the comment, e.g. "Total number of co
Updated. In current code it's total number of collection items read in current row batch. Do you suggest let it count throughout the lifetime of this scanner and update the scannode counter when this scanner is cleaned up?


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/scan-node.h
File be/src/exec/scan-node.h:

Line 153:   /// # collection items read from the scanner
> Can you explain here that this is across nested collections, so that A = {B
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has uploaded a new patch set (#5).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 32 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 4:

(3 comments)

Only a comment left.

http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1008:   row_group_rows_read_ += num_rows_read;
> I'm not sure what value should be COUNTER_SET here. Could you elaborate?
See below.


PS3, Line 1009: COUNTER_ADD
> Same as above.
I tried to get away without having to reset the member variable at the beginning of this function, but I think that won't work. Adding a comment in the header to point out that it gets reset here would be good.


http://gerrit.cloudera.org:8080/#/c/7776/3/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 470:   /// Number of scanners that end up doing no reads because their splits don't overlap
> Updated. In current code it's total number of collection items read in curr
Ah, I see. I think I misunderstood the scope. It actually looks good to me. Can you add to the comment that the variable is reset in AssembleRows()?


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has uploaded a new patch set (#4).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 30 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7776/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS6, Line 962: rows and read
> not sure what that means to say
Typo. Sorry.


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7776

to look at the new patch set (#8).

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................

IMPALA-5210: Count rows and collection items in parquet scanner separately

This patch adds collection_items_read_counter in scan node, makes
rows_read_counter count top-level rows only, and updates these counters
in a less frequent manner.
When scanning nested columns, current code counts both top-level rows
and nested rows in rows_read_counter, which is inconsistent with
rows_returned_counter. Furthermore, rows_read_counter is updated eagerly
whenever a batch of collection items are read. As a result it spends
around 10% time updating the counter with the following simple query:
>select count(*) from
> customer c,
> c.c_orders o,
> o.o_lineitems l
>where
> c_mktsegment = 'BUILDING'
> and o_orderdate < '1995-03-15'
> and l_shipdate > '1995-03-15' and o_orderkey = 10;

This patch moves collection items counting into
collection_items_read_counter. Both counters are updated for every row
batch read. In the query described above, scanning time is decreased by
10.4%.

Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/scan-node.cc
M be/src/exec/scan-node.h
4 files changed, 33 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/7776/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5210: Count rows and collection items in parquet scanner separately

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change.

Change subject: IMPALA-5210: Count rows and collection items in parquet scanner separately
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 963:   int64_t num_rows_read = 0, num_coll_items_read = 0;
We usually limit declarations to one per line per our style guide.


Line 964:   auto update_read_count = MakeScopeExitTrigger([&] {
I find that this construct makes the code harder to reason about. Now someone reading it has to keep in mind that additional code will be executed when the function returns. Can you instead update the counters when CommitRows gets called below? Alternatively you could change the "return" statement in the while loop to a break; and then update the counters before the final return.


PS2, Line 978: continue_execution
initialize


Line 1210:     CollectionValueBuilder* coll_value_builder, int64_t* num_coll_item_read) {
Can you remove these from the function signatures (see my comment in the header).


Line 1277:   (*num_coll_item_read) += rows_read;
are the () needed here?


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 475:   RuntimeProfile::Counter* num_dict_filtered_row_groups_counter_;
Can you do the counting using a member in the scanner instead of passing it through all ReadValue*() functions? That seems to keep these function signatures more clean.


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS2, Line 242: /*num_coll_item_read*/
We usually still name parameters the same, even if some methods don't use them (e.g. the pool parameter in some of these methods).


http://gerrit.cloudera.org:8080/#/c/7776/2/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

PS2, Line 505: 
removing the virtual keyword breaks consistency with the other methods in this class.


PS2, Line 546: 4
?


-- 
To view, visit http://gerrit.cloudera.org:8080/7776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f6efddaea18507482940f5bdab7326b6482b067
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-HasComments: Yes