You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Sahil Takiar (Code Review)" <ge...@cloudera.org> on 2018/10/22 16:26:50 UTC

[Impala-ASF-CR] IMPALA-6964: Track stats about column and page sizes in Parquet reader

Sahil Takiar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11749


Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet reader
......................................................................

IMPALA-6964: Track stats about column and page sizes in Parquet reader

Adds the following new stats:

* ParquetCompressedPageSize - a summary (average, min, max) counter that
tracks the size of compressed pages read, if no compressed pages are
read then this counter is empty
* ParquetUncompressedPageSize - a summary counter that tracks the size
of uncompressed pages read, it is updated in two places: (1) when a
compressed page is de-compressed, and (2) when a page that is not
compressed is read
* ParquetCompressedDataReadPerColumn - a summary counter that tracks the
amount of compressed data read per column for a scan node
* ParquetUncompressedDataReadPerColumn - a summary counter that tracks
the amount of uncompressed data read per column for a scan node

The PerColumn counters are calculated by aggregating the number of bytes
read for each column across all scan ranges processed by a scan node.
Each sample in the counter is the size of a single column.

Here is an example of what the updated HDFS scan profile looks like:

- ParquetCompressedDataReadPerColumn: (Avg: 227.56 KB (233018) ;
Min: 225.14 KB (230540) ; Max: 229.98 KB (235496) ; Number of samples: 2)
- ParquetUncompressedDataReadPerColumn: (Avg: 227.96 KB (233426) ;
Min: 224.91 KB (230306) ; Max: 231.00 KB (236547) ; Number of samples: 2)
- ParquetCompressedPageSize: (Avg: 4.46 KB (4568) ; Min: 3.86 KB (3955) ;
Max: 5.19 KB (5315) ; Number of samples: 102)
- ParquetDecompressedPageSize: (Avg: 4.47 KB (4576) ; Min: 3.86 KB (3950)
 ; Max: 5.22 KB (5349) ; Number of samples: 102)

Testing:
* Added new tests to test_scanners.py that do some basic validation of
the new counters above

Change-Id: I5373b1e2b8157c5b2e3d79d46b60ec44a55f79bc
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M tests/query_test/test_scanners.py
7 files changed, 184 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/11749/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11749
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I5373b1e2b8157c5b2e3d79d46b60ec44a55f79bc
Gerrit-Change-Number: 11749
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-6964: Track stats about column and page sizes in Parquet reader

Posted by "Sahil Takiar (Code Review)" <ge...@cloudera.org>.
Sahil Takiar has abandoned this change. ( http://gerrit.cloudera.org:8080/11749 )

Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet reader
......................................................................


Abandoned
-- 
To view, visit http://gerrit.cloudera.org:8080/11749
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I5373b1e2b8157c5b2e3d79d46b60ec44a55f79bc
Gerrit-Change-Number: 11749
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>

[Impala-ASF-CR] IMPALA-6964: Track stats about column and page sizes in Parquet reader

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11749 )

Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet reader
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/1123/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11749
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5373b1e2b8157c5b2e3d79d46b60ec44a55f79bc
Gerrit-Change-Number: 11749
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Oct 2018 17:02:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6964: Track stats about column and page sizes in Parquet reader

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11749 )

Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet reader
......................................................................


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-parquet-scanner.h@468
PS1, Line 468:   /// to this counter, (2) when a page that is not compressed is read, its size is added to
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-parquet-scanner.h@661
PS1, Line 661:   /// Update the counter parquet_compressed_page_size_counter_ with the given compressed page size
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-parquet-scanner.h@664
PS1, Line 664:   /// Update the counter parquet_uncompressed_page_size_counter_ with the given uncompressed page size
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-scan-node-base.h@538
PS1, Line 538:   /// the size of a single column that is scanned by the scan node. The scan node tracks the
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-scan-node-base.h@539
PS1, Line 539:   /// number of bytes read for each column it processes, and when the scan node is closed, it
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-scan-node-base.h@542
PS1, Line 542:   RuntimeProfile::SummaryStatsCounter* uncompressed_data_read_per_column_counter_ = nullptr;
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/11749/1/be/src/exec/hdfs-scan-node-base.cc@376
PS1, Line 376:   uncompressed_data_read_per_column_counter_ = ADD_SUMMARY_STATS_COUNTER(runtime_profile(),
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/11749
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5373b1e2b8157c5b2e3d79d46b60ec44a55f79bc
Gerrit-Change-Number: 11749
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Oct 2018 17:56:11 +0000
Gerrit-HasComments: Yes