You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/01/17 15:07:00 UTC

[jira] [Commented] (IMPALA-6964) Track stats about column and page sizes in Parquet reader

    [ https://issues.apache.org/jira/browse/IMPALA-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16745174#comment-16745174 ] 

ASF subversion and git services commented on IMPALA-6964:
---------------------------------------------------------

Commit 8da44ce16bb190dadab2ff3d22e5df726d1128e3 in impala's branch refs/heads/master from stakiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8da44ce ]

IMPALA-6964: Track stats about column and page sizes in Parquet reader

Adds the following new stats:

* ParquetCompressedPageSize - a summary (average, min, max) counter that
tracks the size of compressed pages read, if no compressed pages are
read then this counter is empty
* ParquetUncompressedPageSize - a summary counter that tracks the size
of uncompressed pages read, it is updated in two places: (1) when a
compressed page is de-compressed, and (2) when a page that is not
compressed is read
* ParquetCompressedDataReadPerColumn - a summary counter that tracks the
amount of compressed data read per column for a scan node
* ParquetUncompressedDataReadPerColumn - a summary counter that tracks
the amount of uncompressed data read per column for a scan node

The PerColumn counters are calculated by aggregating the number of bytes
read for each column across all scan ranges processed by a scan node.
Each sample in the counter is the size of a single column.

Here is an example of what the updated HDFS scan profile looks like:

- ParquetCompressedDataReadPerColumn: (Avg: 227.56 KB (233018) ;
Min: 225.14 KB (230540) ; Max: 229.98 KB (235496) ; Number of samples: 2)
- ParquetUncompressedDataReadPerColumn: (Avg: 227.96 KB (233426) ;
Min: 224.91 KB (230306) ; Max: 231.00 KB (236547) ; Number of samples: 2)
- ParquetCompressedPageSize: (Avg: 4.46 KB (4568) ; Min: 3.86 KB (3955) ;
Max: 5.19 KB (5315) ; Number of samples: 102)
- ParquetDecompressedPageSize: (Avg: 4.47 KB (4576) ; Min: 3.86 KB (3950)
 ; Max: 5.22 KB (5349) ; Number of samples: 102)

Testing:
* Added new tests to test_scanners.py that do some basic validation of
the new counters above

Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817
Reviewed-on: http://gerrit.cloudera.org:8080/11575
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Track stats about column and page sizes in Parquet reader
> ---------------------------------------------------------
>
>                 Key: IMPALA-6964
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6964
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Sahil Takiar
>            Priority: Major
>              Labels: observability, parquet, ramp-up
>
> It would be good to have stats for scanned parquet data about page sizes. We currently can't tell much about the "shape" of the parquet pages from the profile. Some questions that are interesting:
> * How big is each column? I.e. total compressed and decompressed size read.
> * How big are pages on average? Either compressed or decompressed size
> * What is the compression ratio for pages? Could be inferred from the above two.
> I think storing all the stats in the profile per-column would be too much data, but we could probably infer most useful things from higher-level aggregates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org