You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2017/08/24 15:20:01 UTC
[jira] [Resolved] (IMPALA-5648) Count star optimisation regressed
Parquet memory estimate accuracy
[ https://issues.apache.org/jira/browse/IMPALA-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong resolved IMPALA-5648.
-----------------------------------
Resolution: Fixed
Fix Version/s: Impala 2.10.0
IMPALA-5648: fix count(*) mem estimate regression
The metadata-only scan doesn't allocate I/O buffers, contrary to
an assumption of the memory estimation code in the planner.
This fix also sets a floor on the memory estimate, to avoid
estimating 0 bytes. 1MB seems like a reasonable approximation:
I ran metadata-only scans on a few different data sizes and
saw numbers from 128kb to 1mb.
The estimate is now much closer to actual consumption
(it was 80MB before):
[localhost:21000] > select count(*) from tpch_parquet.lineitem; summary;
Query: select count(*) from tpch_parquet.lineitem
Query submitted at: 2017-08-23 11:58:29 (Coordinator: http://tarmstrong-box:25000)
Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=cb4b8d41fc838c9a:c5496ff300000000
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
Fetched 1 row(s) in 0.13s
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| 03:AGGREGATE | 1 | 168.49us | 168.49us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 30.11ms | 30.11ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 2.05us | 6.14us | 3 | 1 | 20.00 KB | 10.00 MB | |
| 00:SCAN HDFS | 3 | 4.58ms | 4.72ms | 3 | 6.00M | 128.00 KB | 1.00 MB | tpch_parquet.lineitem |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
Testing:
Updated affected planner tests.
Change-Id: Iaf5c2316bef2afae54a94245c715534ed294f286
Reviewed-on: http://gerrit.cloudera.org:8080/7783
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---
> Count star optimisation regressed Parquet memory estimate accuracy
> ------------------------------------------------------------------
>
> Key: IMPALA-5648
> URL: https://issues.apache.org/jira/browse/IMPALA-5648
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.10.0
> Reporter: Tim Armstrong
> Assignee: Tim Armstrong
> Labels: resource-management
> Fix For: Impala 2.10.0
>
>
> The Parquet memory estimate is based on the number of columns scanned in the file. Before IMPALA-5036, count(*) queries correctly counted zero columns read because it was a metadata-only scan. The count-star optimisation is also a metadata-only scan but the memory estimate is based on the column actually being scanned.
> This regression is apparent in the planner test changes:
> https://gerrit.cloudera.org/#/c/6812/12/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)