You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2017/08/24 15:20:01 UTC
[jira] [Resolved] (IMPALA-5648) Count star optimisation regressed Parquet memory estimate accuracy

     [ https://issues.apache.org/jira/browse/IMPALA-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-5648.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0


IMPALA-5648: fix count(*) mem estimate regression

The metadata-only scan doesn't allocate I/O buffers, contrary to
an assumption of the memory estimation code in the planner.

This fix also sets a floor on the memory estimate, to avoid
estimating 0 bytes. 1MB seems like a reasonable approximation:
I ran metadata-only scans on a few different data sizes and
saw numbers from 128kb to 1mb.

The estimate is now much closer to actual consumption
(it was 80MB before):

  [localhost:21000] > select count(*) from tpch_parquet.lineitem; summary;
  Query: select count(*) from tpch_parquet.lineitem
  Query submitted at: 2017-08-23 11:58:29 (Coordinator: http://tarmstrong-box:25000)
  Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=cb4b8d41fc838c9a:c5496ff300000000
  +----------+
  | count(*) |
  +----------+
  | 6001215  |
  +----------+
  Fetched 1 row(s) in 0.13s
  +--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
  | Operator     | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                |
  +--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
  | 03:AGGREGATE | 1      | 168.49us | 168.49us | 1     | 1          | 28.00 KB  | 10.00 MB      | FINALIZE              |
  | 02:EXCHANGE  | 1      | 30.11ms  | 30.11ms  | 3     | 1          | 0 B       | 0 B           | UNPARTITIONED         |
  | 01:AGGREGATE | 3      | 2.05us   | 6.14us   | 3     | 1          | 20.00 KB  | 10.00 MB      |                       |
  | 00:SCAN HDFS | 3      | 4.58ms   | 4.72ms   | 3     | 6.00M      | 128.00 KB | 1.00 MB       | tpch_parquet.lineitem |
  +--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+

Testing:
Updated affected planner tests.

Change-Id: Iaf5c2316bef2afae54a94245c715534ed294f286
Reviewed-on: http://gerrit.cloudera.org:8080/7783
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---

> Count star optimisation regressed Parquet memory estimate accuracy
> ------------------------------------------------------------------
>
>                 Key: IMPALA-5648
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5648
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.10.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>              Labels: resource-management
>             Fix For: Impala 2.10.0
>
>
> The Parquet memory estimate is based on the number of columns scanned in the file. Before IMPALA-5036, count(*) queries correctly counted zero columns read because it was a metadata-only scan. The count-star optimisation is also a metadata-only scan but the memory estimate is based on the column actually being scanned.
> This regression is apparent in the planner test changes:
> https://gerrit.cloudera.org/#/c/6812/12/testdata/workloads/functional-planner/queries/PlannerTest/resource-requirements.test



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)