You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2017/08/28 18:21:00 UTC

[jira] [Created] (IMPALA-5851) Estimate number of rows for sum_init_zero scans should be number of files not table cardinality

Mostafa Mokhtar created IMPALA-5851:
---------------------------------------

             Summary: Estimate number of rows for  sum_init_zero scans should be number of files not table cardinality
                 Key: IMPALA-5851
                 URL: https://issues.apache.org/jira/browse/IMPALA-5851
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: Mostafa Mokhtar
            Priority: Minor


IMPALA-5036 introduced an optimization to use the data stored in the Parquet RowGroup.num_rows field for count(*) queries.
The estimate cardinality for the scan is the number of rows in the base table opposed to number of files or row groups. 

{code}
+-------------------------------------------------------------------------------+
| Explain String                                                                |
+-------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=0B                                  |
| Per-Host Resource Estimates: Memory=108.00MB                                  |
|                                                                               |
| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         |
| |  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B                |
| PLAN-ROOT SINK                                                                |
| |  mem-estimate=0B mem-reservation=0B                                         |
| |                                                                             |
| 03:AGGREGATE [FINALIZE]                                                       |
| |  output: count:merge(*)                                                     |
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                |
| |  tuple-ids=1 row-size=8B cardinality=1                                      |
| |                                                                             |
| 02:EXCHANGE [UNPARTITIONED]                                                   |
| |  mem-estimate=0B mem-reservation=0B                                         |
| |  tuple-ids=1 row-size=8B cardinality=1                                      |
| |                                                                             |
| F00:PLAN FRAGMENT [RANDOM] hosts=130 instances=130                            |
| Per-Host Resources: mem-estimate=98.00MB mem-reservation=0B                   |
| 01:AGGREGATE                                                                  |
| |  output: sum_init_zero(tpch_30000_parquet.lineitem.parquet-stats: num_rows) |
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB                |
| |  tuple-ids=1 row-size=8B cardinality=1                                      |
| |                                                                             |
| 00:SCAN HDFS [tpch_30000_parquet.lineitem, RANDOM]                            |
|    partitions=2526/2526 files=28976 size=6.89TB                               |
|    stats-rows=179999978268 extrapolated-rows=disabled                         |
|    table stats: rows=179999978268 size=unavailable                            |
|    column stats: all                                                          |
|    mem-estimate=88.00MB mem-reservation=0B                                    |
|    tuple-ids=0 row-size=8B cardinality=179999978268                           |
+-------------------------------------------------------------------------------+
{code}

{code}
+--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                      |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
| 03:AGGREGATE | 1      | 1.28ms   | 1.28ms   | 1      | 1          | 532.00 KB | 10.00 MB      | FINALIZE                    |
| 02:EXCHANGE  | 1      | 2.56s    | 2.56s    | 129    | 1          | 0 B       | 0 B           | UNPARTITIONED               |
| 01:AGGREGATE | 129    | 4.89ms   | 62.84ms  | 129    | 1          | 20.00 KB  | 10.00 MB      |                             |
| 00:SCAN HDFS | 129    | 62.44ms  | 341.03ms | 28.98K | 180.00B    | 1.75 MB   | 88.00 MB      | tpch_30000_parquet.lineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+-----------------------------+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)