You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/04/05 13:45:00 UTC
[jira] [Commented] (IMPALA-5036) Improve COUNT(*) performance of Parquet scans.

    [ https://issues.apache.org/jira/browse/IMPALA-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517448#comment-17517448 ] 

ASF subversion and git services commented on IMPALA-5036:
---------------------------------------------------------

Commit f932d78ad0a30e322d59fc39072f710f889d2135 in impala's branch refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f932d78ad ]

IMPALA-11123: Optimize count(star) for ORC scans

This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Reviewed-on: http://gerrit.cloudera.org:8080/18327
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Improve COUNT(*) performance of Parquet scans.
> ----------------------------------------------
>
>                 Key: IMPALA-5036
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5036
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>    Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
>            Reporter: Alexander Behm
>            Assignee: Taras Bobrovytsky
>            Priority: Major
>              Labels: parquet, performance, ramp-up
>             Fix For: Impala 2.10.0
>
>
> {code}
> select count(*) from parquet_table;
> select count(*) from parquet_table group by partition_col;
> {code}
> Impala already has a special code path for fast Parquet scans when no columns are scanned and materialized, but the performance can be significantly improved with a plan+execution change, as follows:
> *Execution change*
> Instead of returning empty batches until num_rows have been returned, the Parquet scanner can populate a single slot with the num_rows from the Parquet row groups
> *Plan change*
> The count(*) local aggregation needs to be changed to a sum(num_rows_slot) aggregation.
> The final distributed plan will be:
> scan -> local agg with sum(num_rows_slot) -> merge agg sum(sum(num_rows_slot))
> This optimization is applicable where is only a count(*) and there are no scan predicates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org