You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/05/28 01:49:00 UTC

[jira] [Commented] (IMPALA-12120) Set appropriate output writer parallelism when using new processing cost planner

    [ https://issues.apache.org/jira/browse/IMPALA-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726871#comment-17726871 ] 

ASF subversion and git services commented on IMPALA-12120:
----------------------------------------------------------

Commit dbddb0844713677cd5165c55fe21ef46238d3e24 in impala's branch refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dbddb0844 ]

IMPALA-12120: Limit output writer parallelism based on write volume

The new processing cost-based planner changes (IMPALA-11604,
IMPALA-12091) will impact output writer parallelism for insert queries,
with the potential for more small files if the processing cost-based
planning results in too many writer fragments. It can further exacerbate
a problem introduced by MT_DOP (see IMPALA-8125).

The MAX_FS_WRITERS query option can help mitigate this. But even without
the MAX_FS_WRITERS set, the default output writer parallelism should
avoid creating excessive writer parallelism for partitioned and
unpartitioned inserts.

This patch implements such a limit when using the cost-based planner. It
limits the number of writer fragments such that each writer fragment
writes at least 256MB of rows. This patch also allows CTAS (a kind of
DDL query) to be eligible for auto-scaling.

This patch also remove comments about NUM_SCANNER_THREADS added by
IMPALA-12029, since it does not applies anymore after IMPALA-12091.

Testing:
- Add test cases in test_query_cpu_count_divisor_default
- Add test_processing_cost_writer_limit in test_insert.py
- Pass test_insert.py::TestInsertHdfsWriterLimit
- Pass test_executor_groups.py

Change-Id: I289c6ffcd6d7b225179cc9fb2f926390325a27e0
Reviewed-on: http://gerrit.cloudera.org:8080/19880
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Set appropriate output writer parallelism when using new processing cost planner
> --------------------------------------------------------------------------------
>
>                 Key: IMPALA-12120
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12120
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: David Rorke
>            Assignee: Riza Suminto
>            Priority: Major
>
> The new processing cost based planner changes (IMPALA-11604, IMPALA-12091) will impact output writer parallelism for insert queries, with the potential for more small files if the processing cost based planning results in too many writer fragments.  This could further exacerbate a problem that was introduced with mt_dop (see IMPALA-8125). 
> There are 2 cases to consider:
>  # Unpartitioned inserts where the output writer is in the same fragment as the scan.  In this case the output parallelism will be determined by the scan parallelism which may increase (vs mt_dop) with the changes in IMPALA-12091.
>  # Partitioned inserts where the output writer fragment typically consists of a sort followed by the writer, and the parallelism under IMPALA-11604 is driven by the estimated sort cost.  Again we have the potential to overparallelize resulting in too many small files.
> The MAX_FS_WRITERS query option (IMPALA-8125) can help mitigate this but we should have better default behavior even when MAX_FS_WRITERS isn't set.  The default output writer parallelism with no query options set should avoid creating excessive writer parallelism for both partitioned and unpartitioned inserts.  We could also consider always including an exchange (even in the unpartitioned case) to decouple the writer from the scan parallelism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org