You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/08/04 07:21:00 UTC

[jira] [Commented] (IMPALA-8125) Limit number of files generated by insert

    [ https://issues.apache.org/jira/browse/IMPALA-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170606#comment-17170606 ] 

ASF subversion and git services commented on IMPALA-8125:
---------------------------------------------------------

Commit 0a13029afccb374763fce7a916ef9da257a8bcde in impala's branch refs/heads/master from Bikramjeet Vig
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0a13029 ]

IMPALA-8125: Add query option to limit number of hdfs writer instances

This patch adds a new query option MAX_FS_WRITERS that limits the
number of HDFS writer instances.

Highlights:
- Depending on the plan, it either restricts the num of instances of
  the root fragment or adds an exchange and then limits the num of
  instances of that.
- Assigns instances evenly across available backends.
- "no-shuffle" query hint is ignored when using query option.
- Change in behavior of plans is only when this query option is used.
- The only exception to the previous point is that the optimization
  logic that decides to add an exchange now looks at the num of
  instances instead of the number of nodes.

Limitation:
A mismatch of cluster state during query planning and scheduling can
result in more or less fragment instances to be scheduled than
expected. Eg. If max_fs_writers in 2 and the planner sees only 2
executors then it might not add an exchange between a scan node and
the table sink, but during scheduling if there are 3 nodes then that
scan+tablesink instance will be scheduled on 3 backends.

Testing:
- Added planner tests to cover all cases where this enforcement kicks
  in and to highlight the behavior.
- Added e2e tests to confirm that the scheduler is enforcing the limit
  and distributing the instance evenly across backends for different
  plan shapes.

Change-Id: I17c8e61b9a32d908eec82c83618ff9caa41078a5
Reviewed-on: http://gerrit.cloudera.org:8080/16204
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Limit number of files generated by insert
> -----------------------------------------
>
>                 Key: IMPALA-8125
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8125
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Tim Armstrong
>            Assignee: Bikramjeet Vig
>            Priority: Major
>              Labels: multithreading
>
> One pitfall of multithreaded execution is that, if implemented naively, the number of files generated by an unpartitioned insert will be multiplied by mt_dop.
> We should provide a mechanism to limit the number of files generated, e.g. limit the number of insert fragment instances (note that there a pre-existing problem with unpartitioned inserts generating too many files).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org