You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2022/09/07 14:54:00 UTC
[jira] [Created] (IMPALA-11561) Improve intra-node scheduling of scan ranges
Csaba Ringhofer created IMPALA-11561:
----------------------------------------
Summary: Improve intra-node scheduling of scan ranges
Key: IMPALA-11561
URL: https://issues.apache.org/jira/browse/IMPALA-11561
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Csaba Ringhofer
This ticket is created as a follow up for IMPALA-11539 / https://gerrit.cloudera.org/#/c/18929/ , as several improvement ideas came up during the review.
The commit above changes intra node scan range scheduling in the mt_dop != 0 case to process the scan ranges order by size (descending) to reduce skew among fragment instances - before that the order was random, with the exception of handling files in HDFS cache before files not in HDFS cache.
The following ideas came up:
1. Take caching into account and process scan ranges with more cached bytes / file handles first. This way we could avoid avoid evicting these from the cache during scanning.
2. Take disk id into account and try to process files from different disks in parallel.
3. Have a more sophisticated estimation of CPU cost than scan size and order by that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)