You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2022/09/07 14:54:00 UTC

[jira] [Created] (IMPALA-11561) Improve intra-node scheduling of scan ranges

Csaba Ringhofer created IMPALA-11561:
----------------------------------------

             Summary: Improve intra-node scheduling of scan ranges
                 Key: IMPALA-11561
                 URL: https://issues.apache.org/jira/browse/IMPALA-11561
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Csaba Ringhofer


This ticket is created as a follow up for IMPALA-11539 / https://gerrit.cloudera.org/#/c/18929/ , as several improvement ideas came up during the review.

The commit above changes intra node scan range scheduling in the mt_dop != 0 case to process the scan ranges order by size (descending) to reduce skew among fragment instances  -  before that the order was random, with the exception of handling files in HDFS cache before files not in HDFS cache.

The following ideas came up:
1. Take caching into account and process scan ranges with more cached bytes / file handles first. This way we could avoid avoid evicting these from the cache during scanning.
2. Take disk id into account and try to process files from different disks in parallel.
3. Have a more sophisticated estimation of CPU cost than scan size and order by that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)