You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/12/24 00:18:00 UTC
[jira] [Resolved] (IMPALA-2842) "SCAN HDFS" "hosts" doesn't account for num_nodes or unsplittable formats

     [ https://issues.apache.org/jira/browse/IMPALA-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-2842.
-----------------------------------
    Resolution: Fixed

This was fixed a while back when hosts= was moved to the FRAGMENT level in the explain plan.


commit 9a29dfc91b1ff8bbae3c94b53bf2b6ac81a271e0
Author: Tim Armstrong <ta...@cloudera.com>
Date:   Wed Jan 25 15:19:35 2017 -0800

    IMPALA-3748: minimum buffer requirements in planner
    
    Compute the minimum buffer requirement for spilling nodes and
    per-host estimates for the entire plan tree.
    
    This builds on top of the existing resource estimation code, which
    computes the sets of plan nodes that can execute concurrently. This is
    cleaned up so that the process of producing resource requirements is
    clearer. It also removes the unused VCore estimates.
    
    Fixes various bugs and other issues:
    * computeCosts() was not called for unpartitioned fragments, so
      the per-operator memory estimate was not visible.
    * Nested loop join was not treated as a blocking join.
    * The TODO comment about union was misleading
    * Fix the computation for mt_dop > 1 by distinguishing per-instance and
      per-host estimates.
    * Always generate an estimate instead of unpredictably returning
      -1/"unavailable" in many circumstances - there was little rhyme or
      reason to when this happened.
    * Remove the special "trivial plan" estimates. With the rest of the
      cleanup we generate estimates <= 10MB for those trivial plans through
      the normal code path.
    
    I left one bug (IMPALA-4862) unfixed because it is subtle, will affect
    estimates for many plans and will be easier to review once we have the
    test infra in place.
    
    Testing:
    Added basic planner tests for resource requirements in both the MT and
    non-MT cases.
    
    Re-enabled the explain_level tests, which appears to be the only
    coverage for many of these estimates. Removed the complex and
    brittle test cases and replaced with a couple of much simpler
    end-to-end tests.
    
    Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
    Reviewed-on: http://gerrit.cloudera.org:8080/5847
    Reviewed-by: Tim Armstrong <ta...@cloudera.com>
    Tested-by: Impala Public Jenkins


> "SCAN HDFS" "hosts" doesn't account for num_nodes or unsplittable formats
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-2842
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2842
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.5.0
>            Reporter: Juan Yu
>            Priority: Minor
>
> According to the comments, "hosts" should be "number of nodes on which the plan tree rooted at this node would execute".
> But for "SCAN HDFS", it always equals to the # of backend where the data is.
> For example for query "select * from sc;"
> distributed plan
> {code}
> Query: explain select * from sc limit 1000
> +--------------------------------------------------------------+
> | Explain String                                               |
> +--------------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=32.00MB VCores=1     |
> |                                                              |
> | F01:PLAN FRAGMENT [UNPARTITIONED]                            |
> |   01:EXCHANGE [UNPARTITIONED]                                |
> |      limit: 1000                                             |
> |      hosts=3 per-host-mem=unavailable                        |
> |      tuple-ids=0 row-size=58B cardinality=8                  |
> |                                                              |
> | F00:PLAN FRAGMENT [RANDOM]                                   |
> |   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] |
> |   00:SCAN HDFS [default.sc, RANDOM]                          |
> |      partitions=1/1 files=3 size=163B                        |
> |      table stats: 8 rows total                               |
> |      column stats: all                                       |
> |      limit: 1000                                             |
> |      hosts=3 per-host-mem=32.00MB                            |
> |      tuple-ids=0 row-size=58B cardinality=8                  |
> +--------------------------------------------------------------+
> {code}
> single node plan
> {code}
> Query: explain select * from sc
> +-----------------------------------------------------+
> | Explain String                                      |
> +-----------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=0B VCores=0 |
> |                                                     |
> | F00:PLAN FRAGMENT [UNPARTITIONED]                   |
> |   00:SCAN HDFS [default.sc]                         |
> |      partitions=1/1 files=3 size=163B               |
> |      table stats: 8 rows total                      |
> |      column stats: all                              |
> |      hosts=3 per-host-mem=unavailable               |
> |      tuple-ids=0 row-size=58B cardinality=8         |
> +-----------------------------------------------------+
> {code}
> Query summary and profile do show correct # of executing nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)