You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2019/09/04 17:58:00 UTC

[jira] [Commented] (IMPALA-8912) Avoid calling computeStats twice on HBaseScanNode

    [ https://issues.apache.org/jira/browse/IMPALA-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922711#comment-16922711 ] 

ASF subversion and git services commented on IMPALA-8912:
---------------------------------------------------------

Commit d21f00ef2f2219551e3501e3cf1cf45ff4afdcef in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d21f00e ]

IMPALA-8912: Avoid sampling hbase table twice for HBaseScanNode

HBaseScanNode#computeStats do sampling on the target hbase table to
estimate cardinality and row size. The sampling can be time consuming
since it requires HBase RPCs. As mentioned in the JIRA description,
HBaseScanNode#computeStats can be called twice if it's the root of the
single node plan. The sampling can be skipped in the second pass since
startKey_, endKey_ and keyConjucts_ are not changed.

inputCardinality_ will have a valid value if the sampling has been done
without errors. We check this to determine whether we can skip sampling
in the second pass.

Tests:
 - Testing in my local env with TRACE log level. Observing the sampling
be done only once.
 - Run CORE tests.

Change-Id: I4650483c0504128048630714e993b481737fd1e2
Reviewed-on: http://gerrit.cloudera.org:8080/14167
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Avoid calling computeStats twice on HBaseScanNode
> -------------------------------------------------
>
>                 Key: IMPALA-8912
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8912
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0, Impala 3.0, Impala 2.12.0, Impala 3.1.0, Impala 3.2.0, Impala 3.3.0
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> For simple queries on HBase tables that has HBaseScanNode as the root of the SingleNodePlan, HBaseScanNode#computeStats will be called twice.
> Stacktrace for the first call:
> {code:java}
>         at org.apache.impala.planner.HBaseScanNode.computeStats(HBaseScanNode.java:286)
>         at org.apache.impala.planner.HBaseScanNode.init(HBaseScanNode.java:160)
>         at org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1405)
>         at org.apache.impala.planner.SingleNodePlanner.createTableRefNode(SingleNodePlanner.java:1582)
>         at org.apache.impala.planner.SingleNodePlanner.createTableRefsPlan(SingleNodePlanner.java:826)
>         at org.apache.impala.planner.SingleNodePlanner.createSelectPlan(SingleNodePlanner.java:662)
>         at org.apache.impala.planner.SingleNodePlanner.createQueryPlan(SingleNodePlanner.java:261)
>         at org.apache.impala.planner.SingleNodePlanner.createSingleNodePlan(SingleNodePlanner.java:151)
>         at org.apache.impala.planner.Planner.createPlan(Planner.java:117)
>         at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1169)
>         at org.apache.impala.service.Frontend.getPlannedExecRequest(Frontend.java:1495)
>         at org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1359)
>         at org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1250)
>         at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1220)
> {code}
> Stacktrace for the second call:
> {code:java}
>         at org.apache.impala.planner.HBaseScanNode.computeStats(HBaseScanNode.java:286)
>         at org.apache.impala.planner.SingleNodePlanner.createQueryPlan(SingleNodePlanner.java:307)
>         at org.apache.impala.planner.SingleNodePlanner.createSingleNodePlan(SingleNodePlanner.java:151)
>         at org.apache.impala.planner.Planner.createPlan(Planner.java:117)
>         at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1169)
>         at org.apache.impala.service.Frontend.getPlannedExecRequest(Frontend.java:1495)
>         at org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1359)
>         at org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1250)
>         at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1220)
>         at org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:154)
> {code}
> Codes of the second call:
> {code:java}
>   private PlanNode createQueryPlan(QueryStmt stmt, Analyzer analyzer, boolean disableTopN)
>       throws ImpalaException {
>     ......
>     if (stmt.evaluateOrderBy() && sortHasMaterializedSlots) {
>       root = createSortNode(analyzer, root, stmt.getSortInfo(), stmt.getLimit(),
>           stmt.getOffset(), stmt.hasLimit(), disableTopN);
>     } else {
>       root.setLimit(stmt.getLimit());
>       root.computeStats(analyzer);   // <--- May call HBaseScanNode#computeStats here
>     }
>     return root;
>   }
> {code}
> Logs for a simple query on an old version of Impala:
> {code:java}
> I0830 11:52:05.991547 41189 Analyzer.java:1578] new pred: stg.xxx_hbase.key >= 'key1' BinaryPredicate{op=>=, SlotRef{path=key, type=STRING, id=0} StringLiteral{value=key1}}
> I0830 11:52:05.991595 41189 Analyzer.java:1578] new pred: stg.xxx_hbase.key <= 'key2' BinaryPredicate{op=<=, SlotRef{path=key, type=STRING, id=0} StringLiteral{value=key2}}
> # <--------- 2 seconds here
> I0830 11:52:08.114225 41189 HBaseScanNode.java:217] computeStats HbaseScan: cardinality=1706076
> I0830 11:52:08.114341 41189 HBaseScanNode.java:223] computeStats HbaseScan: #nodes=100
> I0830 11:52:08.114452 41189 SingleNodePlanner.java:357] createCheapestJoinPlan
> # <--------- 2 seconds here
> I0830 11:52:10.260190 41189 HBaseScanNode.java:217] computeStats HbaseScan: cardinality=1706076
> I0830 11:52:10.260303 41189 HBaseScanNode.java:223] computeStats HbaseScan: #nodes=100
> I0830 11:52:10.260387 41189 SingleNodePlanner.java:357] createCheapestJoinPlan
> {code}
> Such kind of queries are usually point queries and are always expected to return fast. HBaseScanNode#computeStats is heavy since it requires RPCs to HBase. We should avoid calling it twice.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org