You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Stamatis Zampetakis (Jira)" <ji...@apache.org> on 2023/04/20 15:29:00 UTC
[jira] [Created] (HIVE-27278) Simplify correlated queries with empty inputs

Stamatis Zampetakis created HIVE-27278:
------------------------------------------

             Summary: Simplify correlated queries with empty inputs
                 Key: HIVE-27278
                 URL: https://issues.apache.org/jira/browse/HIVE-27278
             Project: Hive
          Issue Type: Bug
          Components: CBO
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


The correlated query below will not produce any result no matter the content of the table.
{code:sql}
create table t1 (id int, val varchar(10)) stored as orc TBLPROPERTIES ('transactional'='true');
create table t2 (id int, val varchar(10)) stored as orc TBLPROPERTIES ('transactional'='true');

EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id = t2.id);
{code}
The CBO is able to derive that part of the query is empty and ends up with the following plan.
{noformat}
CBO PLAN:
HiveProject(id=[$0])
  LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{}])
    HiveTableScan(table=[[default, t1]], table:alias=[t1])
    HiveValues(tuples=[[]])
{noformat}
The presence of LogicalCorrelate is first redundant but also problematic since many parts of the optimizer assume that queries are decorrelated and do not know how to handle the LogicalCorrelate.

In the presence of views the same query can lead to the following exception during compilation.
{code:sql}
CREATE MATERIALIZED VIEW v1 AS SELECT id FROM t2;
EXPLAIN CBO SELECT id FROM t1 WHERE NULL IN (SELECT NULL FROM t2 where t1.id = t2.id);
{code}
{noformat}
org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There are not enough rules to produce a node with desired properties: convention=HIVE, sort=[], dist=any. All the inputs have relevant nodes, however the cost is still infinite.
Root: rel#185:RelSubset#3.HIVE.[].any
Original rel:
HiveProject(id=[$0]): rowcount = 4.0, cumulative cost = {20.0 rows, 13.0 cpu, 0.0 io}, id = 178
  LogicalCorrelate(correlation=[$cor0], joinType=[semi], requiredColumns=[{}]): rowcount = 4.0, cumulative cost = {16.0 rows, 9.0 cpu, 0.0 io}, id = 176
    HiveTableScan(table=[[default, t1]], table:alias=[t1]): rowcount = 4.0, cumulative cost = {4.0 rows, 5.0 cpu, 0.0 io}, id = 111
    HiveValues(tuples=[[]]): rowcount = 1.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 139

Sets:
Set#0, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)
	rel#180:RelSubset#0.HIVE.[].any, best=rel#111
		rel#111:HiveTableScan.HIVE.[].any(table=[default, t1],htColumns=[0, 1, 2, 3, 4, 5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null), rowcount=4.0, cumulative cost={4.0 rows, 5.0 cpu, 0.0 io}
Set#1, type: RecordType(NULL _o__c0)
	rel#181:RelSubset#1.HIVE.[].any, best=rel#139
		rel#139:HiveValues.HIVE.[].any(type=RecordType(NULL _o__c0),tuples=[]), rowcount=1.0, cumulative cost={1.0 rows, 1.0 cpu, 0.0 io}
Set#2, type: RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)
	rel#183:RelSubset#2.NONE.[].any, best=null
		rel#182:LogicalCorrelate.NONE.[].any(left=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}), rowcount=4.0, cumulative cost={inf}
Set#3, type: RecordType(INTEGER id)
	rel#185:RelSubset#3.HIVE.[].any, best=null
		rel#184:HiveProject.HIVE.[].any(input=RelSubset#183,inputs=0,synthetic=false), rowcount=4.0, cumulative cost={inf}

Graphviz:
digraph G {
	root [style=filled,label="Root"];
	subgraph cluster0{
		label="Set 0 RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)";
		rel111 [label="rel#111:HiveTableScan\ntable=[default, t1],htColumns=[0, 1, 2, 3, 4, 5],insideView=false,plKey=default.t1;,table:alias=t1,tableScanTrait=null\nrows=4.0, cost={4.0 rows, 5.0 cpu, 0.0 io}",color=blue,shape=box]
		subset180 [label="rel#180:RelSubset#0.HIVE.[].any"]
	}
	subgraph cluster1{
		label="Set 1 RecordType(NULL _o__c0)";
		rel139 [label="rel#139:HiveValues\ntype=RecordType(NULL _o__c0),tuples=[]\nrows=1.0, cost={1.0 rows, 1.0 cpu, 0.0 io}",color=blue,shape=box]
		subset181 [label="rel#181:RelSubset#1.HIVE.[].any"]
	}
	subgraph cluster2{
		label="Set 2 RecordType(INTEGER id, VARCHAR(10) val, BIGINT BLOCK__OFFSET__INSIDE__FILE, VARCHAR(2147483647) INPUT__FILE__NAME, RecordType(BIGINT writeid, INTEGER bucketid, BIGINT rowid) ROW__ID, BOOLEAN ROW__IS__DELETED)";
		rel182 [label="rel#182:LogicalCorrelate\nleft=RelSubset#180,right=RelSubset#181,correlation=$cor0,joinType=semi,requiredColumns={}\nrows=4.0, cost={inf}",shape=box]
		subset183 [label="rel#183:RelSubset#2.NONE.[].any"]
	}
	subgraph cluster3{
		label="Set 3 RecordType(INTEGER id)";
		rel184 [label="rel#184:HiveProject\ninput=RelSubset#183,inputs=0,synthetic=false\nrows=4.0, cost={inf}",shape=box]
		subset185 [label="rel#185:RelSubset#3.HIVE.[].any"]
	}
	root -> subset185;
	subset180 -> rel111[color=blue];
	subset181 -> rel139[color=blue];
	subset183 -> rel182; rel182 -> subset180[label="0"]; rel182 -> subset181[label="1"];
	subset185 -> rel184; rel184 -> subset183;
}
	at org.apache.calcite.plan.volcano.RelSubset$CheapestPlanReplacer.visit(RelSubset.java:742) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.calcite.plan.volcano.RelSubset.buildCheapestPlan(RelSubset.java:365) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:520) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.applyMaterializedViewRewriting(CalcitePlanner.java:2058) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1722) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1591) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:131) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:914) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:180) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:126) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1343) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:570) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12820) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:465) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:180) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:326) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:107) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:519) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:471) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:436) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:430) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:121) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:227) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
	at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356) ~[hive-cli-4.0.0-SNAPSHOT.jar:?]
	at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:733) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:703) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:115) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62) ~[test-classes/:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_261]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_261]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_261]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_261]
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) ~[junit-4.13.2.jar:4.13.2]
	at org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.Suite.runChild(Suite.java:128) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.Suite.runChild(Suite.java:27) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) ~[junit-4.13.2.jar:4.13.2]
	at org.apache.hadoop.hive.cli.control.CliAdapter$1$1.evaluate(CliAdapter.java:95) ~[hive-it-util-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
	at org.junit.rules.RunRules.evaluate(RunRules.java:20) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) ~[junit-4.13.2.jar:4.13.2]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413) ~[junit-4.13.2.jar:4.13.2]
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) ~[surefire-junit4-3.0.0-M4.jar:3.0.0-M4]
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:377) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:138) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:465) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:451) ~[surefire-booter-3.0.0-M4.jar:3.0.0-M4]
{noformat}
The goal of this ticket is to get rid of the redundant correlation to avoid compilation failures but also for unlocking further simplifications and improving plan readability.

The plan can be simplified further based on the following observations.

If the right side of the correlate is empty then the whole correlate is empty when joinType is SEMI/INNER. Moreover if correlate type is LEFT then we can also drop the correlate and use t1 padded with nulls for the right side. Lastly, if the type is ANTI then result is the entire t1 so the correlate can also be dropped. RIGHT and FULL correlations are invalid and should never appear in the plan.

If the left side of the correlate is empty the result is empty and the correlation can be dropped for every legal joinType (INNER/SEMI/ANTI/LEFT).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)