You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/12/09 23:08:43 UTC

[GitHub] [hive] jcamachor commented on a change in pull request #1750: HIVE-24388: Enhance swo optimizations to merge EventOperators

jcamachor commented on a change in pull request #1750:
URL: https://github.com/apache/hive/pull/1750#discussion_r539697350



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
##########
@@ -17,6 +17,7 @@
  */
 package org.apache.hadoop.hive.ql.optimizer;
 
+import java.io.File;

Review comment:
       Needed?

##########
File path: ql/src/test/results/clientpositive/llap/swo_event_merge.q.out
##########
@@ -0,0 +1,291 @@
+PREHOOK: query: drop table if exists x1_store_sales
+PREHOOK: type: DROPTABLE
+POSTHOOK: query: drop table if exists x1_store_sales
+POSTHOOK: type: DROPTABLE
+PREHOOK: query: drop table if exists x1_date_dim
+PREHOOK: type: DROPTABLE
+POSTHOOK: query: drop table if exists x1_date_dim
+POSTHOOK: type: DROPTABLE
+PREHOOK: query: drop table if exists x1_item
+PREHOOK: type: DROPTABLE
+POSTHOOK: query: drop table if exists x1_item
+POSTHOOK: type: DROPTABLE
+PREHOOK: query: create table x1_store_sales 
+(
+	ss_item_sk	int
+)
+partitioned by (ss_sold_date_sk int)
+stored as orc
+PREHOOK: type: CREATETABLE
+PREHOOK: Output: database:default
+PREHOOK: Output: default@x1_store_sales
+POSTHOOK: query: create table x1_store_sales 
+(
+	ss_item_sk	int
+)
+partitioned by (ss_sold_date_sk int)
+stored as orc
+POSTHOOK: type: CREATETABLE
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@x1_store_sales
+PREHOOK: query: create table x1_date_dim
+(
+	d_date_sk	int,
+	d_month_seq	int,
+	d_year		int,
+	d_moy		int
+)
+stored as orc
+PREHOOK: type: CREATETABLE
+PREHOOK: Output: database:default
+PREHOOK: Output: default@x1_date_dim
+POSTHOOK: query: create table x1_date_dim
+(
+	d_date_sk	int,
+	d_month_seq	int,
+	d_year		int,
+	d_moy		int
+)
+stored as orc
+POSTHOOK: type: CREATETABLE
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@x1_date_dim
+PREHOOK: query: insert into x1_date_dim values	(1,1,2000,2),
+				(2,2,2001,2)
+PREHOOK: type: QUERY
+PREHOOK: Input: _dummy_database@_dummy_table
+PREHOOK: Output: default@x1_date_dim
+POSTHOOK: query: insert into x1_date_dim values	(1,1,2000,2),
+				(2,2,2001,2)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: _dummy_database@_dummy_table
+POSTHOOK: Output: default@x1_date_dim
+POSTHOOK: Lineage: x1_date_dim.d_date_sk SCRIPT []
+POSTHOOK: Lineage: x1_date_dim.d_month_seq SCRIPT []
+POSTHOOK: Lineage: x1_date_dim.d_moy SCRIPT []
+POSTHOOK: Lineage: x1_date_dim.d_year SCRIPT []
+PREHOOK: query: insert into x1_store_sales partition (ss_sold_date_sk=1) values (1)
+PREHOOK: type: QUERY
+PREHOOK: Input: _dummy_database@_dummy_table
+PREHOOK: Output: default@x1_store_sales@ss_sold_date_sk=1
+POSTHOOK: query: insert into x1_store_sales partition (ss_sold_date_sk=1) values (1)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: _dummy_database@_dummy_table
+POSTHOOK: Output: default@x1_store_sales@ss_sold_date_sk=1
+POSTHOOK: Lineage: x1_store_sales PARTITION(ss_sold_date_sk=1).ss_item_sk SCRIPT []
+PREHOOK: query: insert into x1_store_sales partition (ss_sold_date_sk=2) values (2)
+PREHOOK: type: QUERY
+PREHOOK: Input: _dummy_database@_dummy_table
+PREHOOK: Output: default@x1_store_sales@ss_sold_date_sk=2
+POSTHOOK: query: insert into x1_store_sales partition (ss_sold_date_sk=2) values (2)
+POSTHOOK: type: QUERY
+POSTHOOK: Input: _dummy_database@_dummy_table
+POSTHOOK: Output: default@x1_store_sales@ss_sold_date_sk=2
+POSTHOOK: Lineage: x1_store_sales PARTITION(ss_sold_date_sk=2).ss_item_sk SCRIPT []
+PREHOOK: query: alter table x1_store_sales partition (ss_sold_date_sk=1) update statistics set(
+'numRows'='123456',
+'rawDataSize'='1234567')
+PREHOOK: type: ALTERTABLE_UPDATEPARTSTATS
+PREHOOK: Input: default@x1_store_sales
+PREHOOK: Output: default@x1_store_sales@ss_sold_date_sk=1
+POSTHOOK: query: alter table x1_store_sales partition (ss_sold_date_sk=1) update statistics set(
+'numRows'='123456',
+'rawDataSize'='1234567')
+POSTHOOK: type: ALTERTABLE_UPDATEPARTSTATS
+POSTHOOK: Input: default@x1_store_sales
+POSTHOOK: Input: default@x1_store_sales@ss_sold_date_sk=1
+POSTHOOK: Output: default@x1_store_sales@ss_sold_date_sk=1
+PREHOOK: query: alter table x1_date_dim update statistics set(
+'numRows'='56',
+'rawDataSize'='81449')
+PREHOOK: type: ALTERTABLE_UPDATETABLESTATS
+PREHOOK: Input: default@x1_date_dim
+PREHOOK: Output: default@x1_date_dim
+POSTHOOK: query: alter table x1_date_dim update statistics set(
+'numRows'='56',
+'rawDataSize'='81449')
+POSTHOOK: type: ALTERTABLE_UPDATETABLESTATS
+POSTHOOK: Input: default@x1_date_dim
+POSTHOOK: Output: default@x1_date_dim
+PREHOOK: query: explain 
+select   count(*) cnt
+ from
+     x1_store_sales s
+     ,x1_date_dim d
+ where  
+	1=1
+	and s.ss_sold_date_sk = d.d_date_sk
+	and d.d_year=2000
+union
+select   s.ss_item_sk*d_date_sk
+ from
+     x1_store_sales s
+     ,x1_date_dim d
+ where  
+	1=1
+	and s.ss_sold_date_sk = d.d_date_sk
+	and d.d_year=2001
+	group by s.ss_item_sk*d_date_sk
+PREHOOK: type: QUERY
+PREHOOK: Input: default@x1_date_dim
+PREHOOK: Input: default@x1_store_sales
+PREHOOK: Input: default@x1_store_sales@ss_sold_date_sk=1
+PREHOOK: Input: default@x1_store_sales@ss_sold_date_sk=2
+#### A masked pattern was here ####
+POSTHOOK: query: explain 
+select   count(*) cnt
+ from
+     x1_store_sales s
+     ,x1_date_dim d
+ where  
+	1=1
+	and s.ss_sold_date_sk = d.d_date_sk
+	and d.d_year=2000
+union
+select   s.ss_item_sk*d_date_sk
+ from
+     x1_store_sales s
+     ,x1_date_dim d
+ where  
+	1=1
+	and s.ss_sold_date_sk = d.d_date_sk
+	and d.d_year=2001
+	group by s.ss_item_sk*d_date_sk
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@x1_date_dim
+POSTHOOK: Input: default@x1_store_sales
+POSTHOOK: Input: default@x1_store_sales@ss_sold_date_sk=1
+POSTHOOK: Input: default@x1_store_sales@ss_sold_date_sk=2
+#### A masked pattern was here ####
+Plan optimized by CBO.
+
+Vertex dependency in root stage
+Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 8 (SIMPLE_EDGE)
+Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE), Union 4 (CONTAINS)
+Reducer 5 <- Union 4 (SIMPLE_EDGE)
+Reducer 6 <- Map 1 (SIMPLE_EDGE), Map 8 (SIMPLE_EDGE)
+Reducer 7 <- Reducer 6 (SIMPLE_EDGE), Union 4 (CONTAINS)
+
+Stage-0
+  Fetch Operator
+    limit:-1
+    Stage-1
+      Reducer 5 vectorized, llap
+      File Output Operator [FS_89]
+        Group By Operator [GBY_88] (rows=1 width=8)
+          Output:["_col0"],keys:KEY._col0
+        <-Union 4 [SIMPLE_EDGE]
+          <-Reducer 3 [CONTAINS] vectorized, llap
+            Reduce Output Operator [RS_87]
+              PartitionCols:_col0
+              Group By Operator [GBY_86] (rows=1 width=8)
+                Output:["_col0"],keys:_col0
+                Group By Operator [GBY_85] (rows=1 width=8)
+                  Output:["_col0"],aggregations:["count(VALUE._col0)"]
+                <-Reducer 2 [CUSTOM_SIMPLE_EDGE] llap
+                  PARTITION_ONLY_SHUFFLE [RS_11]
+                    Group By Operator [GBY_10] (rows=1 width=8)
+                      Output:["_col0"],aggregations:["count()"]
+                      Merge Join Operator [MERGEJOIN_51] (rows=1728398 width=8)
+                        Conds:RS_71._col0=RS_77._col0(Inner)
+                      <-Map 1 [SIMPLE_EDGE] vectorized, llap
+                        SHUFFLE [RS_71]
+                          PartitionCols:_col0
+                          Select Operator [SEL_69] (rows=123457 width=4)
+                            Output:["_col0"]
+                            Filter Operator [FIL_68]
+                              predicate:ss_sold_date_sk is not null
+                              TableScan [TS_0] (rows=123457 width=14)
+                                default@x1_store_sales,s,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk"]

Review comment:
       Could merging these TS operators be effectively worse?
   
   For instance, in this specific mock query, no partition will be pruned for the *x1_store_sales* table, while before partition pruning was kicking in. Thus, in this case, you are scanning the same data whether you have one or two TS operators (two partitions), however after merging the TS, the size of the data you are shuffling for the join doubles (data in both partitions twice)? Is that analysis correct?
   
   Off the top of my head, this could be beneficial if i) both TS only select a small subset of the partitions in the table, or ii) overlapping in the partition list for those two different TS is greater than a certain threshold. Should we work in that direction, i.e., introduce some config parameters for this?
   
   @rbalamohan , what is your take? It would be helpful to have a second opinion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org