You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2020/03/12 22:52:13 UTC

[GitHub] [hive] pgaref opened a new pull request #952: HIVE-23006 ProbeDecode compiler support

pgaref opened a new pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952
 
 
   This patch adds an extra optimisation step with the goal of finding Table Scan operators that could reduce the number of rows decoded at runtime using extra available information.
   
   It currently looks for all the available MapJoin operators that could use the smaller HashTable on the probing side (where TS is) to filter-out rows that would never match.
   To do so the HashTable information is pushed down to the TS properties and then propagated as part of MapWork.
   If the a single TS is used by multiple operators (shared-word), this rule can not be applied.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r405765078
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
 ##########
 @@ -1482,18 +1490,131 @@ private void removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
         deque.addAll(op.getChildOperators());
       }
     }
+    //  No need to remove SJ branches when we have semi-join reduction or when semijoins are enabled for parallel mapjoins.
+    if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN)) {
+      if (semijoins.size() > 0) {
+        for (Entry<ReduceSinkOperator, TableScanOperator> semiEntry : semijoins.entrySet()) {
+          SemiJoinBranchInfo sjInfo = procCtx.parseContext.getRsToSemiJoinBranchInfo().get(semiEntry.getKey());
+          if (sjInfo.getIsHint() || !sjInfo.getShouldRemove()) {
+            // Created by hint, skip it
+            continue;
+          }
+          if (LOG.isDebugEnabled()) {
+            LOG.debug("Semijoin optimization with parallel edge to map join. Removing semijoin " +
+                OperatorUtils.getOpNamePretty(semiEntry.getKey()) + " - " + OperatorUtils.getOpNamePretty(semiEntry.getValue()));
+          }
+          GenTezUtils.removeBranch(semiEntry.getKey());
+          GenTezUtils.removeSemiJoinOperator(procCtx.parseContext, semiEntry.getKey(), semiEntry.getValue());
+        }
+      }
+    }
+    if (procCtx.conf.getBoolVar(ConfVars.HIVE_OPTIMIZE_SCAN_PROBEDECODE)) {
+      if (probeDecodeMJoins.size() > 0) {
 
 Review comment:
   The path for `HIVE_OPTIMIZE_SCAN_PROBEDECODE` seems independent from SJ optimization. Should we add a mechanism to remove the context for the optimization when we think it is not going to be beneficial, e.g., it is not filtering any data? Or you think that the cost of checking is negligible and we should always apply this optimization? What do you experiments show in the worst case scenario? (In any case, this could be tackled in a follow-up but I wanted to ask)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r406148931
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
 ##########
 @@ -1483,17 +1489,64 @@ private void removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
       }
     }
 
-    if (semijoins.size() > 0) {
-      for (ReduceSinkOperator rs : semijoins.keySet()) {
-        if (LOG.isDebugEnabled()) {
-          LOG.debug("Semijoin optimization with parallel edge to map join. Removing semijoin "
-              + OperatorUtils.getOpNamePretty(rs) + " - " + OperatorUtils.getOpNamePretty(semijoins.get(rs)));
+    if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN)) {
+      if (semijoins.size() > 0) {
 
 Review comment:
   Hey Jesus, SJ removal and MJ probedecode are independent from my point of view, and thats what I was trying to achieve here -- when TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN is not enabled SJ removal is triggered (similar to existing code) and when HIVE_OPTIMIZE_SCAN_PROBEDECODE is enabled we do the extra the MJ to TS selection.
   I am not sure there is a case where these two optimizations could affect each other (?) but currently they are completely independent.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r405762816
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
 ##########
 @@ -1483,17 +1489,64 @@ private void removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
       }
     }
 
-    if (semijoins.size() > 0) {
-      for (ReduceSinkOperator rs : semijoins.keySet()) {
-        if (LOG.isDebugEnabled()) {
-          LOG.debug("Semijoin optimization with parallel edge to map join. Removing semijoin "
-              + OperatorUtils.getOpNamePretty(rs) + " - " + OperatorUtils.getOpNamePretty(semijoins.get(rs)));
+    if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN)) {
+      if (semijoins.size() > 0) {
 
 Review comment:
   Probably I am missing part of the mechanism to enable your optimization. It seems you are skipping the removal of SJ, so you are adding the context AND keeping those branches in the plan. Is the intention to keep both optimizations in these cases? Or is it because there is some dependency between them? I thought for MJ, the intention was to keep only your new optimization, that's maybe where my misunderstanding is coming from.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor edited a comment on issue #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor edited a comment on issue #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#issuecomment-608201286
 
 
   @pgaref , thanks for changes. Patch looks good. About the name, I think it is fine (probe side for the filtering generated statically in case of expressions or dynamically for joins).
   
   I think main remaining issue is related to selection in case of multiple MJ operators. I was thinking that for the time being, we could make the policy pluggable via config and have two very simple ones: 1) keep the one with lowest `ndv_JOIN_key_column / ndv_TS_target_column` ratio, or 2) keep those with ratio below a certain threshold specified through a config value. Alternatively, in this patch you could maybe go only with option 1 and tackle other in follow-up.
   
   You have the information about the stats in the operators themselves (follow `op.getStatistics()` calls to see how it is retrieved per column basis). Thus, I would make that decision in the TezCompiler (once we gather all the context information) rather than GenTezUtils. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on issue #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on issue #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#issuecomment-609981295
 
 
   > @pgaref , thanks for changes. Patch looks good. About the name, I think it is fine (probe side for the filtering generated statically in case of expressions or dynamically for joins).
   > 
   > I think main remaining issue is related to selection in case of multiple MJ operators. I was thinking that for the time being, we could make the policy pluggable via config and have two very simple ones: 1) keep the one with lowest `ndv_JOIN_probe_key_column / ndv_TS_target_column` ratio, or 2) keep those with ratio below a certain threshold specified through a config value. Alternatively, in this patch you could maybe go only with option 1 and tackle other in follow-up.
   > 
   > You have the information about the stats in the operators themselves (follow `op.getStatistics()` calls to see how it is retrieved per column basis). Thus, I would make that decision in the TezCompiler (once we gather all the context information) rather than GenTezUtils.
   
   @jcamachor thanks for the review, I appreciate it!
   
   Please check the latest changes with:
   
   - More generic probeDecode name for HiveConf
   
   - TezCompiler with configurable logic to select a MJ when we have multiple per TS -- currently picking the one with the lowest MJ/TS key Ratio
   
   - SharedWork using a TS on multiple branches clears probeContext that would result in wrong results
   
   - probeContextDetails as part of the explained plan
   
   - ProbeDecode Test cases with and without table stats
   
   Thoughts?
   
   My last concern is some operators like **VectorPTFEvaluatorDecimalAvg** that do not allow the use of Selected (which is what probedecode is doing) and whether we should take that into account as part of the optimizations.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on issue #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor commented on issue #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#issuecomment-608201286
 
 
   @pgaref , thanks for changes. Patch looks good. About the name, I think it is fine (probe side for the filtering generated statically in case of expressions or dynamically for joins).
   
   I think main remaining issue is related to selection in case of multiple MJ operators. I was thinking that for the time being, we could make the policy pluggable via config and have three very simple ones: 1) keep all of them, 2) keep the one with lowest `ndv_JOIN_key_column / ndv_TS_target_column` ratio, or 3) keep those with ratio below a certain threshold specified through a config value. Alternatively, in this patch you could maybe go only with option 2 and tackle others in follow-up.
   
   You have the information about the stats in the operators themselves (follow `op.getStatistics()` calls to see how it is retrieved per column basis). Thus, I would make that decision in the TezCompiler (once we gather all the context information) rather than GenTezUtils. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r405769389
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
 ##########
 @@ -487,6 +487,15 @@ private static boolean sharedWorkOptimization(ParseContext pctx, SharedWorkOptim
             }
             LOG.debug("Input operator removed: {}", op);
           }
+
+          // A shared TSop across branches can not have probeContext that utilizes single branch info
+          // Filtered-out rows from one branch might be needed by another branch sharing a TSop
+          if (retainableTsOp.getProbeDecodeContext() != null) {
 
 Review comment:
   In this case, should we remove the `ProbeDecodeContext` or should we skip merging these two TS operators? It may be that in some cases merging will backfire, i.e., if those two filters were very selective? Just for reference, if I remember correctly, SharedWorkOptimizer only merges TS operators targeted by SJs if at least one of the TS operators do not contain a SJ (since we would incur the full scan cost in any case at least once).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r392609447
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/ProbeDecodeOptimizer.java
 ##########
 @@ -0,0 +1,139 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer;
+
+import org.apache.hadoop.hive.ql.exec.MapJoinOperator;
+import org.apache.hadoop.hive.ql.exec.Operator;
+import org.apache.hadoop.hive.ql.exec.TableScanOperator;
+import org.apache.hadoop.hive.ql.parse.ParseContext;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.plan.ExprNodeColumnDesc;
+import org.apache.hadoop.hive.ql.plan.ExprNodeDesc;
+import org.apache.hadoop.hive.ql.plan.MapJoinDesc;
+import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
+import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.Deque;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * <p> The goal of this rule is to find suitable table scan operators that could reduce the
+ * number of rows decoded at runtime using extra available information.
+ *
+ * <p> First the rule checks all the available MapJoin operators that could propagate the
+ * smaller HashTable on the probing side (where TS is) to filter-out rows that would
+ * never match. To do so the HashTable information is pushed down to the TS properties.
+ * If the a single TS is used by multiple operators (shared-word), this rule
 
 Review comment:
   typo: word -> work

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r405770526
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java
 ##########
 @@ -196,6 +196,11 @@ public MapWork createMapWork(GenTezProcContext context, Operator<?> root,
       mapWork.setIncludedBuckets(ts.getConf().getIncludedBuckets());
     }
 
+    if (ts.getProbeDecodeContext() != null) {
+      // TODO: some operators like VectorPTFEvaluator do not allow the use of Selected take this into account here?
 
 Review comment:
   I think this may be taken into account when we are pushing the SJ predicates down since it would not be valid in that case either (which I believe your logic to create the context is relying on?). Could you verify that?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r406161694
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java
 ##########
 @@ -196,6 +196,11 @@ public MapWork createMapWork(GenTezProcContext context, Operator<?> root,
       mapWork.setIncludedBuckets(ts.getConf().getIncludedBuckets());
     }
 
+    if (ts.getProbeDecodeContext() != null) {
+      // TODO: some operators like VectorPTFEvaluator do not allow the use of Selected take this into account here?
 
 Review comment:
   Just to clarify, I am referring to this Precondition check that is used by all PTF operators.
   As this optimization enables TS filtering by setting selectedInUse down the pipeline it could make these PreConditions to fail.
   
   We could take this into account when adding valid MJ-TS ops indeed, however it could be still beneficial for other operators on the pipeline that do use the selected array. 
   So the real question: is the condition really needed on those operators? 
   It does not look to me that it could cause any issues if it is enabled (even if PTF ops dont use it)
   
   https://github.com/apache/hive/blob/cc38bcc5a993304898ba37b8496f13a15d62bf16/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/ptf/VectorPTFEvaluatorDecimalAvg.java#L64

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r404318170
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
 ##########
 @@ -1483,17 +1489,64 @@ private void removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
       }
     }
 
-    if (semijoins.size() > 0) {
-      for (ReduceSinkOperator rs : semijoins.keySet()) {
-        if (LOG.isDebugEnabled()) {
-          LOG.debug("Semijoin optimization with parallel edge to map join. Removing semijoin "
-              + OperatorUtils.getOpNamePretty(rs) + " - " + OperatorUtils.getOpNamePretty(semijoins.get(rs)));
+    if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN)) {
+      if (semijoins.size() > 0) {
 
 Review comment:
   Hey @jcamachor  -- in this case I just followed the logic [above](https://github.com/apache/hive/pull/952/files#diff-228178321e4c1df7f045fe181de2fffaL1450) where we seems to bail-out when 
   TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN is enabled.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] asfgit closed pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on issue #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on issue #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#issuecomment-603466661
 
 
   > I started taking a look at your patch as well as code in master.
   > I believe following approach would be easier since you could rely on existing SJ logic.
   > You could modify method `removeSemijoinsParallelToMapJoin` in `TezCompiler`. In particular, `findParallelSemiJoinBranch` already finds all semijoins originating in a MapJoin input (note that you may need small modification in [this block](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java#L1380) to not skip adding them to the map). After calling that method, `semijoins` will contain pairs (RS, TS). You will enter [this block](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java#L1486) if `TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN ` is `true` (note that you have to remove this condition from L1450). Else, if `HIVE_MAPJOIN_PROBEDECODE_ENABLED` is `true`, you will need to remove the SJs (same as the block is doing) and create your optimization data structures, i.e., your context in the TS as you are currently doing. You may end up with multiple MJ targeting a single TS; you can generate a context for each of them, store these in the TS, and decide later on a strategy to apply the filtering based on a config parameter. Let me know if this makes sense.
   > 
   > Cc @t3rmin4t0r
   
   Hey @jcamachor , thanks for the comments!
   I updated the patch to capture all (MJ, TS) pairs as part of semijoinRemovalBasedTransformations https://github.com/apache/hive/pull/952/files#diff-228178321e4c1df7f045fe181de2fffaR1453
   
   I believe the TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN flag should be false for semijoin removal to kick in (if true the whole method returned in the clean version)
   https://github.com/apache/hive/pull/952/files#diff-228178321e4c1df7f045fe181de2fffaR1455
   
   The idea of storing multiple MJ targets as part of a TS also makes sense to me -- I guess the decision could be part of GenTezUtils or earlier optimisation rules -- what do you think? https://github.com/apache/hive/pull/952/files#diff-4a039ee6b13e42df6ff51e85db691132R199
   
   PS: Was also thinking that the naming of this is a bit off (HIVE_MAPJOIN_PROBEDECODE) -- as this can also be used for static expression.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor edited a comment on issue #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor edited a comment on issue #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#issuecomment-608201286
 
 
   @pgaref , thanks for changes. Patch looks good. About the name, I think it is fine (probe side for the filtering generated statically in case of expressions or dynamically for joins).
   
   I think main remaining issue is related to selection in case of multiple MJ operators. I was thinking that for the time being, we could make the policy pluggable via config and have two very simple ones: 1) keep the one with lowest `ndv_JOIN_probe_key_column / ndv_TS_target_column` ratio, or 2) keep those with ratio below a certain threshold specified through a config value. Alternatively, in this patch you could maybe go only with option 1 and tackle other in follow-up.
   
   You have the information about the stats in the operators themselves (follow `op.getStatistics()` calls to see how it is retrieved per column basis). Thus, I would make that decision in the TezCompiler (once we gather all the context information) rather than GenTezUtils. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r402708575
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
 ##########
 @@ -1483,17 +1489,64 @@ private void removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
       }
     }
 
-    if (semijoins.size() > 0) {
-      for (ReduceSinkOperator rs : semijoins.keySet()) {
-        if (LOG.isDebugEnabled()) {
-          LOG.debug("Semijoin optimization with parallel edge to map join. Removing semijoin "
-              + OperatorUtils.getOpNamePretty(rs) + " - " + OperatorUtils.getOpNamePretty(semijoins.get(rs)));
+    if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN)) {
+      if (semijoins.size() > 0) {
 
 Review comment:
   You may still need to get rid of the SJ branches for the map join operators here if `HIVE_MAPJOIN_PROBEDECODE_ENABLED`? Or do you need to keep them?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r406155406
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
 ##########
 @@ -1482,18 +1490,131 @@ private void removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
         deque.addAll(op.getChildOperators());
       }
     }
+    //  No need to remove SJ branches when we have semi-join reduction or when semijoins are enabled for parallel mapjoins.
+    if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN)) {
+      if (semijoins.size() > 0) {
+        for (Entry<ReduceSinkOperator, TableScanOperator> semiEntry : semijoins.entrySet()) {
+          SemiJoinBranchInfo sjInfo = procCtx.parseContext.getRsToSemiJoinBranchInfo().get(semiEntry.getKey());
+          if (sjInfo.getIsHint() || !sjInfo.getShouldRemove()) {
+            // Created by hint, skip it
+            continue;
+          }
+          if (LOG.isDebugEnabled()) {
+            LOG.debug("Semijoin optimization with parallel edge to map join. Removing semijoin " +
+                OperatorUtils.getOpNamePretty(semiEntry.getKey()) + " - " + OperatorUtils.getOpNamePretty(semiEntry.getValue()));
+          }
+          GenTezUtils.removeBranch(semiEntry.getKey());
+          GenTezUtils.removeSemiJoinOperator(procCtx.parseContext, semiEntry.getKey(), semiEntry.getValue());
+        }
+      }
+    }
+    if (procCtx.conf.getBoolVar(ConfVars.HIVE_OPTIMIZE_SCAN_PROBEDECODE)) {
+      if (probeDecodeMJoins.size() > 0) {
 
 Review comment:
   When using MJ to add filtering on ProbeDecode (and not static filters) I believe we should always keep the context as we dont really know how effective the filter is going to be right? (as in we dont know how many HT keys are going to match the particular column on the TS side)
   
   In ORC-597 I did some experiments using existing datasets (github, sales, etc) and found that even when we filter-out 20% of elements we dont add extra overhead (due to row-level filtering) -- of course this depends on the data type as well.
   
   To tackle the above, I have have a runtime optimization as part of ORC Data Consumer (part of LLap) that disables the filter when its not effective.
   https://github.com/apache/hive/pull/926/files#diff-4137d272789978e35fd5f489f09da064R343

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r406157569
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java
 ##########
 @@ -487,6 +487,15 @@ private static boolean sharedWorkOptimization(ParseContext pctx, SharedWorkOptim
             }
             LOG.debug("Input operator removed: {}", op);
           }
+
+          // A shared TSop across branches can not have probeContext that utilizes single branch info
+          // Filtered-out rows from one branch might be needed by another branch sharing a TSop
+          if (retainableTsOp.getProbeDecodeContext() != null) {
 
 Review comment:
   I am not sure there is a good way to check how selective these filters are especially when using MJ HT probing (available at runtime at non-deterministic times) thus I chose to disable the optimization.
   
   Maybe an actual decision (merging TS or not) could be made when we deal with actual static filters that can be more predictable.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support

Posted by GitBox <gi...@apache.org>.
pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r404318170
 
 

 ##########
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
 ##########
 @@ -1483,17 +1489,64 @@ private void removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
       }
     }
 
-    if (semijoins.size() > 0) {
-      for (ReduceSinkOperator rs : semijoins.keySet()) {
-        if (LOG.isDebugEnabled()) {
-          LOG.debug("Semijoin optimization with parallel edge to map join. Removing semijoin "
-              + OperatorUtils.getOpNamePretty(rs) + " - " + OperatorUtils.getOpNamePretty(semijoins.get(rs)));
+    if (!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN)) {
+      if (semijoins.size() > 0) {
 
 Review comment:
   Hey @jcamachor  -- in this case I just followed the logic [above](https://github.com/apache/hive/pull/952/files#diff-228178321e4c1df7f045fe181de2fffaL1450) where we  bail-out when 
   TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN is enabled.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org