You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/20 13:39:33 UTC

[GitHub] [spark] wangyum opened a new pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

wangyum opened a new pull request #31908:
URL: https://github.com/apache/spark/pull/31908


   ### What changes were proposed in this pull request?
   
   This pr add new rule to removes outer join if it only has distinct on streamed side. For example:
   ```scala
   spark.range(200L).selectExpr("id AS a").createTempView("t1")
   spark.range(300L).selectExpr("id AS b").createTempView("t2")
   spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true)
   ```
   
   Before this pr:
   ```
   == Optimized Logical Plan ==
   Aggregate [a#2L], [a#2L]
   +- Project [a#2L]
      +- Join LeftOuter, (a#2L = b#6L)
         :- Project [id#0L AS a#2L]
         :  +- Range (0, 200, step=1, splits=Some(2))
         +- Project [id#4L AS b#6L]
            +- Range (0, 300, step=1, splits=Some(2))
   ```
   
   After this pr:
   ```
   == Optimized Logical Plan ==
   Aggregate [a#2L], [a#2L]
   +- Project [id#0L AS a#2L]
      +- Range (0, 200, step=1, splits=Some(2))
   ```
   
   ### Why are the changes needed?
   
   Improve query performance
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598379146



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.

Review comment:
       It would be very helpful for maintenance if you itemize the supported cases one by one here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598269749



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -659,6 +659,12 @@ case class Aggregate(
     val nonAgg = aggregateExpressions.filter(_.find(_.isInstanceOf[AggregateExpression]).isEmpty)
     getAllValidConstraints(nonAgg)
   }
+
+  // Whether this Aggregate operator is equally the Distinct operator.
+  private[sql] def isEquallyDistinct: Boolean = {

Review comment:
       Why didi you put this method here instead of `RemoveOuterJoin`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811877966


   **[Test build #136806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136806/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598379146



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.

Review comment:
       Could you itemize the supported cases one by one here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803346243


   **[Test build #136284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136284/testReport)** for PR 31908 at commit [`618d9d3`](https://github.com/apache/spark/commit/618d9d3869c35f322c555d8474a765e72183aefa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #31908:
URL: https://github.com/apache/spark/pull/31908


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598682821



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -190,7 +190,8 @@ abstract class Optimizer(catalogManager: CatalogManager)
       ReplaceDeduplicateWithAggregate) ::
     Batch("Aggregate", fixedPoint,
       RemoveLiteralFromGroupExpressions,
-      RemoveRepetitionFromGroupExpressions) :: Nil ++
+      RemoveRepetitionFromGroupExpressions,
+      EliminateUnnecessaryOuterJoin) :: Nil ++

Review comment:
       ok




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-809382525


   **[Test build #136656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136656/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824522507


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42296/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811797610


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41389/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598565769



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.

Review comment:
       +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r603145189



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       ping @wangyum 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811919611


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41396/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811756696


   **[Test build #136806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136806/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848774793


   **[Test build #138984 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138984/testReport)** for PR 31908 at commit [`6990abf`](https://github.com/apache/spark/commit/6990abf141f590fa9c0d18b5e01208e54fa56c3a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum edited a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848464935


   cc @sigmod WDYT? It seems DB2 has an additional limitation:
   ![image](https://user-images.githubusercontent.com/5399861/119606544-5ee40000-be25-11eb-9b46-41da4e4ce480.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598566142



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       what if there is no Project between Aggregate and Join?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598282486



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object RemoveOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       It must has a `Project` between `Aggregate` and `Join` because their output is different if we can remove `Join`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #31908:
URL: https://github.com/apache/spark/pull/31908


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848728180


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138982/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811791018


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41385/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849832989


   **[Test build #139022 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139022/testReport)** for PR 31908 at commit [`aae4efe`](https://github.com/apache/spark/commit/aae4efea461af0da9f6ddfec4fb7a073ce191a67).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803588325


   **[Test build #136312 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136312/testReport)** for PR 31908 at commit [`5d48ebc`](https://github.com/apache/spark/commit/5d48ebcb202b419a0e2ab78dc1282a251eb35b90).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811885819


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811756696


   **[Test build #136806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136806/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-809654496


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848818117


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43503/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811919611


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41396/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811713982


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849679436


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43539/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803438731


   **[Test build #136284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136284/testReport)** for PR 31908 at commit [`618d9d3`](https://github.com/apache/spark/commit/618d9d3869c35f322c555d8474a765e72183aefa).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849851907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811791018


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41385/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #31908:
URL: https://github.com/apache/spark/pull/31908


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r638114676



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -165,6 +170,19 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
     case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _, _)) =>
       val newJoinType = buildNewJoinType(f, j)
       if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
+
+    case a @ Aggregate(_, _, Join(left, _, LeftOuter, _, _))
+        if a.isEquallyDistinct && a.references.subsetOf(AttributeSet(left.output)) =>

Review comment:
       `isEquallyDistinct` looks weird, how about `isDistinct`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811886851


   **[Test build #136814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136814/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848805631


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43503/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-851385826


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848728180


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138982/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r640567454



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala
##########
@@ -71,4 +74,65 @@ class AggregateOptimizeSuite extends AnalysisTest {
 
     comparePlans(optimized, correctAnswer)
   }
+
+  test("SPARK-34808: Remove left join if it only has distinct on left side") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, LeftOuter, Some("x.a".attr === "y.a".attr)).select("x.b".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = if (autoBroadcastJoinThreshold < 0) {
+          x.select("x.b".attr).groupBy("x.b".attr)("x.b".attr)
+        } else {
+          Aggregate(query.child.output, query.child.output, query.child)
+        }
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: Remove right join if it only has distinct on right side") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, RightOuter, Some("x.a".attr === "y.a".attr)).select("y.b".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = if (autoBroadcastJoinThreshold < 0) {
+          y.select("y.b".attr).groupBy("y.b".attr)("y.b".attr)
+        } else {
+          Aggregate(query.child.output, query.child.output, query.child)
+        }
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: Should not remove left join if select 2 join sides") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, RightOuter, Some("x.a".attr === "y.a".attr))
+      .select("x.b".attr, "y.c".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = Aggregate(query.child.output, query.child.output, query.child)
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: EliminateOuterJoin must before RemoveRepetitionFromGroupExpressions") {

Review comment:
       let's address https://github.com/apache/spark/pull/31908#discussion_r640359116 then




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803370152


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40866/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598375609



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {

Review comment:
       Just a question. Is there a difference at our optimizer between `EliminateXXX` and `RemoveXXX`?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))
+      if a.isEquallyDistinct && a.references.subsetOf(AttributeSet(left.output)) =>

Review comment:
       Two more spaces?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))
+      if a.isEquallyDistinct && a.references.subsetOf(AttributeSet(left.output)) =>
+      a.copy(child = p.copy(child = left))
+    case a @ Aggregate(_, _, p @ Project(_, Join(_, right, RightOuter, _, _)))
+      if a.isEquallyDistinct && a.references.subsetOf(AttributeSet(right.output)) =>

Review comment:
       Two more spaces?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824505429


   **[Test build #137768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137768/testReport)** for PR 31908 at commit [`9e953da`](https://github.com/apache/spark/commit/9e953da06bcd2f9375d8e9263b25186ceb1e335e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848605741


   @wangyum do you have a TPCDS result?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811822939


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41389/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824674159


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137768/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849020751


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138984/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598375917



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -190,7 +190,8 @@ abstract class Optimizer(catalogManager: CatalogManager)
       ReplaceDeduplicateWithAggregate) ::
     Batch("Aggregate", fixedPoint,
       RemoveLiteralFromGroupExpressions,
-      RemoveRepetitionFromGroupExpressions) :: Nil ++
+      RemoveRepetitionFromGroupExpressions,
+      EliminateUnnecessaryOuterJoin) :: Nil ++

Review comment:
       Maybe, `RemoveUnnecessaryOuterJoin` looks similar with the others here? And, shorter?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r643622282



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -165,6 +170,23 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
     case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _, _)) =>
       val newJoinType = buildNewJoinType(f, j)
       if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
+
+    case a @ Aggregate(_, _, join @ Join(left, _, LeftOuter, _, _))
+        if a.isDistinct && a.references.subsetOf(AttributeSet(left.output)) &&
+          !canPlanAsBroadcastHashJoin(join, conf) =>

Review comment:
       Sorry I missed this. It's outer join and it will never reduce data volume of the left side. So we can always remove the join, no matter it's broadcast or not.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811822939


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41389/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598270070



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object RemoveOuterJoin extends Rule[LogicalPlan] {

Review comment:
       nit: how about following the name of the existing rule `EliminateUnnecessaryJoin`, e.g., `ElimnateUnnecessaryOuterJoin`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-809637447


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811752818


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598359099



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object RemoveOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       Ah, I got it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804137744


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40932/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847824762


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138914/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804337045


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136348/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803499228


   **[Test build #136291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136291/testReport)** for PR 31908 at commit [`7aa7e69`](https://github.com/apache/spark/commit/7aa7e69087460820a11ef4b0d4224ab8d463daa7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811719082


   **[Test build #136802 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136802/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804326399


   **[Test build #136348 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136348/testReport)** for PR 31908 at commit [`c4f1847`](https://github.com/apache/spark/commit/c4f1847842648af315f46be497bea1c64d7f82d5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803640495


   **[Test build #136312 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136312/testReport)** for PR 31908 at commit [`5d48ebc`](https://github.com/apache/spark/commit/5d48ebcb202b419a0e2ab78dc1282a251eb35b90).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847708694


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43439/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811882397


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136806/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848620366


   **[Test build #138982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138982/testReport)** for PR 31908 at commit [`6990abf`](https://github.com/apache/spark/commit/6990abf141f590fa9c0d18b5e01208e54fa56c3a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-812048132


   **[Test build #136814 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136814/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r640358612



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala
##########
@@ -71,4 +74,65 @@ class AggregateOptimizeSuite extends AnalysisTest {
 
     comparePlans(optimized, correctAnswer)
   }
+
+  test("SPARK-34808: Remove left join if it only has distinct on left side") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, LeftOuter, Some("x.a".attr === "y.a".attr)).select("x.b".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = if (autoBroadcastJoinThreshold < 0) {
+          x.select("x.b".attr).groupBy("x.b".attr)("x.b".attr)
+        } else {
+          Aggregate(query.child.output, query.child.output, query.child)
+        }
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: Remove right join if it only has distinct on right side") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, RightOuter, Some("x.a".attr === "y.a".attr)).select("y.b".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = if (autoBroadcastJoinThreshold < 0) {
+          y.select("y.b".attr).groupBy("y.b".attr)("y.b".attr)
+        } else {
+          Aggregate(query.child.output, query.child.output, query.child)
+        }
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: Should not remove left join if select 2 join sides") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, RightOuter, Some("x.a".attr === "y.a".attr))
+      .select("x.b".attr, "y.c".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = Aggregate(query.child.output, query.child.output, query.child)
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: EliminateOuterJoin must before RemoveRepetitionFromGroupExpressions") {

Review comment:
       why?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598376728



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -659,6 +659,12 @@ case class Aggregate(
     val nonAgg = aggregateExpressions.filter(_.find(_.isInstanceOf[AggregateExpression]).isEmpty)
     getAllValidConstraints(nonAgg)
   }
+
+  // Whether this Aggregate operator is equally the Distinct operator.

Review comment:
       Ur, is this description correct? Technically, `semanticEquals` returns `false` for in-deterministic expressions. So, it seems that we miss some `Distinct` operators theoretically. It's not a problem for this optimizer's functionality, but maybe could you revise this description a little?
   ```
     def semanticEquals(other: Expression): Boolean =
       deterministic && other.deterministic && canonicalized == other.canonicalized
   ```

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -659,6 +659,12 @@ case class Aggregate(
     val nonAgg = aggregateExpressions.filter(_.find(_.isInstanceOf[AggregateExpression]).isEmpty)
     getAllValidConstraints(nonAgg)
   }
+
+  // Whether this Aggregate operator is equally the Distinct operator.

Review comment:
       Ur, is this description correct? Technically, `semanticEquals` returns `false` for in-deterministic expressions. So, it seems that we miss some `Distinct` operators theoretically. It's not a problem for this optimizer's functionality, but maybe could you revise this description a little?
   ```scala
     def semanticEquals(other: Expression): Boolean =
       deterministic && other.deterministic && canonicalized == other.canonicalized
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848771254


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43501/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r640469134



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala
##########
@@ -71,4 +74,65 @@ class AggregateOptimizeSuite extends AnalysisTest {
 
     comparePlans(optimized, correctAnswer)
   }
+
+  test("SPARK-34808: Remove left join if it only has distinct on left side") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, LeftOuter, Some("x.a".attr === "y.a".attr)).select("x.b".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = if (autoBroadcastJoinThreshold < 0) {
+          x.select("x.b".attr).groupBy("x.b".attr)("x.b".attr)
+        } else {
+          Aggregate(query.child.output, query.child.output, query.child)
+        }
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: Remove right join if it only has distinct on right side") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, RightOuter, Some("x.a".attr === "y.a".attr)).select("y.b".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = if (autoBroadcastJoinThreshold < 0) {
+          y.select("y.b".attr).groupBy("y.b".attr)("y.b".attr)
+        } else {
+          Aggregate(query.child.output, query.child.output, query.child)
+        }
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: Should not remove left join if select 2 join sides") {
+    val x = testRelation.subquery('x)
+    val y = testRelation.subquery('y)
+    val query = Distinct(x.join(y, RightOuter, Some("x.a".attr === "y.a".attr))
+      .select("x.b".attr, "y.c".attr))
+
+    Seq(-1, 10000).foreach { autoBroadcastJoinThreshold =>
+      withSQLConf(AUTO_BROADCASTJOIN_THRESHOLD.key -> s"$autoBroadcastJoinThreshold") {
+        val correctAnswer = Aggregate(query.child.output, query.child.output, query.child)
+        comparePlans(Optimize.execute(query.analyze), correctAnswer.analyze)
+      }
+    }
+  }
+
+  test("SPARK-34808: EliminateOuterJoin must before RemoveRepetitionFromGroupExpressions") {

Review comment:
       `RemoveRepetitionFromGroupExpressions` will remove repetition from group expressions:
   ```
   === Applying Rule org.apache.spark.sql.catalyst.optimizer.RemoveRepetitionFromGroupExpressions ===
   !Aggregate [a#2L, a#2L], [a#2L, a#2L]                 Aggregate [a#2L], [a#2L, a#2L]
    +- Project [a#2L, a#2L]                              +- Project [a#2L, a#2L]
       +- Join LeftOuter, (a#2L = b#6L)                     +- Join LeftOuter, (a#2L = b#6L)
          :- Project [id#0L AS a#2L]                           :- Project [id#0L AS a#2L]
          :  +- Range (0, 200, step=1, splits=Some(2))         :  +- Range (0, 200, step=1, splits=Some(2))
          +- Project [id#4L AS b#6L]                           +- Project [id#4L AS b#6L]
             +- Range (0, 300, step=1, splits=Some(2))            +- Range (0, 300, step=1, splits=Some(2))
   ```
   
   We can remove this limitation if we refine  `isDistinct`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848717738


   **[Test build #138982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138982/testReport)** for PR 31908 at commit [`6990abf`](https://github.com/apache/spark/commit/6990abf141f590fa9c0d18b5e01208e54fa56c3a).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824522507


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42296/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811768548


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41385/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849020751


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138984/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803588325


   **[Test build #136312 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136312/testReport)** for PR 31908 at commit [`5d48ebc`](https://github.com/apache/spark/commit/5d48ebcb202b419a0e2ab78dc1282a251eb35b90).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847868394


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138918/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598377125



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala
##########
@@ -71,4 +74,33 @@ class AggregateOptimizeSuite extends AnalysisTest {
 
     comparePlans(optimized, correctAnswer)
   }
+
+  test("Remove left join if it only has distinct on left side") {

Review comment:
       Although it's not required for new features, could you add a test prefix, `SPARK-34808:`, for reviewers, please?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848997261


   **[Test build #138984 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138984/testReport)** for PR 31908 at commit [`6990abf`](https://github.com/apache/spark/commit/6990abf141f590fa9c0d18b5e01208e54fa56c3a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848620366


   **[Test build #138982 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138982/testReport)** for PR 31908 at commit [`6990abf`](https://github.com/apache/spark/commit/6990abf141f590fa9c0d18b5e01208e54fa56c3a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r603256332



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ * {{{
+ *   SELECT DISTINCT f1 FROM t1 LEFT JOIN t2 ON t1.id = t2.id  ==>  SELECT DISTINCT f1 FROM t1
+ * }}}
+ */
+object RemoveUnnecessaryOuterJoin extends Rule[LogicalPlan] {

Review comment:
       Can we add this into `EliminateOuterJoin`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-809479809


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136656/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804145735


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40932/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803503801


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40873/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r603280467



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803519703


   **[Test build #136291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136291/testReport)** for PR 31908 at commit [`7aa7e69`](https://github.com/apache/spark/commit/7aa7e69087460820a11ef4b0d4224ab8d463daa7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811802077


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41389/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r638585140



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -165,6 +170,19 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
     case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _, _)) =>
       val newJoinType = buildNewJoinType(f, j)
       if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
+
+    case a @ Aggregate(_, _, Join(left, _, LeftOuter, _, _))

Review comment:
       If we treat left join as a filter, then removing the filter may lose the chance to reduce data volume and cause perf regression?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598283231



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object RemoveOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       For example, we can not remove Join for the following case:
   ```sql
   SELECT DISTINCT * FROM t1 LEFT JOIN t2 ON a = b
   
    Aggregate [a#2L, b#6L], [a#2L, b#6L]              
    +- Join LeftOuter, (a#2L = b#6L)                  
       :- Project [id#0L AS a#2L]                     
       :  +- Range (0, 200, step=1, splits=Some(2))   
       +- Project [id#4L AS b#6L]                     
          +- Range (0, 300, step=1, splits=Some(2))   
   ```
   
   ```sql
   SELECT DISTINCT b FROM t1 LEFT JOIN t2 ON a = b
   
   Aggregate [b#6L], [b#6L]                             
   +- Project [b#6L]                                    
      +- Join LeftOuter, (a#2L = b#6L)                  
         :- Project [id#0L AS a#2L]                     
         :  +- Range (0, 200, step=1, splits=Some(2))   
         +- Project [id#4L AS b#6L]                     
            +- Range (0, 300, step=1, splits=Some(2))   
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803346243


   **[Test build #136284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136284/testReport)** for PR 31908 at commit [`618d9d3`](https://github.com/apache/spark/commit/618d9d3869c35f322c555d8474a765e72183aefa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-809479809


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136656/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598570422



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       It must has a `Project` between `Aggregate` and `Join` because their output is different if we can remove `Join`.
   
   
   For example, we can not remove Join for the following case:
   ```sql
   SELECT DISTINCT * FROM t1 LEFT JOIN t2 ON a = b
   
    Aggregate [a#2L, b#6L], [a#2L, b#6L]              
    +- Join LeftOuter, (a#2L = b#6L)                  
       :- Project [id#0L AS a#2L]                     
       :  +- Range (0, 200, step=1, splits=Some(2))   
       +- Project [id#4L AS b#6L]                     
          +- Range (0, 300, step=1, splits=Some(2))   
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804145340


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40932/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849649213


   **[Test build #139022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139022/testReport)** for PR 31908 at commit [`aae4efe`](https://github.com/apache/spark/commit/aae4efea461af0da9f6ddfec4fb7a073ce191a67).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849686701


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43539/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849649213


   **[Test build #139022 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139022/testReport)** for PR 31908 at commit [`aae4efe`](https://github.com/apache/spark/commit/aae4efea461af0da9f6ddfec4fb7a073ce191a67).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598378931



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -659,6 +659,12 @@ case class Aggregate(
     val nonAgg = aggregateExpressions.filter(_.find(_.isInstanceOf[AggregateExpression]).isEmpty)
     getAllValidConstraints(nonAgg)
   }
+
+  // Whether this Aggregate operator is equally the Distinct operator.
+  private[sql] def isEquallyDistinct: Boolean = {
+    groupingExpressions.size == aggregateExpressions.size &&

Review comment:
       It seems that we don't support the following case. Shall we have a test case for the following?
   ```scala
   scala> sql("select distinct a, a from t1 left join t2 on false").explain
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- HashAggregate(keys=[a#31], functions=[])
      +- Exchange hashpartitioning(a#31, 200), ENSURE_REQUIREMENTS, [id=#194]
         +- HashAggregate(keys=[a#31], functions=[])
            +- Project [a#31, a#31]
               +- BroadcastNestedLoopJoin BuildRight, LeftOuter, false
                  :- Scan hive default.t1 [a#31], HiveTableRelation [`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [a#31], Partition Cols: []]
                  +- BroadcastExchange IdentityBroadcastMode, [id=#189]
                     +- Scan hive default.t2 HiveTableRelation [`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [b#32], Partition Cols: []]
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803642404


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136312/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r603281294



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ * {{{
+ *   SELECT DISTINCT f1 FROM t1 LEFT JOIN t2 ON t1.id = t2.id  ==>  SELECT DISTINCT f1 FROM t1
+ * }}}
+ */
+object RemoveUnnecessaryOuterJoin extends Rule[LogicalPlan] {

Review comment:
       Seems reasonable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803519860


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136291/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803445827


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136284/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847633072


   **[Test build #138914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138914/testReport)** for PR 31908 at commit [`9e953da`](https://github.com/apache/spark/commit/9e953da06bcd2f9375d8e9263b25186ceb1e335e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r643781246



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -165,6 +170,23 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
     case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _, _)) =>
       val newJoinType = buildNewJoinType(f, j)
       if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
+
+    case a @ Aggregate(_, _, join @ Join(left, _, LeftOuter, _, _))
+        if a.isDistinct && a.references.subsetOf(AttributeSet(left.output)) &&
+          !canPlanAsBroadcastHashJoin(join, conf) =>

Review comment:
       https://github.com/apache/spark/pull/32744




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847820609


   **[Test build #138914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138914/testReport)** for PR 31908 at commit [`9e953da`](https://github.com/apache/spark/commit/9e953da06bcd2f9375d8e9263b25186ceb1e335e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r640359116



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -659,6 +659,12 @@ case class Aggregate(
     val nonAgg = aggregateExpressions.filter(_.find(_.isInstanceOf[AggregateExpression]).isEmpty)
     getAllValidConstraints(nonAgg)
   }
+
+  // Whether this Aggregate operator is equally the Distinct operator.
+  private[sql] def isEquallyDistinct: Boolean = {
+    groupingExpressions.size == aggregateExpressions.size &&

Review comment:
       I think we only need `aggregateExpressions` to only contains grouping columns. @wangyum can you refine it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847667035


   **[Test build #138918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138918/testReport)** for PR 31908 at commit [`3ae9488`](https://github.com/apache/spark/commit/3ae9488d833ec6f36ea0636ca03e703d2b3a5956).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847706584


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43439/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848742635


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43501/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598740161



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       Because the supported case must the subset of the join's output(Either all on the left side or all on the right side):
   ![image](https://user-images.githubusercontent.com/5399861/112001045-871b5e00-8b59-11eb-9494-10a2cf035b0b.png)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811754098


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136802/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r638114867



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -844,6 +844,12 @@ case class Aggregate(
 
   override protected def withNewChildInternal(newChild: LogicalPlan): Aggregate =
     copy(child = newChild)
+
+  // Whether this Aggregate operator is equally the Distinct operator.
+  private[sql] def isEquallyDistinct: Boolean = {
+    groupingExpressions.size == aggregateExpressions.size &&
+      groupingExpressions.zip(aggregateExpressions).forall(e => e._1.fastEquals(e._2))

Review comment:
       I'd use `semanticEqual` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-809654496


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811754098


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136802/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598672944



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {

Review comment:
       No difference.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804145735


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40932/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848771254


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43501/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598285144



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -659,6 +659,12 @@ case class Aggregate(
     val nonAgg = aggregateExpressions.filter(_.find(_.isInstanceOf[AggregateExpression]).isEmpty)
     getAllValidConstraints(nonAgg)
   }
+
+  // Whether this Aggregate operator is equally the Distinct operator.
+  private[sql] def isEquallyDistinct: Boolean = {

Review comment:
       This method can be used elsewhere. For example:
   
   https://github.com/apache/spark/blob/e226e2ceea7b648a1864c7ff1ba1e67eee83789b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1809-L1813




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r643628381



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -165,6 +170,23 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
     case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _, _)) =>
       val newJoinType = buildNewJoinType(f, j)
       if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
+
+    case a @ Aggregate(_, _, join @ Join(left, _, LeftOuter, _, _))
+        if a.isDistinct && a.references.subsetOf(AttributeSet(left.output)) &&
+          !canPlanAsBroadcastHashJoin(join, conf) =>

Review comment:
       The result may be incorrect if always remove the join. For example:
   ```
   0: jdbc:hive2://hdc49-mcc10-01-0510-2005-006-> create table test11.t1 using parquet as select id % 3 as a, id as b from range(10);
   +---------+--+
   | Result  |
   +---------+--+
   +---------+--+
   No rows selected (1.611 seconds)
   0: jdbc:hive2://hdc49-mcc10-01-0510-2005-006-> create table test11.t2 using parquet as select id % 3 as x, id as y from range(5);
   +---------+--+
   | Result  |
   +---------+--+
   +---------+--+
   No rows selected (1.043 seconds)
   0: jdbc:hive2://hdc49-mcc10-01-0510-2005-006-> select t1.a from t1 left join t2 on a = x;
   +----+--+
   | a  |
   +----+--+
   | 0  |
   | 0  |
   | 1  |
   | 1  |
   | 0  |
   | 0  |
   | 1  |
   | 1  |
   | 2  |
   | 0  |
   | 0  |
   | 1  |
   | 1  |
   | 2  |
   | 0  |
   | 0  |
   | 2  |
   +----+--+
   17 rows selected (1.409 seconds)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803499228


   **[Test build #136291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136291/testReport)** for PR 31908 at commit [`7aa7e69`](https://github.com/apache/spark/commit/7aa7e69087460820a11ef4b0d4224ab8d463daa7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598285144



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##########
@@ -659,6 +659,12 @@ case class Aggregate(
     val nonAgg = aggregateExpressions.filter(_.find(_.isInstanceOf[AggregateExpression]).isEmpty)
     getAllValidConstraints(nonAgg)
   }
+
+  // Whether this Aggregate operator is equally the Distinct operator.
+  private[sql] def isEquallyDistinct: Boolean = {

Review comment:
       This method can be used elsewhere. For example:
   
   https://github.com/apache/spark/blob/e226e2ceea7b648a1864c7ff1ba1e67eee83789b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1809-L1813 in this pr: https://github.com/apache/spark/pull/31113




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804094310


   **[Test build #136348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136348/testReport)** for PR 31908 at commit [`c4f1847`](https://github.com/apache/spark/commit/c4f1847842648af315f46be497bea1c64d7f82d5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598206411



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1998,6 +1999,20 @@ object RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object RemoveOuterJoin extends Rule[LogicalPlan] {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803519860


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136291/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r643628995



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala
##########
@@ -165,6 +170,23 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper {
     case f @ Filter(condition, j @ Join(_, _, RightOuter | LeftOuter | FullOuter, _, _)) =>
       val newJoinType = buildNewJoinType(f, j)
       if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType))
+
+    case a @ Aggregate(_, _, join @ Join(left, _, LeftOuter, _, _))
+        if a.isDistinct && a.references.subsetOf(AttributeSet(left.output)) &&
+          !canPlanAsBroadcastHashJoin(join, conf) =>

Review comment:
       The aggregate should still be there. I mean we can remove this `canPlanAsBroadcastHashJoin` check




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803503109


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40873/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849686701


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43539/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848685677


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43501/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847824762


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138914/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803600452


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40894/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847863732


   **[Test build #138918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138918/testReport)** for PR 31908 at commit [`3ae9488`](https://github.com/apache/spark/commit/3ae9488d833ec6f36ea0636ca03e703d2b3a5956).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847708694


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43439/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811719082


   **[Test build #136802 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136802/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803598557


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40894/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811886851


   **[Test build #136814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136814/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803600452


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40894/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811762746


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41385/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803503797


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40873/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803368657


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40866/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847186319


   Sorry for the delay. This looks like a straightforward optimization. @wangyum would you like to reopen it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803642404


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136312/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803370483


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40866/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803567724


   The optimization itself looks fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-812068243


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136814/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848464935


   cc @sigmod 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598269687



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object RemoveOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       ```
   == Optimized Logical Plan ==
   Aggregate [a#2L], [a#2L]
   +- Project [a#2L]
      +- Join LeftOuter, (a#2L = b#6L)
         :- Project [id#0L AS a#2L]
         :  +- Range (0, 200, step=1, splits=Some(2))
         +- Project [id#4L AS b#6L]
            +- Range (0, 300, step=1, splits=Some(2))
   ```
   btw, why do we need `Project` between `Aggregate` and `Join` here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803600440


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40894/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848818117


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43503/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-849851907


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139022/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848774793


   **[Test build #138984 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138984/testReport)** for PR 31908 at commit [`6990abf`](https://github.com/apache/spark/commit/6990abf141f590fa9c0d18b5e01208e54fa56c3a).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum edited a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum edited a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848612415


   >  do you have a TPCDS result?
   
    TPCDS does not have this pattern.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811745461


   **[Test build #136802 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136802/testReport)** for PR 31908 at commit [`d53ce3b`](https://github.com/apache/spark/commit/d53ce3b11fec4d45ec1e83f6ea789154ac6ab369).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803503801


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40873/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848771118


   retest this please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824505429


   **[Test build #137768 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137768/testReport)** for PR 31908 at commit [`9e953da`](https://github.com/apache/spark/commit/9e953da06bcd2f9375d8e9263b25186ceb1e335e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598714293



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       why not? `Aggregate` defines its output and it won't change with the child.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804094310


   **[Test build #136348 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136348/testReport)** for PR 31908 at commit [`c4f1847`](https://github.com/apache/spark/commit/c4f1847842648af315f46be497bea1c64d7f82d5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824520323






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811882397


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136806/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824655268


   **[Test build #137768 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137768/testReport)** for PR 31908 at commit [`9e953da`](https://github.com/apache/spark/commit/9e953da06bcd2f9375d8e9263b25186ceb1e335e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-812068243


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136814/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-811911181


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41396/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-824674159


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137768/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847667035


   **[Test build #138918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138918/testReport)** for PR 31908 at commit [`3ae9488`](https://github.com/apache/spark/commit/3ae9488d833ec6f36ea0636ca03e703d2b3a5956).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803370483


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40866/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-809643173


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41238/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-803445827


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136284/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has distinct on streamed side

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598203637



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -1998,6 +1999,20 @@ object RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object RemoveOuterJoin extends Rule[LogicalPlan] {

Review comment:
       If you don't mind, shall we have a separate file for this new optimizer, @wangyum ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847868394


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138918/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-848612415


   >  do you have a TPCDS result?
   
   No. TPCDS does not have this pattern.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31908:
URL: https://github.com/apache/spark/pull/31908#discussion_r598809303



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateUnnecessaryOuterJoin.scala
##########
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.AttributeSet
+import org.apache.spark.sql.catalyst.plans.{LeftOuter, RightOuter}
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Join, LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * Removes outer join if it only has distinct on streamed side.
+ */
+object EliminateUnnecessaryOuterJoin extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case a @ Aggregate(_, _, p @ Project(_, Join(left, _, LeftOuter, _, _)))

Review comment:
       When this happens, column pruning will insert a Project between them. So I agree that the code here is correct in reality. But the code here assumes that column pruning will kick in first, and it's not necessary to have this assumption.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-847633072


   **[Test build #138914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138914/testReport)** for PR 31908 at commit [`9e953da`](https://github.com/apache/spark/commit/9e953da06bcd2f9375d8e9263b25186ceb1e335e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31908: [SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31908:
URL: https://github.com/apache/spark/pull/31908#issuecomment-804337045


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136348/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org