You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "maytasm (via GitHub)" <gi...@apache.org> on 2023/10/21 06:21:48 UTC

[PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

maytasm opened a new pull request, #43471:
URL: https://github.com/apache/spark/pull/43471

   ### What changes were proposed in this pull request?
   
   This PR adds a new feature(which is disabled by default to maintain current behavior) that would evaluate scalar subqueries in the Optimizer before rule to push down filter. 
   
   ### Why are the changes needed?
   Some queries can benefit from having it's scalar subquery in the filter evaluated while planning so that the scalar result (from the subquery) can be push down. 
   
   For example, a query like 
   
   `select * from t2 where b > (select max(a) from t1) `
   
   where t1 is a small table but t2 is a very large table can benefit if we first evaluate the subquery then push down the result to the pushed filter (instead of having the subquery in the post scan filter)
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. A new conf is added (which is disabled by default)
   
   ### How was this patch tested?
   Unit tested and manually tested
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "holdenk (via GitHub)" <gi...@apache.org>.

holdenk commented on code in PR #43471:
URL: https://github.com/apache/spark/pull/43471#discussion_r1369374345


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/ExecuteUncorrelatedScalarSubquery.scala:
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.SubExprUtils.hasOuterReferences
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+class ExecuteUncorrelatedScalarSubquery(spark: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case subquery: expressions.ScalarSubquery if !hasOuterReferences(subquery.plan) =>
+      val result = SubqueryEvaluation.ifEnabled(spark) {
+        evaluate(subquery)
+      }
+      result.getOrElse(subquery)
+  }
+
+  private def evaluate(subquery: expressions.ScalarSubquery): Literal = {
+    val qe = new QueryExecution(spark, subquery.plan)
+    val (resultType, rows) = SQLExecution.withNewExecutionId(qe) {
+      val physicalPlan = qe.executedPlan
+      (physicalPlan.schema.fields.head.dataType, physicalPlan.executeCollect())
+    }
+
+    if (rows.length > 1) {
+      throw new AnalysisException(
+        s"More than one row returned by a subquery used as an expression:\n${subquery.plan}")
+    }
+
+    if (rows.length == 1) {
+      assert(rows(0).numFields == 1,
+        s"Expects 1 field, but got ${rows(0).numFields}; something went wrong in analysis")

Review Comment:
   Should we through an analysis exception here too instead of the assert?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala:
##########
@@ -31,12 +32,14 @@ import org.apache.spark.sql.execution.python.{ExtractGroupingPythonUDFFromAggreg
 class SparkOptimizer(
     catalogManager: CatalogManager,
     catalog: SessionCatalog,
-    experimentalMethods: ExperimentalMethods)
+    experimentalMethods: ExperimentalMethods,
+    spark: SparkSession)
   extends Optimizer(catalogManager) {
 
   override def earlyScanPushDownRules: Seq[Rule[LogicalPlan]] =
     // TODO: move SchemaPruning into catalyst
     Seq(SchemaPruning) :+
+      new ExecuteUncorrelatedScalarSubquery(spark) :+

Review Comment:
   If there is a filter to push down inside of the sub query would it still get pushed down? I'm thinking of a limit 1 case might be especially common.
   
   I'm thinking it will since we launch another query but I just want to make sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "holdenk (via GitHub)" <gi...@apache.org>.

holdenk commented on PR #43471:
URL: https://github.com/apache/spark/pull/43471#issuecomment-1775686906

   CC @dongjoon-hyun we found this helps with some of our read paths that you might share if you have some cycles for review would be appreciated :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "maytasm (via GitHub)" <gi...@apache.org>.

maytasm commented on code in PR #43471:
URL: https://github.com/apache/spark/pull/43471#discussion_r1373912397


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala:
##########
@@ -31,12 +32,14 @@ import org.apache.spark.sql.execution.python.{ExtractGroupingPythonUDFFromAggreg
 class SparkOptimizer(
     catalogManager: CatalogManager,
     catalog: SessionCatalog,
-    experimentalMethods: ExperimentalMethods)
+    experimentalMethods: ExperimentalMethods,
+    spark: SparkSession)

Review Comment:
   Any advice on how to get around this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43471:
URL: https://github.com/apache/spark/pull/43471#discussion_r1371235949


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/ExecuteUncorrelatedScalarSubquery.scala:
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.SubExprUtils.hasOuterReferences
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+class ExecuteUncorrelatedScalarSubquery(spark: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case subquery: expressions.ScalarSubquery if !hasOuterReferences(subquery.plan) =>

Review Comment:
   Please ensure the subquery is small enough. AFAIK, you can use `subquery.plan.stats.sizeInBytes < some threshold`.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala:
##########
@@ -31,12 +32,14 @@ import org.apache.spark.sql.execution.python.{ExtractGroupingPythonUDFFromAggreg
 class SparkOptimizer(
     catalogManager: CatalogManager,
     catalog: SessionCatalog,
-    experimentalMethods: ExperimentalMethods)
+    experimentalMethods: ExperimentalMethods,
+    spark: SparkSession)

Review Comment:
   We can't add `SparkSession`.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/ExecuteUncorrelatedScalarSubquery.scala:
##########
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.SubExprUtils.hasOuterReferences
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
+
+class ExecuteUncorrelatedScalarSubquery(spark: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+    case subquery: expressions.ScalarSubquery if !hasOuterReferences(subquery.plan) =>
+      val result = SubqueryEvaluation.ifEnabled(spark) {

Review Comment:
   You can use `conf.enableSubqueryEvaluation` directly.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4516,6 +4516,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+    val ENABLE_SUBQUERY_EVALUATION =
+    buildConf("spark.sql.subquery.eval.enabled")

Review Comment:
   `spark.sql.subquery.eagerEval.enabled`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #43471:
URL: https://github.com/apache/spark/pull/43471#issuecomment-1926008115

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "maytasm (via GitHub)" <gi...@apache.org>.

maytasm commented on PR #43471:
URL: https://github.com/apache/spark/pull/43471#issuecomment-1773690865

   CC: @holdenk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43471:
URL: https://github.com/apache/spark/pull/43471#discussion_r1374127546


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala:
##########
@@ -31,12 +32,14 @@ import org.apache.spark.sql.execution.python.{ExtractGroupingPythonUDFFromAggreg
 class SparkOptimizer(
     catalogManager: CatalogManager,
     catalog: SessionCatalog,
-    experimentalMethods: ExperimentalMethods)
+    experimentalMethods: ExperimentalMethods,
+    spark: SparkSession)

Review Comment:
   I think you should put `ExecuteUncorrelatedScalarSubquery` into AQE, so we can get session from `AdaptiveExecutionContext`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #43471: [SPARK-45621] Add feature to evaluate subquery before push down filter Optimizer rule
URL: https://github.com/apache/spark/pull/43471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org