You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "dtenedor (via GitHub)" <gi...@apache.org> on 2024/03/21 20:30:31 UTC

[PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

dtenedor opened a new pull request, #45652:
URL: https://github.com/apache/spark/pull/45652

   ### What changes were proposed in this pull request?
   
   This PR blocks lambda and higher-order functions in subquery expression plans, since these can generate incorrect results.
   
   ### Why are the changes needed?
   
   It is safer to block these query patterns which can generate incorrect results for now. We can consider re-enabling these patterns later after fixing the correctness problem.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR adds unit test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45652:
URL: https://github.com/apache/spark/pull/45652#issuecomment-2013674975

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.
jchen5 commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1536238049


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala:
##########
@@ -138,4 +141,16 @@ object ResolveLambdaVariables extends Rule[LogicalPlan] {
     case _ =>
       e.mapChildren(resolve(_, parentLambdaMap))
   }
+
+  /**
+   * SPARK-47509: There is a correctness bug when lambdas or higher-order functions appear within

Review Comment:
   Comment should be updated too: appear within -> contain



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1538082343


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala:
##########
@@ -138,4 +141,16 @@ object ResolveLambdaVariables extends Rule[LogicalPlan] {
     case _ =>
       e.mapChildren(resolve(_, parentLambdaMap))
   }
+
+  /**
+   * SPARK-47509: There is a correctness bug when subquery expressions appear within lambdas or
+   * higher-order functions. Here we check for that case and return an error if the corresponding
+   * configuration indicates to do so.
+   */
+  private def checkForSubqueryExpressions(expression: Expression): Unit = {
+    if (expression.containsPattern(PLAN_EXPRESSION) &&
+      !conf.getConf(SQLConf.ALLOW_SUBQUERY_EXPRESSIONS_IN_LAMBDAS_AND_HIGHER_ORDER_FUNCTIONS)) {
+      throw QueryCompilationErrors.subqueryExpressionInLambdaOrHigherOrderFunctionNotAllowedError()

Review Comment:
   This is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.
jchen5 commented on PR #45652:
URL: https://github.com/apache/spark/pull/45652#issuecomment-2015929621

   I suspect it has existed as long as lambdas have. So at least like 5 years?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1537003305


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -4457,6 +4457,11 @@
           "Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses: <sqlExprs>."
         ]
       },
+      "LAMBDA_OR_HIGHER_ORDER_FUNCTION" : {

Review Comment:
   Reading the code comment
   ```
   /**
    * A higher order function takes one or more (lambda) functions and applies these to some objects.
    * The function produces a number of variables which can be consumed by some lambda function.
    */
   ```
   
   I think for use-facing error message, we can just say "higher order functions". For the code change, we can check both `HigherOrderFunction` and `LambdaFunction` to  be safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1538086998


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -4457,6 +4457,11 @@
           "Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses: <sqlExprs>."
         ]
       },
+      "LAMBDA_OR_HIGHER_ORDER_FUNCTION" : {

Review Comment:
   Sounds good, this is simpler. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1538466697


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -4496,6 +4496,11 @@
           "Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses: <sqlExprs>."
         ]
       },
+      "LAMBDA_OR_HIGHER_ORDER_FUNCTION" : {

Review Comment:
   ```suggestion
         "HIGHER_ORDER_FUNCTION" : {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1538470659


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveSubquerySuite.scala:
##########
@@ -270,4 +271,32 @@ class ResolveSubquerySuite extends AnalysisTest {
       ), Seq(a, b)).as("sub") :: Nil, t1)
     )
   }
+
+  test("SPARK-47509: Incorrect results for subquery expressions in LambdaFunctions") {
+    val data = LocalRelation(Seq(
+      $"key".int,
+      $"values1".array(IntegerType),
+      $"values2".array(ArrayType(ArrayType(IntegerType)))))
+
+    def plan(e: Expression): LogicalPlan = data.select(e.as("res"))
+
+    def lv(s: Symbol): UnresolvedNamedLambdaVariable =
+      UnresolvedNamedLambdaVariable(Seq(s.name))
+
+    val lambdaPlanScanFromTable: LogicalPlan = plan(
+      LambdaFunction(
+        function = lv(Symbol("x")) + lv(Symbol("X")),
+        arguments = Alias(
+          child = ScalarSubquery(
+            plan(lv(Symbol("x")))),
+          name = "alias")()
+          :: lv(Symbol("X"))
+          :: Nil))
+
+    assertAnalysisErrorClass(
+      inputPlan = lambdaPlanScanFromTable,
+      expectedErrorClass =
+        "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.LAMBDA_OR_HIGHER_ORDER_FUNCTION",

Review Comment:
   ```suggestion
           "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.HIGHER_ORDER_FUNCTION",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45652:
URL: https://github.com/apache/spark/pull/45652#issuecomment-2018690058

   Thanks @cloud-fan for your review! Responded to your suggestions and updated the error classes documentation and fixed the conflict with the `master` branch, please take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1538470515


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##########
@@ -620,6 +620,12 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat
       messageParameters = Map("prettyName" -> toSQLId(prettyName), "syntax" -> toSQLStmt(syntax)))
   }
 
+  def subqueryExpressionInLambdaOrHigherOrderFunctionNotAllowedError(): Throwable = {
+    new AnalysisException(
+      errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.LAMBDA_OR_HIGHER_ORDER_FUNCTION",

Review Comment:
   ```suggestion
         errorClass = "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.HIGHER_ORDER_FUNCTION",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1538470307


##########
docs/sql-error-conditions-unsupported-subquery-expression-category-error-class.md:
##########
@@ -50,6 +50,10 @@ A correlated outer name reference within a subquery expression body was not foun
 
 Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses: `<sqlExprs>`.
 
+## LAMBDA_OR_HIGHER_ORDER_FUNCTION

Review Comment:
   ```suggestion
   ## HIGHER_ORDER_FUNCTION
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1537001958


##########
common/utils/src/main/resources/error/error-classes.json:
##########
@@ -4457,6 +4457,11 @@
           "Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses: <sqlExprs>."
         ]
       },
+      "LAMBDA_OR_HIGHER_ORDER_FUNCTION" : {

Review Comment:
   hmmm, are these two the same thing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan closed pull request #45652: [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions
URL: https://github.com/apache/spark/pull/45652


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block subquery expressions in lambda and higher-order functions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #45652:
URL: https://github.com/apache/spark/pull/45652#issuecomment-2019321016

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45652:
URL: https://github.com/apache/spark/pull/45652#issuecomment-2015952790

   @jchen5 @dongjoon-hyun I updated it to switch to checking for subqueries with lambdas inside, instead of vice versa, per request. Please take another look :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45652:
URL: https://github.com/apache/spark/pull/45652#discussion_r1537002460


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala:
##########
@@ -138,4 +141,16 @@ object ResolveLambdaVariables extends Rule[LogicalPlan] {
     case _ =>
       e.mapChildren(resolve(_, parentLambdaMap))
   }
+
+  /**
+   * SPARK-47509: There is a correctness bug when subquery expressions appear within lambdas or
+   * higher-order functions. Here we check for that case and return an error if the corresponding
+   * configuration indicates to do so.
+   */
+  private def checkForSubqueryExpressions(expression: Expression): Unit = {
+    if (expression.containsPattern(PLAN_EXPRESSION) &&
+      !conf.getConf(SQLConf.ALLOW_SUBQUERY_EXPRESSIONS_IN_LAMBDAS_AND_HIGHER_ORDER_FUNCTIONS)) {
+      throw QueryCompilationErrors.subqueryExpressionInLambdaOrHigherOrderFunctionNotAllowedError()

Review Comment:
   shall we update the PR title to match the new code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45652:
URL: https://github.com/apache/spark/pull/45652#issuecomment-2015864799

   > According to the reported JIRA Affected Version, is this Apache Spark 4.0.0 only correctness bug? Do you happen to know when this correctness bug is introduced?
   
   @jchen5 do you know?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47509][SQL] Block lambda and higher-order functions in subquery expression plans [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.
jchen5 commented on PR #45652:
URL: https://github.com/apache/spark/pull/45652#issuecomment-2013722213

   CC @agubichev also


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org