You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "peter-toth (via GitHub)" <gi...@apache.org> on 2023/09/10 12:52:36 UTC

[GitHub] [spark] peter-toth opened a new pull request, #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

peter-toth opened a new pull request, #42864:
URL: https://github.com/apache/spark/pull/42864

   ### What changes were proposed in this pull request?
   This is the first cleanup PR of the ticket to use `UnresolvedFunction`/`FunctionRegistry` based resolution in SQL Dataset functions similar to what Spark Connect does.  
   
   ### Why are the changes needed?
   If we can make the SQL and Connect Dataset functions similar then we can move the functions to `sql-api` module.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Existing UTs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331147608


##########
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -414,12 +407,13 @@ object functions {
    * @group agg_funcs
    * @since 1.3.0
    */
-  def count(e: Column): Column = withAggregateFunction {
-    e.expr match {
+  def count(e: Column): Column = {
+    val withoutStar = e.expr match {
       // Turn count(*) into count(1)

Review Comment:
   No, spark connect doesn't hit such issue.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1722941731

   @peter-toth I don't get the idea. Why we need let dataset API follows Connect? It seems this PR made the plan to be parsed later.
   The ultimate goal is to move dataset and connect functions into the `sql-api` module and have a common implementation where possible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331079422


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: ListQuery)
 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
+  def this(valueAndList: Seq[Expression]) = {

Review Comment:
   I'm not sure if this is useful. `in(a, b, c)` looks pretty weird to me. Can we revert it and treat `def in` as a special case? The SQL parser also treat `IN` as a special case and has dedicated syntax for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
URL: https://github.com/apache/spark/pull/42864


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331134534


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: ListQuery)
 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
+  def this(valueAndList: Seq[Expression]) = {

Review Comment:
   All right, reverted in https://github.com/apache/spark/pull/42864/commits/8bf64a7f934a537ae00421c54a3b115a25c21168.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331077927


##########
python/pyspark/sql/column.py:
##########
@@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column":
         --------
         >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", "d"])
         >>> df.select(df.l[slice(1, 3)], df.d['key']).show()
-        +------------------+------+
-        |substring(l, 1, 3)|d[key]|
-        +------------------+------+
-        |               abc| value|
-        +------------------+------+
+        +---------------+------+
+        |substr(l, 1, 3)|d[key]|

Review Comment:
   Ping @peter-toth 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328209728


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: ListQuery)
 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
+  def this(valueAndList: Seq[Expression]) = {

Review Comment:
   does it means we support `select in(a, b, c)` now?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: ListQuery)
 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
+  def this(valueAndList: Seq[Expression]) = {

Review Comment:
   does it mean we support `select in(a, b, c)` now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328209431


##########
python/pyspark/sql/column.py:
##########
@@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column":
         --------
         >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", "d"])
         >>> df.select(df.l[slice(1, 3)], df.d['key']).show()
-        +------------------+------+
-        |substring(l, 1, 3)|d[key]|
-        +------------------+------+
-        |               abc| value|
-        +------------------+------+
+        +---------------+------+
+        |substr(l, 1, 3)|d[key]|

Review Comment:
   To avoid this change, I think we can call `self.substring` in this method. `df.l[slice(1, 3)]` does not require a certain function name, so keeping it as it was is better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1722709137

   also cc @beliefer @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328220027


##########
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -6843,9 +6562,8 @@ object functions {
    * @since 3.0.0
    */
   // scalastyle:on line.size.limit
-  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column = {
-    withExpr(SchemaOfJson(json.expr, options.asScala.toMap))
-  }
+  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column =

Review Comment:
   I think these should be the few exceptions that we should still keep them as they were, because the expression (`SchemaOfJson` in this case) takes non-expression inputs, and it's tricky to define a protocol that can pass non-expression inputs as expressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328213801


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -234,7 +260,7 @@ class Column(val expr: Expression) extends Logging {
    * @group expr_ops
    * @since 1.3.0
    */
-  def unary_! : Column = withExpr { Not(expr) }
+  def unary_! : Column = fn("!")

Review Comment:
   It's weird to use `!` as the function name, how about `fn("not")`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331149367


##########
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -414,12 +407,13 @@ object functions {
    * @group agg_funcs
    * @since 1.3.0
    */
-  def count(e: Column): Column = withAggregateFunction {
-    e.expr match {
+  def count(e: Column): Column = {
+    val withoutStar = e.expr match {
       // Turn count(*) into count(1)

Review Comment:
   https://github.com/apache/spark/blob/89041a4a8c7b7787fa10f090d4324f20447c4dd3/python/pyspark/sql/connect/functions.py#L1014-L1015
   
   https://github.com/apache/spark/blob/89041a4a8c7b7787fa10f090d4324f20447c4dd3/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala#L391
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328218485


##########
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -414,12 +407,13 @@ object functions {
    * @group agg_funcs
    * @since 1.3.0
    */
-  def count(e: Column): Column = withAggregateFunction {
-    e.expr match {
+  def count(e: Column): Column = {
+    val withoutStar = e.expr match {
       // Turn count(*) into count(1)

Review Comment:
   Unrelated issue: this is not the right place to do this conversion. should be done in the analyzer. cc @zhengruifeng does spark connect hit the same issue?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331110271


##########
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##########
@@ -708,7 +708,7 @@ private[sql] object RelationalGroupedDataset {
     case expr: NamedExpression => expr
     case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
       UnresolvedAlias(a, Some(Column.generateAlias))
-    case u: UnresolvedFunction => UnresolvedAlias(expr, None)
+    case e if !e.resolved => UnresolvedAlias(expr, None)

Review Comment:
   I changed the `AggregateExpression` path to `case a: AggregateExpression => UnresolvedAlias(a, Some(Column.generateAlias))` but still kept the `case _ if !expr.resolved => UnresolvedAlias(expr, None)` because in the `plus_one(sum_udf(col("v1")))` case an unresolved `PythonUDF` is passed here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328894416


##########
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -6843,9 +6562,8 @@ object functions {
    * @since 3.0.0
    */
   // scalastyle:on line.size.limit
-  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column = {
-    withExpr(SchemaOfJson(json.expr, options.asScala.toMap))
-  }
+  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column =

Review Comment:
   Oh I see, then I'm fine with it. It's inevitable that we need to create map expression from the language native (scala or python) map values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1723688982

   To add more color to @peter-toth 's comment. Think about a base trait to define these function APIs
   ```
   trait SQLFunctionsBase {
     // api doc ...
     def substring(...) = fn("substring", ...)
     
     protected def fn(...)
   }
   
   ```
   Then in the spark connect side, we override `fn` to create proto message for SQL function, in Spark SQL, we override `fn` to  create `UnresolvedFunction`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331145071


##########
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSymmetricHashJoinHelperSuite.scala:
##########
@@ -49,7 +44,12 @@ class StreamingSymmetricHashJoinHelperSuite extends StreamTest {
   test("only literals") {
     // Literal-only conjuncts end up on the left side because that's the first bucket they fit in.
     // There's no semantic reason they couldn't be in any bucket.
-    val predicate = (lit(1) < lit(5) && lit(6) < lit(7) && lit(0) === lit(-1)).expr
+    val predicate =
+      And(
+        And(
+          LessThan(lit(1).expr, lit(5).expr),
+          LessThan(lit(6).expr, lit(7).expr)),
+        EqualTo(lit(0).expr, lit(-1).expr))

Review Comment:
   shall we use dsl to build expressions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1727813318

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331113924


##########
python/pyspark/sql/column.py:
##########
@@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column":
         --------
         >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", "d"])
         >>> df.select(df.l[slice(1, 3)], df.d['key']).show()
-        +------------------+------+
-        |substring(l, 1, 3)|d[key]|
-        +------------------+------+
-        |               abc| value|
-        +------------------+------+
+        +---------------+------+
+        |substr(l, 1, 3)|d[key]|

Review Comment:
   I tried `self.substring` (see https://github.com/apache/spark/pull/42864/commits/d611fd5a8da8f0f24d54c3933b3cf7e4dbdabe2c) but there is no `self.substring` so if column name doesn't matter then let's use `self.substr`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331311772


##########
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##########
@@ -723,11 +728,14 @@ object IntegratedUDFTestUtils extends SQLHelper {
       override def builder(e: Seq[Expression]): Expression = {
         assert(e.length == 1, "Defined UDF only has one column")
         val expr = e.head
-        assert(expr.resolved, "column should be resolved to use the same type " +

Review Comment:
   In https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonUDFSuite.scala#L41 the `+` is unresolved now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328419782


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala:
##########
@@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: ListQuery)
 // scalastyle:on line.size.limit
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
+  def this(valueAndList: Seq[Expression]) = {

Review Comment:
   Yes. But it was not a goal, it is just a useful side effect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331301998


##########
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSymmetricHashJoinHelperSuite.scala:
##########
@@ -49,7 +44,12 @@ class StreamingSymmetricHashJoinHelperSuite extends StreamTest {
   test("only literals") {
     // Literal-only conjuncts end up on the left side because that's the first bucket they fit in.
     // There's no semantic reason they couldn't be in any bucket.
-    val predicate = (lit(1) < lit(5) && lit(6) < lit(7) && lit(0) === lit(-1)).expr
+    val predicate =
+      And(
+        And(
+          LessThan(lit(1).expr, lit(5).expr),
+          LessThan(lit(6).expr, lit(7).expr)),
+        EqualTo(lit(0).expr, lit(-1).expr))

Review Comment:
   fixed in https://github.com/apache/spark/pull/42864/commits/580f97b0e458a87509ccf6882075cf7330062d54



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331113924


##########
python/pyspark/sql/column.py:
##########
@@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column":
         --------
         >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", "d"])
         >>> df.select(df.l[slice(1, 3)], df.d['key']).show()
-        +------------------+------+
-        |substring(l, 1, 3)|d[key]|
-        +------------------+------+
-        |               abc| value|
-        +------------------+------+
+        +---------------+------+
+        |substr(l, 1, 3)|d[key]|

Review Comment:
   I tried `self.substring` (see https://github.com/apache/spark/pull/42864/commits/d611fd5a8da8f0f24d54c3933b3cf7e4dbdabe2c) but there is no `self.substring` so reverted it in https://github.com/apache/spark/pull/42864/commits/6bd03d11b4d345ef0d571976ab80a99002a7ca75. If column name doesn't matter then let's use `self.substr`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328960631


##########
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##########
@@ -708,7 +708,7 @@ private[sql] object RelationalGroupedDataset {
     case expr: NamedExpression => expr
     case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
       UnresolvedAlias(a, Some(Column.generateAlias))
-    case u: UnresolvedFunction => UnresolvedAlias(expr, None)
+    case e if !e.resolved => UnresolvedAlias(expr, None)

Review Comment:
   Without this change some Python tests fail like `sum_udf(col("v2")) + 5` here: https://github.com/apache/spark/blob/master/python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py#L404C17-L404C40.
   I debugged this today and it seems these are `AggregateExpression`s but the aggregate function is `PythonUDAF` so they don't match the previous `case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>` case. Shall we remove the if condition? `Column.generateAlias` seems to take care of it: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L44-L50



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1331143291


##########
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala:
##########
@@ -723,11 +728,14 @@ object IntegratedUDFTestUtils extends SQLHelper {
       override def builder(e: Seq[Expression]): Expression = {
         assert(e.length == 1, "Defined UDF only has one column")
         val expr = e.head
-        assert(expr.resolved, "column should be resolved to use the same type " +

Review Comment:
   I'm a bit confused about this change. IIUC we always call `builder` with resolved expressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1728048096

   Thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328386268


##########
sql/core/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -6843,9 +6562,8 @@ object functions {
    * @since 3.0.0
    */
   // scalastyle:on line.size.limit
-  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column = {
-    withExpr(SchemaOfJson(json.expr, options.asScala.toMap))
-  }
+  def schema_of_json(json: Column, options: java.util.Map[String, String]): Column =

Review Comment:
   No sure I get your point. `SchemaOfJson` seems to accept options as expressions: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L776-L778. We just follow Connect here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328221713


##########
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##########
@@ -708,7 +708,7 @@ private[sql] object RelationalGroupedDataset {
     case expr: NamedExpression => expr
     case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] =>
       UnresolvedAlias(a, Some(Column.generateAlias))
-    case u: UnresolvedFunction => UnresolvedAlias(expr, None)
+    case e if !e.resolved => UnresolvedAlias(expr, None)

Review Comment:
   why is this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #42864:
URL: https://github.com/apache/spark/pull/42864#discussion_r1328217592


##########
sql/core/src/main/scala/org/apache/spark/sql/Column.scala:
##########
@@ -234,7 +260,7 @@ class Column(val expr: Expression) extends Logging {
    * @group expr_ops
    * @since 1.3.0
    */
-  def unary_! : Column = withExpr { Not(expr) }
+  def unary_! : Column = fn("!")

Review Comment:
   nvm, other symbols do not have a corresponding function name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #42864:
URL: https://github.com/apache/spark/pull/42864#issuecomment-1722880180

   @peter-toth I don't get the idea. Why we need let dataset API follows Connect? It seems this PR made the plan to be parsed later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org