You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/13 08:30:53 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

zhengruifeng opened a new pull request, #39047:
URL: https://github.com/apache/spark/pull/39047

   ### What changes were proposed in this pull request?
   1, existing `LiteralExpression` is a mixture of `Literal`, `CreateArray`, `CreateStruct` and `CreateMap`, since we have added collection functions `array`, `struct` and `create_map`, the `CreateXXX` expressions can be replaced with `UnresolvedFunction`; 
   2, add field `dataType` in `LiteralExpression`, so we can specify the DataType if needed, a special case is the typed null;
   3, it is up to the `lit` function to infer the DataType, not `LiteralExpression` itself;
   
   ### Why are the changes needed?
   Refactor LiteralExpression to support DataType
   
   ### Does this PR introduce _any_ user-facing change?
   No, `LiteralExpression` is a internal class, should not expose to end users
   
   ### How was this patch tested?
   added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047897612


##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
       int32 year_month_interval = 20;
       int64 day_time_interval = 21;
 
-      Array array = 22;
-      Struct struct = 23;
-      Map map = 24;

Review Comment:
   We do support array literals but that's done at the client side (combination of array of literals) to have the consistent type coercion rule.
   
   Map and structs are not supported yet in Python side because it's arguable for `tuple` and `dict` to be `struct` and `struct` as both can be read as `array` and `map` in `createDataFrame`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047048460


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
 def lit(col: Any) -> Column:

Review Comment:
   Yeah, let's probably remote them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047913096


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
 def lit(col: Any) -> Column:

Review Comment:
   Ok, I will send a followup to remove them soon



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1350189933

   Thanks for the review @grundprinzip. Yes, these questions should be answered, and why they are not there in PySpark. I inlined the answers in the review comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1048005530


##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
       int32 year_month_interval = 20;
       int64 day_time_interval = 21;
 
-      Array array = 22;
-      Struct struct = 23;
-      Map map = 24;
+      DataType typed_null = 22;

Review Comment:
   where do we read this field at the server side?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1348158046

   cc @grundprinzip @HyukjinKwon @cloud-fan @amaliujia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1348646261

   I really should learn to read :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1048108012


##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
       int32 year_month_interval = 20;
       int64 day_time_interval = 21;
 
-      Array array = 22;
-      Struct struct = 23;
-      Map map = 24;
+      DataType typed_null = 22;

Review Comment:
   fix in https://github.com/apache/spark/pull/39060



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047201064


##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
       int32 year_month_interval = 20;
       int64 day_time_interval = 21;
 
-      Array array = 22;
-      Struct struct = 23;
-      Map map = 24;

Review Comment:
   So we don't support array literals? Don't we have some use cases for that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047732183


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
 def lit(col: Any) -> Column:

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1350189157

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1048030050


##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
       int32 year_month_interval = 20;
       int64 day_time_interval = 21;
 
-      Array array = 22;
-      Struct struct = 23;
-      Map map = 24;
+      DataType typed_null = 22;

Review Comment:
   oops, I forgot to add it. Will send a followup for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047199951


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/LiteralValueProtoConverter.scala:
##########
@@ -99,20 +96,6 @@ object LiteralValueProtoConverter {
       case proto.Expression.Literal.LiteralTypeCase.DAY_TIME_INTERVAL =>
         expressions.Literal(lit.getDayTimeInterval, DayTimeIntervalType())
 
-      case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
-        val literals = lit.getArray.getValuesList.asScala.toArray.map(toCatalystExpression)
-        CreateArray(literals)
-
-      case proto.Expression.Literal.LiteralTypeCase.STRUCT =>

Review Comment:
   Where do these go?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1350215606

   Thank you all for the reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1348638637

   @zhengruifeng Sorry I did not see this PR when I commented on your other PR, sorry for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1046801784


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
 def lit(col: Any) -> Column:

Review Comment:
   actually, PySpark's [`lit` function](https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L168-L179) do not support `tuple` and `dict`, and it just start to support `list` since 3.4.0.
   
   ```
   In [6]: from pyspark.sql import functions as SF
   
   In [7]: SF.lit({1 : 2})
   ---------------------------------------------------------------------------
   Py4JJavaError                             Traceback (most recent call last)
   Cell In[7], line 1
   ----> 1 SF.lit({1 : 2})
   ...
   
   Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
   : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '{1=2}' of class java.util.HashMap.
           at 
   ...
   
   In [8]: SF.lit(("a", "b"))
   ---------------------------------------------------------------------------
   Py4JJavaError                             Traceback (most recent call last)
   Cell In[8], line 1
   ----> 1 SF.lit(("a", "b"))
   ...
   
   Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
   : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '[a, b]' of class java.util.ArrayList.
   
   ...
   
   In [9]: SF.lit(["a", "b"])
   Out[9]: Column<'array(a, b)'>
   ```
   
   I'm thinking of removing `struct` and `map` here.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047897785


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/LiteralValueProtoConverter.scala:
##########
@@ -99,20 +96,6 @@ object LiteralValueProtoConverter {
       case proto.Expression.Literal.LiteralTypeCase.DAY_TIME_INTERVAL =>
         expressions.Literal(lit.getDayTimeInterval, DayTimeIntervalType())
 
-      case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
-        val literals = lit.getArray.getValuesList.asScala.toArray.map(toCatalystExpression)
-        CreateArray(literals)
-
-      case proto.Expression.Literal.LiteralTypeCase.STRUCT =>

Review Comment:
   This was removed per https://github.com/apache/spark/pull/39047#discussion_r1046801784. See also https://github.com/apache/spark/pull/39047#discussion_r1047897612



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
URL: https://github.com/apache/spark/pull/39047


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org