You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/13 08:30:53 UTC
[GitHub] [spark] zhengruifeng opened a new pull request, #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
zhengruifeng opened a new pull request, #39047:
URL: https://github.com/apache/spark/pull/39047
### What changes were proposed in this pull request?
1, existing `LiteralExpression` is a mixture of `Literal`, `CreateArray`, `CreateStruct` and `CreateMap`, since we have added collection functions `array`, `struct` and `create_map`, the `CreateXXX` expressions can be replaced with `UnresolvedFunction`;
2, add field `dataType` in `LiteralExpression`, so we can specify the DataType if needed, a special case is the typed null;
3, it is up to the `lit` function to infer the DataType, not `LiteralExpression` itself;
### Why are the changes needed?
Refactor LiteralExpression to support DataType
### Does this PR introduce _any_ user-facing change?
No, `LiteralExpression` is a internal class, should not expose to end users
### How was this patch tested?
added UT
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047897612
##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
int32 year_month_interval = 20;
int64 day_time_interval = 21;
- Array array = 22;
- Struct struct = 23;
- Map map = 24;
Review Comment:
We do support array literals but that's done at the client side (combination of array of literals) to have the consistent type coercion rule.
Map and structs are not supported yet in Python side because it's arguable for `tuple` and `dict` to be `struct` and `struct` as both can be read as `array` and `map` in `createDataFrame`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047048460
##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
def lit(col: Any) -> Column:
Review Comment:
Yeah, let's probably remote them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047913096
##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
def lit(col: Any) -> Column:
Review Comment:
Ok, I will send a followup to remove them soon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1350189933
Thanks for the review @grundprinzip. Yes, these questions should be answered, and why they are not there in PySpark. I inlined the answers in the review comments.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1048005530
##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
int32 year_month_interval = 20;
int64 day_time_interval = 21;
- Array array = 22;
- Struct struct = 23;
- Map map = 24;
+ DataType typed_null = 22;
Review Comment:
where do we read this field at the server side?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1348158046
cc @grundprinzip @HyukjinKwon @cloud-fan @amaliujia
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] grundprinzip commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1348646261
I really should learn to read :(
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1048108012
##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
int32 year_month_interval = 20;
int64 day_time_interval = 21;
- Array array = 22;
- Struct struct = 23;
- Map map = 24;
+ DataType typed_null = 22;
Review Comment:
fix in https://github.com/apache/spark/pull/39060
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047201064
##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
int32 year_month_interval = 20;
int64 day_time_interval = 21;
- Array array = 22;
- Struct struct = 23;
- Map map = 24;
Review Comment:
So we don't support array literals? Don't we have some use cases for that?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047732183
##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
def lit(col: Any) -> Column:
Review Comment:
+1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1350189157
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1048030050
##########
connector/connect/common/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -77,9 +77,7 @@ message Expression {
int32 year_month_interval = 20;
int64 day_time_interval = 21;
- Array array = 22;
- Struct struct = 23;
- Map map = 24;
+ DataType typed_null = 22;
Review Comment:
oops, I forgot to add it. Will send a followup for it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047199951
##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/LiteralValueProtoConverter.scala:
##########
@@ -99,20 +96,6 @@ object LiteralValueProtoConverter {
case proto.Expression.Literal.LiteralTypeCase.DAY_TIME_INTERVAL =>
expressions.Literal(lit.getDayTimeInterval, DayTimeIntervalType())
- case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
- val literals = lit.getArray.getValuesList.asScala.toArray.map(toCatalystExpression)
- CreateArray(literals)
-
- case proto.Expression.Literal.LiteralTypeCase.STRUCT =>
Review Comment:
Where do these go?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1350215606
Thank you all for the reviews
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] grundprinzip commented on pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #39047:
URL: https://github.com/apache/spark/pull/39047#issuecomment-1348638637
@zhengruifeng Sorry I did not see this PR when I commented on your other PR, sorry for that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1046801784
##########
python/pyspark/sql/connect/functions.py:
##########
@@ -92,8 +92,19 @@ def col(col: str) -> Column:
def lit(col: Any) -> Column:
Review Comment:
actually, PySpark's [`lit` function](https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L168-L179) do not support `tuple` and `dict`, and it just start to support `list` since 3.4.0.
```
In [6]: from pyspark.sql import functions as SF
In [7]: SF.lit({1 : 2})
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
Cell In[7], line 1
----> 1 SF.lit({1 : 2})
...
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '{1=2}' of class java.util.HashMap.
at
...
In [8]: SF.lit(("a", "b"))
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
Cell In[8], line 1
----> 1 SF.lit(("a", "b"))
...
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '[a, b]' of class java.util.ArrayList.
...
In [9]: SF.lit(["a", "b"])
Out[9]: Column<'array(a, b)'>
```
I'm thinking of removing `struct` and `map` here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39047:
URL: https://github.com/apache/spark/pull/39047#discussion_r1047897785
##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/LiteralValueProtoConverter.scala:
##########
@@ -99,20 +96,6 @@ object LiteralValueProtoConverter {
case proto.Expression.Literal.LiteralTypeCase.DAY_TIME_INTERVAL =>
expressions.Literal(lit.getDayTimeInterval, DayTimeIntervalType())
- case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
- val literals = lit.getArray.getValuesList.asScala.toArray.map(toCatalystExpression)
- CreateArray(literals)
-
- case proto.Expression.Literal.LiteralTypeCase.STRUCT =>
Review Comment:
This was removed per https://github.com/apache/spark/pull/39047#discussion_r1046801784. See also https://github.com/apache/spark/pull/39047#discussion_r1047897612
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #39047: [SPARK-41506][CONNECT][PYTHON] Refactor LiteralExpression to support DataType
URL: https://github.com/apache/spark/pull/39047
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org