You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/22 08:45:27 UTC

[GitHub] [spark] itholic opened a new pull request, #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`.

itholic opened a new pull request, #37966:
URL: https://github.com/apache/spark/pull/37966

   ### What changes were proposed in this pull request?
   
   This PR proposes to support `np.ndarray` type for `pyspark.sql.functions.lit`.
   
   ### Why are the changes needed?
   
   To improve the API usability.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, now the `np.ndarray` type is available for `pyspark.sql.functions.list` as below:
   
   - Before
   ```python
   >>> spark.range(1).select(lit(np.ndarray([1, 2]))).show(truncate=False)
   Traceback (most recent call last):
   ...
   py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
   : org.apache.spark.SparkRuntimeException: [UNSUPPORTED_FEATURE.LITERAL_TYPE] The feature is not supported: Literal for '[[D@596a58b7]' of class java.util.ArrayList.
   	at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:295)
   	at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:100)
   	at org.apache.spark.sql.functions$.lit(functions.scala:125)
   	at org.apache.spark.sql.functions.lit(functions.scala)
   	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   	at java.base/java.lang.Thread.run(Thread.java:833)
   ```
   
   - After
   ```python
   >>> spark.range(1).select(lit(np.ndarray([1, 2]))).show(truncate=False)
   +------------------------------------------------------+
   |array(array(2.058335917824E-312, 2.334195370625E-312))|
   +------------------------------------------------------+
   |[[2.058335917824E-312, 2.334195370625E-312]]          |
   +------------------------------------------------------+
   ```
   
   
   ### How was this patch tested?
   
   Added doctest.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37966:
URL: https://github.com/apache/spark/pull/37966#discussion_r978243957


##########
python/pyspark/sql/functions.py:
##########
@@ -164,13 +166,24 @@ def lit(col: Any) -> Column:
     +--------------+
     |     [1, 2, 3]|
     +--------------+
+
+    Create a literal from numpy.ndarray.
+
+    >>> spark.range(1).select(lit(np.ndarray([1, 2]))).show(truncate=False)
+    +------------------------------------------------------+
+    |array(array(2.058335917824E-312, 2.334195370625E-312))|
+    +------------------------------------------------------+
+    |[[2.058335917824E-312, 2.334195370625E-312]]          |

Review Comment:
   Seems like the type in this example is correct since `np.ndarray([1, 2])` returns `np.float64` as below:
   
   e.g.
   ```python
   >>> np.ndarray([1, 2])
   array([[2.05833592e-312, 2.33419537e-312]])
   >>> np.ndarray([1, 2]).dtype
   dtype('float64')
   >>> spark.range(1).select(lit(np.ndarray([1, 2]))).dtypes
   [('array(array(6.94998724E-310, 6.95001394E-310))', 'array<array<double>>')]
   ```
   
   However, they do not respect NumPy types in  other cases as below:
   
   ```python
   >>> spark.range(1).select(lit(np.array([[np.int8(1), np.int8(2)]]))).dtypes
   [('array(array(1, 2))', 'array<array<int>>')]
   ```
   
   This should return `smallint` rather than `int` as below:
   
   ```python
   >>> spark.range(1).select(lit(np.array([[np.int8(1), np.int8(2)]]))).dtypes
   [('ARRAY(ARRAY(1S, 2S))', 'array<array<smallint>>')]
   ```
   
   So, I think the PR title should be addressed such as "Support np.ndarray for `functions.lit` for multi dimensions"
   
   Now we only respect the NumPy type only for the single dimension.
   
   Let me address it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic closed pull request #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`

Posted by GitBox <gi...@apache.org>.

itholic closed pull request #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`
URL: https://github.com/apache/spark/pull/37966


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #37966:
URL: https://github.com/apache/spark/pull/37966#issuecomment-1255781028

   Just noticed that we already support this. Let me just close this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37966:
URL: https://github.com/apache/spark/pull/37966#discussion_r978243957


##########
python/pyspark/sql/functions.py:
##########
@@ -164,13 +166,24 @@ def lit(col: Any) -> Column:
     +--------------+
     |     [1, 2, 3]|
     +--------------+
+
+    Create a literal from numpy.ndarray.
+
+    >>> spark.range(1).select(lit(np.ndarray([1, 2]))).show(truncate=False)
+    +------------------------------------------------------+
+    |array(array(2.058335917824E-312, 2.334195370625E-312))|
+    +------------------------------------------------------+
+    |[[2.058335917824E-312, 2.334195370625E-312]]          |

Review Comment:
   Seems like the type in this example is correct since `np.ndarray([1, 2])` returns `np.float64` as below:
   
   e.g.
   ```python
   >>> np.ndarray([1, 2])
   array([[2.05833592e-312, 2.33419537e-312]])
   >>> np.ndarray([1, 2]).dtype
   dtype('float64')
   >>> spark.range(1).select(lit(np.ndarray([1, 2]))).dtypes
   [('array(array(6.94998724E-310, 6.95001394E-310))', 'array<array<double>>')]
   ```
   
   However, they do not respect NumPy types in  other cases as below:
   
   ```python
   >>> spark.range(1).select(lit(np.array([[np.int8(1), np.int8(2)]]))).dtypes
   [('array(array(1, 2))', 'array<array<int>>')]
   ```
   
   This should return `smallint` rather than `int` when the given type is `np.int8`:
   
   ```python
   >>> spark.range(1).select(lit(np.array([[np.int8(1), np.int8(2)]]))).dtypes
   [('ARRAY(ARRAY(1S, 2S))', 'array<array<smallint>>')]
   ```
   
   So, I think the PR title also should be addressed such as "Support np.ndarray for `functions.lit` for multi dimensions"
   
   Now we only respect the NumPy type only for the single dimension.
   
   Let me address it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`.

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37966:
URL: https://github.com/apache/spark/pull/37966#discussion_r978243957


##########
python/pyspark/sql/functions.py:
##########
@@ -164,13 +166,24 @@ def lit(col: Any) -> Column:
     +--------------+
     |     [1, 2, 3]|
     +--------------+
+
+    Create a literal from numpy.ndarray.
+
+    >>> spark.range(1).select(lit(np.ndarray([1, 2]))).show(truncate=False)
+    +------------------------------------------------------+
+    |array(array(2.058335917824E-312, 2.334195370625E-312))|
+    +------------------------------------------------------+
+    |[[2.058335917824E-312, 2.334195370625E-312]]          |

Review Comment:
   Seems like the type in this example is correct since `np.ndarray([1, 2])` returns `np.float64` as below:
   
   e.g.
   ```python
   >>> np.ndarray([1, 2])
   array([[2.05833592e-312, 2.33419537e-312]])
   >>> np.ndarray([1, 2]).dtype
   dtype('float64')
   >>> spark.range(1).select(lit(np.ndarray([1, 2]))).dtypes
   [('array(array(6.94998724E-310, 6.95001394E-310))', 'array<array<double>>')]
   ```
   
   However, they do not respect NumPy types in  other cases as below:
   
   ```python
   >>> spark.range(1).select(lit(np.array([[np.int8(1), np.int8(2)]]))).dtypes
   [('array(array(1, 2))', 'array<array<int>>')]
   ```
   
   This should return `smallint` rather than `int` when the given type is `np.int8`:
   
   ```python
   >>> spark.range(1).select(lit(np.array([[np.int8(1), np.int8(2)]]))).dtypes
   [('ARRAY(ARRAY(1S, 2S))', 'array<array<smallint>>')]
   ```
   
   So, I think the PR title should be addressed such as "Support np.ndarray for `functions.lit` for multi dimensions"
   
   Now we only respect the NumPy type only for the single dimension.
   
   Let me address it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37966: [SPARK-40462][PYTHON] Support np.ndarray for `functions.lit`.

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #37966:
URL: https://github.com/apache/spark/pull/37966#discussion_r977380636


##########
python/pyspark/sql/functions.py:
##########
@@ -164,13 +166,24 @@ def lit(col: Any) -> Column:
     +--------------+
     |     [1, 2, 3]|
     +--------------+
+
+    Create a literal from numpy.ndarray.
+
+    >>> spark.range(1).select(lit(np.ndarray([1, 2]))).show(truncate=False)
+    +------------------------------------------------------+
+    |array(array(2.058335917824E-312, 2.334195370625E-312))|
+    +------------------------------------------------------+
+    |[[2.058335917824E-312, 2.334195370625E-312]]          |

Review Comment:
   Hm, shouldn't they be integers?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org