You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/10 05:48:45 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers

zhengruifeng opened a new pull request, #39016:
URL: https://github.com/apache/spark/pull/39016

   ### What changes were proposed in this pull request?
   check the bounds of integer and choose the correct datatypes
   
   
   ### Why are the changes needed?
   to match pyspark:
   
   ```
   
   In [15]: int_max = 2147483647
   
   In [16]: long_max = 9223372036854775807
   
   In [17]: sdf = spark.range(0, 1)
   
   In [18]: sdf.select(lit(0), lit(int_max), lit(int_max + 1), lit(long_max))
   Out[18]: DataFrame[0: int, 2147483647: int, 2147483648: bigint, 9223372036854775807: bigint]
   
   In [19]: sdf.select(lit(0), lit(int_max), lit(int_max + 1), lit(long_max), lit(long_max + 1))
   ---------------------------------------------------------------------------
   Py4JError                                 Traceback (most recent call last)
   Cell In[19], line 1
   ----> 1 sdf.select(lit(0), lit(int_max), lit(int_max + 1), lit(long_max), lit(long_max + 1))
   
   File ~/Dev/spark/python/pyspark/sql/functions.py:179, in lit(col)
       177     if dt is not None:
       178         return _invoke_function("lit", col).astype(dt)
   --> 179 return _invoke_function("lit", col)
   
   File ~/Dev/spark/python/pyspark/sql/functions.py:88, in _invoke_function(name, *args)
        86 assert SparkContext._active_spark_context is not None
        87 jf = _get_jvm_function(name, SparkContext._active_spark_context)
   ---> 88 return Column(jf(*args))
   
   ...
   
   File ~/Dev/spark/python/pyspark/sql/utils.py:199, in capture_sql_exception.<locals>.deco(*a, **kw)
       197 def deco(*a: Any, **kw: Any) -> Any:
       198     try:
   --> 199         return f(*a, **kw)
       200     except Py4JJavaError as e:
       201         converted = convert_exception(e.java_exception)
   
   File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:330, in get_return_value(answer, gateway_client, target_id, name)
       326         raise Py4JJavaError(
       327             "An error occurred while calling {0}{1}{2}.\n".
       328             format(target_id, ".", name), value)
       329     else:
   --> 330         raise Py4JError(
       331             "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
       332             format(target_id, ".", name, value))
       333 else:
       334     raise Py4JError(
       335         "An error occurred while calling {0}{1}{2}".
       336         format(target_id, ".", name))
   
   Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.lit. Trace:
   java.lang.NumberFormatException: For input string: "9223372036854775808"
           at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
   
   ```
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   added ut


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39016:
URL: https://github.com/apache/spark/pull/39016#discussion_r1045163077


##########
python/pyspark/sql/connect/column.py:
##########
@@ -180,7 +186,12 @@ def to_plan(self, session: "SparkConnectClient") -> "proto.Expression":
         elif isinstance(self._value, bool):
             expr.literal.boolean = bool(self._value)
         elif isinstance(self._value, int):
-            expr.literal.long = int(self._value)
+            if JVM_INT_MIN <= self._value <= JVM_INT_MAX:

Review Comment:
   I think the conversion [happens here](https://github.com/py4j/py4j/blob/master/py4j-java/src/main/java/py4j/Protocol.java#L281-L316)
   
   It seems py4j doesn't deal with `Byte` and `Short`
   
   If there is an expression only accept a ByteType, then I think it will be the same problem.
   
   also cc @HyukjinKwon @xinrong-meng who knows py4j better then me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39016:
URL: https://github.com/apache/spark/pull/39016#issuecomment-1345706119

   thanks for the reviews, merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39016:
URL: https://github.com/apache/spark/pull/39016#discussion_r1045017515


##########
python/pyspark/sql/connect/column.py:
##########
@@ -180,7 +180,12 @@ def to_plan(self, session: "SparkConnectClient") -> "proto.Expression":
         elif isinstance(self._value, bool):
             expr.literal.boolean = bool(self._value)
         elif isinstance(self._value, int):
-            expr.literal.long = int(self._value)
+            if -2147483648 <= self._value <= 2147483647:

Review Comment:
   can we extract these magic values out into "constants"? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers
URL: https://github.com/apache/spark/pull/39016


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39016:
URL: https://github.com/apache/spark/pull/39016#discussion_r1045163182


##########
python/pyspark/sql/connect/column.py:
##########
@@ -180,7 +186,12 @@ def to_plan(self, session: "SparkConnectClient") -> "proto.Expression":
         elif isinstance(self._value, bool):
             expr.literal.boolean = bool(self._value)
         elif isinstance(self._value, int):
-            expr.literal.long = int(self._value)
+            if JVM_INT_MIN <= self._value <= JVM_INT_MAX:

Review Comment:
   ```
   In [1]: from pyspark.sql.connect.column import JVM_INT_MIN, JVM_INT_MAX, JVM_LONG_MIN, JVM_LONG_MAX
   
   In [2]: sdf = spark.range(0, 1)
   
   In [3]: sdf.select(lit(0), lit(JVM_LONG_MAX + 1))
   ---------------------------------------------------------------------------
   NameError                                 Traceback (most recent call last)
   Cell In[3], line 1
   ----> 1 sdf.select(lit(0), lit(JVM_LONG_MAX + 1))
   
   NameError: name 'lit' is not defined
   
   In [4]: from pyspark.sql.functions import *
   
   In [5]: sdf.select(lit(0), lit(JVM_LONG_MAX + 1))
   ---------------------------------------------------------------------------
   Py4JError                                 Traceback (most recent call last)
   Cell In[5], line 1
   ----> 1 sdf.select(lit(0), lit(JVM_LONG_MAX + 1))
   
   File ~/Dev/spark/python/pyspark/sql/functions.py:179, in lit(col)
       177     if dt is not None:
       178         return _invoke_function("lit", col).astype(dt)
   --> 179 return _invoke_function("lit", col)
   
   File ~/Dev/spark/python/pyspark/sql/functions.py:88, in _invoke_function(name, *args)
        86 assert SparkContext._active_spark_context is not None
        87 jf = _get_jvm_function(name, SparkContext._active_spark_context)
   ---> 88 return Column(jf(*args))
   
   File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
      1316 command = proto.CALL_COMMAND_NAME +\
      1317     self.command_header +\
      1318     args_command +\
      1319     proto.END_COMMAND_PART
      1321 answer = self.gateway_client.send_command(command)
   -> 1322 return_value = get_return_value(
      1323     answer, self.gateway_client, self.target_id, self.name)
      1325 for temp_arg in temp_args:
      1326     if hasattr(temp_arg, "_detach"):
   
   File ~/Dev/spark/python/pyspark/sql/utils.py:199, in capture_sql_exception.<locals>.deco(*a, **kw)
       197 def deco(*a: Any, **kw: Any) -> Any:
       198     try:
   --> 199         return f(*a, **kw)
       200     except Py4JJavaError as e:
       201         converted = convert_exception(e.java_exception)
   
   File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:330, in get_return_value(answer, gateway_client, target_id, name)
       326         raise Py4JJavaError(
       327             "An error occurred while calling {0}{1}{2}.\n".
       328             format(target_id, ".", name), value)
       329     else:
   --> 330         raise Py4JError(
       331             "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
       332             format(target_id, ".", name, value))
       333 else:
       334     raise Py4JError(
       335         "An error occurred while calling {0}{1}{2}".
       336         format(target_id, ".", name))
   
   Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.lit. Trace:
   java.lang.NumberFormatException: For input string: "9223372036854775808"
   	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
   	at java.base/java.lang.Long.parseLong(Long.java:696)
   	at java.base/java.lang.Long.parseLong(Long.java:817)
   	at py4j.Protocol.getLong(Protocol.java:239)
   	at py4j.Protocol.getObject(Protocol.java:291)
   	at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:82)
   	at py4j.commands.CallCommand.execute(CallCommand.java:77)
   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   	at java.base/java.lang.Thread.run(Thread.java:829)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39016: [SPARK-41477][CONNECT][PYTHON] Correctly infer the datatype of literal integers

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39016:
URL: https://github.com/apache/spark/pull/39016#discussion_r1045160705


##########
python/pyspark/sql/connect/column.py:
##########
@@ -180,7 +186,12 @@ def to_plan(self, session: "SparkConnectClient") -> "proto.Expression":
         elif isinstance(self._value, bool):
             expr.literal.boolean = bool(self._value)
         elif isinstance(self._value, int):
-            expr.literal.long = int(self._value)
+            if JVM_INT_MIN <= self._value <= JVM_INT_MAX:

Review Comment:
   I'm sorry for commenting on this so late and it's not blocking the PR.
   
   Can we quickly check what the behavior for Py4j is when it comes to sending numeric values to the JVM? Will a Python int in the byte or short range be automatically converted to this type?
   
   The reason I'm wondering is if we have expressions that take a short argument will we have the same problem?
   
   Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org