You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2024/01/19 03:38:58 UTC

[PR] [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` [spark]

zhengruifeng opened a new pull request, #44793:
URL: https://github.com/apache/spark/pull/44793

   ### What changes were proposed in this pull request?
   Make `shuffle` specify the datatype of `seed`
   
   
   ### Why are the changes needed?
   `shuffle` function may fail with an extreme possibility (~ 2e10) :
   `shuffle` in an unregistered function, and it requires a Long type `seed`, in Scala client the 
   `SparkClassUtils.random.nextLong` make sure the type; while in Python, `lit(random.randint(0, sys.maxsize))` may return a Literal Integer instead of Literal Long. 
   
   ```
   In [26]: from pyspark.sql import functions as sf
   
   In [27]: df = spark.createDataFrame([([1, 20, 3, 5],)], ['data'])
   
   In [28]: df.select(sf.shuffle(df.data)).show()
   +-------------+
   |shuffle(data)|
   +-------------+
   |[1, 3, 5, 20]|
   +-------------+
   
   
   In [29]: df.select(sf.call_udf("shuffle", df.data, sf.lit(123456789000000))).show()
   +-------------+
   |shuffle(data)|
   +-------------+
   |[20, 1, 5, 3]|
   +-------------+
   
   
   In [30]: df.select(sf.call_udf("shuffle", df.data, sf.lit(12345))).show()
   ...
   SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) seed should be a literal long, but got 12345
   
   ```
   
   Another case is `uuid`, but it is not supported in Python due to namespace conflicts.
   I don't find other similar cases.
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   manually check
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #44793: [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed`
URL: https://github.com/apache/spark/pull/44793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #44793:
URL: https://github.com/apache/spark/pull/44793#issuecomment-1899643472

   ci: https://github.com/zhengruifeng/spark/actions/runs/7578671965/job/20641673985


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #44793:
URL: https://github.com/apache/spark/pull/44793#issuecomment-1899825827

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org