You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "peter-toth (via GitHub)" <gi...@apache.org> on 2023/09/19 13:47:17 UTC

[GitHub] [spark] peter-toth opened a new pull request, #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

peter-toth opened a new pull request, #42997:
URL: https://github.com/apache/spark/pull/42997

   ### What changes were proposed in this pull request?
   This PR fixes a bug regarding non-deterministic seeded Dataset functions.
   
   If we run the following example the result is the expected equal 2 columns:
   ```
   val c = rand()
   df.select(c, c)
   
   +--------------------------+--------------------------+
   |rand(-4522010140232537566)|rand(-4522010140232537566)|
   +--------------------------+--------------------------+
   |        0.4520819282997137|        0.4520819282997137|
   +--------------------------+--------------------------+
   ```
   
   But if we run use other similar APIs their result is incorrect:
   ```
   val r1 = random()
   val r2 = uuid()
   val r3 = shuffle(col("x"))
   val x = df.select(r1, r1, r2, r2, r3, r3)
   
   +------------------+------------------+--------------------+--------------------+----------+----------+
   |            rand()|            rand()|              uuid()|              uuid()|shuffle(x)|shuffle(x)|
   +------------------+------------------+--------------------+--------------------+----------+----------+
   |0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...| [1, 2, 3]| [2, 1, 3]|
   +------------------+------------------+--------------------+--------------------+----------+----------+
   ```
   
   This is because the current implementation of `rand()` passes a random seed to `Rand`, but other functions like `random()`, `uuid()` and `shuffle()` don’t. Later the `ResolveRandomSeed` rule is adds the necessary seeds but since the resolution rules don’t track expression object identities they can’t map an expression object 2 times to the same transformed object. I.e. in case of `random()` the `UnresolvedFunction("random", Seq.empty, ...)` object is transformed to 2 different `Rand(UnresolvedSeed)` objects and then 2 different random seeds are chosen.
   
   This PR explicitely adds the seeds.
   
   ### Why are the changes needed?
   To fix the above bug.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, fixes the above bug.
   
   ### How was this patch tested?
   Added new UT.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #42997:
URL: https://github.com/apache/spark/pull/42997#discussion_r1331943027


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column:
     if seed is not None:
         return _invoke_function("rand", lit(seed))
     else:
-        return _invoke_function("rand")
+        return _invoke_function("rand", lit(random.randint(0, sys.maxsize)))

Review Comment:
   These are spark connect functions. Do we also need to update `python/pyspark/sql/functions.py`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1728596271

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1826172661

   This is a pre-existing correctness issue in DataFrame API with/without Spark Connect, that we might need to backport to other branches. I did not backport because it at least changes the output values but I don't mind we backport as it's technically a correctness issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42997:
URL: https://github.com/apache/spark/pull/42997#discussion_r1331955820


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column:
     if seed is not None:
         return _invoke_function("rand", lit(seed))
     else:
-        return _invoke_function("rand")
+        return _invoke_function("rand", lit(random.randint(0, sys.maxsize)))

Review Comment:
   No, we don't. The sql Python functions just invoke the Scala implementations, which has been fixed.
   
   BTW, I covered sql Python functions with a test: https://github.com/apache/spark/pull/42997/files#diff-a4984706554eae562eb0c7a66d284b1855d9bc463b03474f73493c7694244b66R1356-R1366



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs [spark]

Posted by "JoshRosen (via GitHub)" <gi...@apache.org>.

JoshRosen commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1826115510

   I have a question about the scope of the bug fixed in this PR:
   
   Is this issue specific to Connect? Or is there a pre-existing correctness issue that we might want to fix in other branches as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1727800532

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #42997:
URL: https://github.com/apache/spark/pull/42997#issuecomment-1729012140

   Thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] peter-toth commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #42997:
URL: https://github.com/apache/spark/pull/42997#discussion_r1331955820


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column:
     if seed is not None:
         return _invoke_function("rand", lit(seed))
     else:
-        return _invoke_function("rand")
+        return _invoke_function("rand", lit(random.randint(0, sys.maxsize)))

Review Comment:
   No, we don't. The sql Python functions just invoke the Scala implementation of the functions, which have been fixed.
   
   BTW, I covered sql Python functions with a test: https://github.com/apache/spark/pull/42997/files#diff-a4984706554eae562eb0c7a66d284b1855d9bc463b03474f73493c7694244b66R1356-R1366



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
URL: https://github.com/apache/spark/pull/42997


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org