You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/21 15:44:16 UTC

[GitHub] [spark] peter-toth opened a new pull request, #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

peter-toth opened a new pull request, #38334:
URL: https://github.com/apache/spark/pull/38334

   ### What changes were proposed in this pull request?
   This PR fixes a bug in broadcast handling `PythonRunner` when encryption is enabed. Due to this bug the following pyspark script:
   ```
   bin/pyspark --conf spark.io.encryption.enabled=true
   
   ...
   
   bar = {"a": "aa", "b": "bb"}
   foo = spark.sparkContext.broadcast(bar)
   spark.udf.register("MYUDF", lambda x: foo.value[x] if x else "")
   spark.sql("SELECT MYUDF('a') AS a, MYUDF('b') AS b").collect()
   ```
   fails with:
   ```
   22/10/21 17:14:32 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
   org.apache.spark.api.python.PythonException: Traceback (most recent call last):
     File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 811, in main
       func, profiler, deserializer, serializer = read_command(pickleSer, infile)
     File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/worker.py", line 87, in read_command
       command = serializer._read_with_length(file)
     File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 173, in _read_with_length
       return self.loads(obj)
     File "/Users/petertoth/git/apache/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 471, in loads
       return cloudpickle.loads(obj, encoding=encoding)
   EOFError: Ran out of input
   ```
   The reason for this failure is that we have multiple Python UDF referencing the same broadcast and in the current code:
   https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L385-L420
   the number of broadcasts (`cnt`) is correct (1) but the broadcast id is serialized 2 times from JVM to Python ruining the next item that Python expects from JVM side.
   
   Please note that the example above works in Spark 3.3 without this fix. That is because https://github.com/apache/spark/pull/36121 in Spark 3.4 modified `ExpressionSet` and so `udfs` in `ExtractPythonUDFs`:
   https://github.com/apache/spark/blob/748fa2792e488a6b923b32e2898d9bb6e16fb4ca/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala#L239-L242
   changed from `Stream` to `Vector`. When `broadcastVars` (and so `idsAndFiles`) is a `Stream` the example accidentaly works as the broadcast id is written to `dataOut` once (`oldBids.add(id)` in `idsAndFiles.foreach` is called before the 2nd item is calculated in `broadcastVars.flatMap`). But that doesn't mean that https://github.com/apache/spark/pull/36121 introduced the regression as `EncryptedPythonBroadcastServer` shouldn't serve the broadcast data 2 times (which `EncryptedPythonBroadcastServer` does currently, but it is not noticed) as it could fail other cases when there are more than 1 broadcast used in UDFs). 
   
   ### Why are the changes needed?
   To fix a bug.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Added new UT.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled
URL: https://github.com/apache/spark/pull/38334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #38334:
URL: https://github.com/apache/spark/pull/38334#issuecomment-1287665617

   Thanks @HyukjinKwon for the quick review!
   
   The bug was introduced in https://github.com/apache/spark/commit/58419b92673c46911c25bc6c6b13397f880c6424#diff-ed4fb5ce30273e8eefcc7d4b0152ea7a60fb4f8f709d4da8ea1ab56aeda26001R307-R323 in Spark 3.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38334:
URL: https://github.com/apache/spark/pull/38334#issuecomment-1287575193

   I am merging this to master but is the issue only in master branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38334:
URL: https://github.com/apache/spark/pull/38334#issuecomment-1288285305

   Merged to branch-3.3, branch-3.2 and branch-3.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38334: [SPARK-40874][PYTHON] Fix broadcasts in Python UDFs when encryption enabled

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38334:
URL: https://github.com/apache/spark/pull/38334#issuecomment-1288260588

   @peter-toth I wll backport to the all active branches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org