You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/07/12 15:37:47 UTC

[GitHub] [spark] WeichenXu123 opened a new pull request #25138: [SPARK-26175][PYSPARK] Closing stdin of the worker process right after fork

WeichenXu123 opened a new pull request #25138: [SPARK-26175][PYSPARK] Closing stdin of the worker process right after fork
URL: https://github.com/apache/spark/pull/25138
 
 
   ## What changes were proposed in this pull request?
   
   PySpark worker daemon reads from stdin the worker PIDs to kill. https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127
   
   However, the worker process is a forked process from the worker daemon process and we didn't close stdin on the child after fork. This means the child and user program can read stdin as well, which blocks daemon from receiving the PID to kill. This can cause issues because the task reaper might detect the task was not terminated and eventually kill the JVM.
   
   This PR fix this by closing stdin of the worker process right after fork.
   
   ## How was this patch tested?
   
   Manually test.
   
   In `pyspark`, run:
   ```
   import subprocess
   def task(_):
     subprocess.check_output(["cat"])
   
   sc.parallelize(range(1), 1).mapPartitions(task).count()
   ```
   
   Before:
   The job will get stuck and press Ctrl+C to exit the job but the python worker process do not exit.
   After:
   The job exit immediately and raise error like `subprocess.CalledProcessError: Command '['cat']' returned non-zero exit status 1.`. And the python worker exit immediately.
   
   Please review https://spark.apache.org/contributing.html before opening a pull request.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org