You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peter Parente (JIRA)" <ji...@apache.org> on 2017/06/14 13:03:00 UTC

[jira] [Created] (SPARK-21094) Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway

Peter Parente created SPARK-21094:
-------------------------------------

             Summary: Allow stdout/stderr pipes in pyspark.java_gateway.launch_gateway
                 Key: SPARK-21094
                 URL: https://issues.apache.org/jira/browse/SPARK-21094
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.1.1
            Reporter: Peter Parente


The Popen call to launch the py4j gateway specifies no stdout and stderr options, meaning logging from the JVM always goes to the parent process terminal. 

https://github.com/apache/spark/blob/v2.1.1/python/pyspark/java_gateway.py#L77

It would be super handy if the launch_gateway function took an additional dict parameter called popen_kwargs which got passed to the Popen calls. This API enhancement, for example, will allow Python applications to capture all stdout and stderr coming from Spark and process it programmatically without resorting to reading from log files or other hijinks.

Example use:


{code:python}
import pyspark
import subprocess
from pyspark.java_gateway import launch_gateway

# Make the py4j JVM stdout and stderr available without buffering
popen_kwargs = {
  'stdout': subprocess.PIPE,
  'stderr': subprocess.PIPE,
  'bufsiz': 0
}

# Launch the gateway with our custom settings
gateway = launch_gateway(popen_kwargs=popen_kwargs)
# Use the gateway we launched
sc = pyspark.SparkContext(gateway=gateway)

# This could be done in a thread or event loop or ...
# Written briefly / poorly here only as a demo
while True:
  buf = gateway.proc.stdout.read()
  print(buf.decode('utf-8'))
{code}

To get access to the stdout and stderr pipes, the "proc" instance created in launch_gateway also needs to be exposed to the application. I'm thinking that stashing it on the JavaGateway instance that the function already returns is the cleanest from the client perspective, but means hanging an extra attribute off the py4j.JavaGateway object. 

I can submit a PR with this addition for further discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org