You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matthew Farrellee (JIRA)" <ji...@apache.org> on 2014/06/28 14:26:24 UTC

[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN

    [ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046843#comment-14046843 ] 

Matthew Farrellee commented on SPARK-2313:
------------------------------------------

components involved -
 0. pyspark - python program that initiates a py4j setup when constructing the SparkContext (calls launch_gateway form java_gateway.py)
 1. launch_gateway - invokes "o.a.s.d.SparkSubmit pyspark-shell" via spark-class via spark-submit, which invokes py4j.GatewayServer
 2. py4j.GatewayServer - py4j specific code that listens on a port and prints it to stdout (see GatewayServer.java#L610)
 3. launch_gateway - reads the port from stdin and constructs the client side of the py4j channel

comments -
 a. by allowing the child to pick an ephemeral port there's a guarantee of success (except for the case of no available ports)
 b. having the parent pick a port and pass it to the child introduces a risk that when the child tries to use the port it will no longer be available. thus, not strictly simpler to keep the same guarantees that currently exist.
 c. printing the port to stdout from the child (py4j gatewayserver) is the intended method for discovery, see https://github.com/bartdag/py4j/blob/master/py4j-java/src/py4j/GatewayServer.java#L610
 d. any data on stdout from spark-submit, spark-class or o.a.s.d.SparkSubmit can interfere with the py4j setup

because of (d), i consider this fragile - good meaning, unrelated changes are likely to break it.

i'll take a look at this

> PySpark should accept port via a command line argument rather than STDIN
> ------------------------------------------------------------------------
>
>                 Key: SPARK-2313
>                 URL: https://issues.apache.org/jira/browse/SPARK-2313
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>            Reporter: Patrick Wendell
>
> Relying on stdin is a brittle mechanism and has broken several times in the past. From what I can tell this is used only to bootstrap worker.py one time. It would be strictly simpler to just pass it is a command line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)