You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2015/02/14 05:57:38 UTC

[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/4603

    [SPARK-2313] Use socket to communicate GatewayServer port back to Python driver

    This patch changes PySpark so that the GatewayServer's port is communicated back to the Python process that launches it over a local socket instead of a pipe.  The old pipe-based approach was brittle and could fail if `spark-submit` printed unexpected to stdout.
    
    To accomplish this, I wrote a custom `PythonGatewayServer.main()` function to use in place of Py4J's `GatewayServer.main()`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-2313

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4603
    
----
commit 8bf956ea3ac7d9af481acea3a14b9e48dc0ba2fa
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-02-14T01:01:09Z

    Initial cut at passing Py4J gateway port back to driver via socket

commit 2f70689aebed4dcee67d2dbc9ee42255f6324b5f
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-02-14T04:51:01Z

    Use stdin PIPE to share fate with driver

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74361979
  
      [Test build #27475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27475/consoleFull) for   PR 4603 at commit [`2f70689`](https://github.com/apache/spark/commit/2f70689aebed4dcee67d2dbc9ee42255f6324b5f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24715643
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -41,36 +43,33 @@ def launch_gateway():
             submit_args = submit_args if submit_args is not None else ""
             submit_args = shlex.split(submit_args)
             command = [os.path.join(SPARK_HOME, script)] + submit_args + ["pyspark-shell"]
    +
    +        # Start a socket that will be used by PythonGatewayServer to communicate its port to us
    +        callback_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    +        callback_socket.bind(('127.0.0.1', 0))
    +        callback_socket.listen(1)
    +        callback_host, callback_port = callback_socket.getsockname()
    +        env = dict(os.environ)
    +        env['PYSPARK_DRIVER_CALLBACK_HOST'] = callback_host
    +        env['PYSPARK_DRIVER_CALLBACK_PORT'] = str(callback_port)
    +
    +        # Launch the Java gateway.
    +        # We open a pipe to stdin so that the Java gateway can die when the pipe is broken
             if not on_windows:
                 # Don't send ctrl-c / SIGINT to the Java gateway:
                 def preexec_func():
                     signal.signal(signal.SIGINT, signal.SIG_IGN)
    -            env = dict(os.environ)
                 env["IS_SUBPROCESS"] = "1"  # tell JVM to exit after python exits
                 proc = Popen(command, stdout=PIPE, stdin=PIPE, preexec_fn=preexec_func, env=env)
             else:
                 # preexec_fn not supported on Windows
    -            proc = Popen(command, stdout=PIPE, stdin=PIPE)
    +            proc = Popen(command, stdout=PIPE, stdin=PIPE, env=env)
    --- End diff --
    
    Since we will not read any from stdout of JVM, so `stdout=PIPE` is not necessary now. Which will give us a chance to have colorful progress bar in console.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24715623
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala ---
    @@ -0,0 +1,59 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.api.python
    +
    +import java.io.DataOutputStream
    +import java.net.Socket
    +
    +import py4j.GatewayServer
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Process that starts a Py4J GatewayServer on an ephemeral port and communicates the bound port
    + * back to its caller via a callback port specified by the caller.
    + *
    + * This process is launched (via SparkSubmit) by the PySpark driver (see java_gateway.py).
    + */
    +private[spark] object PythonGatewayServer extends Logging {
    +  def main(args: Array[String]): Unit = Utils.tryOrExit {
    +    // Start a GatewayServer on an ephemeral port
    +    val gatewayServer: GatewayServer = new GatewayServer(null, 0)
    +    gatewayServer.start()
    +    val boundPort: Int = gatewayServer.getListeningPort
    --- End diff --
    
    boundPort could be -1, if failed to bind. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74383158
  
    @JoshRosen This looks good to me, just two minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74364177
  
      [Test build #27477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27477/consoleFull) for   PR 4603 at commit [`e5f9730`](https://github.com/apache/spark/commit/e5f973076038173b301cecadc11893305eabf700).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74361899
  
    @brkyvz, can you confirm whether this fixes the packages-related issue that you saw ([SPARK-5810](https://issues.apache.org/jira/browse/SPARK-5810))?
    
    @davies, this is based on the approach described in [your comment](https://github.com/apache/spark/pull/3424#issuecomment-64270813) on @lvsoft's PR  for this same issue (#3424).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74364179
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27477/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24717434
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -41,36 +43,33 @@ def launch_gateway():
             submit_args = submit_args if submit_args is not None else ""
             submit_args = shlex.split(submit_args)
             command = [os.path.join(SPARK_HOME, script)] + submit_args + ["pyspark-shell"]
    +
    +        # Start a socket that will be used by PythonGatewayServer to communicate its port to us
    +        callback_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    +        callback_socket.bind(('127.0.0.1', 0))
    +        callback_socket.listen(1)
    +        callback_host, callback_port = callback_socket.getsockname()
    +        env = dict(os.environ)
    +        env['PYSPARK_DRIVER_CALLBACK_HOST'] = callback_host
    +        env['PYSPARK_DRIVER_CALLBACK_PORT'] = str(callback_port)
    +
    +        # Launch the Java gateway.
    +        # We open a pipe to stdin so that the Java gateway can die when the pipe is broken
             if not on_windows:
                 # Don't send ctrl-c / SIGINT to the Java gateway:
                 def preexec_func():
                     signal.signal(signal.SIGINT, signal.SIG_IGN)
    -            env = dict(os.environ)
                 env["IS_SUBPROCESS"] = "1"  # tell JVM to exit after python exits
                 proc = Popen(command, stdout=PIPE, stdin=PIPE, preexec_fn=preexec_func, env=env)
             else:
                 # preexec_fn not supported on Windows
    -            proc = Popen(command, stdout=PIPE, stdin=PIPE)
    +            proc = Popen(command, stdout=PIPE, stdin=PIPE, env=env)
     
    -        try:
    -            # Determine which ephemeral port the server started on:
    -            gateway_port = proc.stdout.readline()
    -            gateway_port = int(gateway_port)
    -        except ValueError:
    -            # Grab the remaining lines of stdout
    -            (stdout, _) = proc.communicate()
    -            exit_code = proc.poll()
    -            error_msg = "Launching GatewayServer failed"
    -            error_msg += " with exit code %d!\n" % exit_code if exit_code else "!\n"
    -            error_msg += "Warning: Expected GatewayServer to output a port, but found "
    -            if gateway_port == "" and stdout == "":
    -                error_msg += "no output.\n"
    -            else:
    -                error_msg += "the following:\n\n"
    -                error_msg += "--------------------------------------------------------------\n"
    -                error_msg += gateway_port + stdout
    -                error_msg += "--------------------------------------------------------------\n"
    -            raise Exception(error_msg)
    +        gateway_connection = callback_socket.accept()[0]
    --- End diff --
    
    This could be prone to hanging indefinitely if the JavaGateway process fails to start.  I think I'd better add a timeout + loop here with a check for whether the subprocess has died.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74363881
  
      [Test build #27475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27475/consoleFull) for   PR 4603 at commit [`2f70689`](https://github.com/apache/spark/commit/2f70689aebed4dcee67d2dbc9ee42255f6324b5f).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24712728
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -41,36 +43,33 @@ def launch_gateway():
             submit_args = submit_args if submit_args is not None else ""
             submit_args = shlex.split(submit_args)
             command = [os.path.join(SPARK_HOME, script)] + submit_args + ["pyspark-shell"]
    +
    +        # Start a socket that will be used by PythonGatewayServer to communicate its port to us
    +        callback_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    +        callback_socket.bind(('127.0.0.1', 0))
    +        callback_socket.listen(1)
    +        callback_host, callback_port = callback_socket.getsockname()
    +        env = dict(os.environ)
    +        env['PYSPARK_DRIVER_CALLBACK_HOST'] = callback_host
    --- End diff --
    
    I don't mind, but can we add a prefix: `_PYSPARK_DRIVER_CALLBACK_HOST`? Otherwise people sometimes try to set these externally (it sounds crazy, but I've seen it happen with other internal vars).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74585897
  
    @davies, want to give this a final look over?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74364396
  
      [Test build #27478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27478/consoleFull) for   PR 4603 at commit [`d12c95d`](https://github.com/apache/spark/commit/d12c95d4faf8ec9f946671e540c909bb916676e6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4603


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74446042
  
      [Test build #27533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27533/consoleFull) for   PR 4603 at commit [`6a7740b`](https://github.com/apache/spark/commit/6a7740b23ecaa20dcb7e59f9468bfb24a6a224ed).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74588555
  
    @JoshRosen LGTM, ship it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74388179
  
    This fixes the problem! looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74364400
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27478/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24712323
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala ---
    @@ -0,0 +1,54 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.api.python
    +
    +import java.io.DataOutputStream
    +import java.net.Socket
    +
    +import py4j.GatewayServer
    +
    +/**
    + * Process that starts a Py4J GatewayServer on an ephemeral port and communicates the bound port
    + * back to its caller via a callback port specified by the caller.
    + *
    + * This process is launched (via SparkSubmit) by the PySpark driver (see java_gateway.py).
    + */
    +private[spark] object PythonGatewayServer {
    --- End diff --
    
    When reviewing this, compare against Py4J's GatewayServer.main() method: https://github.com/bartdag/py4j/blob/master/py4j-java/src/py4j/GatewayServer.java#L596


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74363883
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27475/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74449870
  
      [Test build #27533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27533/consoleFull) for   PR 4603 at commit [`6a7740b`](https://github.com/apache/spark/commit/6a7740b23ecaa20dcb7e59f9468bfb24a6a224ed).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24712357
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala ---
    @@ -0,0 +1,62 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.api.python
    +
    +import java.io.DataOutputStream
    +import java.net.Socket
    +
    +import py4j.GatewayServer
    +
    +/**
    + * Process that starts a Py4J GatewayServer on an ephemeral port and communicates the bound port
    + * back to its caller via a callback port specified by the caller.
    + *
    + * This process is launched (via SparkSubmit) by the PySpark driver (see java_gateway.py).
    + */
    +private[spark] object PythonGatewayServer {
    +  def main(args: Array[String]): Unit = {
    +    try {
    +      // Start a GatewayServer on an ephemeral port
    +      val gatewayServer: GatewayServer = new GatewayServer(null, 0)
    +      gatewayServer.start()
    +      val boundPort: Int = gatewayServer.getListeningPort
    +
    +      // Communicate the bound port back to the caller via the caller-specified callback port
    +      val callbackHost = sys.env("PYSPARK_DRIVER_CALLBACK_HOST")
    +      val callbackPort = sys.env("PYSPARK_DRIVER_CALLBACK_PORT").toInt
    +      val callbackSocket = new Socket(callbackHost, callbackPort)
    +      val dos = new DataOutputStream(callbackSocket.getOutputStream)
    +      dos.writeInt(boundPort)
    +      dos.close()
    +      callbackSocket.close()
    +
    +      // Exit on EOF or broken pipe to ensure that this process dies when the Python driver dies:
    +      while (System.in.read() != -1) {
    +        // Do nothing
    +      }
    +      System.exit(0)
    +    } catch {
    +      case e: Exception =>
    +        try {
    +          throw e
    --- End diff --
    
    Initially, I wasn't sure whether we wanted to add a dependency on Spark's `Logging` trait here, since I was worried that having logging initialization performed here might cause problems, but on reflection I don't think that's an issue, so I'll re-work this to do proper logging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74449873
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27533/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74445894
  
    I've updated this based on the review feedback; see the individual commits for a summary of the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24728054
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonGatewayServer.scala ---
    @@ -0,0 +1,59 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.api.python
    +
    +import java.io.DataOutputStream
    +import java.net.Socket
    +
    +import py4j.GatewayServer
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Process that starts a Py4J GatewayServer on an ephemeral port and communicates the bound port
    + * back to its caller via a callback port specified by the caller.
    + *
    + * This process is launched (via SparkSubmit) by the PySpark driver (see java_gateway.py).
    + */
    +private[spark] object PythonGatewayServer extends Logging {
    +  def main(args: Array[String]): Unit = Utils.tryOrExit {
    +    // Start a GatewayServer on an ephemeral port
    +    val gatewayServer: GatewayServer = new GatewayServer(null, 0)
    +    gatewayServer.start()
    +    val boundPort: Int = gatewayServer.getListeningPort
    --- End diff --
    
    I've added error-checking for this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74362558
  
      [Test build #27478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27478/consoleFull) for   PR 4603 at commit [`d12c95d`](https://github.com/apache/spark/commit/d12c95d4faf8ec9f946671e540c909bb916676e6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4603#discussion_r24712317
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -41,36 +43,33 @@ def launch_gateway():
             submit_args = submit_args if submit_args is not None else ""
             submit_args = shlex.split(submit_args)
             command = [os.path.join(SPARK_HOME, script)] + submit_args + ["pyspark-shell"]
    +
    +        # Start a socket that will be used by PythonGatewayServer to communicate its port to us
    +        callback_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    +        callback_socket.bind(('127.0.0.1', 0))
    +        callback_socket.listen(1)
    +        callback_host, callback_port = callback_socket.getsockname()
    +        env = dict(os.environ)
    +        env['PYSPARK_DRIVER_CALLBACK_HOST'] = callback_host
    --- End diff --
    
    Admittedly, the introduction of new internal environment variables isn't great, but this let me avoid some pain in figuring out how to pipe a hostname / port pair from the Python driver through `spark-submit` all the way to the `PythonGatewayServer`.  If there's a strong feeling that we shouldn't do this, I don't mind figuring out how to implement SparkSubmit-argument-based plumbing instead, but I decided to not do that for the initial cut at fixing this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74362277
  
      [Test build #27477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27477/consoleFull) for   PR 4603 at commit [`e5f9730`](https://github.com/apache/spark/commit/e5f973076038173b301cecadc11893305eabf700).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2313] Use socket to communicate Gateway...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4603#issuecomment-74591167
  
    I've merged this to `master` (1.4.0) and `branch-1.3` (1.3.0).  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org