You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mateiz <gi...@git.apache.org> on 2014/05/06 10:52:07 UTC

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

GitHub user mateiz opened a pull request:

    https://github.com/apache/spark/pull/664

    [SPARK-1549] Add Python support to spark-submit

    This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
    
    This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
    
    In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
    
    In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mateiz/spark py-submit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/664.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #664
    
----
commit d4375bddd41542bdf94bd08a40b073f7046720e3
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-05-04T22:49:12Z

    Clean up description of spark-submit args a bit and add Python ones

commit 47c0655da0ef53fc37b2c41a4cfa0030fc85d84b
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-05-05T07:52:22Z

    More work to make spark-submit work with Python:
    
    - Launch Py4J gateway server in-process and execute Python main class
    - Redirect its output to PythonRunner
    - Various misc fixes to messages and error reporting in SparkSubmit

commit 15f8e1ef7eead23922aaec5e63e439c054355911
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-05-05T23:59:47Z

    Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside

commit 4650412c7794a0002a80b82831e29c8b095c206f
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-05-06T08:43:02Z

    Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42366510
  
    Great, thanks. Going to merge this then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42284622
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42344663
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12342018
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala ---
    @@ -0,0 +1,44 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.api.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.SparkContext
    +
    +private[spark] object PythonUtils {
    +  private val pathSeparator = System.getProperty("path.separator")
    --- End diff --
    
    Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12352548
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -37,6 +37,9 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String
       val daemonHost = InetAddress.getByAddress(Array(127, 0, 0, 1))
       var daemonPort: Int = 0
     
    +  val pythonPath = PythonUtils.mergePythonPaths(
    +    PythonUtils.sparkPythonPath, envVars.getOrElse("PYTHONPATH", ""))
    --- End diff --
    
    Do we want to include `sys.env.getOrElse("PYTHONPATH", "")` here as well? Or are we ignoring all PYTHONPATH environment variables altogether?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42334020
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12337792
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
    @@ -260,9 +296,10 @@ private[spark] class SparkSubmitArguments(args: Seq[String]) {
                   val errMessage = s"Unrecognized option '$value'."
                   SparkSubmit.printErrorAndExit(errMessage)
                 case v =>
    -             primaryResource = v
    -             inSparkOpts = false
    -             parse(tail)
    +              primaryResource = v
    +              inSparkOpts = false
    +              isPython = v.endsWith(".py")
    --- End diff --
    
    Is it very uncommon to run python files with a `#!/usr/bin/env python` header?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12337504
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala ---
    @@ -0,0 +1,44 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.api.python
    +
    +import java.io.File
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.SparkContext
    +
    +private[spark] object PythonUtils {
    +  private val pathSeparator = System.getProperty("path.separator")
    --- End diff --
    
    File.pathSeparator?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42351665
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42360644
  
    I tested running a python file through spark-submit on a YARN cluster and verified that it works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42279724
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12315419
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
    @@ -130,50 +145,51 @@ object SparkSubmit {
           childArgs += ("--class", appArgs.mainClass)
         }
     
    +    // Make sure YARN is included in our build if we're trying to use it
         if (clusterManager == YARN) {
    -      // The choice of class is arbitrary, could use any spark-yarn class
           if (!Utils.classIsLoadable("org.apache.spark.deploy.yarn.Client") && !Utils.isTesting) {
    -        val msg = "Could not load YARN classes. This copy of Spark may not have been compiled " +
    -          "with YARN support."
    -        throw new Exception(msg)
    +        printErrorAndExit("Could not load YARN classes. " +
    +          "This copy of Spark may not have been compiled with YARN support.")
           }
         }
     
         // Special flag to avoid deprecation warnings at the client
         sysProps("SPARK_SUBMIT") = "true"
     
    +    // A list of rules to map each argument to system properties or command-line options in
    +    // each deploy mode; we iterate through these below
         val options = List[OptionAssigner](
    -      new OptionAssigner(appArgs.master, ALL_CLUSTER_MGRS, false, sysProp = "spark.master"),
    -      new OptionAssigner(appArgs.driverExtraClassPath, STANDALONE | YARN, true,
    +      OptionAssigner(appArgs.master, ALL_CLUSTER_MGRS, false, sysProp = "spark.master"),
    +      OptionAssigner(appArgs.driverExtraClassPath, STANDALONE | YARN, true,
             sysProp = "spark.driver.extraClassPath"),
    -      new OptionAssigner(appArgs.driverExtraJavaOptions, STANDALONE | YARN, true,
    +      OptionAssigner(appArgs.driverExtraJavaOptions, STANDALONE | YARN, true,
             sysProp = "spark.driver.extraJavaOptions"),
    -      new OptionAssigner(appArgs.driverExtraLibraryPath, STANDALONE | YARN, true,
    +      OptionAssigner(appArgs.driverExtraLibraryPath, STANDALONE | YARN, true,
             sysProp = "spark.driver.extraLibraryPath"),
    -      new OptionAssigner(appArgs.driverMemory, YARN, true, clOption = "--driver-memory"),
    -      new OptionAssigner(appArgs.name, YARN, true, clOption = "--name"),
    -      new OptionAssigner(appArgs.queue, YARN, true, clOption = "--queue"),
    -      new OptionAssigner(appArgs.queue, YARN, false, sysProp = "spark.yarn.queue"),
    -      new OptionAssigner(appArgs.numExecutors, YARN, true, clOption = "--num-executors"),
    -      new OptionAssigner(appArgs.numExecutors, YARN, false, sysProp = "spark.executor.instances"),
    -      new OptionAssigner(appArgs.executorMemory, YARN, true, clOption = "--executor-memory"),
    -      new OptionAssigner(appArgs.executorMemory, STANDALONE | MESOS | YARN, false,
    +      OptionAssigner(appArgs.driverMemory, YARN, true, clOption = "--driver-memory"),
    +      OptionAssigner(appArgs.name, YARN, true, clOption = "--name"),
    +      OptionAssigner(appArgs.queue, YARN, true, clOption = "--queue"),
    +      OptionAssigner(appArgs.queue, YARN, false, sysProp = "spark.yarn.queue"),
    +      OptionAssigner(appArgs.numExecutors, YARN, true, clOption = "--num-executors"),
    +      OptionAssigner(appArgs.numExecutors, YARN, false, sysProp = "spark.executor.instances"),
    +      OptionAssigner(appArgs.executorMemory, YARN, true, clOption = "--executor-memory"),
    +      OptionAssigner(appArgs.executorMemory, STANDALONE | MESOS | YARN, false,
             sysProp = "spark.executor.memory"),
    -      new OptionAssigner(appArgs.driverMemory, STANDALONE, true, clOption = "--memory"),
    -      new OptionAssigner(appArgs.driverCores, STANDALONE, true, clOption = "--cores"),
    -      new OptionAssigner(appArgs.executorCores, YARN, true, clOption = "--executor-cores"),
    -      new OptionAssigner(appArgs.executorCores, YARN, false, sysProp = "spark.executor.cores"),
    -      new OptionAssigner(appArgs.totalExecutorCores, STANDALONE | MESOS, false,
    +      OptionAssigner(appArgs.driverMemory, STANDALONE, true, clOption = "--memory"),
    +      OptionAssigner(appArgs.driverCores, STANDALONE, true, clOption = "--cores"),
    +      OptionAssigner(appArgs.executorCores, YARN, true, clOption = "--executor-cores"),
    +      OptionAssigner(appArgs.executorCores, YARN, false, sysProp = "spark.executor.cores"),
    +      OptionAssigner(appArgs.totalExecutorCores, STANDALONE | MESOS, false,
             sysProp = "spark.cores.max"),
    -      new OptionAssigner(appArgs.files, YARN, false, sysProp = "spark.yarn.dist.files"),
    -      new OptionAssigner(appArgs.files, YARN, true, clOption = "--files"),
    -      new OptionAssigner(appArgs.archives, YARN, false, sysProp = "spark.yarn.dist.archives"),
    -      new OptionAssigner(appArgs.archives, YARN, true, clOption = "--archives"),
    -      new OptionAssigner(appArgs.jars, YARN, true, clOption = "--addJars"),
    -      new OptionAssigner(appArgs.files, LOCAL | STANDALONE | MESOS, true, sysProp = "spark.files"),
    -      new OptionAssigner(appArgs.jars, LOCAL | STANDALONE | MESOS, false, sysProp = "spark.jars"),
    -      new OptionAssigner(appArgs.name, LOCAL | STANDALONE | MESOS, false,
    -        sysProp = "spark.app.name")
    +      OptionAssigner(appArgs.files, YARN, false, sysProp = "spark.yarn.dist.files"),
    +      OptionAssigner(appArgs.files, YARN, true, clOption = "--files"),
    +      OptionAssigner(appArgs.archives, YARN, false, sysProp = "spark.yarn.dist.archives"),
    +      OptionAssigner(appArgs.archives, YARN, true, clOption = "--archives"),
    +      OptionAssigner(appArgs.jars, YARN, true, clOption = "--addJars"),
    +      OptionAssigner(appArgs.files, LOCAL | STANDALONE | MESOS, false, sysProp = "spark.files"),
    --- End diff --
    
    I added this line to set spark.files even when running in client mode -- not sure why it wasn't set before, but it's the correct thing to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42344760
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14725/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42279714
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42280473
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42279779
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12341753
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala ---
    @@ -0,0 +1,84 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.deploy
    +
    +import java.io.{IOException, File, InputStream, OutputStream}
    +
    +import scala.collection.mutable.ArrayBuffer
    +import scala.collection.JavaConversions._
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.api.python.PythonUtils
    +
    +/**
    + * A main class used by spark-submit to launch Python applications. It executes python as a
    + * subprocess and then has it connect back to the JVM to access system properties, etc.
    + */
    +object PythonRunner {
    +  def main(args: Array[String]) {
    +    val primaryResource = args(0)
    +    val pyFiles = args(1)
    +    val otherArgs = args.slice(2, args.length)
    +
    +    val pythonExec = sys.env.get("PYSPARK_PYTHON").getOrElse("python") // TODO: get this from conf
    +
    +    // Launch a Py4J gateway server for the process to connect to; this will let it see our
    +    // Java system properties and such
    +    val gatewayServer = new py4j.GatewayServer(null, 0)
    +    gatewayServer.start()
    +
    +    // Build up a PYTHONPATH that includes the Spark assembly JAR (where this class is), the
    +    // python directories in SPARK_HOME (if set), and any files in the pyFiles argument
    +    val pathElements = new ArrayBuffer[String]
    +    pathElements ++= pyFiles.split(",")
    +    pathElements += PythonUtils.sparkPythonPath
    +    pathElements += sys.env.getOrElse("PYTHONPATH", "")
    +    val pythonPath = PythonUtils.mergePythonPaths(pathElements: _*)
    +
    +    // Launch Python process
    +    val builder = new ProcessBuilder(Seq(pythonExec, "-u", primaryResource) ++ otherArgs)
    +    val env = builder.environment()
    +    env.put("PYTHONPATH", pythonPath)
    +    env.put("PYSPARK_GATEWAY_PORT", "" + gatewayServer.getListeningPort)
    +    builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize
    +    val process = builder.start()
    +
    +    new RedirectThread(process.getInputStream, System.out, "redirect output").start()
    +
    +    System.exit(process.waitFor())
    +  }
    +
    +  /**
    +   * A utility class to redirect the child process's stdout or stderr
    +   */
    +  class RedirectThread(in: InputStream, out: OutputStream, name: String) extends Thread(name) {
    +    setDaemon(true)
    +    override def run() {
    +      scala.util.control.Exception.ignoring(classOf[IOException]) {
    +        // FIXME: We copy the stream on the level of bytes to avoid encoding problems.
    +        val buf = new Array[Byte](1024)
    --- End diff --
    
    That might be good but I'm not sure that it flushes after each write, which we need for letting the user see this output fast


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42344648
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/664


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42280464
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42344755
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12342052
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala ---
    @@ -0,0 +1,84 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.deploy
    +
    +import java.io.{IOException, File, InputStream, OutputStream}
    +
    +import scala.collection.mutable.ArrayBuffer
    +import scala.collection.JavaConversions._
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.api.python.PythonUtils
    +
    +/**
    + * A main class used by spark-submit to launch Python applications. It executes python as a
    + * subprocess and then has it connect back to the JVM to access system properties, etc.
    + */
    +object PythonRunner {
    +  def main(args: Array[String]) {
    +    val primaryResource = args(0)
    +    val pyFiles = args(1)
    +    val otherArgs = args.slice(2, args.length)
    +
    +    val pythonExec = sys.env.get("PYSPARK_PYTHON").getOrElse("python") // TODO: get this from conf
    +
    +    // Launch a Py4J gateway server for the process to connect to; this will let it see our
    +    // Java system properties and such
    +    val gatewayServer = new py4j.GatewayServer(null, 0)
    +    gatewayServer.start()
    +
    +    // Build up a PYTHONPATH that includes the Spark assembly JAR (where this class is), the
    +    // python directories in SPARK_HOME (if set), and any files in the pyFiles argument
    +    val pathElements = new ArrayBuffer[String]
    +    pathElements ++= pyFiles.split(",")
    +    pathElements += PythonUtils.sparkPythonPath
    +    pathElements += sys.env.getOrElse("PYTHONPATH", "")
    +    val pythonPath = PythonUtils.mergePythonPaths(pathElements: _*)
    +
    +    // Launch Python process
    +    val builder = new ProcessBuilder(Seq(pythonExec, "-u", primaryResource) ++ otherArgs)
    +    val env = builder.environment()
    +    env.put("PYTHONPATH", pythonPath)
    +    env.put("PYSPARK_GATEWAY_PORT", "" + gatewayServer.getListeningPort)
    +    builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize
    +    val process = builder.start()
    +
    +    new RedirectThread(process.getInputStream, System.out, "redirect output").start()
    +
    +    System.exit(process.waitFor())
    +  }
    +
    +  /**
    +   * A utility class to redirect the child process's stdout or stderr
    +   */
    +  class RedirectThread(in: InputStream, out: OutputStream, name: String) extends Thread(name) {
    +    setDaemon(true)
    +    override def run() {
    +      scala.util.control.Exception.ignoring(classOf[IOException]) {
    +        // FIXME: We copy the stream on the level of bytes to avoid encoding problems.
    +        val buf = new Array[Byte](1024)
    --- End diff --
    
    Ah, good point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42284626
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14712/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12337782
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala ---
    @@ -0,0 +1,84 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.deploy
    +
    +import java.io.{IOException, File, InputStream, OutputStream}
    +
    +import scala.collection.mutable.ArrayBuffer
    +import scala.collection.JavaConversions._
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.api.python.PythonUtils
    +
    +/**
    + * A main class used by spark-submit to launch Python applications. It executes python as a
    + * subprocess and then has it connect back to the JVM to access system properties, etc.
    + */
    +object PythonRunner {
    +  def main(args: Array[String]) {
    +    val primaryResource = args(0)
    +    val pyFiles = args(1)
    +    val otherArgs = args.slice(2, args.length)
    +
    +    val pythonExec = sys.env.get("PYSPARK_PYTHON").getOrElse("python") // TODO: get this from conf
    +
    +    // Launch a Py4J gateway server for the process to connect to; this will let it see our
    +    // Java system properties and such
    +    val gatewayServer = new py4j.GatewayServer(null, 0)
    +    gatewayServer.start()
    +
    +    // Build up a PYTHONPATH that includes the Spark assembly JAR (where this class is), the
    +    // python directories in SPARK_HOME (if set), and any files in the pyFiles argument
    +    val pathElements = new ArrayBuffer[String]
    +    pathElements ++= pyFiles.split(",")
    +    pathElements += PythonUtils.sparkPythonPath
    +    pathElements += sys.env.getOrElse("PYTHONPATH", "")
    +    val pythonPath = PythonUtils.mergePythonPaths(pathElements: _*)
    +
    +    // Launch Python process
    +    val builder = new ProcessBuilder(Seq(pythonExec, "-u", primaryResource) ++ otherArgs)
    +    val env = builder.environment()
    +    env.put("PYTHONPATH", pythonPath)
    +    env.put("PYSPARK_GATEWAY_PORT", "" + gatewayServer.getListeningPort)
    +    builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize
    +    val process = builder.start()
    +
    +    new RedirectThread(process.getInputStream, System.out, "redirect output").start()
    +
    +    System.exit(process.waitFor())
    +  }
    +
    +  /**
    +   * A utility class to redirect the child process's stdout or stderr
    +   */
    +  class RedirectThread(in: InputStream, out: OutputStream, name: String) extends Thread(name) {
    +    setDaemon(true)
    +    override def run() {
    +      scala.util.control.Exception.ignoring(classOf[IOException]) {
    +        // FIXME: We copy the stream on the level of bytes to avoid encoding problems.
    --- End diff --
    
    Not sure what needs to be fixed? This looks like the right approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12340115
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
    @@ -260,9 +296,10 @@ private[spark] class SparkSubmitArguments(args: Seq[String]) {
                   val errMessage = s"Unrecognized option '$value'."
                   SparkSubmit.printErrorAndExit(errMessage)
                 case v =>
    -             primaryResource = v
    -             inSparkOpts = false
    -             parse(tail)
    +              primaryResource = v
    +              inSparkOpts = false
    +              isPython = v.endsWith(".py")
    --- End diff --
    
    I'd rather disallow that for now, because as far as I've seen, they still tend to have the .py extension. Later on we will also want to support zips and eggs here, and to have an argument for specifying the entry-point module (similar to python -m).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42351670
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14727/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/664#discussion_r12337978
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala ---
    @@ -0,0 +1,84 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.deploy
    +
    +import java.io.{IOException, File, InputStream, OutputStream}
    +
    +import scala.collection.mutable.ArrayBuffer
    +import scala.collection.JavaConversions._
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.api.python.PythonUtils
    +
    +/**
    + * A main class used by spark-submit to launch Python applications. It executes python as a
    + * subprocess and then has it connect back to the JVM to access system properties, etc.
    + */
    +object PythonRunner {
    +  def main(args: Array[String]) {
    +    val primaryResource = args(0)
    +    val pyFiles = args(1)
    +    val otherArgs = args.slice(2, args.length)
    +
    +    val pythonExec = sys.env.get("PYSPARK_PYTHON").getOrElse("python") // TODO: get this from conf
    +
    +    // Launch a Py4J gateway server for the process to connect to; this will let it see our
    +    // Java system properties and such
    +    val gatewayServer = new py4j.GatewayServer(null, 0)
    +    gatewayServer.start()
    +
    +    // Build up a PYTHONPATH that includes the Spark assembly JAR (where this class is), the
    +    // python directories in SPARK_HOME (if set), and any files in the pyFiles argument
    +    val pathElements = new ArrayBuffer[String]
    +    pathElements ++= pyFiles.split(",")
    +    pathElements += PythonUtils.sparkPythonPath
    +    pathElements += sys.env.getOrElse("PYTHONPATH", "")
    +    val pythonPath = PythonUtils.mergePythonPaths(pathElements: _*)
    +
    +    // Launch Python process
    +    val builder = new ProcessBuilder(Seq(pythonExec, "-u", primaryResource) ++ otherArgs)
    +    val env = builder.environment()
    +    env.put("PYTHONPATH", pythonPath)
    +    env.put("PYSPARK_GATEWAY_PORT", "" + gatewayServer.getListeningPort)
    +    builder.redirectErrorStream(true) // Ugly but needed for stdout and stderr to synchronize
    +    val process = builder.start()
    +
    +    new RedirectThread(process.getInputStream, System.out, "redirect output").start()
    +
    +    System.exit(process.waitFor())
    +  }
    +
    +  /**
    +   * A utility class to redirect the child process's stdout or stderr
    +   */
    +  class RedirectThread(in: InputStream, out: OutputStream, name: String) extends Thread(name) {
    +    setDaemon(true)
    +    override def run() {
    +      scala.util.control.Exception.ignoring(classOf[IOException]) {
    +        // FIXME: We copy the stream on the level of bytes to avoid encoding problems.
    +        val buf = new Array[Byte](1024)
    --- End diff --
    
    Unless you really want the 1k buffer, you could just use ByteStreams.copy() (from Guava).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42279780
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14711/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1549] Add Python support to spark-submi...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/664#issuecomment-42334001
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---