You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "thom neale (JIRA)" <ji...@apache.org> on 2015/07/24 17:46:05 UTC
[jira] [Created] (SPARK-9313) Enable a "docker run" invocation in place of PYSPARK_PYTHON

thom neale created SPARK-9313:
---------------------------------

             Summary: Enable a "docker run" invocation in place of PYSPARK_PYTHON
                 Key: SPARK-9313
                 URL: https://issues.apache.org/jira/browse/SPARK-9313
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
         Environment: Linux
            Reporter: thom neale
            Priority: Minor


There's a potentially high-yield improvement that might be possible by enabling people to set PYSPARK_PYTHON (or possibly a new env var) to a docker run of a specific docker image. I'm interesting in taking a shot at this, but could use some pointers on overall pyspark architecture in order to avoid hurting myself or trying something stupid that won't work. 

History of this idea: I handle most of the spark infrastructure for MassMutual's data science team, and we currently push code updates out to spark workers with a combination of git post-recieve hooks and ansible playbooks, all glued together with jenkins. It works well, but every time someone wants a specific PYSPARK_PYTHON environment with precise branch checkouts, for example, it has to be exquisitely configured in advance. What would be amazing is if we could run a docker image in place of PYSPARK_PYTHON, so people could build an image with whatever they want on it, push it to a docker registry, then as long as the spark worker nodes had a docker daemon running, they wouldn't need the images in advance--they would just pull the built images from the registry on the fly once someone submitted their job and specified the appropriate docker fu in place of PYSPARK_PYTHON. This would basically make the distribution of code to the workers self-service as long as users were savvy with docker. A lesser benefit is that the layered filesystem feature of docker would solve the (it's not really a problem) minor issue of a profusion of python virtualenvs, each loaded with a huge ML stack plus other deps, from gobbling up gigs of space on smaller code partitions on our workers. Each new combination of branch checkouts for our application code could use the same huge ML base image, and things would just be faster and simpler. 

What I Speculate This Would Require 
--------------------------------------------------- 
Based on a reading of pyspark/daemon.py, I think this would require: 
- somehow making the os.setpgid call inside manager() optional. The pyspark.daemon process isn't allowed to call setpgid, I think because it has pid 1 in the container. In my hacked branch I'm going this by checking if a new environment variable is set. 
- instead of binding to a random port, if the worker is dockerized, bind to a predetermined port 
- When the dockerized worker is invoked, query docker for the exposed port on the host, and print that instead - Possibly do the same with ports opened by forked workers? 
- Forward stdin/out to/from the container where appropriate My initial tinkering has done the first three points on 1.3.1 and I get the InvalidArgumentException with an out-of-range port number, probably indicating something is hitting an error a printing something else instead of the actual port. 

Any pointers people can supply would most welcome; I'm really interested in at least succeeding in a demonstration of this hack, if not getting it merged any time soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org