You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Benjamin Zaitlen <qu...@gmail.com> on 2014/11/14 20:40:43 UTC

Submitting Python Applications from Remote to Master

Hi All,

I'm not quite clear on whether submitting a python application to spark
standalone on ec2 is possible.

Am I reading this correctly:

*A common deployment strategy is to submit your application from a gateway
machine that is physically co-located with your worker machines (e.g.
Master node in a standalone EC2 cluster). In this setup, client mode is
appropriate. In client mode, the driver is launched directly within the
client spark-submit process, with the input and output of the application
attached to the console. Thus, this mode is especially suitable for
applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the
worker machines (e.g. locally on your laptop), it is common to usecluster mode
to minimize network latency between the drivers and the executors. Note
that cluster mode is currently not supported for standalone clusters, Mesos
clusters, or python applications.


So I shouldn't be able to do something like:

./bin/spark-submit  --master spark:/xxxxx.compute-1.amazonaws.com:7077
 examples/src/main/python/pi.py


>From a laptop connecting to a previously launched spark cluster using the
default spark-ec2 script, correct?


If I am not mistaken about this then docs are slightly confusing -- the
above example is more or less the example here:
https://spark.apache.org/docs/1.1.0/submitting-applications.html


If I am mistaken, apologies, can you help me figure out where I went wrong?

I've also taken to opening port 7077 to 0.0.0.0/0

--Ben

RE: Submitting Python Applications from Remote to Master

Posted by Ashic Mahtab <as...@live.com>.
Hi Ognen,Currently, 
"Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications."
So it seems like Yarn + scala is the only option for fire and forget. It shouldn't be too hard to create a "proxy" submitter, but yes, that does involve another process (potentially server) on that side. I've heard good things about Ooyala's server, but haven't got around to trying to set it up. As such, can't really comment.
Regards,Ashic. 
> Date: Sat, 15 Nov 2014 09:50:14 -0600
> From: ognen.duzlevski@gmail.com
> To: ashic@live.com
> CC: quasiben@gmail.com; user@spark.apache.org
> Subject: Re: Submitting Python Applications from Remote to Master
> 
> Ashic,
> 
> Thanks for your email.
> 
> Two things:
> 
> 1. I think a whole lot of data scientists and other people would love
> it if they could just fire off jobs from their laptops. It is, in my
> opinion, a common desired use case.
> 
> 2. Did anyone actually get the Ooyala job server to work? I asked that
> question 6 months ago and never got a straight answer. I ended up
> writing a middle-layer using Scalatra and actors to submit jobs via an
> API and receive results back in JSON. In that I ran into the inability
> to share the SparkContext "feature" and it took a lot of finagling to
> make things work (but it never felt "production ready").
> 
> Ognen
> 
> On Sat, Nov 15, 2014 at 03:36:43PM +0000, Ashic Mahtab wrote:
> > Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala compiled (jar) apps. What it's saying is that it's not possible to offload the entire work to the master (ala hadoop) in a fire and forget (or rather submit-and-forget) manner when running on stand alone. There are two deployment modes - client and cluster. For standalone, only client is supported. What this means is that the "submitting process" will be the driver process (not to be confused with "master"). It should very well be possible to submit from you laptop to a standalone cluster, but the process running spark-submit will be alive until the job finishes. If you terminate the process (via kill-9 or otherwise), then the job will be terminated as well. The driver process will submit the work to the spark master, which will do the usually divvying up of tasks, distribution, fault tolerance, etc. and the results will get reported back to the driver process. 
> > Often it's not possible to have arbitrary access to the spark master, and if jobs take hours to complete, it's not feasible to have the process running on the laptop without interruptions, disconnects, etc. As such, a "gateway" machine is used closer to the spark master that's used to submit jobs from. That way, the process on the gateway machine lives for the duration of the job, and no connection from the laptop, etc. is needed. It's not uncommon to actually have an api to the gateway machine. For example, Ooyala's job server https://github.com/ooyala/spark-jobserver provides a restful interface to submit jobs.
> > Does that help?
> > Regards,Ashic.
> > Date: Fri, 14 Nov 2014 13:40:43 -0600
> > Subject: Submitting Python Applications from Remote to Master
> > From: quasiben@gmail.com
> > To: user@spark.apache.org
> > 
> > Hi All,
> > I'm not quite clear on whether submitting a python application to spark standalone on ec2 is possible. 
> > Am I reading this correctly:
> > *A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the client spark-submit process, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications.
> > So I shouldn't be able to do something like:./bin/spark-submit  --master spark:/xxxxx.compute-1.amazonaws.com:7077  examples/src/main/python/pi.py 
> > From a laptop connecting to a previously launched spark cluster using the default spark-ec2 script, correct?
> > If I am not mistaken about this then docs are slightly confusing -- the above example is more or less the example here: https://spark.apache.org/docs/1.1.0/submitting-applications.html
> > If I am mistaken, apologies, can you help me figure out where I went wrong?I've also taken to opening port 7077 to 0.0.0.0/0
> > --Ben
> > 
> > 
> >  		 	   		  
> 
> -- 
> "Convictions are more dangerous enemies of truth than lies." - Friedrich Nietzsche
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
 		 	   		  

Re: Submitting Python Applications from Remote to Master

Posted by Ognen Duzlevski <og...@gmail.com>.
Ashic,

Thanks for your email.

Two things:

1. I think a whole lot of data scientists and other people would love
it if they could just fire off jobs from their laptops. It is, in my
opinion, a common desired use case.

2. Did anyone actually get the Ooyala job server to work? I asked that
question 6 months ago and never got a straight answer. I ended up
writing a middle-layer using Scalatra and actors to submit jobs via an
API and receive results back in JSON. In that I ran into the inability
to share the SparkContext "feature" and it took a lot of finagling to
make things work (but it never felt "production ready").

Ognen

On Sat, Nov 15, 2014 at 03:36:43PM +0000, Ashic Mahtab wrote:
> Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala compiled (jar) apps. What it's saying is that it's not possible to offload the entire work to the master (ala hadoop) in a fire and forget (or rather submit-and-forget) manner when running on stand alone. There are two deployment modes - client and cluster. For standalone, only client is supported. What this means is that the "submitting process" will be the driver process (not to be confused with "master"). It should very well be possible to submit from you laptop to a standalone cluster, but the process running spark-submit will be alive until the job finishes. If you terminate the process (via kill-9 or otherwise), then the job will be terminated as well. The driver process will submit the work to the spark master, which will do the usually divvying up of tasks, distribution, fault tolerance, etc. and the results will get reported back to the driver process. 
> Often it's not possible to have arbitrary access to the spark master, and if jobs take hours to complete, it's not feasible to have the process running on the laptop without interruptions, disconnects, etc. As such, a "gateway" machine is used closer to the spark master that's used to submit jobs from. That way, the process on the gateway machine lives for the duration of the job, and no connection from the laptop, etc. is needed. It's not uncommon to actually have an api to the gateway machine. For example, Ooyala's job server https://github.com/ooyala/spark-jobserver provides a restful interface to submit jobs.
> Does that help?
> Regards,Ashic.
> Date: Fri, 14 Nov 2014 13:40:43 -0600
> Subject: Submitting Python Applications from Remote to Master
> From: quasiben@gmail.com
> To: user@spark.apache.org
> 
> Hi All,
> I'm not quite clear on whether submitting a python application to spark standalone on ec2 is possible. 
> Am I reading this correctly:
> *A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the client spark-submit process, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications.
> So I shouldn't be able to do something like:./bin/spark-submit  --master spark:/xxxxx.compute-1.amazonaws.com:7077  examples/src/main/python/pi.py 
> From a laptop connecting to a previously launched spark cluster using the default spark-ec2 script, correct?
> If I am not mistaken about this then docs are slightly confusing -- the above example is more or less the example here: https://spark.apache.org/docs/1.1.0/submitting-applications.html
> If I am mistaken, apologies, can you help me figure out where I went wrong?I've also taken to opening port 7077 to 0.0.0.0/0
> --Ben
> 
> 
>  		 	   		  

-- 
"Convictions are more dangerous enemies of truth than lies." - Friedrich Nietzsche

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


RE: Submitting Python Applications from Remote to Master

Posted by Ashic Mahtab <as...@live.com>.
Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala compiled (jar) apps. What it's saying is that it's not possible to offload the entire work to the master (ala hadoop) in a fire and forget (or rather submit-and-forget) manner when running on stand alone. There are two deployment modes - client and cluster. For standalone, only client is supported. What this means is that the "submitting process" will be the driver process (not to be confused with "master"). It should very well be possible to submit from you laptop to a standalone cluster, but the process running spark-submit will be alive until the job finishes. If you terminate the process (via kill-9 or otherwise), then the job will be terminated as well. The driver process will submit the work to the spark master, which will do the usually divvying up of tasks, distribution, fault tolerance, etc. and the results will get reported back to the driver process. 
Often it's not possible to have arbitrary access to the spark master, and if jobs take hours to complete, it's not feasible to have the process running on the laptop without interruptions, disconnects, etc. As such, a "gateway" machine is used closer to the spark master that's used to submit jobs from. That way, the process on the gateway machine lives for the duration of the job, and no connection from the laptop, etc. is needed. It's not uncommon to actually have an api to the gateway machine. For example, Ooyala's job server https://github.com/ooyala/spark-jobserver provides a restful interface to submit jobs.
Does that help?
Regards,Ashic.
Date: Fri, 14 Nov 2014 13:40:43 -0600
Subject: Submitting Python Applications from Remote to Master
From: quasiben@gmail.com
To: user@spark.apache.org

Hi All,
I'm not quite clear on whether submitting a python application to spark standalone on ec2 is possible. 
Am I reading this correctly:
*A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the client spark-submit process, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications.
So I shouldn't be able to do something like:./bin/spark-submit  --master spark:/xxxxx.compute-1.amazonaws.com:7077  examples/src/main/python/pi.py 
>From a laptop connecting to a previously launched spark cluster using the default spark-ec2 script, correct?
If I am not mistaken about this then docs are slightly confusing -- the above example is more or less the example here: https://spark.apache.org/docs/1.1.0/submitting-applications.html
If I am mistaken, apologies, can you help me figure out where I went wrong?I've also taken to opening port 7077 to 0.0.0.0/0
--Ben