You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ashwin Shankar <as...@gmail.com> on 2015/06/10 22:43:04 UTC

Problem with pyspark on Docker talking to YARN cluster

All,
I was wondering if any of you have solved this problem :

I have pyspark(ipython mode) running on docker talking to
a yarn cluster(AM/executors are NOT running on docker).

When I start pyspark in the docker container, it binds to port *49460.*

Once the app is submitted to YARN, the app(AM) on the cluster side fails
with the following error message :
*ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460*

This makes sense because AM is trying to talk to container directly and
it cannot, it should be talking to the docker host instead.

*Question* :
How do we make Spark AM talk to host1:port1 of the docker host(not the
container), which would then
route it to container which is running pyspark on host2:port2 ?

One solution I could think of is : after starting the driver(say on
hostA:portA), and before submitting the app to yarn, we could
reset driver's host/port to hostmachine's ip/port. So the AM can then talk
hostmachine's ip/port, which would be mapped
to the container.

Thoughts ?
-- 
Thanks,
Ashwin

Re: Problem with pyspark on Docker talking to YARN cluster

Posted by John Omernik <jo...@omernik.com>.
Was there any other creative solutions for this? I am running into the same
issue with submitting to yarn from a Docker container and the solutions
don't provided don't work. (1. the host doesn't work, even if I use the
hostname of the physical node because when spark tries to bind to the
hostname of the physical node in bridged mode, it doesn't see it and errors
out... as stated we need a bind address, and advertise address if this is
to work), 2. Same restrictions. 3. cluster mode doesn't work for pyspark
shell.

Any other thoughts?

John

On Thu, Jun 11, 2015 at 12:09 AM, Ashwin Shankar <as...@gmail.com>
wrote:

> Hi Eron, Thanks for your reply, but none of these options works for us.
>>
>>
>>    1. use 'spark.driver.host' and 'spark.driver.port' setting to
>>    stabilize the driver-side endpoint.  (ref
>>    <https://spark.apache.org/docs/latest/configuration.html#networking>)
>>
>> This unfortunately won't help since if we set spark.driver.port to
> something, its going to be used to bind on the client
> side and the same will be passed to the AM. We need two variables,a) one
> to bind to on the client side, b)another port which is opened up on the
> docker host and will be used by the AM to talk back to the driver.
>
> 2. use host networking for your container, i.e. "docker run --net=host
>> ..."
>
> We run containers in shared environment, and this option makes host
> network stack accessible to all
> containers in it, which could leads to security issues.
>
> 3. use yarn-cluster mode
>
>  Pyspark interactive shell(ipython) doesn't have cluster mode. SPARK-5162
> <https://issues.apache.org/jira/browse/SPARK-5162> is for spark-submit
> python in cluster mode.
>
> Thanks,
> Ashwin
>
>
> On Wed, Jun 10, 2015 at 3:55 PM, Eron Wright <ew...@live.com> wrote:
>
>> Options include:
>>
>>    1. use 'spark.driver.host' and 'spark.driver.port' setting to
>>    stabilize the driver-side endpoint.  (ref
>>    <https://spark.apache.org/docs/latest/configuration.html#networking>)
>>    2. use host networking for your container, i.e. "docker run
>>    --net=host ..."
>>    3. use yarn-cluster mode (see SPARK-5162
>>    <https://issues.apache.org/jira/browse/SPARK-5162>)
>>
>>
>> Hope this helps,
>> Eron
>>
>>
>> ------------------------------
>> Date: Wed, 10 Jun 2015 13:43:04 -0700
>> Subject: Problem with pyspark on Docker talking to YARN cluster
>> From: ashwinshankar77@gmail.com
>> To: dev@spark.apache.org; user@spark.apache.org
>>
>>
>> All,
>> I was wondering if any of you have solved this problem :
>>
>> I have pyspark(ipython mode) running on docker talking to
>> a yarn cluster(AM/executors are NOT running on docker).
>>
>> When I start pyspark in the docker container, it binds to port *49460.*
>>
>> Once the app is submitted to YARN, the app(AM) on the cluster side fails
>> with the following error message :
>> *ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460*
>>
>> This makes sense because AM is trying to talk to container directly and
>> it cannot, it should be talking to the docker host instead.
>>
>> *Question* :
>> How do we make Spark AM talk to host1:port1 of the docker host(not the
>> container), which would then
>> route it to container which is running pyspark on host2:port2 ?
>>
>> One solution I could think of is : after starting the driver(say on
>> hostA:portA), and before submitting the app to yarn, we could
>> reset driver's host/port to hostmachine's ip/port. So the AM can then
>> talk hostmachine's ip/port, which would be mapped
>> to the container.
>>
>> Thoughts ?
>> --
>> Thanks,
>> Ashwin
>>
>>
>>
>
>
> --
> Thanks,
> Ashwin
>
>
>

Re: Problem with pyspark on Docker talking to YARN cluster

Posted by Ashwin Shankar <as...@gmail.com>.
Hi Eron, Thanks for your reply, but none of these options works for us.
>
>
>    1. use 'spark.driver.host' and 'spark.driver.port' setting to
>    stabilize the driver-side endpoint.  (ref
>    <https://spark.apache.org/docs/latest/configuration.html#networking>)
>
> This unfortunately won't help since if we set spark.driver.port to
something, its going to be used to bind on the client
side and the same will be passed to the AM. We need two variables,a) one to
bind to on the client side, b)another port which is opened up on the docker
host and will be used by the AM to talk back to the driver.

2. use host networking for your container, i.e. "docker run --net=host ..."

We run containers in shared environment, and this option makes host network
stack accessible to all
containers in it, which could leads to security issues.

3. use yarn-cluster mode

 Pyspark interactive shell(ipython) doesn't have cluster mode. SPARK-5162
<https://issues.apache.org/jira/browse/SPARK-5162> is for spark-submit
python in cluster mode.

Thanks,
Ashwin


On Wed, Jun 10, 2015 at 3:55 PM, Eron Wright <ew...@live.com> wrote:

> Options include:
>
>    1. use 'spark.driver.host' and 'spark.driver.port' setting to
>    stabilize the driver-side endpoint.  (ref
>    <https://spark.apache.org/docs/latest/configuration.html#networking>)
>    2. use host networking for your container, i.e. "docker run --net=host
>    ..."
>    3. use yarn-cluster mode (see SPARK-5162
>    <https://issues.apache.org/jira/browse/SPARK-5162>)
>
>
> Hope this helps,
> Eron
>
>
> ------------------------------
> Date: Wed, 10 Jun 2015 13:43:04 -0700
> Subject: Problem with pyspark on Docker talking to YARN cluster
> From: ashwinshankar77@gmail.com
> To: dev@spark.apache.org; user@spark.apache.org
>
>
> All,
> I was wondering if any of you have solved this problem :
>
> I have pyspark(ipython mode) running on docker talking to
> a yarn cluster(AM/executors are NOT running on docker).
>
> When I start pyspark in the docker container, it binds to port *49460.*
>
> Once the app is submitted to YARN, the app(AM) on the cluster side fails
> with the following error message :
> *ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460*
>
> This makes sense because AM is trying to talk to container directly and
> it cannot, it should be talking to the docker host instead.
>
> *Question* :
> How do we make Spark AM talk to host1:port1 of the docker host(not the
> container), which would then
> route it to container which is running pyspark on host2:port2 ?
>
> One solution I could think of is : after starting the driver(say on
> hostA:portA), and before submitting the app to yarn, we could
> reset driver's host/port to hostmachine's ip/port. So the AM can then talk
> hostmachine's ip/port, which would be mapped
> to the container.
>
> Thoughts ?
> --
> Thanks,
> Ashwin
>
>
>


-- 
Thanks,
Ashwin

Re: Problem with pyspark on Docker talking to YARN cluster

Posted by Ashwin Shankar <as...@gmail.com>.
Hi Eron, Thanks for your reply, but none of these options works for us.
>
>
>    1. use 'spark.driver.host' and 'spark.driver.port' setting to
>    stabilize the driver-side endpoint.  (ref
>    <https://spark.apache.org/docs/latest/configuration.html#networking>)
>
> This unfortunately won't help since if we set spark.driver.port to
something, its going to be used to bind on the client
side and the same will be passed to the AM. We need two variables,a) one to
bind to on the client side, b)another port which is opened up on the docker
host and will be used by the AM to talk back to the driver.

2. use host networking for your container, i.e. "docker run --net=host ..."

We run containers in shared environment, and this option makes host network
stack accessible to all
containers in it, which could leads to security issues.

3. use yarn-cluster mode

 Pyspark interactive shell(ipython) doesn't have cluster mode. SPARK-5162
<https://issues.apache.org/jira/browse/SPARK-5162> is for spark-submit
python in cluster mode.

Thanks,
Ashwin


On Wed, Jun 10, 2015 at 3:55 PM, Eron Wright <ew...@live.com> wrote:

> Options include:
>
>    1. use 'spark.driver.host' and 'spark.driver.port' setting to
>    stabilize the driver-side endpoint.  (ref
>    <https://spark.apache.org/docs/latest/configuration.html#networking>)
>    2. use host networking for your container, i.e. "docker run --net=host
>    ..."
>    3. use yarn-cluster mode (see SPARK-5162
>    <https://issues.apache.org/jira/browse/SPARK-5162>)
>
>
> Hope this helps,
> Eron
>
>
> ------------------------------
> Date: Wed, 10 Jun 2015 13:43:04 -0700
> Subject: Problem with pyspark on Docker talking to YARN cluster
> From: ashwinshankar77@gmail.com
> To: dev@spark.apache.org; user@spark.apache.org
>
>
> All,
> I was wondering if any of you have solved this problem :
>
> I have pyspark(ipython mode) running on docker talking to
> a yarn cluster(AM/executors are NOT running on docker).
>
> When I start pyspark in the docker container, it binds to port *49460.*
>
> Once the app is submitted to YARN, the app(AM) on the cluster side fails
> with the following error message :
> *ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460*
>
> This makes sense because AM is trying to talk to container directly and
> it cannot, it should be talking to the docker host instead.
>
> *Question* :
> How do we make Spark AM talk to host1:port1 of the docker host(not the
> container), which would then
> route it to container which is running pyspark on host2:port2 ?
>
> One solution I could think of is : after starting the driver(say on
> hostA:portA), and before submitting the app to yarn, we could
> reset driver's host/port to hostmachine's ip/port. So the AM can then talk
> hostmachine's ip/port, which would be mapped
> to the container.
>
> Thoughts ?
> --
> Thanks,
> Ashwin
>
>
>


-- 
Thanks,
Ashwin

RE: Problem with pyspark on Docker talking to YARN cluster

Posted by Eron Wright <ew...@live.com>.
Options include:use 'spark.driver.host' and 'spark.driver.port' setting to stabilize the driver-side endpoint.  (ref)use host networking for your container, i.e. "docker run --net=host ..."use yarn-cluster mode (see SPARK-5162)
Hope this helps,Eron

Date: Wed, 10 Jun 2015 13:43:04 -0700
Subject: Problem with pyspark on Docker talking to YARN cluster
From: ashwinshankar77@gmail.com
To: dev@spark.apache.org; user@spark.apache.org

All,I was wondering if any of you have solved this problem :
I have pyspark(ipython mode) running on docker talking toa yarn cluster(AM/executors are NOT running on docker).
When I start pyspark in the docker container, it binds to port 49460.
Once the app is submitted to YARN, the app(AM) on the cluster side fails with the following error message :ERROR yarn.ApplicationMaster: Failed to connect to driver at :49460
This makes sense because AM is trying to talk to container directly andit cannot, it should be talking to the docker host instead.
Question :How do we make Spark AM talk to host1:port1 of the docker host(not the container), which would thenroute it to container which is running pyspark on host2:port2 ?
One solution I could think of is : after starting the driver(say on hostA:portA), and before submitting the app to yarn, we could reset driver's host/port to hostmachine's ip/port. So the AM can then talk hostmachine's ip/port, which would be mappedto the container.
Thoughts ? -- 
Thanks,
Ashwin