You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by James Vanns <jv...@gmail.com> on 2015/06/10 19:10:50 UTC

Debugging framework registration from inside docker

Hi. When attempting to run my scheduler inside a docker container in
--net=bridge mode it never receives acknowledgement or a reply to that
request. However, it works fine in --net=host mode. It does not listen on
any port as a service so does not expose any.

The scheduler receives the mesos master (leader) from zookeeper fine but
fails to register the framework with that master. It just loops trying to
do so - the master sees the registration but deactivates it immediately as
apparently it disconnects. It doesn't disconnect but is obviously
unreachable. I see the reason for this in the sendto() and the master log
file -- because the internal docker bridge IP is included in the POST and
perhaps that is how the master is trying to talk back
to the requesting framework??

Inside the container is this;
tcp        0      0 0.0.0.0:44431           0.0.0.0:*               LISTEN
     1/scheduler

This is not my code! I'm at a loss where to go from here. Anyone got any
further suggestions
to fix this?

Cheers,

Jim

--
Senior Code Pig
Industrial Light & Magic

Re: Debugging framework registration from inside docker

Posted by Tom Arnfeld <to...@duedil.com>.

I believe you're correct Jim, if you set LIBPROCESS_IP=$HOST_IP libprocess will try to bind to that address as well as announce it, which won't work inside a bridged container.

We've been having a similar discussion on https://github.com/wickman/pesos/issues/25.

Tom Arnfeld

Developer // DueDil

On Thursday, Jun 11, 2015 at 10:00 am, James Vanns <jv...@gmail.com>, wrote:

Looks like I share the same symptoms as this 'marathon inside container' problem;

https://groups.google.com/d/topic/marathon-framework/aFIlv-VnF58/discussion

I guess that sheds some light on the subject ;)

On 11 June 2015 at 09:43, James Vanns <jv...@gmail.com> wrote:

For what exactly? I thought that was for slave<->master communication? There is no problem there. Or are you suggesting that from inside the running container I set at least LIBPROCESS_IP to the host IP rather than the IP of eth0 the container sees? Won't that screw with the docker bridge routing?

This doesn't quite make sense. I have other network connections inside this container and those channels are established and communicating fine. It's just with the mesos master for some reason. Just to be clear;

* The running process is a scheduling framework

* It does not listen for any inbound connection requests

* It, of course, does attempt an outbound connection to the zookeeper to get the MM

(this works)

* It then attempts to establish a connection with the MM

(this also works)

* When the MM sends a response, it fails - it effectively tries to send the

response back to the private/internal docker IP where my scheduler is

running.

* This problem disappears when run with --net=host

TCPDump never shows any inbound traffic;

IP 172.17.1.197.55182 > 172.20.121.193.5050

...

Therefore there is never any ACK# that corresponds with the SEQ# and these are just re-transmissions. I think!

Jim

On 10 June 2015 at 18:16, Steven Schlansker <ss...@opentable.com> wrote:

On Jun 10, 2015, at 10:10 AM, James Vanns <jv...@gmail.com> wrote:

> Hi. When attempting to run my scheduler inside a docker container in --net=bridge mode it never receives acknowledgement or a reply to that request. However, it works fine in --net=host mode. It does not listen on any port as a service so does not expose any.

> The scheduler receives the mesos master (leader) from zookeeper fine but fails to register the framework with that master. It just loops trying to do so - the master sees the registration but deactivates it immediately as apparently it disconnects. It doesn't disconnect but is obviously unreachable. I see the reason for this in the sendto() and the master log file -- because the internal docker bridge IP is included in the POST and perhaps that is how the master is trying to talk back

> to the requesting framework??

> Inside the container is this;

> tcp 0 0 0.0.0.0:44431 0.0.0.0:* LISTEN 1/scheduler

> This is not my code! I'm at a loss where to go from here. Anyone got any further suggestions

> to fix this?

You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact that you are on a virtual Docker interface.

Senior Code Pig

Industrial Light & Magic

Senior Code Pig

Industrial Light & Magic

Re: Debugging framework registration from inside docker

Posted by James Vanns <jv...@gmail.com>.

Hi Vinod - this is good news! Just the fact that I'm not barking up the
wrong tree and that indeed it is a known issue.

Cheers

Jim


On 11 June 2015 at 18:16, Vinod Kone <vi...@gmail.com> wrote:

>
> On Thu, Jun 11, 2015 at 4:00 AM, James Vanns <jv...@gmail.com> wrote:
>
>> I think I can conclude then that this just won't work; one cannot run a
>> framework as a docker container using bridged networking. This is because a
>> POST to the MM that libprocess does on your framework's behalf, includes
>> the non-route-able private docker IP and that is what the MM well then try
>> to communicate with? Setting LIBPROCESS_IP to the host IP will of course
>> not work because then libprocess or somewhere in the mesos framework code
>> an attempt at bind()ing to that interface is made and fails.... because it
>> does not exist in bridge mode.
>>
>
> You are right on track. This is a known issue:
> https://issues.apache.org/jira/browse/MESOS-809. Anindya has submitted a
> short term fix, which unfortunately never landed. I'll shepherd and commit
> this.
>
>
>
>> *If* the above is correct then the question I suppose is why does the
>> communication channel get established in that way? Why off the back of some
>> data in a POST rather than the connected endpoint (that presumably docker
>> would manage/forward as it would with a regular web service, for example)?
>> Is this some caveat of using zookeeper?
>>
>
> Longer term, the plan is for the master is to reuse the connection opened
> by the scheduler and not open a new one, as you mentioned. See
> https://issues.apache.org/jira/browse/MESOS-2289
>
>
>
>
>> I'm sure someone will correct me where I'm wrong ;)
>>
>
> You are not!
>
>


-- 
--
Senior Code Pig
Industrial Light & Magic

Re: Debugging framework registration from inside docker

Posted by Vinod Kone <vi...@gmail.com>.

On Thu, Jun 11, 2015 at 4:00 AM, James Vanns <jv...@gmail.com> wrote:

> I think I can conclude then that this just won't work; one cannot run a
> framework as a docker container using bridged networking. This is because a
> POST to the MM that libprocess does on your framework's behalf, includes
> the non-route-able private docker IP and that is what the MM well then try
> to communicate with? Setting LIBPROCESS_IP to the host IP will of course
> not work because then libprocess or somewhere in the mesos framework code
> an attempt at bind()ing to that interface is made and fails.... because it
> does not exist in bridge mode.
>

You are right on track. This is a known issue:
https://issues.apache.org/jira/browse/MESOS-809. Anindya has submitted a
short term fix, which unfortunately never landed. I'll shepherd and commit
this.

> *If* the above is correct then the question I suppose is why does the
> communication channel get established in that way? Why off the back of some
> data in a POST rather than the connected endpoint (that presumably docker
> would manage/forward as it would with a regular web service, for example)?
> Is this some caveat of using zookeeper?
>

Longer term, the plan is for the master is to reuse the connection opened
by the scheduler and not open a new one, as you mentioned. See
https://issues.apache.org/jira/browse/MESOS-2289

> I'm sure someone will correct me where I'm wrong ;)
>

You are not!

Re: Debugging framework registration from inside docker

Posted by James Vanns <jv...@gmail.com>.

I think I can conclude then that this just won't work; one cannot run a
framework as a docker container using bridged networking. This is because a
POST to the MM that libprocess does on your framework's behalf, includes
the non-route-able private docker IP and that is what the MM well then try
to communicate with? Setting LIBPROCESS_IP to the host IP will of course
not work because then libprocess or somewhere in the mesos framework code
an attempt at bind()ing to that interface is made and fails.... because it
does not exist in bridge mode.

*If* the above is correct then the question I suppose is why does the
communication channel get established in that way? Why off the back of some
data in a POST rather than the connected endpoint (that presumably docker
would manage/forward as it would with a regular web service, for example)?
Is this some caveat of using zookeeper?

I'm sure someone will correct me where I'm wrong ;)

Cheers,

Jim


On 11 June 2015 at 10:00, James Vanns <jv...@gmail.com> wrote:

> Looks like I share the same symptoms as this 'marathon inside container'
> problem;
>
> https://groups.google.com/d/topic/marathon-framework/aFIlv-VnF58/discussion
>
> I guess that sheds some light on the subject ;)
>
>
> On 11 June 2015 at 09:43, James Vanns <jv...@gmail.com> wrote:
>
>> For what exactly? I thought that was for slave<->master communication?
>> There is no problem there. Or are you suggesting that from inside the
>> running container I set at least LIBPROCESS_IP to the host IP rather than
>> the IP of eth0 the container sees? Won't that screw with the docker bridge
>> routing?
>>
>> This doesn't quite make sense. I have other network connections inside
>> this container and those channels are established and communicating fine.
>> It's just with the mesos master for some reason. Just to be clear;
>>
>> * The running process is a scheduling framework
>> * It does not listen for any inbound connection requests
>> * It, of course, does attempt an outbound connection to the zookeeper to
>> get the MM
>>   (this works)
>> * It then attempts to establish a connection with the MM
>>   (this also works)
>> * When the MM sends a response, it fails - it effectively tries to send
>> the
>> response back to the private/internal docker IP where my scheduler is
>> running.
>> * This problem disappears when run with --net=host
>>
>> TCPDump never shows any inbound traffic;
>>
>> IP 172.17.1.197.55182 > 172.20.121.193.5050
>> ...
>>
>> Therefore there is never any ACK# that corresponds with the SEQ# and
>> these are just re-transmissions. I think!
>>
>> Jim
>>
>>
>> On 10 June 2015 at 18:16, Steven Schlansker <ss...@opentable.com>
>> wrote:
>>
>>> On Jun 10, 2015, at 10:10 AM, James Vanns <jv...@gmail.com> wrote:
>>>
>>> > Hi. When attempting to run my scheduler inside a docker container in
>>> --net=bridge mode it never receives acknowledgement or a reply to that
>>> request. However, it works fine in --net=host mode. It does not listen on
>>> any port as a service so does not expose any.
>>> >
>>> > The scheduler receives the mesos master (leader) from zookeeper fine
>>> but fails to register the framework with that master. It just loops trying
>>> to do so - the master sees the registration but deactivates it immediately
>>> as apparently it disconnects. It doesn't disconnect but is obviously
>>> unreachable. I see the reason for this in the sendto() and the master log
>>> file -- because the internal docker bridge IP is included in the POST and
>>> perhaps that is how the master is trying to talk back
>>> > to the requesting framework??
>>> >
>>> > Inside the container is this;
>>> > tcp        0      0 0.0.0.0:44431           0.0.0.0:*
>>>  LISTEN      1/scheduler
>>> >
>>> > This is not my code! I'm at a loss where to go from here. Anyone got
>>> any further suggestions
>>> > to fix this?
>>>
>>> You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide
>>> the fact that you are on a virtual Docker interface.
>>>
>>>
>>>
>>
>>
>> --
>> --
>> Senior Code Pig
>> Industrial Light & Magic
>>
>
>
>
> --
> --
> Senior Code Pig
> Industrial Light & Magic
>



-- 
--
Senior Code Pig
Industrial Light & Magic

Re: Debugging framework registration from inside docker

Posted by James Vanns <jv...@gmail.com>.

Looks like I share the same symptoms as this 'marathon inside container'
problem;

https://groups.google.com/d/topic/marathon-framework/aFIlv-VnF58/discussion

I guess that sheds some light on the subject ;)


On 11 June 2015 at 09:43, James Vanns <jv...@gmail.com> wrote:

> For what exactly? I thought that was for slave<->master communication?
> There is no problem there. Or are you suggesting that from inside the
> running container I set at least LIBPROCESS_IP to the host IP rather than
> the IP of eth0 the container sees? Won't that screw with the docker bridge
> routing?
>
> This doesn't quite make sense. I have other network connections inside
> this container and those channels are established and communicating fine.
> It's just with the mesos master for some reason. Just to be clear;
>
> * The running process is a scheduling framework
> * It does not listen for any inbound connection requests
> * It, of course, does attempt an outbound connection to the zookeeper to
> get the MM
>   (this works)
> * It then attempts to establish a connection with the MM
>   (this also works)
> * When the MM sends a response, it fails - it effectively tries to send
> the
> response back to the private/internal docker IP where my scheduler is
> running.
> * This problem disappears when run with --net=host
>
> TCPDump never shows any inbound traffic;
>
> IP 172.17.1.197.55182 > 172.20.121.193.5050
> ...
>
> Therefore there is never any ACK# that corresponds with the SEQ# and these
> are just re-transmissions. I think!
>
> Jim
>
>
> On 10 June 2015 at 18:16, Steven Schlansker <ss...@opentable.com>
> wrote:
>
>> On Jun 10, 2015, at 10:10 AM, James Vanns <jv...@gmail.com> wrote:
>>
>> > Hi. When attempting to run my scheduler inside a docker container in
>> --net=bridge mode it never receives acknowledgement or a reply to that
>> request. However, it works fine in --net=host mode. It does not listen on
>> any port as a service so does not expose any.
>> >
>> > The scheduler receives the mesos master (leader) from zookeeper fine
>> but fails to register the framework with that master. It just loops trying
>> to do so - the master sees the registration but deactivates it immediately
>> as apparently it disconnects. It doesn't disconnect but is obviously
>> unreachable. I see the reason for this in the sendto() and the master log
>> file -- because the internal docker bridge IP is included in the POST and
>> perhaps that is how the master is trying to talk back
>> > to the requesting framework??
>> >
>> > Inside the container is this;
>> > tcp        0      0 0.0.0.0:44431           0.0.0.0:*
>>  LISTEN      1/scheduler
>> >
>> > This is not my code! I'm at a loss where to go from here. Anyone got
>> any further suggestions
>> > to fix this?
>>
>> You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the
>> fact that you are on a virtual Docker interface.
>>
>>
>>
>
>
> --
> --
> Senior Code Pig
> Industrial Light & Magic
>



-- 
--
Senior Code Pig
Industrial Light & Magic

Re: Debugging framework registration from inside docker

Posted by James Vanns <jv...@gmail.com>.

For what exactly? I thought that was for slave<->master communication?
There is no problem there. Or are you suggesting that from inside the
running container I set at least LIBPROCESS_IP to the host IP rather than
the IP of eth0 the container sees? Won't that screw with the docker bridge
routing?

This doesn't quite make sense. I have other network connections inside this
container and those channels are established and communicating fine. It's
just with the mesos master for some reason. Just to be clear;

* The running process is a scheduling framework
* It does not listen for any inbound connection requests
* It, of course, does attempt an outbound connection to the zookeeper to
get the MM
  (this works)
* It then attempts to establish a connection with the MM
  (this also works)
* When the MM sends a response, it fails - it effectively tries to send the
response back to the private/internal docker IP where my scheduler is
running.
* This problem disappears when run with --net=host

TCPDump never shows any inbound traffic;

IP 172.17.1.197.55182 > 172.20.121.193.5050
...

Therefore there is never any ACK# that corresponds with the SEQ# and these
are just re-transmissions. I think!

Jim

On 10 June 2015 at 18:16, Steven Schlansker <ss...@opentable.com>
wrote:

> On Jun 10, 2015, at 10:10 AM, James Vanns <jv...@gmail.com> wrote:
>
> > Hi. When attempting to run my scheduler inside a docker container in
> --net=bridge mode it never receives acknowledgement or a reply to that
> request. However, it works fine in --net=host mode. It does not listen on
> any port as a service so does not expose any.
> >
> > The scheduler receives the mesos master (leader) from zookeeper fine but
> fails to register the framework with that master. It just loops trying to
> do so - the master sees the registration but deactivates it immediately as
> apparently it disconnects. It doesn't disconnect but is obviously
> unreachable. I see the reason for this in the sendto() and the master log
> file -- because the internal docker bridge IP is included in the POST and
> perhaps that is how the master is trying to talk back
> > to the requesting framework??
> >
> > Inside the container is this;
> > tcp        0      0 0.0.0.0:44431           0.0.0.0:*
>  LISTEN      1/scheduler
> >
> > This is not my code! I'm at a loss where to go from here. Anyone got any
> further suggestions
> > to fix this?
>
> You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the
> fact that you are on a virtual Docker interface.
>
>
>

-- 
--
Senior Code Pig
Industrial Light & Magic

Re: Debugging framework registration from inside docker

Posted by Steven Schlansker <ss...@opentable.com>.

On Jun 10, 2015, at 10:10 AM, James Vanns <jv...@gmail.com> wrote:

> Hi. When attempting to run my scheduler inside a docker container in --net=bridge mode it never receives acknowledgement or a reply to that request. However, it works fine in --net=host mode. It does not listen on any port as a service so does not expose any.
> 
> The scheduler receives the mesos master (leader) from zookeeper fine but fails to register the framework with that master. It just loops trying to do so - the master sees the registration but deactivates it immediately as apparently it disconnects. It doesn't disconnect but is obviously unreachable. I see the reason for this in the sendto() and the master log file -- because the internal docker bridge IP is included in the POST and perhaps that is how the master is trying to talk back
> to the requesting framework?? 
> 
> Inside the container is this;
> tcp        0      0 0.0.0.0:44431           0.0.0.0:*               LISTEN      1/scheduler
> 
> This is not my code! I'm at a loss where to go from here. Anyone got any further suggestions
> to fix this?

You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact that you are on a virtual Docker interface.