You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by Renan DelValle <re...@apache.org> on 2018/10/16 01:46:51 UTC

[DISCUSSION] Potential braking change in Mesos 1.6 upgrade - docker thermos based tasks and aurora task ssh

All,

As you may know Mesos has changed the default permissions for the sandbox
from 755 (-rwxr-xr-x) to 750 (-rwxr-x---) (
https://issues.apache.org/jira/browse/MESOS-8332).

Stephan Erb fixed most of the breakage caused by this change with his
recent patch
https://github.com/apache/aurora/commit/32776792d273b36afbf4a1bab69a66fb06163ffd

Unfortunately, when it comes to docker based containers, the issue is a bit
more complicated.

Stephan and I have both looked into this and have been posting our findings
here:
https://github.com/apache/aurora/pull/42

Unfortunately, and I speak for myself here, I don't think there is an easy
way to keep our promise to allow users to aurora task ssh into the sandbox
of a docker container based task.

Problem:

When a docker container is launched, it is launched in its own namespace
and every command is run as root (uid=0) by default. This means two things:

A) None of the users of the host exist inside the container and therefore
we don't know the uid of the role inside the job key.

B) The sandbox for the dockerized task are owned by uid=0 and gid=0 on both
the container and the host.

Before Mesos 1.6, the permissions were open enough to allow aurora task ssh
to see the sandbox of a docker based task on the host.

From Mesos 1.6 on, aurora task ssh will not be able to see anything inside
of the sandbox of a docker based task since by default it is run under
user=role.

tl;dr: default aurora task ssh lacks the permissions to see docker
container based thermos sandboxes.

Solutions:

1. Find a way to mirror host users in container. (Not partial to this as it
adds a lot of complexity)

2. Allow users to provide images with uids that match the local boxes.
(Messy and error prone)

4. Leave as is (broken aurora task ssh for docker container based thermos
sandboxes) and leave it to operators to provide access to these
sandboxes. Users
should still be able to see these files in the sandbox through the Aurora
observer UI and Mesos UI (Sane but potentially burdensome on operators).

I'd love to hear other solutions if anyone else has thought of this problem.

-Renan

Re: [DISCUSSION] Potential braking change in Mesos 1.6 upgrade - docker thermos based tasks and aurora task ssh

Posted by Renan DelValle <re...@gmail.com>.
+1 to this idea.

It's a good stop gap solution while we explore a better options as well as
explore possible corner cases the change to 750 brings.

I know that you're busy, so thanks for looking into this as well!

-Renan

On Mon, Oct 22, 2018 at 9:26 PM Stephan Erb <st...@blue-yonder.com>
wrote:

> Hi Renan,
>
> Unfortunately, it might even be a bit more complicated: The executor is
> normally launched as root and then drops the privileges for each Thermos
> process once it got forked successfully. If the Mesos filesystem
> permissions are too narrow, then subsequent operations managed by those
> processes will fail. Most notably, the executor will crash whenever it
> tries to rotate log files. At least this is the behavior of the Mesos
> containerize before the fix you have referenced.
>
> In the Docker case, the executor always runs as root. However, there might
> even be other similar issues that only show up for long running containers.
> I therefore see the broken SSH as a symptom of an underlying issue that we
> need to address.
>
> Given that this is currently blocking our progress: Should we consider a
> chmod in
> https://github.com/apache/aurora/blob/32776792d273b36afbf4a1bab69a66fb06163ffd/src/main/python/apache/aurora/executor/common/sandbox.py#L173
> to restore the previous umask of 755 for the sandbox directory?
>
> Best regards,
> Stephan
>
> On 16.10.18, 03:47, "Renan DelValle" <re...@apache.org> wrote:
>
>     All,
>
>     As you may know Mesos has changed the default permissions for the
> sandbox
>     from 755 (-rwxr-xr-x) to 750 (-rwxr-x---) (
>     https://issues.apache.org/jira/browse/MESOS-8332).
>
>     Stephan Erb fixed most of the breakage caused by this change with his
>     recent patch
>
> https://github.com/apache/aurora/commit/32776792d273b36afbf4a1bab69a66fb06163ffd
>
>     Unfortunately, when it comes to docker based containers, the issue is
> a bit
>     more complicated.
>
>     Stephan and I have both looked into this and have been posting our
> findings
>     here:
>     https://github.com/apache/aurora/pull/42
>
>     Unfortunately, and I speak for myself here, I don't think there is an
> easy
>     way to keep our promise to allow users to aurora task ssh into the
> sandbox
>     of a docker container based task.
>
>     Problem:
>
>     When a docker container is launched, it is launched in its own
> namespace
>     and every command is run as root (uid=0) by default. This means two
> things:
>
>     A) None of the users of the host exist inside the container and
> therefore
>     we don't know the uid of the role inside the job key.
>
>     B) The sandbox for the dockerized task are owned by uid=0 and gid=0 on
> both
>     the container and the host.
>
>     Before Mesos 1.6, the permissions were open enough to allow aurora
> task ssh
>     to see the sandbox of a docker based task on the host.
>
>     From Mesos 1.6 on, aurora task ssh will not be able to see anything
> inside
>     of the sandbox of a docker based task since by default it is run under
>     user=role.
>
>     tl;dr: default aurora task ssh lacks the permissions to see docker
>     container based thermos sandboxes.
>
>     Solutions:
>
>     1. Find a way to mirror host users in container. (Not partial to this
> as it
>     adds a lot of complexity)
>
>     2. Allow users to provide images with uids that match the local boxes.
>     (Messy and error prone)
>
>     4. Leave as is (broken aurora task ssh for docker container based
> thermos
>     sandboxes) and leave it to operators to provide access to these
>     sandboxes. Users
>     should still be able to see these files in the sandbox through the
> Aurora
>     observer UI and Mesos UI (Sane but potentially burdensome on
> operators).
>
>     I'd love to hear other solutions if anyone else has thought of this
> problem.
>
>     -Renan
>
>
>

Re: [DISCUSSION] Potential braking change in Mesos 1.6 upgrade - docker thermos based tasks and aurora task ssh

Posted by Stephan Erb <st...@blue-yonder.com>.
Hi Renan,

Unfortunately, it might even be a bit more complicated: The executor is normally launched as root and then drops the privileges for each Thermos process once it got forked successfully. If the Mesos filesystem permissions are too narrow, then subsequent operations managed by those processes will fail. Most notably, the executor will crash whenever it tries to rotate log files. At least this is the behavior of the Mesos containerize before the fix you have referenced.

In the Docker case, the executor always runs as root. However, there might even be other similar issues that only show up for long running containers. I therefore see the broken SSH as a symptom of an underlying issue that we need to address. 

Given that this is currently blocking our progress: Should we consider a chmod in https://github.com/apache/aurora/blob/32776792d273b36afbf4a1bab69a66fb06163ffd/src/main/python/apache/aurora/executor/common/sandbox.py#L173 to restore the previous umask of 755 for the sandbox directory? 

Best regards,
Stephan

On 16.10.18, 03:47, "Renan DelValle" <re...@apache.org> wrote:

    All,
    
    As you may know Mesos has changed the default permissions for the sandbox
    from 755 (-rwxr-xr-x) to 750 (-rwxr-x---) (
    https://issues.apache.org/jira/browse/MESOS-8332).
    
    Stephan Erb fixed most of the breakage caused by this change with his
    recent patch
    https://github.com/apache/aurora/commit/32776792d273b36afbf4a1bab69a66fb06163ffd
    
    Unfortunately, when it comes to docker based containers, the issue is a bit
    more complicated.
    
    Stephan and I have both looked into this and have been posting our findings
    here:
    https://github.com/apache/aurora/pull/42
    
    Unfortunately, and I speak for myself here, I don't think there is an easy
    way to keep our promise to allow users to aurora task ssh into the sandbox
    of a docker container based task.
    
    Problem:
    
    When a docker container is launched, it is launched in its own namespace
    and every command is run as root (uid=0) by default. This means two things:
    
    A) None of the users of the host exist inside the container and therefore
    we don't know the uid of the role inside the job key.
    
    B) The sandbox for the dockerized task are owned by uid=0 and gid=0 on both
    the container and the host.
    
    Before Mesos 1.6, the permissions were open enough to allow aurora task ssh
    to see the sandbox of a docker based task on the host.
    
    From Mesos 1.6 on, aurora task ssh will not be able to see anything inside
    of the sandbox of a docker based task since by default it is run under
    user=role.
    
    tl;dr: default aurora task ssh lacks the permissions to see docker
    container based thermos sandboxes.
    
    Solutions:
    
    1. Find a way to mirror host users in container. (Not partial to this as it
    adds a lot of complexity)
    
    2. Allow users to provide images with uids that match the local boxes.
    (Messy and error prone)
    
    4. Leave as is (broken aurora task ssh for docker container based thermos
    sandboxes) and leave it to operators to provide access to these
    sandboxes. Users
    should still be able to see these files in the sandbox through the Aurora
    observer UI and Mesos UI (Sane but potentially burdensome on operators).
    
    I'd love to hear other solutions if anyone else has thought of this problem.
    
    -Renan