You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Arun M J (Jira)" <ji...@apache.org> on 2020/12/13 06:25:00 UTC

[jira] [Created] (MESOS-10203) Agent process crashes on newer linux kernels if 'linux/capabilities' isolation is enbaled

Arun M J created MESOS-10203:
--------------------------------

             Summary: Agent process crashes on newer linux kernels if 'linux/capabilities' isolation is enbaled
                 Key: MESOS-10203
                 URL: https://issues.apache.org/jira/browse/MESOS-10203
             Project: Mesos
          Issue Type: Bug
          Components: agent
            Reporter: Arun M J


Mesos agent crashes with following stack trace on newer Linux kernels (>=5.8.x) if started with  MESOS_ISOLATION=linux/capabilities.  
Tested on {color:#5454ff}5.7.19{color} where it was running fine, but fails on {color:#000000}5.8.18{color},{color:#000000}5.9.11 {color}and {color:#000000}5.10{color}

{quote}{{Dec 13 05:08:28 mesosbox mesos-agent[465]: sh: hadoop: command not found}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: I1213 05:08:28.234824 458 fetcher.cpp:66] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Hadoop client is not available, exit status: 32512}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: Reached unreachable statement at linux/capabilities.cpp:497}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: *** Aborted at 1607836108 (unix time) try "date -d @1607836108" if you are using GNU date ***}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: PC: @ 0x7f875bd62387 __GI_raise}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: *** SIGABRT (@0x1ca) received by PID 458 (TID 0x7f8760ddca00) from PID 458; stack trace: ***}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875c626630 (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875bd62387 __GI_raise}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875bd63a78 __GI_abort}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875e60f237 (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ef6e7c1 (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ef723cc (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ef70c96 (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875f05389d (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ed837fc (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ed72332 (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ecf54c6 (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x55f5d9c1a256 (unknown)}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875bd4e555 __libc_start_main}}
{{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x55f5d9c1d10f (unknown)}}
{{Dec 13 05:08:28 mesosbox kernel: audit: type=1701 audit(1607836108.250:274): auid=4294967295 uid=0 gid=0 ses=4294967295 subj==unconfined pid=4772 comm="mesos-agent" exe="/usr/sbin/mesos-agent" sig=6 res=1}}
{quote}
 

When looked further, I could find out that this was raised from [linux/capabilities.cpp|https://github.com/apache/mesos/blob/206da612c0aada0b1d86beb63660d9083b774894/src/linux/capabilities.cpp#L495-L502] which converts capability enum values to human-readable names.
{code:java}
ostream& operator<<(ostream& stream, const Capability& capability)
        {
        switch (capability) {
            case CHOWN:             return stream << "CHOWN";
            case DAC_OVERRIDE:      return stream << "DAC_OVERRIDE";
            case AUDIT_READ:        return stream << "AUDIT_READ";
...
...
            case MAX_CAPABILITY:    UNREACHABLE(); // !!! Crash site
          }
          UNREACHABLE();
        }
   
{code}
[MAX_CAPABILITY|https://github.com/apache/mesos/blob/206da612c0aada0b1d86beb63660d9083b774894/src/linux/capabilities.hpp#L75] is defined as *38*.  But as of now, the new capabilities were introduced to Linux. Namely,


 *  *CAP_PERFMON*=38  // (since Linux 5.8) - Employ various performance-monitoring mechanisms
 * *CAP_BPF*=39             // (since Linux 5.8) - Employ privileged BPF operations;
 *  *CAP_CHECKPOINT_RESTORE*=40      ** (since Linux 5.9) - Allow checkpoint/restore related operations

ref: [https://github.com/torvalds/linux/blob/master/include/uapi/linux/capability.h]


Above Mesos code does not seem to respect such kernel evolutions. So adding new capability on Kernel will break the Isolator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)