You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Gavin (JIRA)" <ji...@apache.org> on 2019/04/29 09:27:27 UTC
[jira] [Issue Comment Deleted] (MESOS-8756) Missing reasons for early task failures

     [ https://issues.apache.org/jira/browse/MESOS-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gavin updated MESOS-8756:
-------------------------
    Comment: was deleted

(was: www.rtat.net)

> Missing reasons for early task failures
> ---------------------------------------
>
>                 Key: MESOS-8756
>                 URL: https://issues.apache.org/jira/browse/MESOS-8756
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor, master, scheduler api
>    Affects Versions: 1.6.0
>            Reporter: A. Dukhovniy
>            Priority: Major
>              Labels: integration, observability
>
> Some early task failures are not propagated to the framework. Here is an example of a marathon pod (mesos containerizer) definition with *a non-existing image*:
> {code:java}
> {
>   "id": "/fail",
>   "containers": [
>     {
>       "name": "container-1",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 128
>       },
>       "image": {
>         "id": "non-existing-image-56789",
>         "kind": "DOCKER"
>       }
>     }
>   ],
>   "scaling": {
>     "instances": 1,
>     "kind": "fixed"
>   },
>   "networks": [
>     {
>       "mode": "host"
>     }
>   ],
>   "volumes": [],
>   "fetch": [],
>   "scheduling": {
>     "placement": {
>       "constraints": []
>     }
>   }
> }
> {code}
> Here the status update the framework receives is {{TASK_FAILED (Executor terminated)}}.
> Here another example where *a non-existing artifact* is being fetched:
> {code:java}
> {
>   "id": "/fail2",
>   "containers": [
>     {
>       "name": "container-1",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 128
>       },
>       "image": {
>         "id": "nginx",
>         "kind": "DOCKER",
>         "forcePull": false
>       },
>       "artifacts": [
>         {
>           "uri": "http://example.com/smth-non-existing-12345.tar.gz"
>         }
>       ]
>     }
>   ],
>   "scaling": {
>     "instances": 1,
>     "kind": "fixed"
>   },
>   "networks": [
>     {
>       "mode": "host"
>     }
>   ],
>   "volumes": [],
>   "fetch": [],
>   "scheduling": {
>     "placement": {
>       "constraints": []
>     }
>   }
> }
> {code}
> which results in the same status update as above.
> This is not an exhaustive list of such cases. I'm sure there are more failures along the fork-chain which are not properly propagated. 
> Frameworks (and their users) should always receive meaningful task failures reasons no matter where those failures happened. Otherwise, the only way to find out what happened is to grep agent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)