You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Steven Schlansker (JIRA)" <ji...@apache.org> on 2015/01/14 19:34:34 UTC

[jira] [Comment Edited] (MESOS-1949) All log messages from master, slave, executor, etc. should be collected on a per-task basis

    [ https://issues.apache.org/jira/browse/MESOS-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277399#comment-14277399 ] 

Steven Schlansker edited comment on MESOS-1949 at 1/14/15 6:33 PM:
-------------------------------------------------------------------

Well, it's not quite as urgent as I thought.  But there's still a lot of information that is hidden in log files and is very hard to correlate.  For example, I had a task die with 
{code}
I0106 20:08:04.998108  1625 docker.cpp:928] Starting container '78065406-449e-4103-85c1-bbfab09d7372' for task 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' (and executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a') of framework 'Singularity'
E0106 20:08:05.221181  1624 slave.cpp:2787] Container '78065406-449e-4103-85c1-bbfab09d7372' for executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed to start: Port [4111] not included in resources
E0106 20:08:05.277864  1622 slave.cpp:2882] Termination of executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed: Unknown container: 78065406-449e-4103-85c1-bbfab09d7372
{code}
but the "message" field only has "Abnormal executor termination"

Whenever something like this happens, application developers come to me -- they don't have the knowledge to trawl through Mesos logs (arguably a developer education problem, but the tools could help much more!).  You can find the Mesos slave logs through the UI, but you have to do a lot of correlation yourself -- you have to find the right slave, dig through the messages looking only for the ones relevant to your task, etc.

If all of the relevant logs to one task were collected in one place, this would be much easier.  Makes sense?


was (Author: stevenschlansker):
Well, it's not quite as urgent as I thought.  But there's still a lot of information that is hidden in log files and is very hard to correlate.  For example, I had a task die with 
{code}
I0106 20:08:04.998108  1625 docker.cpp:928] Starting container '78065406-449e-4103-85c1-bbfab09d7372' for task 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' (and executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a') of framework 'Singularity'
E0106 20:08:05.221181  1624 slave.cpp:2787] Container '78065406-449e-4103-85c1-bbfab09d7372' for executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed to start: Port [4111] not included in resources
E0106 20:08:05.277864  1622 slave.cpp:2882] Termination of executor 'ci-http-redirector-bhardy.2015.01.06T20.08.03-1420574884931-1-10.70.6.233-us_west_2a' of framework 'Singularity' failed: Unknown container: 78065406-449e-4103-85c1-bbfab09d7372
{code}
but the "message" field only has "Abnormal executor termination"

Whenever something like this happens, application developers come to me -- they don't have any way to see the Mesos slave logs (no login permissions in general).  You can find the Mesos slave logs through the UI, but you have to do a lot of correlation yourself -- you have to find the right slave, dig through the messages, etc.

If all of the relevant logs to one task were collected in one place, this would be much easier.  Makes sense?

> All log messages from master, slave, executor, etc. should be collected on a per-task basis
> -------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1949
>                 URL: https://issues.apache.org/jira/browse/MESOS-1949
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master, slave
>    Affects Versions: 0.20.1
>            Reporter: Steven Schlansker
>
> Currently through a task's lifecycle, various debugging information is created at different layers of the Mesos ecosystem.  The framework will log task information, the master deals with resource allocation, the slave actually allocates those resources, and the executor does the work of launching the task.
> If anything through that pipeline fails, the end user is left with little but a "TASK_FAILED" or "TASK_LOST" -- the actually interesting / useful information (for example a "Docker pull failed because repository didn't exist") is hidden in one of four or five different places, potentially spread across as many different machines.  This leads to unpleasant and repetitive searching through logs looking for a clue to what went wrong.
> Collating logs on a per-task basis would give the end user a much friendlier way of figuring out exactly where in this process something went wrong, and likely much faster resolution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)