You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Cody Maloney (JIRA)" <ji...@apache.org> on 2015/12/21 23:06:46 UTC
[jira] [Created] (MESOS-4233) Logging is too verbose for sysadmins / syslog

Cody Maloney created MESOS-4233:
-----------------------------------

             Summary: Logging is too verbose for sysadmins / syslog
                 Key: MESOS-4233
                 URL: https://issues.apache.org/jira/browse/MESOS-4233
             Project: Mesos
          Issue Type: Epic
            Reporter: Cody Maloney


Currently mesos logs a lot. When launching a thousand tasks in the space of 10 seconds it will print tens of thousands of log lines, overwhelming syslog (there is a max rate at which a process can send stuff over a unix socket) and not giving useful information to a sysadmin who cares about just the high-level activity and when something goes wrong.

Note mesos also blocks writing to its log locations, so when writing a lot of log messages, it can fill up the write buffer in the kernel, and be suspended until the syslog agent catches up reading from the socket (GLOG does a blocking fwrite to stderr). GLOG also has a big mutex around logging so only one thing logs at a time.

While for "internal debugging" it is useful to see things like "message went from internal compoent x to internal component y", from a sysadmin perspective I only care about the high level actions taken (launched task for framework x), sent offer to framework y, got task failed from host z. Note those are what I'd expect at the "INFO" level. At the "WARNING" level I'd expect very little to be logged / almost nothing in normal operation. Just things like "WARN: Repliacted log write took longer than expected". WARN would also get things like backtraces on crashes and abnormal exits / abort.

When trying to launch 3k+ tasks inside a second, mesos logging currently overwhelms syslog with 100k+ messages, many of which are thousands of bytes. Sysadmins expect to be able to use syslog to monitor basic events in their system. This is too much.

We can keep logging the messages to files, but the logging to stderr needs to be reduced significantly (stderr gets picked up and forwarded to syslog / central aggregation).

What I would like is if I can set the stderr logging level to be different / independent from the file logging level (Syslog giving the "sysadmin" aggregated overview, files useful for debugging in depth what happened in a cluster). A lot of what mesos currently logs at info is really debugging info / should show up as debug log level.

Some samples of mesos logging a lot more than a sysadmin would want / expect are attached, and some are below:

Every task gets printed multiple times for a basic launch:
{noformat}
There are also things like every task gets printed multiple times when launched (Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382644  1315 master.cpp:3248] Launching task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 5178f46d-71d6-422f-922c-5bbe82dff9cc-0000 (marathon)
Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382925  1315 master.hpp:176] Adding task envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(*):0.0001; mem(*):16; ports(*):[14047-14047]
{noformat}

Every task status update prints many log lines, successful ones are part of normal operation and maybe should be logged at info / debug levels, but not to a sysadmin (Just show when things fail, and maybe aggregate counters to tell of the volume of working)

No log messagse should be really big / more than 1k characters (Would prevent the giant port list attached, make that easily discoverable / bug filable / fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)