You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Steven Schlansker (JIRA)" <ji...@apache.org> on 2017/02/08 19:17:41 UTC

[jira] [Updated] (MESOS-7085) Consider reducing processing of DECLINE calls log from info to debug

     [ https://issues.apache.org/jira/browse/MESOS-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Schlansker updated MESOS-7085:
-------------------------------------
    Description: 
The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical diagnostics information, but unless your scheduler is broken or buggy, usually uninteresting.

In our production environment this ended up being a significant fraction of all logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE" is a scary word to an operator), is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based manner?  ("654 offers declined in last 1 minute")

  was:
The Mesos master gets resource decline messages as a normal matter of course.

It repeatedly logs the offers declined from schedulers.  This is critical diagnostics information, but unless your scheduler is broken or buggy, usually uninteresting.

In our production environment this ended up being a significant fraction of all logging.  One of our operators got paged:


> Checking to see what I can delete.
> 90% of the 1.6GB mesos log file is taken up by by these ( + we are also outputting this to syslog ) :

> I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844

➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
➢ grep -c DECLINE mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
5846770

It seems that this line looks scary ("DECLINE") is a scary word to an operator, is a huge percentage of log output, and is part of normal operation.
Should it be reduced to DEBUG?  Or could Mesos print it out in a time based manner?  ("654 offers declined in last 1 minute")


> Consider reducing processing of DECLINE calls log from info to debug
> --------------------------------------------------------------------
>
>                 Key: MESOS-7085
>                 URL: https://issues.apache.org/jira/browse/MESOS-7085
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.0.1
>            Reporter: Steven Schlansker
>
> The Mesos master gets resource decline messages as a normal matter of course.
> It repeatedly logs the offers declined from schedulers.  This is critical diagnostics information, but unless your scheduler is broken or buggy, usually uninteresting.
> In our production environment this ended up being a significant fraction of all logging.  One of our operators got paged:
> > Checking to see what I can delete.
> > 90% of the 1.6GB mesos log file is taken up by by these ( + we are also outputting this to syslog ) :
> > I0208 15:54:41.032714 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488245 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> > I0208 15:54:41.032871 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488246 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> > I0208 15:54:41.033025 10833 master.cpp:3951] Processing DECLINE call for offers: [ 68809dc9-6d79-467c-a20b-b3b7d50dc415-O12488247 ] for framework Singularity (Singularity) at scheduler-c355fd25-4a89-40e1-9128-6f452518f038@10.20.16.235:38844
> ➢  wc -l mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> 6796024 mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> ➢ grep -c DECLINE mesos-master.mesos3-prod-sc.invalid-user.log.INFO.20170130-014425.10812
> 5846770
> It seems that this line looks scary ("DECLINE" is a scary word to an operator), is a huge percentage of log output, and is part of normal operation.
> Should it be reduced to DEBUG?  Or could Mesos print it out in a time based manner?  ("654 offers declined in last 1 minute")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)