You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Benjamin Mahler <bm...@apache.org> on 2016/06/06 23:59:48 UTC

Re: Consequences of health-check timeouts?

I'll make sure this gets fixed for 1.0. Apologies for the pain, it looks
like there is a significant amount of debt in the docker containerizer /
executor.

On Wed, May 18, 2016 at 10:51 AM, Steven Schlansker <
sschlansker@opentable.com> wrote:

>
> > On May 18, 2016, at 10:44 AM, haosdent <ha...@gmail.com> wrote:
> >
> > >In re executor_shutdown_grace_period: how would this enable the task
> (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
> STDOUT as captured by Mesos shows that it received signal 15 just before it
> said good-bye). My naive understanding of this grace period is that it
> simply delays the termination of the executor.
>
> I'm not 100% sure this is related or helpful, but be aware that we believe
> there is a bug in the Docker
> containerizer's handling of logs during shutdown:
>
> https://issues.apache.org/jira/browse/MESOS-5195
>
> We spent a lot of time debugging why our application was not shutting down
> as we expected,
> only to find that the real problem was that Mesos was losing all logs sent
> during
> shutdown.
>
> >
> > If you use DockerContainerizer, mesos use executor_shutdown_grace_period
> as the shutdown gracefully timeout for task as well. If you use
> MesosContainerizer, it would send SIGTERM(15) first. After 3 seconds, if
> the task is still alive, Mesos would send SIGKILL(9) to the task again.
> >
> > >I'm not sure what the java task is. This took place on the mesos-master
> node and none of our applications runs there. It runs master, Marathon, and
> ZK. Maybe the java task is Marathon or ZK?
> >
> > Not sure about this, maybe others have similar experience on this, do
> Marathon or Zookeeper abnormal at that time? Could you provide the log of
> mesos-master/mesos-slave when accident happened as well?
> >
> >
> > On Wed, May 18, 2016 at 7:11 PM, Paul Bell <ar...@gmail.com> wrote:
> > Hi Hasodent,
> >
> > Thanks for your reply.
> >
> > In re executor_shutdown_grace_period: how would this enable the task
> (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
> STDOUT as captured by Mesos shows that it received signal 15 just before it
> said good-bye). My naive understanding of this grace period is that it
> simply delays the termination of the executor.
> >
> > The following snippet is rom /var/log/syslog. I believe it shows the
> stack trace (largely in the kernel) that led to mesos-master being blocked
> for more than 120 seconds. Please note that immediately above (before) the
> blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the
> blocked mesos-master is a blocked java task. I'm not sure what the java
> task is. This took place on the mesos-master node and none of our
> applications runs there. It runs master, Marathon, and ZK. Maybe the java
> task is Marathon or ZK?
> >
> > Thanks again.
> >
> > -Paul
> > May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013
> blocked for more than 120 seconds.
> >
> > May 16 20:06:53 71 kernel: [193339.890873]       Not tainted
> 3.13.0-32-generic #57-Ubuntu
> >
> > May 16 20:06:53 71 kernel: [193339.890889] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >
> > May 16 20:06:53 71 kernel: [193339.890912] mesos-master    D
> ffff88013fd94440     0  4013      1 0x00000000
> >
> > May 16 20:06:53 71 kernel: [193339.890914]  ffff880137429a28
> 0000000000000002 ffff880135778000 ffff880137429fd8
> >
> > May 16 20:06:53 71 kernel: [193339.890916]  0000000000014440
> 0000000000014440 ffff880135778000 ffff88013fd94cd8
> >
> > May 16 20:06:53 71 kernel: [193339.890918]  ffff88013ffd34b0
> 0000000000000002 ffffffff81284630 ffff880137429aa0
> >
> > May 16 20:06:53 71 kernel: [193339.890919] Call Trace:
> >
> > May 16 20:06:53 71 kernel: [193339.890922]  [<ffffffff81284630>] ?
> start_this_handle+0x590/0x590
> >
> > May 16 20:06:53 71 kernel: [193339.890924]  [<ffffffff817203fd>]
> io_schedule+0x9d/0x140
> >
> > May 16 20:06:53 71 kernel: [193339.890925]  [<ffffffff8128463e>]
> sleep_on_shadow_bh+0xe/0x20
> >
> > May 16 20:06:53 71 kernel: [193339.890927]  [<ffffffff81720882>]
> __wait_on_bit+0x62/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890929]  [<ffffffff81284630>] ?
> start_this_handle+0x590/0x590
> >
> > May 16 20:06:53 71 kernel: [193339.890930]  [<ffffffff81720927>]
> out_of_line_wait_on_bit+0x77/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890932]  [<ffffffff810aaf40>] ?
> autoremove_wake_function+0x40/0x40
> >
> > May 16 20:06:53 71 kernel: [193339.890934]  [<ffffffff810aab15>] ?
> wake_up_bit+0x25/0x30
> >
> > May 16 20:06:53 71 kernel: [193339.890936]  [<ffffffff81285a3d>]
> do_get_write_access+0x2ad/0x4f0
> >
> > May 16 20:06:53 71 kernel: [193339.890938]  [<ffffffff811ef90d>] ?
> __getblk+0x2d/0x2e0
> >
> > May 16 20:06:53 71 kernel: [193339.890939]  [<ffffffff81285ca7>]
> jbd2_journal_get_write_access+0x27/0x40
> >
> > May 16 20:06:53 71 kernel: [193339.890942]  [<ffffffff8126c37b>]
> __ext4_journal_get_write_access+0x3b/0x80
> >
> > May 16 20:06:53 71 kernel: [193339.890946]  [<ffffffff81242200>]
> ext4_reserve_inode_write+0x70/0xa0
> >
> > May 16 20:06:53 71 kernel: [193339.890948]  [<ffffffff81245490>] ?
> ext4_dirty_inode+0x40/0x60
> >
> > May 16 20:06:53 71 kernel: [193339.890949]  [<ffffffff81242274>]
> ext4_mark_inode_dirty+0x44/0x1f0
> >
> > May 16 20:06:53 71 kernel: [193339.890951]  [<ffffffff81245490>]
> ext4_dirty_inode+0x40/0x60
> >
> > May 16 20:06:53 71 kernel: [193339.890953]  [<ffffffff811e65aa>]
> __mark_inode_dirty+0x10a/0x2d0
> >
> > May 16 20:06:53 71 kernel: [193339.890956]  [<ffffffff811d7761>]
> update_time+0x81/0xd0
> >
> > May 16 20:06:53 71 kernel: [193339.890957]  [<ffffffff811d7970>]
> file_update_time+0x80/0xd0
> >
> > May 16 20:06:53 71 kernel: [193339.890961]  [<ffffffff8114fc40>]
> __generic_file_aio_write+0x180/0x3d0
> >
> > May 16 20:06:53 71 kernel: [193339.890963]  [<ffffffff8114fee8>]
> generic_file_aio_write+0x58/0xa0
> >
> > May 16 20:06:53 71 kernel: [193339.890965]  [<ffffffff81239ea9>]
> ext4_file_write+0x99/0x400
> >
> > May 16 20:06:53 71 kernel: [193339.890967]  [<ffffffff8109a7f0>] ?
> wake_up_state+0x10/0x20
> >
> > May 16 20:06:53 71 kernel: [193339.890970]  [<ffffffff810d7c56>] ?
> wake_futex+0x66/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890972]  [<ffffffff810d8c01>] ?
> futex_wake+0x1b1/0x1d0
> >
> > May 16 20:06:53 71 kernel: [193339.890974]  [<ffffffff811bc3da>]
> do_sync_write+0x5a/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890976]  [<ffffffff811bcb64>]
> vfs_write+0xb4/0x1f0
> >
> > May 16 20:06:53 71 kernel: [193339.890978]  [<ffffffff811bd599>]
> SyS_write+0x49/0xa0
> >
> > May 16 20:06:53 71 kernel: [193339.890980]  [<ffffffff8172c87f>]
> tracesys+0xe1/0xe6
> >
> >
> >
> >
> > On Wed, May 18, 2016 at 2:33 AM, haosdent <ha...@gmail.com> wrote:
> > >Is there some way to be given control (a callback, or an "exit"
> routine) so that the container about to be nuked can be given a chance to
> exit gracefully?
> > The default value of executor_shutdown_grace_period is 5 seconds, you
> could change it by specify the `--executor_shutdown_grace_period` flag when
> launch mesos agent.
> >
> > >Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> > >mesos-slaves get shutdown
> > Do you know where your mesos-master stuck when it happens? Any error log
> or related log about this? In addition, is there any log when mesos-slave
> shut down?
> >
> > On Wed, May 18, 2016 at 6:12 AM, Paul Bell <ar...@gmail.com> wrote:
> > Hi All,
> >
> > I probably have the following account partly wrong, but let me present
> it just the same and those who know better can correct me as needed.
> >
> > I've an application that runs several MongoDB shards, each a Dockerized
> container, each on a distinct node (VM); in fact, some of the VMs are on
> separate ESXi hosts.
> >
> > I've lately seen situations where, because of very slow disks for the
> database, the following sequence occurs (I think):
> >       • Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
> defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
> >       • Synchronous flushing of many, many pages occurs, writing to a
> slow disk
> >       • (Around this time one might see in /var/log/syslog "task X
> blocked for more than 120 seconds" for all kinds of tasks, including
> mesos-master)
> >       • mesos-slaves get shutdown (this is the part I'm unclear about;
> but I am quite certain that on 2 nodes the executors and their in-flight
> MongoDB tasks got zapped because I can see that Marathon restarted them).
> > The consequences of this are a corrupt MongoDB database. In the case at
> hand, the job had run for over 50 hours, processing close to 120 million
> files.
> >
> > Steps I've taken so far to remedy include:
> >       • tune vm.dirty_background_ratio and vm.dirty_ratio down,
> respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate
> more frequent, smaller flushes and thus avoid less frequent massive flushes
> that suspend threads for very long periods.
> >       • increase agent ping timeout to 10 minutes (every 30 seconds, 20
> times)
> > So the questions are:
> >       • Is there some way to be given control (a callback, or an "exit"
> routine) so that the container about to be nuked can be given a chance to
> exit gracefully?
> >       • Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> >       • (Also, I'd be grateful for more clarity on anything in steps 1-4
> above that is a bit hand-wavy!)
> > As always, thanks.
> >
> > -Paul
> >
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
>
>