You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Paul Bell <ar...@gmail.com> on 2016/05/17 22:12:25 UTC

Consequences of health-check timeouts?

Hi All,

I probably have the following account partly wrong, but let me present it
just the same and those who know better can correct me as needed.

I've an application that runs several MongoDB shards, each a Dockerized
container, each on a distinct node (VM); in fact, some of the VMs are on
separate ESXi hosts.

I've lately seen situations where, because of very slow disks for the
database, the following sequence occurs (I think):

   1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
   defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
   2. Synchronous flushing of many, many pages occurs, writing to a slow
   disk
   3. (Around this time one might see in /var/log/syslog "task X blocked
   for more than 120 seconds" for all kinds of tasks, including mesos-master)
   4. mesos-slaves get shutdown (this is the part I'm unclear about; but I
   am quite certain that on 2 nodes the executors and their in-flight MongoDB
   tasks got zapped because I can see that Marathon restarted them).

The consequences of this are a corrupt MongoDB database. In the case at
hand, the job had run for over 50 hours, processing close to 120 million
files.

Steps I've taken so far to remedy include:

   - tune vm.dirty_background_ratio and vm.dirty_ratio down, respectively,
   to 5 and 10 (from 10 and 20). The intent here is to tolerate more frequent,
   smaller flushes and thus avoid less frequent massive flushes that suspend
   threads for very long periods.
   - increase agent ping timeout to 10 minutes (every 30 seconds, 20 times)

So the questions are:

   - Is there some way to be given control (a callback, or an "exit"
   routine) so that the container about to be nuked can be given a chance to
   exit gracefully?
   - Are there other steps I can take to avoid this mildly calamitous
   occurrence?
   - (Also, I'd be grateful for more clarity on anything in steps 1-4 above
   that is a bit hand-wavy!)

As always, thanks.

-Paul

Re: Consequences of health-check timeouts?

Posted by Benjamin Mahler <bm...@apache.org>.

I'll make sure this gets fixed for 1.0. Apologies for the pain, it looks
like there is a significant amount of debt in the docker containerizer /
executor.

On Wed, May 18, 2016 at 10:51 AM, Steven Schlansker <
sschlansker@opentable.com> wrote:

>
> > On May 18, 2016, at 10:44 AM, haosdent <ha...@gmail.com> wrote:
> >
> > >In re executor_shutdown_grace_period: how would this enable the task
> (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
> STDOUT as captured by Mesos shows that it received signal 15 just before it
> said good-bye). My naive understanding of this grace period is that it
> simply delays the termination of the executor.
>
> I'm not 100% sure this is related or helpful, but be aware that we believe
> there is a bug in the Docker
> containerizer's handling of logs during shutdown:
>
> https://issues.apache.org/jira/browse/MESOS-5195
>
> We spent a lot of time debugging why our application was not shutting down
> as we expected,
> only to find that the real problem was that Mesos was losing all logs sent
> during
> shutdown.
>
> >
> > If you use DockerContainerizer, mesos use executor_shutdown_grace_period
> as the shutdown gracefully timeout for task as well. If you use
> MesosContainerizer, it would send SIGTERM(15) first. After 3 seconds, if
> the task is still alive, Mesos would send SIGKILL(9) to the task again.
> >
> > >I'm not sure what the java task is. This took place on the mesos-master
> node and none of our applications runs there. It runs master, Marathon, and
> ZK. Maybe the java task is Marathon or ZK?
> >
> > Not sure about this, maybe others have similar experience on this, do
> Marathon or Zookeeper abnormal at that time? Could you provide the log of
> mesos-master/mesos-slave when accident happened as well?
> >
> >
> > On Wed, May 18, 2016 at 7:11 PM, Paul Bell <ar...@gmail.com> wrote:
> > Hi Hasodent,
> >
> > Thanks for your reply.
> >
> > In re executor_shutdown_grace_period: how would this enable the task
> (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
> STDOUT as captured by Mesos shows that it received signal 15 just before it
> said good-bye). My naive understanding of this grace period is that it
> simply delays the termination of the executor.
> >
> > The following snippet is rom /var/log/syslog. I believe it shows the
> stack trace (largely in the kernel) that led to mesos-master being blocked
> for more than 120 seconds. Please note that immediately above (before) the
> blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the
> blocked mesos-master is a blocked java task. I'm not sure what the java
> task is. This took place on the mesos-master node and none of our
> applications runs there. It runs master, Marathon, and ZK. Maybe the java
> task is Marathon or ZK?
> >
> > Thanks again.
> >
> > -Paul
> > May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013
> blocked for more than 120 seconds.
> >
> > May 16 20:06:53 71 kernel: [193339.890873]       Not tainted
> 3.13.0-32-generic #57-Ubuntu
> >
> > May 16 20:06:53 71 kernel: [193339.890889] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >
> > May 16 20:06:53 71 kernel: [193339.890912] mesos-master    D
> ffff88013fd94440     0  4013      1 0x00000000
> >
> > May 16 20:06:53 71 kernel: [193339.890914]  ffff880137429a28
> 0000000000000002 ffff880135778000 ffff880137429fd8
> >
> > May 16 20:06:53 71 kernel: [193339.890916]  0000000000014440
> 0000000000014440 ffff880135778000 ffff88013fd94cd8
> >
> > May 16 20:06:53 71 kernel: [193339.890918]  ffff88013ffd34b0
> 0000000000000002 ffffffff81284630 ffff880137429aa0
> >
> > May 16 20:06:53 71 kernel: [193339.890919] Call Trace:
> >
> > May 16 20:06:53 71 kernel: [193339.890922]  [<ffffffff81284630>] ?
> start_this_handle+0x590/0x590
> >
> > May 16 20:06:53 71 kernel: [193339.890924]  [<ffffffff817203fd>]
> io_schedule+0x9d/0x140
> >
> > May 16 20:06:53 71 kernel: [193339.890925]  [<ffffffff8128463e>]
> sleep_on_shadow_bh+0xe/0x20
> >
> > May 16 20:06:53 71 kernel: [193339.890927]  [<ffffffff81720882>]
> __wait_on_bit+0x62/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890929]  [<ffffffff81284630>] ?
> start_this_handle+0x590/0x590
> >
> > May 16 20:06:53 71 kernel: [193339.890930]  [<ffffffff81720927>]
> out_of_line_wait_on_bit+0x77/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890932]  [<ffffffff810aaf40>] ?
> autoremove_wake_function+0x40/0x40
> >
> > May 16 20:06:53 71 kernel: [193339.890934]  [<ffffffff810aab15>] ?
> wake_up_bit+0x25/0x30
> >
> > May 16 20:06:53 71 kernel: [193339.890936]  [<ffffffff81285a3d>]
> do_get_write_access+0x2ad/0x4f0
> >
> > May 16 20:06:53 71 kernel: [193339.890938]  [<ffffffff811ef90d>] ?
> __getblk+0x2d/0x2e0
> >
> > May 16 20:06:53 71 kernel: [193339.890939]  [<ffffffff81285ca7>]
> jbd2_journal_get_write_access+0x27/0x40
> >
> > May 16 20:06:53 71 kernel: [193339.890942]  [<ffffffff8126c37b>]
> __ext4_journal_get_write_access+0x3b/0x80
> >
> > May 16 20:06:53 71 kernel: [193339.890946]  [<ffffffff81242200>]
> ext4_reserve_inode_write+0x70/0xa0
> >
> > May 16 20:06:53 71 kernel: [193339.890948]  [<ffffffff81245490>] ?
> ext4_dirty_inode+0x40/0x60
> >
> > May 16 20:06:53 71 kernel: [193339.890949]  [<ffffffff81242274>]
> ext4_mark_inode_dirty+0x44/0x1f0
> >
> > May 16 20:06:53 71 kernel: [193339.890951]  [<ffffffff81245490>]
> ext4_dirty_inode+0x40/0x60
> >
> > May 16 20:06:53 71 kernel: [193339.890953]  [<ffffffff811e65aa>]
> __mark_inode_dirty+0x10a/0x2d0
> >
> > May 16 20:06:53 71 kernel: [193339.890956]  [<ffffffff811d7761>]
> update_time+0x81/0xd0
> >
> > May 16 20:06:53 71 kernel: [193339.890957]  [<ffffffff811d7970>]
> file_update_time+0x80/0xd0
> >
> > May 16 20:06:53 71 kernel: [193339.890961]  [<ffffffff8114fc40>]
> __generic_file_aio_write+0x180/0x3d0
> >
> > May 16 20:06:53 71 kernel: [193339.890963]  [<ffffffff8114fee8>]
> generic_file_aio_write+0x58/0xa0
> >
> > May 16 20:06:53 71 kernel: [193339.890965]  [<ffffffff81239ea9>]
> ext4_file_write+0x99/0x400
> >
> > May 16 20:06:53 71 kernel: [193339.890967]  [<ffffffff8109a7f0>] ?
> wake_up_state+0x10/0x20
> >
> > May 16 20:06:53 71 kernel: [193339.890970]  [<ffffffff810d7c56>] ?
> wake_futex+0x66/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890972]  [<ffffffff810d8c01>] ?
> futex_wake+0x1b1/0x1d0
> >
> > May 16 20:06:53 71 kernel: [193339.890974]  [<ffffffff811bc3da>]
> do_sync_write+0x5a/0x90
> >
> > May 16 20:06:53 71 kernel: [193339.890976]  [<ffffffff811bcb64>]
> vfs_write+0xb4/0x1f0
> >
> > May 16 20:06:53 71 kernel: [193339.890978]  [<ffffffff811bd599>]
> SyS_write+0x49/0xa0
> >
> > May 16 20:06:53 71 kernel: [193339.890980]  [<ffffffff8172c87f>]
> tracesys+0xe1/0xe6
> >
> >
> >
> >
> > On Wed, May 18, 2016 at 2:33 AM, haosdent <ha...@gmail.com> wrote:
> > >Is there some way to be given control (a callback, or an "exit"
> routine) so that the container about to be nuked can be given a chance to
> exit gracefully?
> > The default value of executor_shutdown_grace_period is 5 seconds, you
> could change it by specify the `--executor_shutdown_grace_period` flag when
> launch mesos agent.
> >
> > >Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> > >mesos-slaves get shutdown
> > Do you know where your mesos-master stuck when it happens? Any error log
> or related log about this? In addition, is there any log when mesos-slave
> shut down?
> >
> > On Wed, May 18, 2016 at 6:12 AM, Paul Bell <ar...@gmail.com> wrote:
> > Hi All,
> >
> > I probably have the following account partly wrong, but let me present
> it just the same and those who know better can correct me as needed.
> >
> > I've an application that runs several MongoDB shards, each a Dockerized
> container, each on a distinct node (VM); in fact, some of the VMs are on
> separate ESXi hosts.
> >
> > I've lately seen situations where, because of very slow disks for the
> database, the following sequence occurs (I think):
> >       • Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
> defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
> >       • Synchronous flushing of many, many pages occurs, writing to a
> slow disk
> >       • (Around this time one might see in /var/log/syslog "task X
> blocked for more than 120 seconds" for all kinds of tasks, including
> mesos-master)
> >       • mesos-slaves get shutdown (this is the part I'm unclear about;
> but I am quite certain that on 2 nodes the executors and their in-flight
> MongoDB tasks got zapped because I can see that Marathon restarted them).
> > The consequences of this are a corrupt MongoDB database. In the case at
> hand, the job had run for over 50 hours, processing close to 120 million
> files.
> >
> > Steps I've taken so far to remedy include:
> >       • tune vm.dirty_background_ratio and vm.dirty_ratio down,
> respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate
> more frequent, smaller flushes and thus avoid less frequent massive flushes
> that suspend threads for very long periods.
> >       • increase agent ping timeout to 10 minutes (every 30 seconds, 20
> times)
> > So the questions are:
> >       • Is there some way to be given control (a callback, or an "exit"
> routine) so that the container about to be nuked can be given a chance to
> exit gracefully?
> >       • Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> >       • (Also, I'd be grateful for more clarity on anything in steps 1-4
> above that is a bit hand-wavy!)
> > As always, thanks.
> >
> > -Paul
> >
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
>
>

Re: Consequences of health-check timeouts?

Posted by Steven Schlansker <ss...@opentable.com>.

> On May 18, 2016, at 10:44 AM, haosdent <ha...@gmail.com> wrote:
> 
> >In re executor_shutdown_grace_period: how would this enable the task (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo STDOUT as captured by Mesos shows that it received signal 15 just before it said good-bye). My naive understanding of this grace period is that it simply delays the termination of the executor.

I'm not 100% sure this is related or helpful, but be aware that we believe there is a bug in the Docker
containerizer's handling of logs during shutdown:

https://issues.apache.org/jira/browse/MESOS-5195

We spent a lot of time debugging why our application was not shutting down as we expected,
only to find that the real problem was that Mesos was losing all logs sent during
shutdown.

> 
> If you use DockerContainerizer, mesos use executor_shutdown_grace_period as the shutdown gracefully timeout for task as well. If you use MesosContainerizer, it would send SIGTERM(15) first. After 3 seconds, if the task is still alive, Mesos would send SIGKILL(9) to the task again.
> 
> >I'm not sure what the java task is. This took place on the mesos-master node and none of our applications runs there. It runs master, Marathon, and ZK. Maybe the java task is Marathon or ZK?
> 
> Not sure about this, maybe others have similar experience on this, do Marathon or Zookeeper abnormal at that time? Could you provide the log of mesos-master/mesos-slave when accident happened as well?
> 
> 
> On Wed, May 18, 2016 at 7:11 PM, Paul Bell <ar...@gmail.com> wrote:
> Hi Hasodent,
> 
> Thanks for your reply.
> 
> In re executor_shutdown_grace_period: how would this enable the task (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo STDOUT as captured by Mesos shows that it received signal 15 just before it said good-bye). My naive understanding of this grace period is that it simply delays the termination of the executor.
> 
> The following snippet is rom /var/log/syslog. I believe it shows the stack trace (largely in the kernel) that led to mesos-master being blocked for more than 120 seconds. Please note that immediately above (before) the blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the blocked mesos-master is a blocked java task. I'm not sure what the java task is. This took place on the mesos-master node and none of our applications runs there. It runs master, Marathon, and ZK. Maybe the java task is Marathon or ZK?
> 
> Thanks again.
> 
> -Paul
> May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013 blocked for more than 120 seconds.
> 
> May 16 20:06:53 71 kernel: [193339.890873]       Not tainted 3.13.0-32-generic #57-Ubuntu
> 
> May 16 20:06:53 71 kernel: [193339.890889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> May 16 20:06:53 71 kernel: [193339.890912] mesos-master    D ffff88013fd94440     0  4013      1 0x00000000
> 
> May 16 20:06:53 71 kernel: [193339.890914]  ffff880137429a28 0000000000000002 ffff880135778000 ffff880137429fd8
> 
> May 16 20:06:53 71 kernel: [193339.890916]  0000000000014440 0000000000014440 ffff880135778000 ffff88013fd94cd8
> 
> May 16 20:06:53 71 kernel: [193339.890918]  ffff88013ffd34b0 0000000000000002 ffffffff81284630 ffff880137429aa0
> 
> May 16 20:06:53 71 kernel: [193339.890919] Call Trace:
> 
> May 16 20:06:53 71 kernel: [193339.890922]  [<ffffffff81284630>] ? start_this_handle+0x590/0x590
> 
> May 16 20:06:53 71 kernel: [193339.890924]  [<ffffffff817203fd>] io_schedule+0x9d/0x140
> 
> May 16 20:06:53 71 kernel: [193339.890925]  [<ffffffff8128463e>] sleep_on_shadow_bh+0xe/0x20
> 
> May 16 20:06:53 71 kernel: [193339.890927]  [<ffffffff81720882>] __wait_on_bit+0x62/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890929]  [<ffffffff81284630>] ? start_this_handle+0x590/0x590
> 
> May 16 20:06:53 71 kernel: [193339.890930]  [<ffffffff81720927>] out_of_line_wait_on_bit+0x77/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890932]  [<ffffffff810aaf40>] ? autoremove_wake_function+0x40/0x40
> 
> May 16 20:06:53 71 kernel: [193339.890934]  [<ffffffff810aab15>] ? wake_up_bit+0x25/0x30
> 
> May 16 20:06:53 71 kernel: [193339.890936]  [<ffffffff81285a3d>] do_get_write_access+0x2ad/0x4f0
> 
> May 16 20:06:53 71 kernel: [193339.890938]  [<ffffffff811ef90d>] ? __getblk+0x2d/0x2e0
> 
> May 16 20:06:53 71 kernel: [193339.890939]  [<ffffffff81285ca7>] jbd2_journal_get_write_access+0x27/0x40
> 
> May 16 20:06:53 71 kernel: [193339.890942]  [<ffffffff8126c37b>] __ext4_journal_get_write_access+0x3b/0x80
> 
> May 16 20:06:53 71 kernel: [193339.890946]  [<ffffffff81242200>] ext4_reserve_inode_write+0x70/0xa0
> 
> May 16 20:06:53 71 kernel: [193339.890948]  [<ffffffff81245490>] ? ext4_dirty_inode+0x40/0x60
> 
> May 16 20:06:53 71 kernel: [193339.890949]  [<ffffffff81242274>] ext4_mark_inode_dirty+0x44/0x1f0
> 
> May 16 20:06:53 71 kernel: [193339.890951]  [<ffffffff81245490>] ext4_dirty_inode+0x40/0x60
> 
> May 16 20:06:53 71 kernel: [193339.890953]  [<ffffffff811e65aa>] __mark_inode_dirty+0x10a/0x2d0
> 
> May 16 20:06:53 71 kernel: [193339.890956]  [<ffffffff811d7761>] update_time+0x81/0xd0
> 
> May 16 20:06:53 71 kernel: [193339.890957]  [<ffffffff811d7970>] file_update_time+0x80/0xd0
> 
> May 16 20:06:53 71 kernel: [193339.890961]  [<ffffffff8114fc40>] __generic_file_aio_write+0x180/0x3d0
> 
> May 16 20:06:53 71 kernel: [193339.890963]  [<ffffffff8114fee8>] generic_file_aio_write+0x58/0xa0
> 
> May 16 20:06:53 71 kernel: [193339.890965]  [<ffffffff81239ea9>] ext4_file_write+0x99/0x400
> 
> May 16 20:06:53 71 kernel: [193339.890967]  [<ffffffff8109a7f0>] ? wake_up_state+0x10/0x20
> 
> May 16 20:06:53 71 kernel: [193339.890970]  [<ffffffff810d7c56>] ? wake_futex+0x66/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890972]  [<ffffffff810d8c01>] ? futex_wake+0x1b1/0x1d0
> 
> May 16 20:06:53 71 kernel: [193339.890974]  [<ffffffff811bc3da>] do_sync_write+0x5a/0x90
> 
> May 16 20:06:53 71 kernel: [193339.890976]  [<ffffffff811bcb64>] vfs_write+0xb4/0x1f0
> 
> May 16 20:06:53 71 kernel: [193339.890978]  [<ffffffff811bd599>] SyS_write+0x49/0xa0
> 
> May 16 20:06:53 71 kernel: [193339.890980]  [<ffffffff8172c87f>] tracesys+0xe1/0xe6
> 
> 
> 
> 
> On Wed, May 18, 2016 at 2:33 AM, haosdent <ha...@gmail.com> wrote:
> >Is there some way to be given control (a callback, or an "exit" routine) so that the container about to be nuked can be given a chance to exit gracefully?
> The default value of executor_shutdown_grace_period is 5 seconds, you could change it by specify the `--executor_shutdown_grace_period` flag when launch mesos agent.
> 
> >Are there other steps I can take to avoid this mildly calamitous occurrence?
> >mesos-slaves get shutdown
> Do you know where your mesos-master stuck when it happens? Any error log or related log about this? In addition, is there any log when mesos-slave shut down?
> 
> On Wed, May 18, 2016 at 6:12 AM, Paul Bell <ar...@gmail.com> wrote:
> Hi All,
> 
> I probably have the following account partly wrong, but let me present it just the same and those who know better can correct me as needed.
> 
> I've an application that runs several MongoDB shards, each a Dockerized container, each on a distinct node (VM); in fact, some of the VMs are on separate ESXi hosts.
> 
> I've lately seen situations where, because of very slow disks for the database, the following sequence occurs (I think):
> 	• Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
> 	• Synchronous flushing of many, many pages occurs, writing to a slow disk
> 	• (Around this time one might see in /var/log/syslog "task X blocked for more than 120 seconds" for all kinds of tasks, including mesos-master)
> 	• mesos-slaves get shutdown (this is the part I'm unclear about; but I am quite certain that on 2 nodes the executors and their in-flight MongoDB tasks got zapped because I can see that Marathon restarted them).
> The consequences of this are a corrupt MongoDB database. In the case at hand, the job had run for over 50 hours, processing close to 120 million files.
> 
> Steps I've taken so far to remedy include:
> 	• tune vm.dirty_background_ratio and vm.dirty_ratio down, respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate more frequent, smaller flushes and thus avoid less frequent massive flushes that suspend threads for very long periods.
> 	• increase agent ping timeout to 10 minutes (every 30 seconds, 20 times)
> So the questions are:
> 	• Is there some way to be given control (a callback, or an "exit" routine) so that the container about to be nuked can be given a chance to exit gracefully?
> 	• Are there other steps I can take to avoid this mildly calamitous occurrence?
> 	• (Also, I'd be grateful for more clarity on anything in steps 1-4 above that is a bit hand-wavy!)
> As always, thanks.
> 
> -Paul
> 
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang
> 
> 
> 
> 
> --
> Best Regards,
> Haosdent Huang

Re: Consequences of health-check timeouts?

Posted by haosdent <ha...@gmail.com>.

>In re executor_shutdown_grace_period: how would this enable the task
(MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
STDOUT as captured by Mesos shows that it received signal 15 just before it
said good-bye). My naive understanding of this grace period is that it
simply delays the termination of the executor.

If you use DockerContainerizer, mesos use executor_shutdown_grace_period as
the shutdown gracefully timeout for task as well. If you use
MesosContainerizer, it would send SIGTERM(15) first. After 3 seconds, if
the task is still alive, Mesos would send SIGKILL(9) to the task again.

>I'm not sure what the java task is. This took place on the mesos-master
node and none of our applications runs there. It runs master, Marathon, and
ZK. Maybe the java task is Marathon or ZK?

Not sure about this, maybe others have similar experience on this, do
Marathon or Zookeeper abnormal at that time? Could you provide the log of
mesos-master/mesos-slave when accident happened as well?


On Wed, May 18, 2016 at 7:11 PM, Paul Bell <ar...@gmail.com> wrote:

> Hi Hasodent,
>
> Thanks for your reply.
>
> In re executor_shutdown_grace_period: how would this enable the task
> (MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
> STDOUT as captured by Mesos shows that it received signal 15 just before it
> said good-bye). My naive understanding of this grace period is that it
> simply delays the termination of the executor.
>
> The following snippet is rom /var/log/syslog. I believe it shows the stack
> trace (largely in the kernel) that led to mesos-master being blocked for
> more than 120 seconds. Please note that immediately above (before) the
> blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the
> blocked mesos-master is a blocked java task. I'm not sure what the java
> task is. This took place on the mesos-master node and none of our
> applications runs there. It runs master, Marathon, and ZK. Maybe the java
> task is Marathon or ZK?
>
> Thanks again.
>
> -Paul
>
> May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013
> blocked for more than 120 seconds.
>
> May 16 20:06:53 71 kernel: [193339.890873]       Not tainted
> 3.13.0-32-generic #57-Ubuntu
>
> May 16 20:06:53 71 kernel: [193339.890889] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>
> May 16 20:06:53 71 kernel: [193339.890912] mesos-master    D
> ffff88013fd94440     0  4013      1 0x00000000
>
> May 16 20:06:53 71 kernel: [193339.890914]  ffff880137429a28
> 0000000000000002 ffff880135778000 ffff880137429fd8
>
> May 16 20:06:53 71 kernel: [193339.890916]  0000000000014440
> 0000000000014440 ffff880135778000 ffff88013fd94cd8
>
> May 16 20:06:53 71 kernel: [193339.890918]  ffff88013ffd34b0
> 0000000000000002 ffffffff81284630 ffff880137429aa0
>
> May 16 20:06:53 71 kernel: [193339.890919] Call Trace:
>
> May 16 20:06:53 71 kernel: [193339.890922]  [<ffffffff81284630>] ?
> start_this_handle+0x590/0x590
>
> May 16 20:06:53 71 kernel: [193339.890924]  [<ffffffff817203fd>]
> io_schedule+0x9d/0x140
>
> May 16 20:06:53 71 kernel: [193339.890925]  [<ffffffff8128463e>]
> sleep_on_shadow_bh+0xe/0x20
>
> May 16 20:06:53 71 kernel: [193339.890927]  [<ffffffff81720882>]
> __wait_on_bit+0x62/0x90
>
> May 16 20:06:53 71 kernel: [193339.890929]  [<ffffffff81284630>] ?
> start_this_handle+0x590/0x590
>
> May 16 20:06:53 71 kernel: [193339.890930]  [<ffffffff81720927>]
> out_of_line_wait_on_bit+0x77/0x90
>
> May 16 20:06:53 71 kernel: [193339.890932]  [<ffffffff810aaf40>] ?
> autoremove_wake_function+0x40/0x40
>
> May 16 20:06:53 71 kernel: [193339.890934]  [<ffffffff810aab15>] ?
> wake_up_bit+0x25/0x30
>
> May 16 20:06:53 71 kernel: [193339.890936]  [<ffffffff81285a3d>]
> do_get_write_access+0x2ad/0x4f0
>
> May 16 20:06:53 71 kernel: [193339.890938]  [<ffffffff811ef90d>] ?
> __getblk+0x2d/0x2e0
>
> May 16 20:06:53 71 kernel: [193339.890939]  [<ffffffff81285ca7>]
> jbd2_journal_get_write_access+0x27/0x40
>
> May 16 20:06:53 71 kernel: [193339.890942]  [<ffffffff8126c37b>]
> __ext4_journal_get_write_access+0x3b/0x80
>
> May 16 20:06:53 71 kernel: [193339.890946]  [<ffffffff81242200>]
> ext4_reserve_inode_write+0x70/0xa0
>
> May 16 20:06:53 71 kernel: [193339.890948]  [<ffffffff81245490>] ?
> ext4_dirty_inode+0x40/0x60
>
> May 16 20:06:53 71 kernel: [193339.890949]  [<ffffffff81242274>]
> ext4_mark_inode_dirty+0x44/0x1f0
>
> May 16 20:06:53 71 kernel: [193339.890951]  [<ffffffff81245490>]
> ext4_dirty_inode+0x40/0x60
>
> May 16 20:06:53 71 kernel: [193339.890953]  [<ffffffff811e65aa>]
> __mark_inode_dirty+0x10a/0x2d0
>
> May 16 20:06:53 71 kernel: [193339.890956]  [<ffffffff811d7761>]
> update_time+0x81/0xd0
>
> May 16 20:06:53 71 kernel: [193339.890957]  [<ffffffff811d7970>]
> file_update_time+0x80/0xd0
>
> May 16 20:06:53 71 kernel: [193339.890961]  [<ffffffff8114fc40>]
> __generic_file_aio_write+0x180/0x3d0
>
> May 16 20:06:53 71 kernel: [193339.890963]  [<ffffffff8114fee8>]
> generic_file_aio_write+0x58/0xa0
>
> May 16 20:06:53 71 kernel: [193339.890965]  [<ffffffff81239ea9>]
> ext4_file_write+0x99/0x400
>
> May 16 20:06:53 71 kernel: [193339.890967]  [<ffffffff8109a7f0>] ?
> wake_up_state+0x10/0x20
>
> May 16 20:06:53 71 kernel: [193339.890970]  [<ffffffff810d7c56>] ?
> wake_futex+0x66/0x90
>
> May 16 20:06:53 71 kernel: [193339.890972]  [<ffffffff810d8c01>] ?
> futex_wake+0x1b1/0x1d0
>
> May 16 20:06:53 71 kernel: [193339.890974]  [<ffffffff811bc3da>]
> do_sync_write+0x5a/0x90
>
> May 16 20:06:53 71 kernel: [193339.890976]  [<ffffffff811bcb64>]
> vfs_write+0xb4/0x1f0
>
> May 16 20:06:53 71 kernel: [193339.890978]  [<ffffffff811bd599>]
> SyS_write+0x49/0xa0
>
> May 16 20:06:53 71 kernel: [193339.890980]  [<ffffffff8172c87f>]
> tracesys+0xe1/0xe6
>
>
>
> On Wed, May 18, 2016 at 2:33 AM, haosdent <ha...@gmail.com> wrote:
>
>> >Is there some way to be given control (a callback, or an "exit" routine)
>> so that the container about to be nuked can be given a chance to exit
>> gracefully?
>> The default value of executor_shutdown_grace_period is 5 seconds, you
>> could change it by specify the `--executor_shutdown_grace_period` flag when
>> launch mesos agent.
>>
>> >Are there other steps I can take to avoid this mildly calamitous
>> occurrence?
>> >mesos-slaves get shutdown
>> Do you know where your mesos-master stuck when it happens? Any error log
>> or related log about this? In addition, is there any log when mesos-slave
>> shut down?
>>
>> On Wed, May 18, 2016 at 6:12 AM, Paul Bell <ar...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I probably have the following account partly wrong, but let me present
>>> it just the same and those who know better can correct me as needed.
>>>
>>> I've an application that runs several MongoDB shards, each a Dockerized
>>> container, each on a distinct node (VM); in fact, some of the VMs are on
>>> separate ESXi hosts.
>>>
>>> I've lately seen situations where, because of very slow disks for the
>>> database, the following sequence occurs (I think):
>>>
>>>    1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
>>>    defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
>>>    2. Synchronous flushing of many, many pages occurs, writing to a
>>>    slow disk
>>>    3. (Around this time one might see in /var/log/syslog "task X
>>>    blocked for more than 120 seconds" for all kinds of tasks, including
>>>    mesos-master)
>>>    4. mesos-slaves get shutdown (this is the part I'm unclear about;
>>>    but I am quite certain that on 2 nodes the executors and their in-flight
>>>    MongoDB tasks got zapped because I can see that Marathon restarted them).
>>>
>>> The consequences of this are a corrupt MongoDB database. In the case at
>>> hand, the job had run for over 50 hours, processing close to 120 million
>>> files.
>>>
>>> Steps I've taken so far to remedy include:
>>>
>>>    - tune vm.dirty_background_ratio and vm.dirty_ratio down,
>>>    respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate
>>>    more frequent, smaller flushes and thus avoid less frequent massive flushes
>>>    that suspend threads for very long periods.
>>>    - increase agent ping timeout to 10 minutes (every 30 seconds, 20
>>>    times)
>>>
>>> So the questions are:
>>>
>>>    - Is there some way to be given control (a callback, or an "exit"
>>>    routine) so that the container about to be nuked can be given a chance to
>>>    exit gracefully?
>>>    - Are there other steps I can take to avoid this mildly calamitous
>>>    occurrence?
>>>    - (Also, I'd be grateful for more clarity on anything in steps 1-4
>>>    above that is a bit hand-wavy!)
>>>
>>> As always, thanks.
>>>
>>> -Paul
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Consequences of health-check timeouts?

Posted by Paul Bell <ar...@gmail.com>.

Hi Hasodent,

Thanks for your reply.

In re executor_shutdown_grace_period: how would this enable the task
(MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
STDOUT as captured by Mesos shows that it received signal 15 just before it
said good-bye). My naive understanding of this grace period is that it
simply delays the termination of the executor.

The following snippet is rom /var/log/syslog. I believe it shows the stack
trace (largely in the kernel) that led to mesos-master being blocked for
more than 120 seconds. Please note that immediately above (before) the
blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the
blocked mesos-master is a blocked java task. I'm not sure what the java
task is. This took place on the mesos-master node and none of our
applications runs there. It runs master, Marathon, and ZK. Maybe the java
task is Marathon or ZK?

Thanks again.

-Paul

May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013
blocked for more than 120 seconds.

May 16 20:06:53 71 kernel: [193339.890873]       Not tainted
3.13.0-32-generic #57-Ubuntu

May 16 20:06:53 71 kernel: [193339.890889] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

May 16 20:06:53 71 kernel: [193339.890912] mesos-master    D
ffff88013fd94440     0  4013      1 0x00000000

May 16 20:06:53 71 kernel: [193339.890914]  ffff880137429a28
0000000000000002 ffff880135778000 ffff880137429fd8

May 16 20:06:53 71 kernel: [193339.890916]  0000000000014440
0000000000014440 ffff880135778000 ffff88013fd94cd8

May 16 20:06:53 71 kernel: [193339.890918]  ffff88013ffd34b0
0000000000000002 ffffffff81284630 ffff880137429aa0

May 16 20:06:53 71 kernel: [193339.890919] Call Trace:

May 16 20:06:53 71 kernel: [193339.890922]  [<ffffffff81284630>] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890924]  [<ffffffff817203fd>]
io_schedule+0x9d/0x140

May 16 20:06:53 71 kernel: [193339.890925]  [<ffffffff8128463e>]
sleep_on_shadow_bh+0xe/0x20

May 16 20:06:53 71 kernel: [193339.890927]  [<ffffffff81720882>]
__wait_on_bit+0x62/0x90

May 16 20:06:53 71 kernel: [193339.890929]  [<ffffffff81284630>] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890930]  [<ffffffff81720927>]
out_of_line_wait_on_bit+0x77/0x90

May 16 20:06:53 71 kernel: [193339.890932]  [<ffffffff810aaf40>] ?
autoremove_wake_function+0x40/0x40

May 16 20:06:53 71 kernel: [193339.890934]  [<ffffffff810aab15>] ?
wake_up_bit+0x25/0x30

May 16 20:06:53 71 kernel: [193339.890936]  [<ffffffff81285a3d>]
do_get_write_access+0x2ad/0x4f0

May 16 20:06:53 71 kernel: [193339.890938]  [<ffffffff811ef90d>] ?
__getblk+0x2d/0x2e0

May 16 20:06:53 71 kernel: [193339.890939]  [<ffffffff81285ca7>]
jbd2_journal_get_write_access+0x27/0x40

May 16 20:06:53 71 kernel: [193339.890942]  [<ffffffff8126c37b>]
__ext4_journal_get_write_access+0x3b/0x80

May 16 20:06:53 71 kernel: [193339.890946]  [<ffffffff81242200>]
ext4_reserve_inode_write+0x70/0xa0

May 16 20:06:53 71 kernel: [193339.890948]  [<ffffffff81245490>] ?
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890949]  [<ffffffff81242274>]
ext4_mark_inode_dirty+0x44/0x1f0

May 16 20:06:53 71 kernel: [193339.890951]  [<ffffffff81245490>]
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890953]  [<ffffffff811e65aa>]
__mark_inode_dirty+0x10a/0x2d0

May 16 20:06:53 71 kernel: [193339.890956]  [<ffffffff811d7761>]
update_time+0x81/0xd0

May 16 20:06:53 71 kernel: [193339.890957]  [<ffffffff811d7970>]
file_update_time+0x80/0xd0

May 16 20:06:53 71 kernel: [193339.890961]  [<ffffffff8114fc40>]
__generic_file_aio_write+0x180/0x3d0

May 16 20:06:53 71 kernel: [193339.890963]  [<ffffffff8114fee8>]
generic_file_aio_write+0x58/0xa0

May 16 20:06:53 71 kernel: [193339.890965]  [<ffffffff81239ea9>]
ext4_file_write+0x99/0x400

May 16 20:06:53 71 kernel: [193339.890967]  [<ffffffff8109a7f0>] ?
wake_up_state+0x10/0x20

May 16 20:06:53 71 kernel: [193339.890970]  [<ffffffff810d7c56>] ?
wake_futex+0x66/0x90

May 16 20:06:53 71 kernel: [193339.890972]  [<ffffffff810d8c01>] ?
futex_wake+0x1b1/0x1d0

May 16 20:06:53 71 kernel: [193339.890974]  [<ffffffff811bc3da>]
do_sync_write+0x5a/0x90

May 16 20:06:53 71 kernel: [193339.890976]  [<ffffffff811bcb64>]
vfs_write+0xb4/0x1f0

May 16 20:06:53 71 kernel: [193339.890978]  [<ffffffff811bd599>]
SyS_write+0x49/0xa0

May 16 20:06:53 71 kernel: [193339.890980]  [<ffffffff8172c87f>]
tracesys+0xe1/0xe6



On Wed, May 18, 2016 at 2:33 AM, haosdent <ha...@gmail.com> wrote:

> >Is there some way to be given control (a callback, or an "exit" routine)
> so that the container about to be nuked can be given a chance to exit
> gracefully?
> The default value of executor_shutdown_grace_period is 5 seconds, you
> could change it by specify the `--executor_shutdown_grace_period` flag when
> launch mesos agent.
>
> >Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> >mesos-slaves get shutdown
> Do you know where your mesos-master stuck when it happens? Any error log
> or related log about this? In addition, is there any log when mesos-slave
> shut down?
>
> On Wed, May 18, 2016 at 6:12 AM, Paul Bell <ar...@gmail.com> wrote:
>
>> Hi All,
>>
>> I probably have the following account partly wrong, but let me present it
>> just the same and those who know better can correct me as needed.
>>
>> I've an application that runs several MongoDB shards, each a Dockerized
>> container, each on a distinct node (VM); in fact, some of the VMs are on
>> separate ESXi hosts.
>>
>> I've lately seen situations where, because of very slow disks for the
>> database, the following sequence occurs (I think):
>>
>>    1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
>>    defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
>>    2. Synchronous flushing of many, many pages occurs, writing to a slow
>>    disk
>>    3. (Around this time one might see in /var/log/syslog "task X blocked
>>    for more than 120 seconds" for all kinds of tasks, including mesos-master)
>>    4. mesos-slaves get shutdown (this is the part I'm unclear about; but
>>    I am quite certain that on 2 nodes the executors and their in-flight
>>    MongoDB tasks got zapped because I can see that Marathon restarted them).
>>
>> The consequences of this are a corrupt MongoDB database. In the case at
>> hand, the job had run for over 50 hours, processing close to 120 million
>> files.
>>
>> Steps I've taken so far to remedy include:
>>
>>    - tune vm.dirty_background_ratio and vm.dirty_ratio down,
>>    respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate
>>    more frequent, smaller flushes and thus avoid less frequent massive flushes
>>    that suspend threads for very long periods.
>>    - increase agent ping timeout to 10 minutes (every 30 seconds, 20
>>    times)
>>
>> So the questions are:
>>
>>    - Is there some way to be given control (a callback, or an "exit"
>>    routine) so that the container about to be nuked can be given a chance to
>>    exit gracefully?
>>    - Are there other steps I can take to avoid this mildly calamitous
>>    occurrence?
>>    - (Also, I'd be grateful for more clarity on anything in steps 1-4
>>    above that is a bit hand-wavy!)
>>
>> As always, thanks.
>>
>> -Paul
>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>

Re: Consequences of health-check timeouts?

Posted by haosdent <ha...@gmail.com>.

>Is there some way to be given control (a callback, or an "exit" routine)
so that the container about to be nuked can be given a chance to exit
gracefully?
The default value of executor_shutdown_grace_period is 5 seconds, you could
change it by specify the `--executor_shutdown_grace_period` flag when
launch mesos agent.

>Are there other steps I can take to avoid this mildly calamitous
occurrence?
>mesos-slaves get shutdown
Do you know where your mesos-master stuck when it happens? Any error log or
related log about this? In addition, is there any log when mesos-slave shut
down?

On Wed, May 18, 2016 at 6:12 AM, Paul Bell <ar...@gmail.com> wrote:

> Hi All,
>
> I probably have the following account partly wrong, but let me present it
> just the same and those who know better can correct me as needed.
>
> I've an application that runs several MongoDB shards, each a Dockerized
> container, each on a distinct node (VM); in fact, some of the VMs are on
> separate ESXi hosts.
>
> I've lately seen situations where, because of very slow disks for the
> database, the following sequence occurs (I think):
>
>    1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
>    defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
>    2. Synchronous flushing of many, many pages occurs, writing to a slow
>    disk
>    3. (Around this time one might see in /var/log/syslog "task X blocked
>    for more than 120 seconds" for all kinds of tasks, including mesos-master)
>    4. mesos-slaves get shutdown (this is the part I'm unclear about; but
>    I am quite certain that on 2 nodes the executors and their in-flight
>    MongoDB tasks got zapped because I can see that Marathon restarted them).
>
> The consequences of this are a corrupt MongoDB database. In the case at
> hand, the job had run for over 50 hours, processing close to 120 million
> files.
>
> Steps I've taken so far to remedy include:
>
>    - tune vm.dirty_background_ratio and vm.dirty_ratio down,
>    respectively, to 5 and 10 (from 10 and 20). The intent here is to tolerate
>    more frequent, smaller flushes and thus avoid less frequent massive flushes
>    that suspend threads for very long periods.
>    - increase agent ping timeout to 10 minutes (every 30 seconds, 20
>    times)
>
> So the questions are:
>
>    - Is there some way to be given control (a callback, or an "exit"
>    routine) so that the container about to be nuked can be given a chance to
>    exit gracefully?
>    - Are there other steps I can take to avoid this mildly calamitous
>    occurrence?
>    - (Also, I'd be grateful for more clarity on anything in steps 1-4
>    above that is a bit hand-wavy!)
>
> As always, thanks.
>
> -Paul
>
>
>


-- 
Best Regards,
Haosdent Huang