You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chun-Hung Hsiao (JIRA)" <ji...@apache.org> on 2018/01/24 01:41:00 UTC

[jira] [Updated] (MESOS-8481) Agent reboot during checkpointing may result in empty checkpoints.

     [ https://issues.apache.org/jira/browse/MESOS-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chun-Hung Hsiao updated MESOS-8481:
-----------------------------------
    Description: 
An empty checkpoint file was created due to the following incident.

At 17:12:25, the master assigned a task to an agent:
{noformat}
I0123 17:12:25.000000 18618 master.cpp:11457] Adding task 5602 with resources cpus(allocated: *):0.1; mem(allocated: *):128 on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@<agent_ip>:5051 (<agent_ip>)
I0123 17:12:25.000000 18618 master.cpp:5017] Launching task 5602 of framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@<ip>:10015 with resources [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@<agent_ip>:5051 (<agent_ip>)
{noformat}
Meanwhile, the agent is being rebooted:
{noformat}
$ last reboot
reboot   system boot  3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09  (06:55)
{noformat}
The agent log did not show any information about the task, possibly because there was no fsync before reboot:
{noformat}
I0123 17:12:09.000000 17237 http.cpp:851] Authorizing principal 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot'
-- Reboot --
I0123 17:15:40.000000  2689 logsink.cpp:89] Added FileSink for glog logs to: /var/log/mesos/mesos-agent.log
{noformat}
However, the agent was checkpointing the task before reboot:
{noformat}
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’
  Size: 39        	Blocks: 0          IO Block: 4096   directory
Device: ca40h/51776d	Inode: 67306254    Links: 3
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-24 00:23:43.237322609 +0000
Modify: 2018-01-23 17:12:25.751463030 +0000
Change: 2018-01-23 17:12:25.751463030 +0000
 Birth: -
{noformat}
And since there was no fsync before reboot, all checkpoints resulted in empty files:
{noformat}
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: ca40h/51776d    Inode: 33967500    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:15:41.485506070 +0000
Modify: 2018-01-23 17:12:25.749463047 +0000
Change: 2018-01-23 17:12:25.749463047 +0000
 Birth: -
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: ca40h/51776d    Inode: 33967495    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 23:00:42.190975780 +0000
Modify: 2018-01-23 17:12:25.749463047 +0000
Change: 2018-01-23 17:12:25.749463047 +0000
 Birth: -
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: ca40h/51776d	Inode: 67306255    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:12:25.751463030 +0000
Modify: 2018-01-23 17:12:25.751463030 +0000
Change: 2018-01-23 17:12:25.751463030 +0000
 Birth: -
{noformat}
So were {{forked.pid}} and {{task.info}}.

As a result, the agent failed to recover after reboot:
{noformat}
E0123 17:15:41.000000  2709 slave.cpp:6800] EXIT with status 1: Failed to perform recovery: Failed to recover framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from '/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info': Found an empty file
{noformat}
The error came from [https://github.com/apache/mesos/blob/a892a2e80255291e6cd5cb3b0e90b9a029989c99/3rdparty/stout/include/stout/protobuf.hpp#L367].
This is related to a recent change of the checkpoint recovery logic: we used to ignore empty checkpoint files before, but now it return errors on empty checkpoint files.

  was:
An empty checkpoint file was created due to the following incident.

At 17:12:25, the master assigned a task to an agent:
{noformat}
I0123 17:12:25.000000 18618 master.cpp:11457] Adding task 5602 with resources cpus(allocated: *):0.1; mem(allocated: *):128 on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@<agent_ip>:5051 (<agent_ip>)
I0123 17:12:25.000000 18618 master.cpp:5017] Launching task 5602 of framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@<ip>:10015 with resources [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@<agent_ip>:5051 (<agent_ip>)
{noformat}
Meanwhile, the agent is being rebooted:
{noformat}
$ last reboot
reboot   system boot  3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09  (06:55)
{noformat}
The agent log did not show any information about the task, possibly because there was no fsync before reboot:
{noformat}
I0123 17:12:09.000000 17237 http.cpp:851] Authorizing principal 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot'
-- Reboot --
I0123 17:15:40.000000  2689 logsink.cpp:89] Added FileSink for glog logs to: /var/log/mesos/mesos-agent.log
{noformat}
However, the agent was checkpointing the task before reboot:
{noformat}
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’
  Size: 39        	Blocks: 0          IO Block: 4096   directory
Device: ca40h/51776d	Inode: 67306254    Links: 3
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-24 00:23:43.237322609 +0000
Modify: 2018-01-23 17:12:25.751463030 +0000
Change: 2018-01-23 17:12:25.751463030 +0000
 Birth: -
{noformat}
And since there was no fsync before reboot, all checkpoints resulted in empty files:
{noformat}
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: ca40h/51776d    Inode: 33967500    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:15:41.485506070 +0000
Modify: 2018-01-23 17:12:25.749463047 +0000
Change: 2018-01-23 17:12:25.749463047 +0000
 Birth: -
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: ca40h/51776d    Inode: 33967495    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 23:00:42.190975780 +0000
Modify: 2018-01-23 17:12:25.749463047 +0000
Change: 2018-01-23 17:12:25.749463047 +0000
 Birth: -
$ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info
  File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: ca40h/51776d	Inode: 67306255    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:unlabeled_t:s0
Access: 2018-01-23 17:12:25.751463030 +0000
Modify: 2018-01-23 17:12:25.751463030 +0000
Change: 2018-01-23 17:12:25.751463030 +0000
 Birth: -
{noformat}
So were {{forked.pid}} and {{task.info}}.

As a result, the agent failed to recover after reboot:
{noformat}
E0123 17:15:41.000000  2709 slave.cpp:6800] EXIT with status 1: Failed to perform recovery: Failed to recover framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from '/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info': Found an empty file
{noformat}
https://github.com/apache/mesos/blob/a892a2e80255291e6cd5cb3b0e90b9a029989c99/3rdparty/stout/include/stout/protobuf.hpp#L367


> Agent reboot during checkpointing may result in empty checkpoints.
> ------------------------------------------------------------------
>
>                 Key: MESOS-8481
>                 URL: https://issues.apache.org/jira/browse/MESOS-8481
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Chun-Hung Hsiao
>            Assignee: Michael Park
>            Priority: Major
>
> An empty checkpoint file was created due to the following incident.
> At 17:12:25, the master assigned a task to an agent:
> {noformat}
> I0123 17:12:25.000000 18618 master.cpp:11457] Adding task 5602 with resources cpus(allocated: *):0.1; mem(allocated: *):128 on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@<agent_ip>:5051 (<agent_ip>)
> I0123 17:12:25.000000 18618 master.cpp:5017] Launching task 5602 of framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112 (Balloon Framework OOM) at scheduler-fbba22f7-ebbc-4864-8394-0aa558f8ffaa@<ip>:10015 with resources [...] on agent aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4 at slave(1)@<agent_ip>:5051 (<agent_ip>)
> {noformat}
> Meanwhile, the agent is being rebooted:
> {noformat}
> $ last reboot
> reboot   system boot  3.10.0-693.11.6. Tue Jan 23 17:14 - 00:09  (06:55)
> {noformat}
> The agent log did not show any information about the task, possibly because there was no fsync before reboot:
> {noformat}
> I0123 17:12:09.000000 17237 http.cpp:851] Authorizing principal 'dcos_checks_agent' to GET the endpoint '/metrics/snapshot'
> -- Reboot --
> I0123 17:15:40.000000  2689 logsink.cpp:89] Added FileSink for glog logs to: /var/log/mesos/mesos-agent.log
> {noformat}
> However, the agent was checkpointing the task before reboot:
> {noformat}
> $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/
>   File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/’
>   Size: 39        	Blocks: 0          IO Block: 4096   directory
> Device: ca40h/51776d	Inode: 67306254    Links: 3
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2018-01-24 00:23:43.237322609 +0000
> Modify: 2018-01-23 17:12:25.751463030 +0000
> Change: 2018-01-23 17:12:25.751463030 +0000
>  Birth: -
> {noformat}
> And since there was no fsync before reboot, all checkpoints resulted in empty files:
> {noformat}
> $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info
>   File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info’
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: ca40h/51776d    Inode: 33967500    Links: 1
> Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2018-01-23 17:15:41.485506070 +0000
> Modify: 2018-01-23 17:12:25.749463047 +0000
> Change: 2018-01-23 17:12:25.749463047 +0000
>  Birth: -
> $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid
>   File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.pid’
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: ca40h/51776d    Inode: 33967495    Links: 1
> Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2018-01-23 23:00:42.190975780 +0000
> Modify: 2018-01-23 17:12:25.749463047 +0000
> Change: 2018-01-23 17:12:25.749463047 +0000
>  Birth: -
> $ sudo stat /var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info
>   File: ‘/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/executors/5602/executor.info’
>   Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
> Device: ca40h/51776d	Inode: 67306255    Links: 1
> Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2018-01-23 17:12:25.751463030 +0000
> Modify: 2018-01-23 17:12:25.751463030 +0000
> Change: 2018-01-23 17:12:25.751463030 +0000
>  Birth: -
> {noformat}
> So were {{forked.pid}} and {{task.info}}.
> As a result, the agent failed to recover after reboot:
> {noformat}
> E0123 17:15:41.000000  2709 slave.cpp:6800] EXIT with status 1: Failed to perform recovery: Failed to recover framework 6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112: Failed to read framework info from '/var/lib/mesos/slave/meta/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S4/frameworks/6f9b0688-38f7-4b38-bb1c-421f55e486e5-0112/framework.info': Found an empty file
> {noformat}
> The error came from [https://github.com/apache/mesos/blob/a892a2e80255291e6cd5cb3b0e90b9a029989c99/3rdparty/stout/include/stout/protobuf.hpp#L367].
> This is related to a recent change of the checkpoint recovery logic: we used to ignore empty checkpoint files before, but now it return errors on empty checkpoint files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)