You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Stephan Erb <st...@blue-yonder.com> on 2018/01/08 22:28:02 UTC

Mesos replicated log fills disk with logging output

Hi everyone,

a few days ago, we have bumped into an interesting issue that we had not seen before. Essentially, one of our toy clusters dissolved itself:


  *   3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper (3.4.5) for leader election
  *   Master 1 and master 2 had 100% disk usage, because /var/lib/mesos/replicated_log/LOG had grown to about 170 GB
  *   The replicated log of both Master 1 and 2 was corrupted. A process restart did not fix it.
  *   The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was caused by the full disk.
  *   Master 3 was the leading Mesos master and healthy. Its disk usage was normal.


The content of /var/lib/mesos/replicated_log/LOG was an endless stream of:

2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753
2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started
2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK
2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753
2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751
2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started
2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK
2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757
2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760
2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from (begin) .. (end); will stop at '0000003060' @ 9423 : 1
2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761
2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762
2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763

Content of the associated folder:

/var/lib/mesos/replicated_log.corrupted# ls -la
total 964480
drwxr-xr-x 2 mesos mesos      4096 Jan  5 10:12 .
drwxr-xr-x 4 mesos mesos      4096 Jan  5 10:27 ..
-rw-r--r-- 1 mesos mesos       724 Dec 14 16:22 001735.ldb
-rw-r--r-- 1 mesos mesos      7393 Dec 14 16:45 001737.sst
-rw-r--r-- 1 mesos mesos     22129 Jan  3 12:53 001742.sst
-rw-r--r-- 1 mesos mesos     14967 Jan  3 13:00 001747.sst
-rw-r--r-- 1 mesos mesos      7526 Jan  4 12:30 001756.sst
-rw-r--r-- 1 mesos mesos     15113 Jan  5 10:08 001765.sst
-rw-r--r-- 1 mesos mesos     65536 Jan  5 10:09 001767.log
-rw-r--r-- 1 mesos mesos        16 Jan  5 10:08 CURRENT
-rw-r--r-- 1 mesos mesos         0 Aug 25  2015 LOCK
-rw-r--r-- 1 mesos mesos 178303865220 Jan  5 10:12 LOG
-rw-r--r-- 1 mesos mesos 463093282 Jan  5 10:08 LOG.old
-rw-r--r-- 1 mesos mesos     65536 Jan  5 10:08 MANIFEST-001764

Monitoring indicates that the disk usage started to grow shortly after a badly coordinated configuration deployment change:


  *   Master 1 was leading and restarted after a few hours of uptime
  *   Master 2 was now leading. After a few seconds (30s-60s or so) it got restarted as well
  *   Master 3 was now leading (and continued to do so)

I have to admit I am a bit surprised that the restart scenario could lead to the issues described above. Has anyone seen similar issues as well?

Thanks and best regards,
Stephan

Re: Mesos replicated log fills disk with logging output

Posted by Stephan Erb <st...@blue-yonder.com>.

Thanks for the hint! The cluster is using ext4, and judging from the linked thread this could have indeed be caused by a stalling hypervisor.

From: Jie Yu <yu...@gmail.com>
Reply-To: "user@mesos.apache.org" <us...@mesos.apache.org>
Date: Monday, 8. January 2018 at 23:36
To: user <us...@mesos.apache.org>
Subject: Re: Mesos replicated log fills disk with logging output

Stephan,

I haven't seen that before. A quick Google search suggests that it might be related to leveldb. The following thread might be related.
https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ

What is the filesystem you're using?

- JIe

On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erb <st...@blue-yonder.com>> wrote:
Hi everyone,

a few days ago, we have bumped into an interesting issue that we had not seen before. Essentially, one of our toy clusters dissolved itself:

·  3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper (3.4.5) for leader election
·  Master 1 and master 2 had 100% disk usage, because /var/lib/mesos/replicated_log/LOG had grown to about 170 GB
·  The replicated log of both Master 1 and 2 was corrupted. A process restart did not fix it.
·  The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was caused by the full disk.
·  Master 3 was the leading Mesos master and healthy. Its disk usage was normal.

The content of /var/lib/mesos/replicated_log/LOG was an endless stream of:

2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753
2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started
2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK
2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753
2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751
2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started
2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK
2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757
2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760
2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from (begin) .. (end); will stop at '0000003060' @ 9423 : 1
2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761
2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762
2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: /var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763

Content of the associated folder:

/var/lib/mesos/replicated_log.corrupted# ls -la
total 964480
drwxr-xr-x 2 mesos mesos      4096 Jan  5 10:12 .
drwxr-xr-x 4 mesos mesos      4096 Jan  5 10:27 ..
-rw-r--r-- 1 mesos mesos       724 Dec 14 16:22 001735.ldb
-rw-r--r-- 1 mesos mesos      7393 Dec 14 16:45 001737.sst
-rw-r--r-- 1 mesos mesos     22129 Jan  3 12:53 001742.sst
-rw-r--r-- 1 mesos mesos     14967 Jan  3 13:00 001747.sst
-rw-r--r-- 1 mesos mesos      7526 Jan  4 12:30 001756.sst
-rw-r--r-- 1 mesos mesos     15113 Jan  5 10:08 001765.sst
-rw-r--r-- 1 mesos mesos     65536 Jan  5 10:09 001767.log
-rw-r--r-- 1 mesos mesos        16 Jan  5 10:08 CURRENT
-rw-r--r-- 1 mesos mesos         0 Aug 25  2015 LOCK
-rw-r--r-- 1 mesos mesos 178303865220 Jan  5 10:12 LOG
-rw-r--r-- 1 mesos mesos 463093282 Jan  5 10:08 LOG.old
-rw-r--r-- 1 mesos mesos     65536 Jan  5 10:08 MANIFEST-001764

Monitoring indicates that the disk usage started to grow shortly after a badly coordinated configuration deployment change:

·  Master 1 was leading and restarted after a few hours of uptime
·  Master 2 was now leading. After a few seconds (30s-60s or so) it got restarted as well
·  Master 3 was now leading (and continued to do so)

I have to admit I am a bit surprised that the restart scenario could lead to the issues described above. Has anyone seen similar issues as well?

Thanks and best regards,
Stephan

Re: Mesos replicated log fills disk with logging output

Posted by Jie Yu <yu...@gmail.com>.

Stephan,

I haven't seen that before. A quick Google search suggests that it might be
related to leveldb. The following thread might be related.
https://groups.google.com/d/msg/leveldb/lRrbv4Y0YgU/AtfRTfQXNoYJ

What is the filesystem you're using?

- JIe

On Mon, Jan 8, 2018 at 2:28 PM, Stephan Erb <st...@blue-yonder.com>
wrote:

> Hi everyone,
>
>
>
> a few days ago, we have bumped into an interesting issue that we had not
> seen before. Essentially, one of our toy clusters dissolved itself:
>
>
>
>    - 3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and
>    ZooKeeper (3.4.5) for leader election
>    - Master 1 and master 2 had 100% disk usage, because
>    /var/lib/mesos/replicated_log/LOG had grown to about 170 GB
>    - The replicated log of both Master 1 and 2 was corrupted. A process
>    restart did not fix it.
>    - The ZooKeeper on Master 2 was corrupted as well. Logs indicated this
>    was caused by the full disk.
>    - Master 3 was the leading Mesos master and healthy. Its disk usage
>    was normal.
>
>
>
>
>
> The content of /var/lib/mesos/replicated_log/LOG was an endless stream of:
>
>
>
> 2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753
>
> 2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started
>
> 2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK
>
> 2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753
>
> 2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751
>
> 2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started
>
> 2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK
>
> 2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757
>
> 2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files
>
> 2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0
> ]
>
> 2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760
>
> 2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error:
> /var/lib/mesos/replicated_log/001735.sst: No such file or directory
>
> 2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from
> (begin) .. (end); will stop at '0000003060' @ 9423 : 1
>
> 2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files
>
> 2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0
> ]
>
> 2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761
>
> 2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error:
> /var/lib/mesos/replicated_log/001735.sst: No such file or directory
>
> 2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files
>
> 2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0
> ]
>
> 2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762
>
> 2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error:
> /var/lib/mesos/replicated_log/001735.sst: No such file or directory
>
> 2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files
>
> 2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0
> ]
>
> 2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763
>
>
>
> Content of the associated folder:
>
>
>
> /var/lib/mesos/replicated_log.corrupted# ls -la
>
> total 964480
>
> drwxr-xr-x 2 mesos mesos      4096 Jan  5 10:12 .
>
> drwxr-xr-x 4 mesos mesos      4096 Jan  5 10:27 ..
>
> -rw-r--r-- 1 mesos mesos       724 Dec 14 16:22 001735.ldb
>
> -rw-r--r-- 1 mesos mesos      7393 Dec 14 16:45 001737.sst
>
> -rw-r--r-- 1 mesos mesos     22129 Jan  3 12:53 001742.sst
>
> -rw-r--r-- 1 mesos mesos     14967 Jan  3 13:00 001747.sst
>
> -rw-r--r-- 1 mesos mesos      7526 Jan  4 12:30 001756.sst
>
> -rw-r--r-- 1 mesos mesos     15113 Jan  5 10:08 001765.sst
>
> -rw-r--r-- 1 mesos mesos     65536 Jan  5 10:09 001767.log
>
> -rw-r--r-- 1 mesos mesos        16 Jan  5 10:08 CURRENT
>
> -rw-r--r-- 1 mesos mesos         0 Aug 25  2015 LOCK
>
> -rw-r--r-- 1 mesos mesos 178303865220 Jan  5 10:12 LOG
>
> -rw-r--r-- 1 mesos mesos 463093282 Jan  5 10:08 LOG.old
>
> -rw-r--r-- 1 mesos mesos     65536 Jan  5 10:08 MANIFEST-001764
>
>
>
> Monitoring indicates that the disk usage started to grow shortly after a
> badly coordinated configuration deployment change:
>
>
>
>    - Master 1 was leading and restarted after a few hours of uptime
>    - Master 2 was now leading. After a few seconds (30s-60s or so) it got
>    restarted as well
>    - Master 3 was now leading (and continued to do so)
>
>
>
> I have to admit I am a bit surprised that the restart scenario could lead
> to the issues described above. Has anyone seen similar issues as well?
>
>
>
> Thanks and best regards,
>
> Stephan
>