You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Ilya Pronin <ip...@twopensource.com> on 2018/07/02 23:49:17 UTC

[Proposal] Replicated log storage compaction

Hi everyone,

I'd like to propose adding "manual" LevelDB compaction to the
replicated log truncation process.

Motivation

Mesos Master and Aurora Scheduler use the replicated log to persist
information about the cluster. This log is periodically truncated to
prune outdated log entries. However the replicated log storage is not
compacted and grows without bounds. This leads to problems like
synchronous failover of all master/scheduler replicas happening
because all of them ran out of disk space.

The only time when log storage compaction happens is during recovery.
Because of that periodic failovers are required to control the
replicated log storage growth. But this solution is suboptimal.
Failovers are not instant: e.g. Aurora Scheduler needs to recover the
storage which depending on the cluster can take several minutes.
During the downtime tasks cannot be (re-)scheduled and users cannot
interact with the service.

Proposal

In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
work well with LevelDB background compaction algorithm. Fortunately,
LevelDB provides a way to force compaction with DB::CompactRange()
method. Replicated log storage can trigger it after persisting learned
TRUNCATE action and deleting truncated log positions. The compacted
range will be from previous first position of the log to the new first
position (the one the log was truncated up to).

Performance impact

Mesos Master and Aurora Scheduler have 2 different replicated log
usage profiles. For Mesos Master every registry update (agent
(re-)registration/marking, maintenance schedule update, etc.) induces
writing a complete snapshot which depending on the cluster size can
get pretty big (in a scale test fake cluster with 55k agents it is
~15MB). Every snapshot is followed by a truncation of all previous
entries, which doesn't block the registrar and happens kind of in the
background. In the scale test cluster with 55k agents compactions
after such truncations take ~680ms.

To reduce the performance impact for the Master compaction can be
triggered only after more than some configurable number of keys were
deleted.

Aurora Scheduler writes incremental changes of its storage to the
replicated log. Every hour a storage snapshot is created and persisted
to the log, followed by a truncation of all entries preceding the
snapshot. Therefore, storage compactions will be infrequent but will
deal with potentially large number of keys. In the scale test cluster
such compactions took ~425ms each.

Please let me know what you think about it.

Thanks!

-- 
Ilya Pronin

Re: [Proposal] Replicated log storage compaction

Posted by Ilya Pronin <ip...@twopensource.com>.
As discussed, I've filed a LevelDB issue:
https://github.com/google/leveldb/issues/603. So far it seems that the
LevelDB behavior that we see is unexpected.

I'll post a patch with the temporary workaround that I described in
the first email in https://issues.apache.org/jira/browse/MESOS-184. It
will be disabled by default.

On Wed, Jul 11, 2018 at 2:23 PM, Judith Malnick <jm...@mesosphere.io> wrote:
> Hey Ilya,
> If you'd like to generate some real-time conversation about your proposal
> this might be a good thing to talk about during tomorrow's developer sync
> at 10:00 am Pacific time. If you're interested please feel free to put it
> on the agenda
> <https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing>
> !
> All the best!
> Judith
>
> On Fri, Jul 6, 2018 at 2:35 PM Benjamin Mahler <bm...@apache.org> wrote:
>
>> I was chatting with Ilya on slack and I'll re-post here:
>>
>> * Like Jie, I was hoping for a toggle (maybe it should start default off
>> until we have production experience? sounds like Ilya has already
>> experience with it running in test clusters so far)
>>
>> * I was asking whether this would be considered a flaw in leveldb's
>> compaction algorithm. Ilya didn't see any changes in recent leveldb
>> releases that would affect this. So, we probably should file an issue to
>> see if they think it's a flaw and whether our workaround makes sense to
>> them. We can reference this in the code for posterity.
>>
>> On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu <ji...@mesosphere.io> wrote:
>>
>> > Sounds good to me.
>> >
>> > My only ask is to have a way to turn this feature off (flag, env var,
>> etc)
>> >
>> > - Jie
>> >
>> > On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone <vi...@apache.org> wrote:
>> >
>> >> I don't know about the replicated log, but the proposal seems find to
>> me.
>> >>
>> >> Jie/BenM, do you guys have an opinion?
>> >>
>> >> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham <
>> >> sshanmugham@twitter.com.invalid> wrote:
>> >>
>> >>> +1. Aurora will hugely benefit from this change.
>> >>>
>> >>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ip...@twopensource.com>
>> >>> wrote:
>> >>>
>> >>> > Hi everyone,
>> >>> >
>> >>> > I'd like to propose adding "manual" LevelDB compaction to the
>> >>> > replicated log truncation process.
>> >>> >
>> >>> > Motivation
>> >>> >
>> >>> > Mesos Master and Aurora Scheduler use the replicated log to persist
>> >>> > information about the cluster. This log is periodically truncated to
>> >>> > prune outdated log entries. However the replicated log storage is not
>> >>> > compacted and grows without bounds. This leads to problems like
>> >>> > synchronous failover of all master/scheduler replicas happening
>> >>> > because all of them ran out of disk space.
>> >>> >
>> >>> > The only time when log storage compaction happens is during recovery.
>> >>> > Because of that periodic failovers are required to control the
>> >>> > replicated log storage growth. But this solution is suboptimal.
>> >>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
>> >>> > storage which depending on the cluster can take several minutes.
>> >>> > During the downtime tasks cannot be (re-)scheduled and users cannot
>> >>> > interact with the service.
>> >>> >
>> >>> > Proposal
>> >>> >
>> >>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
>> >>> > work well with LevelDB background compaction algorithm. Fortunately,
>> >>> > LevelDB provides a way to force compaction with DB::CompactRange()
>> >>> > method. Replicated log storage can trigger it after persisting
>> learned
>> >>> > TRUNCATE action and deleting truncated log positions. The compacted
>> >>> > range will be from previous first position of the log to the new
>> first
>> >>> > position (the one the log was truncated up to).
>> >>> >
>> >>> > Performance impact
>> >>> >
>> >>> > Mesos Master and Aurora Scheduler have 2 different replicated log
>> >>> > usage profiles. For Mesos Master every registry update (agent
>> >>> > (re-)registration/marking, maintenance schedule update, etc.) induces
>> >>> > writing a complete snapshot which depending on the cluster size can
>> >>> > get pretty big (in a scale test fake cluster with 55k agents it is
>> >>> > ~15MB). Every snapshot is followed by a truncation of all previous
>> >>> > entries, which doesn't block the registrar and happens kind of in the
>> >>> > background. In the scale test cluster with 55k agents compactions
>> >>> > after such truncations take ~680ms.
>> >>> >
>> >>> > To reduce the performance impact for the Master compaction can be
>> >>> > triggered only after more than some configurable number of keys were
>> >>> > deleted.
>> >>> >
>> >>> > Aurora Scheduler writes incremental changes of its storage to the
>> >>> > replicated log. Every hour a storage snapshot is created and
>> persisted
>> >>> > to the log, followed by a truncation of all entries preceding the
>> >>> > snapshot. Therefore, storage compactions will be infrequent but will
>> >>> > deal with potentially large number of keys. In the scale test cluster
>> >>> > such compactions took ~425ms each.
>> >>> >
>> >>> > Please let me know what you think about it.
>> >>> >
>> >>> > Thanks!
>> >>> >
>> >>> > --
>> >>> > Ilya Pronin
>> >>> >
>> >>>
>> >>
>> >
>>
>
>
> --
> Judith Malnick
> Community Manager
> 310-709-1517

Re: [Proposal] Replicated log storage compaction

Posted by Judith Malnick <jm...@mesosphere.io>.
Hey Ilya,
If you'd like to generate some real-time conversation about your proposal
this might be a good thing to talk about during tomorrow's developer sync
at 10:00 am Pacific time. If you're interested please feel free to put it
on the agenda
<https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing>
!
All the best!
Judith

On Fri, Jul 6, 2018 at 2:35 PM Benjamin Mahler <bm...@apache.org> wrote:

> I was chatting with Ilya on slack and I'll re-post here:
>
> * Like Jie, I was hoping for a toggle (maybe it should start default off
> until we have production experience? sounds like Ilya has already
> experience with it running in test clusters so far)
>
> * I was asking whether this would be considered a flaw in leveldb's
> compaction algorithm. Ilya didn't see any changes in recent leveldb
> releases that would affect this. So, we probably should file an issue to
> see if they think it's a flaw and whether our workaround makes sense to
> them. We can reference this in the code for posterity.
>
> On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu <ji...@mesosphere.io> wrote:
>
> > Sounds good to me.
> >
> > My only ask is to have a way to turn this feature off (flag, env var,
> etc)
> >
> > - Jie
> >
> > On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone <vi...@apache.org> wrote:
> >
> >> I don't know about the replicated log, but the proposal seems find to
> me.
> >>
> >> Jie/BenM, do you guys have an opinion?
> >>
> >> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham <
> >> sshanmugham@twitter.com.invalid> wrote:
> >>
> >>> +1. Aurora will hugely benefit from this change.
> >>>
> >>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ip...@twopensource.com>
> >>> wrote:
> >>>
> >>> > Hi everyone,
> >>> >
> >>> > I'd like to propose adding "manual" LevelDB compaction to the
> >>> > replicated log truncation process.
> >>> >
> >>> > Motivation
> >>> >
> >>> > Mesos Master and Aurora Scheduler use the replicated log to persist
> >>> > information about the cluster. This log is periodically truncated to
> >>> > prune outdated log entries. However the replicated log storage is not
> >>> > compacted and grows without bounds. This leads to problems like
> >>> > synchronous failover of all master/scheduler replicas happening
> >>> > because all of them ran out of disk space.
> >>> >
> >>> > The only time when log storage compaction happens is during recovery.
> >>> > Because of that periodic failovers are required to control the
> >>> > replicated log storage growth. But this solution is suboptimal.
> >>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
> >>> > storage which depending on the cluster can take several minutes.
> >>> > During the downtime tasks cannot be (re-)scheduled and users cannot
> >>> > interact with the service.
> >>> >
> >>> > Proposal
> >>> >
> >>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
> >>> > work well with LevelDB background compaction algorithm. Fortunately,
> >>> > LevelDB provides a way to force compaction with DB::CompactRange()
> >>> > method. Replicated log storage can trigger it after persisting
> learned
> >>> > TRUNCATE action and deleting truncated log positions. The compacted
> >>> > range will be from previous first position of the log to the new
> first
> >>> > position (the one the log was truncated up to).
> >>> >
> >>> > Performance impact
> >>> >
> >>> > Mesos Master and Aurora Scheduler have 2 different replicated log
> >>> > usage profiles. For Mesos Master every registry update (agent
> >>> > (re-)registration/marking, maintenance schedule update, etc.) induces
> >>> > writing a complete snapshot which depending on the cluster size can
> >>> > get pretty big (in a scale test fake cluster with 55k agents it is
> >>> > ~15MB). Every snapshot is followed by a truncation of all previous
> >>> > entries, which doesn't block the registrar and happens kind of in the
> >>> > background. In the scale test cluster with 55k agents compactions
> >>> > after such truncations take ~680ms.
> >>> >
> >>> > To reduce the performance impact for the Master compaction can be
> >>> > triggered only after more than some configurable number of keys were
> >>> > deleted.
> >>> >
> >>> > Aurora Scheduler writes incremental changes of its storage to the
> >>> > replicated log. Every hour a storage snapshot is created and
> persisted
> >>> > to the log, followed by a truncation of all entries preceding the
> >>> > snapshot. Therefore, storage compactions will be infrequent but will
> >>> > deal with potentially large number of keys. In the scale test cluster
> >>> > such compactions took ~425ms each.
> >>> >
> >>> > Please let me know what you think about it.
> >>> >
> >>> > Thanks!
> >>> >
> >>> > --
> >>> > Ilya Pronin
> >>> >
> >>>
> >>
> >
>


-- 
Judith Malnick
Community Manager
310-709-1517

Re: [Proposal] Replicated log storage compaction

Posted by Benjamin Mahler <bm...@apache.org>.
I was chatting with Ilya on slack and I'll re-post here:

* Like Jie, I was hoping for a toggle (maybe it should start default off
until we have production experience? sounds like Ilya has already
experience with it running in test clusters so far)

* I was asking whether this would be considered a flaw in leveldb's
compaction algorithm. Ilya didn't see any changes in recent leveldb
releases that would affect this. So, we probably should file an issue to
see if they think it's a flaw and whether our workaround makes sense to
them. We can reference this in the code for posterity.

On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu <ji...@mesosphere.io> wrote:

> Sounds good to me.
>
> My only ask is to have a way to turn this feature off (flag, env var, etc)
>
> - Jie
>
> On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone <vi...@apache.org> wrote:
>
>> I don't know about the replicated log, but the proposal seems find to me.
>>
>> Jie/BenM, do you guys have an opinion?
>>
>> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham <
>> sshanmugham@twitter.com.invalid> wrote:
>>
>>> +1. Aurora will hugely benefit from this change.
>>>
>>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ip...@twopensource.com>
>>> wrote:
>>>
>>> > Hi everyone,
>>> >
>>> > I'd like to propose adding "manual" LevelDB compaction to the
>>> > replicated log truncation process.
>>> >
>>> > Motivation
>>> >
>>> > Mesos Master and Aurora Scheduler use the replicated log to persist
>>> > information about the cluster. This log is periodically truncated to
>>> > prune outdated log entries. However the replicated log storage is not
>>> > compacted and grows without bounds. This leads to problems like
>>> > synchronous failover of all master/scheduler replicas happening
>>> > because all of them ran out of disk space.
>>> >
>>> > The only time when log storage compaction happens is during recovery.
>>> > Because of that periodic failovers are required to control the
>>> > replicated log storage growth. But this solution is suboptimal.
>>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
>>> > storage which depending on the cluster can take several minutes.
>>> > During the downtime tasks cannot be (re-)scheduled and users cannot
>>> > interact with the service.
>>> >
>>> > Proposal
>>> >
>>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
>>> > work well with LevelDB background compaction algorithm. Fortunately,
>>> > LevelDB provides a way to force compaction with DB::CompactRange()
>>> > method. Replicated log storage can trigger it after persisting learned
>>> > TRUNCATE action and deleting truncated log positions. The compacted
>>> > range will be from previous first position of the log to the new first
>>> > position (the one the log was truncated up to).
>>> >
>>> > Performance impact
>>> >
>>> > Mesos Master and Aurora Scheduler have 2 different replicated log
>>> > usage profiles. For Mesos Master every registry update (agent
>>> > (re-)registration/marking, maintenance schedule update, etc.) induces
>>> > writing a complete snapshot which depending on the cluster size can
>>> > get pretty big (in a scale test fake cluster with 55k agents it is
>>> > ~15MB). Every snapshot is followed by a truncation of all previous
>>> > entries, which doesn't block the registrar and happens kind of in the
>>> > background. In the scale test cluster with 55k agents compactions
>>> > after such truncations take ~680ms.
>>> >
>>> > To reduce the performance impact for the Master compaction can be
>>> > triggered only after more than some configurable number of keys were
>>> > deleted.
>>> >
>>> > Aurora Scheduler writes incremental changes of its storage to the
>>> > replicated log. Every hour a storage snapshot is created and persisted
>>> > to the log, followed by a truncation of all entries preceding the
>>> > snapshot. Therefore, storage compactions will be infrequent but will
>>> > deal with potentially large number of keys. In the scale test cluster
>>> > such compactions took ~425ms each.
>>> >
>>> > Please let me know what you think about it.
>>> >
>>> > Thanks!
>>> >
>>> > --
>>> > Ilya Pronin
>>> >
>>>
>>
>

Re: [Proposal] Replicated log storage compaction

Posted by Vinod Kone <vi...@apache.org>.
I don't know about the replicated log, but the proposal seems find to me.

Jie/BenM, do you guys have an opinion?

On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham
<ss...@twitter.com.invalid> wrote:

> +1. Aurora will hugely benefit from this change.
>
> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ip...@twopensource.com>
> wrote:
>
> > Hi everyone,
> >
> > I'd like to propose adding "manual" LevelDB compaction to the
> > replicated log truncation process.
> >
> > Motivation
> >
> > Mesos Master and Aurora Scheduler use the replicated log to persist
> > information about the cluster. This log is periodically truncated to
> > prune outdated log entries. However the replicated log storage is not
> > compacted and grows without bounds. This leads to problems like
> > synchronous failover of all master/scheduler replicas happening
> > because all of them ran out of disk space.
> >
> > The only time when log storage compaction happens is during recovery.
> > Because of that periodic failovers are required to control the
> > replicated log storage growth. But this solution is suboptimal.
> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
> > storage which depending on the cluster can take several minutes.
> > During the downtime tasks cannot be (re-)scheduled and users cannot
> > interact with the service.
> >
> > Proposal
> >
> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
> > work well with LevelDB background compaction algorithm. Fortunately,
> > LevelDB provides a way to force compaction with DB::CompactRange()
> > method. Replicated log storage can trigger it after persisting learned
> > TRUNCATE action and deleting truncated log positions. The compacted
> > range will be from previous first position of the log to the new first
> > position (the one the log was truncated up to).
> >
> > Performance impact
> >
> > Mesos Master and Aurora Scheduler have 2 different replicated log
> > usage profiles. For Mesos Master every registry update (agent
> > (re-)registration/marking, maintenance schedule update, etc.) induces
> > writing a complete snapshot which depending on the cluster size can
> > get pretty big (in a scale test fake cluster with 55k agents it is
> > ~15MB). Every snapshot is followed by a truncation of all previous
> > entries, which doesn't block the registrar and happens kind of in the
> > background. In the scale test cluster with 55k agents compactions
> > after such truncations take ~680ms.
> >
> > To reduce the performance impact for the Master compaction can be
> > triggered only after more than some configurable number of keys were
> > deleted.
> >
> > Aurora Scheduler writes incremental changes of its storage to the
> > replicated log. Every hour a storage snapshot is created and persisted
> > to the log, followed by a truncation of all entries preceding the
> > snapshot. Therefore, storage compactions will be infrequent but will
> > deal with potentially large number of keys. In the scale test cluster
> > such compactions took ~425ms each.
> >
> > Please let me know what you think about it.
> >
> > Thanks!
> >
> > --
> > Ilya Pronin
> >
>

Re: [Proposal] Replicated log storage compaction

Posted by Santhosh Kumar Shanmugham <ss...@twitter.com.INVALID>.
+1. Aurora will hugely benefit from this change.

On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ip...@twopensource.com> wrote:

> Hi everyone,
>
> I'd like to propose adding "manual" LevelDB compaction to the
> replicated log truncation process.
>
> Motivation
>
> Mesos Master and Aurora Scheduler use the replicated log to persist
> information about the cluster. This log is periodically truncated to
> prune outdated log entries. However the replicated log storage is not
> compacted and grows without bounds. This leads to problems like
> synchronous failover of all master/scheduler replicas happening
> because all of them ran out of disk space.
>
> The only time when log storage compaction happens is during recovery.
> Because of that periodic failovers are required to control the
> replicated log storage growth. But this solution is suboptimal.
> Failovers are not instant: e.g. Aurora Scheduler needs to recover the
> storage which depending on the cluster can take several minutes.
> During the downtime tasks cannot be (re-)scheduled and users cannot
> interact with the service.
>
> Proposal
>
> In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
> work well with LevelDB background compaction algorithm. Fortunately,
> LevelDB provides a way to force compaction with DB::CompactRange()
> method. Replicated log storage can trigger it after persisting learned
> TRUNCATE action and deleting truncated log positions. The compacted
> range will be from previous first position of the log to the new first
> position (the one the log was truncated up to).
>
> Performance impact
>
> Mesos Master and Aurora Scheduler have 2 different replicated log
> usage profiles. For Mesos Master every registry update (agent
> (re-)registration/marking, maintenance schedule update, etc.) induces
> writing a complete snapshot which depending on the cluster size can
> get pretty big (in a scale test fake cluster with 55k agents it is
> ~15MB). Every snapshot is followed by a truncation of all previous
> entries, which doesn't block the registrar and happens kind of in the
> background. In the scale test cluster with 55k agents compactions
> after such truncations take ~680ms.
>
> To reduce the performance impact for the Master compaction can be
> triggered only after more than some configurable number of keys were
> deleted.
>
> Aurora Scheduler writes incremental changes of its storage to the
> replicated log. Every hour a storage snapshot is created and persisted
> to the log, followed by a truncation of all entries preceding the
> snapshot. Therefore, storage compactions will be infrequent but will
> deal with potentially large number of keys. In the scale test cluster
> such compactions took ~425ms each.
>
> Please let me know what you think about it.
>
> Thanks!
>
> --
> Ilya Pronin
>