You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by Maxim Khutornenko <ma...@apache.org> on 2016/03/02 18:18:25 UTC

Re: [PROPOSAL] DB snapshotting

Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
to track it.

On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wf...@apache.org> wrote:
> Thanks for the detailed write up and real-world details!  I generally
> support momentum towards a single task store implementation, so +1
> on dealing with that.
>
> I anticipated there would be a performance win from straight-to-SQL
> snapshots, so I am a +1 on that as well.
>
> In summary, +1 on all fronts!
>
> On Monday, February 29, 2016, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> (Apologies for the wordy problem statement but I feel it's really
>> necessary to justify the proposal).
>>
>> Over the past two weeks we have been battling a nasty scheduler issue
>> in production: the scheduler suddenly stops responding to any user
>> requests and subsequently gets killed by our health monitoring. Upon
>> restart, a leader may only function for a few seconds and almost
>> immediately hangs again.
>>
>> The long and painful investigation pointed towards internal H2 table
>> lock contention that resulted in a massive db-write starvation and a
>> state where a scheduler write lock would *never* be released. This was
>> relatively easy to replicate in Vagrant by creating a large update
>> (~4K instances) with a large batch_size (~1K), while bombarding the
>> scheduler with getJobUpdateDetails() requests for that job. The
>> scheduler would enter a locked up state on the very first write op
>> following the update creation (e.g. a status update for an instance
>> transition from the first batch) and stay in that state for minutes
>> until all getJobUpdateDetails() requests are served. This behavior is
>> well explained by the following sentence from [1]:
>>
>>     "When a lock is released, and multiple connections are waiting for
>> it, one of them is picked at random."
>>
>> What happens here is that in a situation when many more read requests
>> are competing for a shared table lock, the H2 PageStore does not help
>> write requests requiring an exclusive table lock in any way to
>> succeed. This leads to db-write starvation and eventual scheduler
>> native store write starvation as there is no timeout on a scheduler
>> write lock.
>>
>> We have played with various available H2/MyBatis configuration
>> settings to mitigate the above with no noticeable impact. That, until
>> we switched to H2 MVStore [2], at which point we were able to
>> completely eliminate the scheduler lockup without making any other
>> code changes! So, the solution has finally been found? The answer
>> would be YES until you try MVStore-enabled H2 with any reasonable size
>> production DB on scheduler restart. There was a reason why we disabled
>> MVStore in the scheduler [3] in the first place and that reason was
>> poor MVStore performance with bulk inserts. Re-populating
>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
>> is unacceptable in prod where every second of scheduler downtime
>> counts.
>>
>> Back to the drawing board, we tried all relevant settings and
>> approaches to speed up MVStore inserts on restart but nothing really
>> helped. Finally, the only reasonable way forward was to eliminate the
>> point of slowness altogether - namely remove thrift-to-sql migration
>> on restart. Fortunately, H2 supports an easy to operate command to
>> generate the entire DB dump with a single statement [4]. We were now
>> able to bypass the lengthly DB repopulation on restart by storing the
>> entire DB dump in snapshot and replaying it on scheduler restart.
>>
>>
>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
>> currently use, I suggest we move our H2 to it AND adopt db
>> snapshotting instead of thrift snapshotting to speed up scheduler
>> restarts. The rough POC is available here [5]. We are running a
>> version of this build in production since last week and were able to
>> completely eliminate scheduler lockups. As a welcome side effect, we
>> also observed faster scheduler restart times due to eliminating
>> thrift-to-sql chattiness. Depending on the snapshot freshness the
>> observed failover downtimes got reduced by ~40%.
>>
>> Moving to db snapshotting will require us to rethink DB schema
>> versioning and thrift deprecating/removal policy. We will have to move
>> to pre-/post- snapshot restore SQL migration scripts to handle any
>> schema changes, which is a common industry pattern but something we
>> have not tried yet. The upside though is that we can get an early
>> start here as we will have to adopt strict SQL migration rules anyway
>> when we move to persistent DB storage. Also, given that migrating to
>> H2 TaskStore will likely further degrade scheduler restart times,
>> having a better performing DB snapshotting solution in place will
>> definitely aid migration.
>>
>> Thanks,
>> Maxim
>>
>> [1] - http://www.h2database.com/html/advanced.html?#transaction_isolation
>> [2] - http://www.h2database.com/html/mvstore.html
>> [3] -
>> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
>> [4] - http://www.h2database.com/html/grammar.html#script
>> [5] -
>> https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370
>>

Re: [PROPOSAL] DB snapshotting

Posted by Maxim Khutornenko <ma...@apache.org>.
According to MVStore docs, data is moved along the MVStore -> MVMap ->
Page -> FileStore path. A quick look at the MVStore internals shows
the closest possible interception point could the FileStore [1].
Implementing a custom FileStore would require running H2 in persistent
(rather than in-memory) mode and our media would have to support
random data access for things like "readFully(long pos, int len)" to
read arbitrary Pages. This is not what our LevelDB-backed native log
is normally capable of.

Another possibility could be a hybrid approach where a custom
FileStore would store in-memory cache of everything but that would
fully defeat the purpose of what we are trying to accomplish with this
change.

Thanks,
Maxim

[1] - http://www.atetric.com/atetric/javadoc/com.h2database/h2/1.3.174/src-html/org/h2/mvstore/FileStore.html

On Wed, Mar 2, 2016 at 12:52 PM, Bill Farner <wf...@apache.org> wrote:
> Seems prudent to explore rather than write off though.  For all we know it
> simplifies a lot.
>
> On Wednesday, March 2, 2016, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> Ah, sorry, missed that conversation on IRC.
>>
>> I have not looked into that. Would be interesting to explore that
>> route. Given our ultimate goal is to get rid of the replicated log
>> altogether it does not stand as an immediate priority to me though.
>>
>> On Wed, Mar 2, 2016 at 11:51 AM, Erb, Stephan
>> <Stephan.Erb@blue-yonder.com <javascript:;>> wrote:
>> > +1 for the plan and the ticket.
>> >
>> > In addition, for reference a couple of messages from IRC from yesterday:
>> >
>> > 23:42 <serb> mkhutornenko:  interesting storage proposal on the
>> mailinglist! I only wondered one thing...
>> > 23:42 <serb> it feeld kind of weird that we use H2 as a non-replicated
>> database and build some scaffolding around it in order to distribute its
>> state via the Mesos replicated log.
>> > 23:42 <serb> Have you looked into H2, if it would be possible to
>> replace/subclass their in-process transaction log with a replicated Mesos
>> one?
>> > 23:43 <serb> Then we would not need that logic that performs a
>> simultaneous inserts into the log and the taskstore, as the backend would
>> handle that by itself
>> > 23:44 <serb> (I know close to nothing about the storage layer, so that's
>> like my perspective from 10.000 feet)
>> >
>> > 00:22 <wfarner> serb: that crossed my mind as well.  I have only drilled
>> in a bit, would love to more
>> >
>> > ________________________________________
>> > From: Maxim Khutornenko <maxim@apache.org <javascript:;>>
>> > Sent: Wednesday, March 2, 2016 18:18
>> > To: dev@aurora.apache.org <javascript:;>
>> > Subject: Re: [PROPOSAL] DB snapshotting
>> >
>> > Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
>> > to track it.
>> >
>> > On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wfarner@apache.org
>> <javascript:;>> wrote:
>> >> Thanks for the detailed write up and real-world details!  I generally
>> >> support momentum towards a single task store implementation, so +1
>> >> on dealing with that.
>> >>
>> >> I anticipated there would be a performance win from straight-to-SQL
>> >> snapshots, so I am a +1 on that as well.
>> >>
>> >> In summary, +1 on all fronts!
>> >>
>> >> On Monday, February 29, 2016, Maxim Khutornenko <maxim@apache.org
>> <javascript:;>> wrote:
>> >>
>> >>> (Apologies for the wordy problem statement but I feel it's really
>> >>> necessary to justify the proposal).
>> >>>
>> >>> Over the past two weeks we have been battling a nasty scheduler issue
>> >>> in production: the scheduler suddenly stops responding to any user
>> >>> requests and subsequently gets killed by our health monitoring. Upon
>> >>> restart, a leader may only function for a few seconds and almost
>> >>> immediately hangs again.
>> >>>
>> >>> The long and painful investigation pointed towards internal H2 table
>> >>> lock contention that resulted in a massive db-write starvation and a
>> >>> state where a scheduler write lock would *never* be released. This was
>> >>> relatively easy to replicate in Vagrant by creating a large update
>> >>> (~4K instances) with a large batch_size (~1K), while bombarding the
>> >>> scheduler with getJobUpdateDetails() requests for that job. The
>> >>> scheduler would enter a locked up state on the very first write op
>> >>> following the update creation (e.g. a status update for an instance
>> >>> transition from the first batch) and stay in that state for minutes
>> >>> until all getJobUpdateDetails() requests are served. This behavior is
>> >>> well explained by the following sentence from [1]:
>> >>>
>> >>>     "When a lock is released, and multiple connections are waiting for
>> >>> it, one of them is picked at random."
>> >>>
>> >>> What happens here is that in a situation when many more read requests
>> >>> are competing for a shared table lock, the H2 PageStore does not help
>> >>> write requests requiring an exclusive table lock in any way to
>> >>> succeed. This leads to db-write starvation and eventual scheduler
>> >>> native store write starvation as there is no timeout on a scheduler
>> >>> write lock.
>> >>>
>> >>> We have played with various available H2/MyBatis configuration
>> >>> settings to mitigate the above with no noticeable impact. That, until
>> >>> we switched to H2 MVStore [2], at which point we were able to
>> >>> completely eliminate the scheduler lockup without making any other
>> >>> code changes! So, the solution has finally been found? The answer
>> >>> would be YES until you try MVStore-enabled H2 with any reasonable size
>> >>> production DB on scheduler restart. There was a reason why we disabled
>> >>> MVStore in the scheduler [3] in the first place and that reason was
>> >>> poor MVStore performance with bulk inserts. Re-populating
>> >>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
>> >>> is unacceptable in prod where every second of scheduler downtime
>> >>> counts.
>> >>>
>> >>> Back to the drawing board, we tried all relevant settings and
>> >>> approaches to speed up MVStore inserts on restart but nothing really
>> >>> helped. Finally, the only reasonable way forward was to eliminate the
>> >>> point of slowness altogether - namely remove thrift-to-sql migration
>> >>> on restart. Fortunately, H2 supports an easy to operate command to
>> >>> generate the entire DB dump with a single statement [4]. We were now
>> >>> able to bypass the lengthly DB repopulation on restart by storing the
>> >>> entire DB dump in snapshot and replaying it on scheduler restart.
>> >>>
>> >>>
>> >>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
>> >>> currently use, I suggest we move our H2 to it AND adopt db
>> >>> snapshotting instead of thrift snapshotting to speed up scheduler
>> >>> restarts. The rough POC is available here [5]. We are running a
>> >>> version of this build in production since last week and were able to
>> >>> completely eliminate scheduler lockups. As a welcome side effect, we
>> >>> also observed faster scheduler restart times due to eliminating
>> >>> thrift-to-sql chattiness. Depending on the snapshot freshness the
>> >>> observed failover downtimes got reduced by ~40%.
>> >>>
>> >>> Moving to db snapshotting will require us to rethink DB schema
>> >>> versioning and thrift deprecating/removal policy. We will have to move
>> >>> to pre-/post- snapshot restore SQL migration scripts to handle any
>> >>> schema changes, which is a common industry pattern but something we
>> >>> have not tried yet. The upside though is that we can get an early
>> >>> start here as we will have to adopt strict SQL migration rules anyway
>> >>> when we move to persistent DB storage. Also, given that migrating to
>> >>> H2 TaskStore will likely further degrade scheduler restart times,
>> >>> having a better performing DB snapshotting solution in place will
>> >>> definitely aid migration.
>> >>>
>> >>> Thanks,
>> >>> Maxim
>> >>>
>> >>> [1] -
>> http://www.h2database.com/html/advanced.html?#transaction_isolation
>> >>> [2] - http://www.h2database.com/html/mvstore.html
>> >>> [3] -
>> >>>
>> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
>> >>> [4] - http://www.h2database.com/html/grammar.html#script
>> >>> [5] -
>> >>>
>> https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370
>> >>>
>>

Re: [PROPOSAL] DB snapshotting

Posted by Bill Farner <wf...@apache.org>.
Seems prudent to explore rather than write off though.  For all we know it
simplifies a lot.

On Wednesday, March 2, 2016, Maxim Khutornenko <ma...@apache.org> wrote:

> Ah, sorry, missed that conversation on IRC.
>
> I have not looked into that. Would be interesting to explore that
> route. Given our ultimate goal is to get rid of the replicated log
> altogether it does not stand as an immediate priority to me though.
>
> On Wed, Mar 2, 2016 at 11:51 AM, Erb, Stephan
> <Stephan.Erb@blue-yonder.com <javascript:;>> wrote:
> > +1 for the plan and the ticket.
> >
> > In addition, for reference a couple of messages from IRC from yesterday:
> >
> > 23:42 <serb> mkhutornenko:  interesting storage proposal on the
> mailinglist! I only wondered one thing...
> > 23:42 <serb> it feeld kind of weird that we use H2 as a non-replicated
> database and build some scaffolding around it in order to distribute its
> state via the Mesos replicated log.
> > 23:42 <serb> Have you looked into H2, if it would be possible to
> replace/subclass their in-process transaction log with a replicated Mesos
> one?
> > 23:43 <serb> Then we would not need that logic that performs a
> simultaneous inserts into the log and the taskstore, as the backend would
> handle that by itself
> > 23:44 <serb> (I know close to nothing about the storage layer, so that's
> like my perspective from 10.000 feet)
> >
> > 00:22 <wfarner> serb: that crossed my mind as well.  I have only drilled
> in a bit, would love to more
> >
> > ________________________________________
> > From: Maxim Khutornenko <maxim@apache.org <javascript:;>>
> > Sent: Wednesday, March 2, 2016 18:18
> > To: dev@aurora.apache.org <javascript:;>
> > Subject: Re: [PROPOSAL] DB snapshotting
> >
> > Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
> > to track it.
> >
> > On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wfarner@apache.org
> <javascript:;>> wrote:
> >> Thanks for the detailed write up and real-world details!  I generally
> >> support momentum towards a single task store implementation, so +1
> >> on dealing with that.
> >>
> >> I anticipated there would be a performance win from straight-to-SQL
> >> snapshots, so I am a +1 on that as well.
> >>
> >> In summary, +1 on all fronts!
> >>
> >> On Monday, February 29, 2016, Maxim Khutornenko <maxim@apache.org
> <javascript:;>> wrote:
> >>
> >>> (Apologies for the wordy problem statement but I feel it's really
> >>> necessary to justify the proposal).
> >>>
> >>> Over the past two weeks we have been battling a nasty scheduler issue
> >>> in production: the scheduler suddenly stops responding to any user
> >>> requests and subsequently gets killed by our health monitoring. Upon
> >>> restart, a leader may only function for a few seconds and almost
> >>> immediately hangs again.
> >>>
> >>> The long and painful investigation pointed towards internal H2 table
> >>> lock contention that resulted in a massive db-write starvation and a
> >>> state where a scheduler write lock would *never* be released. This was
> >>> relatively easy to replicate in Vagrant by creating a large update
> >>> (~4K instances) with a large batch_size (~1K), while bombarding the
> >>> scheduler with getJobUpdateDetails() requests for that job. The
> >>> scheduler would enter a locked up state on the very first write op
> >>> following the update creation (e.g. a status update for an instance
> >>> transition from the first batch) and stay in that state for minutes
> >>> until all getJobUpdateDetails() requests are served. This behavior is
> >>> well explained by the following sentence from [1]:
> >>>
> >>>     "When a lock is released, and multiple connections are waiting for
> >>> it, one of them is picked at random."
> >>>
> >>> What happens here is that in a situation when many more read requests
> >>> are competing for a shared table lock, the H2 PageStore does not help
> >>> write requests requiring an exclusive table lock in any way to
> >>> succeed. This leads to db-write starvation and eventual scheduler
> >>> native store write starvation as there is no timeout on a scheduler
> >>> write lock.
> >>>
> >>> We have played with various available H2/MyBatis configuration
> >>> settings to mitigate the above with no noticeable impact. That, until
> >>> we switched to H2 MVStore [2], at which point we were able to
> >>> completely eliminate the scheduler lockup without making any other
> >>> code changes! So, the solution has finally been found? The answer
> >>> would be YES until you try MVStore-enabled H2 with any reasonable size
> >>> production DB on scheduler restart. There was a reason why we disabled
> >>> MVStore in the scheduler [3] in the first place and that reason was
> >>> poor MVStore performance with bulk inserts. Re-populating
> >>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
> >>> is unacceptable in prod where every second of scheduler downtime
> >>> counts.
> >>>
> >>> Back to the drawing board, we tried all relevant settings and
> >>> approaches to speed up MVStore inserts on restart but nothing really
> >>> helped. Finally, the only reasonable way forward was to eliminate the
> >>> point of slowness altogether - namely remove thrift-to-sql migration
> >>> on restart. Fortunately, H2 supports an easy to operate command to
> >>> generate the entire DB dump with a single statement [4]. We were now
> >>> able to bypass the lengthly DB repopulation on restart by storing the
> >>> entire DB dump in snapshot and replaying it on scheduler restart.
> >>>
> >>>
> >>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
> >>> currently use, I suggest we move our H2 to it AND adopt db
> >>> snapshotting instead of thrift snapshotting to speed up scheduler
> >>> restarts. The rough POC is available here [5]. We are running a
> >>> version of this build in production since last week and were able to
> >>> completely eliminate scheduler lockups. As a welcome side effect, we
> >>> also observed faster scheduler restart times due to eliminating
> >>> thrift-to-sql chattiness. Depending on the snapshot freshness the
> >>> observed failover downtimes got reduced by ~40%.
> >>>
> >>> Moving to db snapshotting will require us to rethink DB schema
> >>> versioning and thrift deprecating/removal policy. We will have to move
> >>> to pre-/post- snapshot restore SQL migration scripts to handle any
> >>> schema changes, which is a common industry pattern but something we
> >>> have not tried yet. The upside though is that we can get an early
> >>> start here as we will have to adopt strict SQL migration rules anyway
> >>> when we move to persistent DB storage. Also, given that migrating to
> >>> H2 TaskStore will likely further degrade scheduler restart times,
> >>> having a better performing DB snapshotting solution in place will
> >>> definitely aid migration.
> >>>
> >>> Thanks,
> >>> Maxim
> >>>
> >>> [1] -
> http://www.h2database.com/html/advanced.html?#transaction_isolation
> >>> [2] - http://www.h2database.com/html/mvstore.html
> >>> [3] -
> >>>
> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
> >>> [4] - http://www.h2database.com/html/grammar.html#script
> >>> [5] -
> >>>
> https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370
> >>>
>

Re: [PROPOSAL] DB snapshotting

Posted by Maxim Khutornenko <ma...@apache.org>.
Ah, sorry, missed that conversation on IRC.

I have not looked into that. Would be interesting to explore that
route. Given our ultimate goal is to get rid of the replicated log
altogether it does not stand as an immediate priority to me though.

On Wed, Mar 2, 2016 at 11:51 AM, Erb, Stephan
<St...@blue-yonder.com> wrote:
> +1 for the plan and the ticket.
>
> In addition, for reference a couple of messages from IRC from yesterday:
>
> 23:42 <serb> mkhutornenko:  interesting storage proposal on the mailinglist! I only wondered one thing...
> 23:42 <serb> it feeld kind of weird that we use H2 as a non-replicated database and build some scaffolding around it in order to distribute its state via the Mesos replicated log.
> 23:42 <serb> Have you looked into H2, if it would be possible to replace/subclass their in-process transaction log with a replicated Mesos one?
> 23:43 <serb> Then we would not need that logic that performs a simultaneous inserts into the log and the taskstore, as the backend would handle that by itself
> 23:44 <serb> (I know close to nothing about the storage layer, so that's like my perspective from 10.000 feet)
>
> 00:22 <wfarner> serb: that crossed my mind as well.  I have only drilled in a bit, would love to more
>
> ________________________________________
> From: Maxim Khutornenko <ma...@apache.org>
> Sent: Wednesday, March 2, 2016 18:18
> To: dev@aurora.apache.org
> Subject: Re: [PROPOSAL] DB snapshotting
>
> Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
> to track it.
>
> On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wf...@apache.org> wrote:
>> Thanks for the detailed write up and real-world details!  I generally
>> support momentum towards a single task store implementation, so +1
>> on dealing with that.
>>
>> I anticipated there would be a performance win from straight-to-SQL
>> snapshots, so I am a +1 on that as well.
>>
>> In summary, +1 on all fronts!
>>
>> On Monday, February 29, 2016, Maxim Khutornenko <ma...@apache.org> wrote:
>>
>>> (Apologies for the wordy problem statement but I feel it's really
>>> necessary to justify the proposal).
>>>
>>> Over the past two weeks we have been battling a nasty scheduler issue
>>> in production: the scheduler suddenly stops responding to any user
>>> requests and subsequently gets killed by our health monitoring. Upon
>>> restart, a leader may only function for a few seconds and almost
>>> immediately hangs again.
>>>
>>> The long and painful investigation pointed towards internal H2 table
>>> lock contention that resulted in a massive db-write starvation and a
>>> state where a scheduler write lock would *never* be released. This was
>>> relatively easy to replicate in Vagrant by creating a large update
>>> (~4K instances) with a large batch_size (~1K), while bombarding the
>>> scheduler with getJobUpdateDetails() requests for that job. The
>>> scheduler would enter a locked up state on the very first write op
>>> following the update creation (e.g. a status update for an instance
>>> transition from the first batch) and stay in that state for minutes
>>> until all getJobUpdateDetails() requests are served. This behavior is
>>> well explained by the following sentence from [1]:
>>>
>>>     "When a lock is released, and multiple connections are waiting for
>>> it, one of them is picked at random."
>>>
>>> What happens here is that in a situation when many more read requests
>>> are competing for a shared table lock, the H2 PageStore does not help
>>> write requests requiring an exclusive table lock in any way to
>>> succeed. This leads to db-write starvation and eventual scheduler
>>> native store write starvation as there is no timeout on a scheduler
>>> write lock.
>>>
>>> We have played with various available H2/MyBatis configuration
>>> settings to mitigate the above with no noticeable impact. That, until
>>> we switched to H2 MVStore [2], at which point we were able to
>>> completely eliminate the scheduler lockup without making any other
>>> code changes! So, the solution has finally been found? The answer
>>> would be YES until you try MVStore-enabled H2 with any reasonable size
>>> production DB on scheduler restart. There was a reason why we disabled
>>> MVStore in the scheduler [3] in the first place and that reason was
>>> poor MVStore performance with bulk inserts. Re-populating
>>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
>>> is unacceptable in prod where every second of scheduler downtime
>>> counts.
>>>
>>> Back to the drawing board, we tried all relevant settings and
>>> approaches to speed up MVStore inserts on restart but nothing really
>>> helped. Finally, the only reasonable way forward was to eliminate the
>>> point of slowness altogether - namely remove thrift-to-sql migration
>>> on restart. Fortunately, H2 supports an easy to operate command to
>>> generate the entire DB dump with a single statement [4]. We were now
>>> able to bypass the lengthly DB repopulation on restart by storing the
>>> entire DB dump in snapshot and replaying it on scheduler restart.
>>>
>>>
>>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
>>> currently use, I suggest we move our H2 to it AND adopt db
>>> snapshotting instead of thrift snapshotting to speed up scheduler
>>> restarts. The rough POC is available here [5]. We are running a
>>> version of this build in production since last week and were able to
>>> completely eliminate scheduler lockups. As a welcome side effect, we
>>> also observed faster scheduler restart times due to eliminating
>>> thrift-to-sql chattiness. Depending on the snapshot freshness the
>>> observed failover downtimes got reduced by ~40%.
>>>
>>> Moving to db snapshotting will require us to rethink DB schema
>>> versioning and thrift deprecating/removal policy. We will have to move
>>> to pre-/post- snapshot restore SQL migration scripts to handle any
>>> schema changes, which is a common industry pattern but something we
>>> have not tried yet. The upside though is that we can get an early
>>> start here as we will have to adopt strict SQL migration rules anyway
>>> when we move to persistent DB storage. Also, given that migrating to
>>> H2 TaskStore will likely further degrade scheduler restart times,
>>> having a better performing DB snapshotting solution in place will
>>> definitely aid migration.
>>>
>>> Thanks,
>>> Maxim
>>>
>>> [1] - http://www.h2database.com/html/advanced.html?#transaction_isolation
>>> [2] - http://www.h2database.com/html/mvstore.html
>>> [3] -
>>> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
>>> [4] - http://www.h2database.com/html/grammar.html#script
>>> [5] -
>>> https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370
>>>

Re: [PROPOSAL] DB snapshotting

Posted by "Erb, Stephan" <St...@blue-yonder.com>.
+1 for the plan and the ticket.

In addition, for reference a couple of messages from IRC from yesterday:

23:42 <serb> mkhutornenko:  interesting storage proposal on the mailinglist! I only wondered one thing...
23:42 <serb> it feeld kind of weird that we use H2 as a non-replicated database and build some scaffolding around it in order to distribute its state via the Mesos replicated log.
23:42 <serb> Have you looked into H2, if it would be possible to replace/subclass their in-process transaction log with a replicated Mesos one?
23:43 <serb> Then we would not need that logic that performs a simultaneous inserts into the log and the taskstore, as the backend would handle that by itself
23:44 <serb> (I know close to nothing about the storage layer, so that's like my perspective from 10.000 feet)

00:22 <wfarner> serb: that crossed my mind as well.  I have only drilled in a bit, would love to more

________________________________________
From: Maxim Khutornenko <ma...@apache.org>
Sent: Wednesday, March 2, 2016 18:18
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] DB snapshotting

Thanks Bill! Filed https://issues.apache.org/jira/browse/AURORA-1627
to track it.

On Mon, Feb 29, 2016 at 11:41 AM, Bill Farner <wf...@apache.org> wrote:
> Thanks for the detailed write up and real-world details!  I generally
> support momentum towards a single task store implementation, so +1
> on dealing with that.
>
> I anticipated there would be a performance win from straight-to-SQL
> snapshots, so I am a +1 on that as well.
>
> In summary, +1 on all fronts!
>
> On Monday, February 29, 2016, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> (Apologies for the wordy problem statement but I feel it's really
>> necessary to justify the proposal).
>>
>> Over the past two weeks we have been battling a nasty scheduler issue
>> in production: the scheduler suddenly stops responding to any user
>> requests and subsequently gets killed by our health monitoring. Upon
>> restart, a leader may only function for a few seconds and almost
>> immediately hangs again.
>>
>> The long and painful investigation pointed towards internal H2 table
>> lock contention that resulted in a massive db-write starvation and a
>> state where a scheduler write lock would *never* be released. This was
>> relatively easy to replicate in Vagrant by creating a large update
>> (~4K instances) with a large batch_size (~1K), while bombarding the
>> scheduler with getJobUpdateDetails() requests for that job. The
>> scheduler would enter a locked up state on the very first write op
>> following the update creation (e.g. a status update for an instance
>> transition from the first batch) and stay in that state for minutes
>> until all getJobUpdateDetails() requests are served. This behavior is
>> well explained by the following sentence from [1]:
>>
>>     "When a lock is released, and multiple connections are waiting for
>> it, one of them is picked at random."
>>
>> What happens here is that in a situation when many more read requests
>> are competing for a shared table lock, the H2 PageStore does not help
>> write requests requiring an exclusive table lock in any way to
>> succeed. This leads to db-write starvation and eventual scheduler
>> native store write starvation as there is no timeout on a scheduler
>> write lock.
>>
>> We have played with various available H2/MyBatis configuration
>> settings to mitigate the above with no noticeable impact. That, until
>> we switched to H2 MVStore [2], at which point we were able to
>> completely eliminate the scheduler lockup without making any other
>> code changes! So, the solution has finally been found? The answer
>> would be YES until you try MVStore-enabled H2 with any reasonable size
>> production DB on scheduler restart. There was a reason why we disabled
>> MVStore in the scheduler [3] in the first place and that reason was
>> poor MVStore performance with bulk inserts. Re-populating
>> MVStore-enabled H2 DB took at least 2.5 times longer than normal. This
>> is unacceptable in prod where every second of scheduler downtime
>> counts.
>>
>> Back to the drawing board, we tried all relevant settings and
>> approaches to speed up MVStore inserts on restart but nothing really
>> helped. Finally, the only reasonable way forward was to eliminate the
>> point of slowness altogether - namely remove thrift-to-sql migration
>> on restart. Fortunately, H2 supports an easy to operate command to
>> generate the entire DB dump with a single statement [4]. We were now
>> able to bypass the lengthly DB repopulation on restart by storing the
>> entire DB dump in snapshot and replaying it on scheduler restart.
>>
>>
>> Now, the proposal. Given that MVStore vastly outperforms PageStore we
>> currently use, I suggest we move our H2 to it AND adopt db
>> snapshotting instead of thrift snapshotting to speed up scheduler
>> restarts. The rough POC is available here [5]. We are running a
>> version of this build in production since last week and were able to
>> completely eliminate scheduler lockups. As a welcome side effect, we
>> also observed faster scheduler restart times due to eliminating
>> thrift-to-sql chattiness. Depending on the snapshot freshness the
>> observed failover downtimes got reduced by ~40%.
>>
>> Moving to db snapshotting will require us to rethink DB schema
>> versioning and thrift deprecating/removal policy. We will have to move
>> to pre-/post- snapshot restore SQL migration scripts to handle any
>> schema changes, which is a common industry pattern but something we
>> have not tried yet. The upside though is that we can get an early
>> start here as we will have to adopt strict SQL migration rules anyway
>> when we move to persistent DB storage. Also, given that migrating to
>> H2 TaskStore will likely further degrade scheduler restart times,
>> having a better performing DB snapshotting solution in place will
>> definitely aid migration.
>>
>> Thanks,
>> Maxim
>>
>> [1] - http://www.h2database.com/html/advanced.html?#transaction_isolation
>> [2] - http://www.h2database.com/html/mvstore.html
>> [3] -
>> https://github.com/apache/aurora/blob/824e396ab80874cfea98ef47829279126838a3b2/src/main/java/org/apache/aurora/scheduler/storage/db/DbModule.java#L119
>> [4] - http://www.h2database.com/html/grammar.html#script
>> [5] -
>> https://github.com/maxim111333/incubator-aurora/blob/mv_store/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L317-L370
>>