You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Zameer Manji <zm...@apache.org> on 2016/10/01 00:34:11 UTC

A mini postmortem on snapshot failures

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:

````
W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
### Error rolling back transaction. Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid. org.apache.ibatis.exceptions.PersistenceException:
### Error rolling back transaction. Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at
org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
at
org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:216)
at
org.apache.ibatis.session.SqlSessionManager.rollback(SqlSessionManager.java:299)
at
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:116)
at
org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$0(DbStorage.java:175)
at
org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
at
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:173)
at
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at
org.apache.aurora.scheduler.storage.log.LogStorage.doInTransaction(LogStorage.java:521)
at
org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:551)
at
org.apache.aurora.scheduler.storage.log.LogStorage.doSnapshot(LogStorage.java:489)
at
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at
org.apache.aurora.scheduler.storage.log.LogStorage.snapshot(LogStorage.java:565)
at
org.apache.aurora.scheduler.storage.log.LogStorage.lambda$scheduleSnapshots$20(LogStorage.java:468)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at
org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(PooledConnection.java:254)
at
org.apache.ibatis.datasource.pooled.PooledConnection.invoke(PooledConnection.java:243)
at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
at
org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(JdbcTransaction.java:79)
at org.apache.ibatis.executor.BaseExecutor.rollback(BaseExecutor.java:249)
at
org.apache.ibatis.executor.CachingExecutor.rollback(CachingExecutor.java:119)
at
org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:213)
... 19 common frames omitted
````

This failure is silent and can be observed only through the
`scheduler_log_snapshots` metric, if it isn't increasing then snapshots are
not being created. In this cluster, a snapshot was not taken for about 4
days.
For those unfamiliar with Aurora's replicated log storage system, snapshot
creation is important because it allows us to truncate the number of
entries in the replicated log to a single large entry. This is required
because the log recovery time is proportional to the number of entries in
the log. Operators can observe the amount of time it takes to recover the
log at startup via the `scheduler_log_recover_nanos_total` metric.

The largest value observed for `scheduler_log_recover_nanos_total` during
this period was 8 minutes. This means that recovery from a failover would
take at least 8 minutes. For reference, a system aiming to have 99.99%
uptime can only sustain 4 minutes of downtime a month.

The root cause of this can be found from the exception in the above stack
trace:
`Caused by: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.`
This originates from the MyBatis connection pool used to communicate with
the in memory SQL store. To create a snapshot, we run a `SCRIPT` query to
dump the entire database into the replicated log [1].

This exception is being thrown because we have a connection pool to
communicate with the H2 SQL database. By default the connection pool has
the following properties:
* Maximum 10 active connections
* Maximum connection time of 20s before being considered for eviction.

Under high read load, there can be many pending SQL queries for a
connection. If a single connection takes more than 20s it will likely be
evicted. In this case running one of the `SCRIPT` queries was taking more
than 20s and there were many pending queries which caused MyBatis to evict
the connection for the `SCRIPT` query, causing snapshot creation failure.

To fix this issue, operators used the `-db_max_active_connection_count` to
increase the maximum number of active connections for MyBatis to 100. Once
the scheduler was able to serve requests, operators used `aurora_admin
scheduler_snapshot` to force create a snapshot. Then a scheduler failover
was induced and it was observed that recovery time dropped to about 40
seconds.

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

I would like to raise three questions:
* Should we add a flag to tune the maximum connection time for MyBatis?
* Should a Snapshot creation failure be fatal?
* Should we change the default maximum connection time and maximum number
of active connections?

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji

Re: A mini postmortem on snapshot failures

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

Thanks for the pointers regarding the broken documentation. I will fix that.

The configuration options have moved and are now described here http://aurora.apache.org/documentation/latest/operations/configuration/#replicated-log-configuration

On 03/10/16 09:05, "meghdoot bhattacharya" <me...@yahoo.com.INVALID> wrote:

Zameer, thanks for sharing this. For folks who are looking to operate Aurora with HA this is very valuable. Operational insights from aurora experts is always welcome.Not to hijack the conversation on the 3 questions you asked, I found inhttp://aurora.apache.org/documentation/latest/operations/storage/

the links to "Replicated log configuration" and "Backup configuration" broken. Last reference I can find in 0.12 releasehttp://aurora.apache.org/documentation/0.12.0/storage-config/

If the operational documents can be reviewed and enhanced further that would be helpful. Highlighting stats like 'scheduler_log_snapshots' or the load time for example can be alert points to operators.
Thx
From: Zameer Manji <zm...@apache.org>
To: dev@aurora.apache.org; user@aurora.apache.org
Sent: Friday, September 30, 2016 5:34 PM
Subject: A mini postmortem on snapshot failures

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji

Re: A mini postmortem on snapshot failures

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

Thanks for the pointers regarding the broken documentation. I will fix that.

The configuration options have moved and are now described here http://aurora.apache.org/documentation/latest/operations/configuration/#replicated-log-configuration

On 03/10/16 09:05, "meghdoot bhattacharya" <me...@yahoo.com.INVALID> wrote:

the links to "Replicated log configuration" and "Backup configuration" broken. Last reference I can find in 0.12 releasehttp://aurora.apache.org/documentation/0.12.0/storage-config/

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji

Re: A mini postmortem on snapshot failures

Posted by meghdoot bhattacharya <me...@yahoo.com.INVALID>.

the links to "Replicated log configuration" and "Backup configuration" broken. Last reference I can find in 0.12 releasehttp://aurora.apache.org/documentation/0.12.0/storage-config/

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji

Re: A mini postmortem on snapshot failures

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

I am not aware of tickets or work on the _exact_ problem described in this thread.

However, there is currently some on-going work to improve scheduler performance and availability during snaphshots on large clusters:

• https://reviews.apache.org/r/54883/  (still pending)
• https://reviews.apache.org/r/54774/ with associated ticket https://issues.apache.org/jira/browse/AURORA-1861 (already merged)

Maybe those are also interesting for you.


On 31/12/2016, 00:03, "meghdoot bhattacharya" <me...@yahoo.com.INVALID> wrote:

    Is there a ticket for this? In our large test cluster, we hit into this under heavy load.
    Thx
    
          From: John Sirois <js...@apache.org>
     To: dev@aurora.apache.org 
     Sent: Wednesday, October 5, 2016 11:04 AM
     Subject: Re: A mini postmortem on snapshot failures
       
    On Tue, Oct 4, 2016 at 11:55 AM, Joshua Cohen <jc...@apache.org> wrote:
    
    > Hi Zameer,
    >
    > Thanks for this writeup!
    >
    > I think one other option to consider would be using a connection for
    > writing the snapshots that's not bound by the pool's maximum checkout time.
    > I'm not sure if this is feasible or not, but I worry that there's
    > potentially no upper bound on raising the maximum checkout time as the size
    > of a cluster grows or its read traffic grows. It feels a bit heavy weight
    > to up the max checkout time when potentially the only connection exceeding
    > the limits is the one writing the snapshot.
    >
    
    +1 to the above, it would be great if the snapshotter thread could grab its
    own dedicated connection.  In fact, since the operation is relatively rare
    (compared to all other reads and writes), it could even grab and dispose of
    a new connection per-snapshot to simplify things (ie: no need to check for
    a stale connection like you would have to if you grabbed one and tried to
    hold it for the lifetime of the scheduler process).
    
    
    > I'd definitely be in favor of adding a flag to tune the maximum connection
    > checkout.
    >
    > I'm neutral to negative on having Snapshot creation failures be fatal, I
    > don't necessarily think that one failed snapshot should take the scheduler
    > down, but I agree that some number of consecutive failures is a Bad Thing
    > that is worthy of investigation. My concern with having failures be fatal
    > is the pathological case where snapshots always fail causing your scheduler
    > to failover once every SNAPSHOT_INTERVAL. Do you think it would be
    > sufficient to add `scheduler_log_snapshots` to the list of important
    > stats[1]?
    >
    > I'm also neutral on changing the defaults. I'm not sure if it's warranted,
    > as the behavior will vary based on cluster. It seems like you guys got bit
    > by this due to a comparatively heavy read load? Our cluster, on the other
    > hand, is probably significantly larger, but is not queried as much, and we
    > haven't run into issues with the defaults. However, as long as there are is
    > no adverse impact to bumping the default values I've got no objections.
    >
    > Cheers,
    >
    > Joshua
    >
    > [1]
    > https://github.com/apache/aurora/blob/master/docs/
    > operations/monitoring.md#important-stats
    >
    > On Fri, Sep 30, 2016 at 7:34 PM, Zameer Manji <zm...@apache.org> wrote:
    >
    > > Aurora Developers and Users,
    > >
    > > I would like to share failure case I experienced recently. In a modestly
    > > sized production cluster with high read load, snapshot creation began to
    > > fail. The logs showed the following:
    > >
    > > ````
    > > W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
    > > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
    > > accessing PooledConnection. Connection is invalid.
    > > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid. org.apache.ibatis.exceptions.
    > PersistenceException:
    > > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
    > > accessing PooledConnection. Connection is invalid.
    > > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid.
    > > at
    > > org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
    > > ExceptionFactory.java:30)
    > > at
    > > org.apache.ibatis.session.defaults.DefaultSqlSession.
    > > rollback(DefaultSqlSession.java:216)
    > > at
    > > org.apache.ibatis.session.SqlSessionManager.rollback(
    > > SqlSessionManager.java:299)
    > > at
    > > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
    > > TransactionalMethodInterceptor.java:116)
    > > at
    > > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$
    > > write$0(DbStorage.java:175)
    > > at
    > > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(
    > > GatingDelayExecutor.java:62)
    > > at
    > > org.apache.aurora.scheduler.storage.db.DbStorage.write(
    > DbStorage.java:173)
    > > at
    > > org.apache.aurora.common.inject.TimedInterceptor.
    > > invoke(TimedInterceptor.java:83)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.
    > > doInTransaction(LogStorage.java:521)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.write(
    > > LogStorage.java:551)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.
    > > doSnapshot(LogStorage.java:489)
    > > at
    > > org.apache.aurora.common.inject.TimedInterceptor.
    > > invoke(TimedInterceptor.java:83)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.
    > > snapshot(LogStorage.java:565)
    > > at
    > > org.apache.aurora.scheduler.storage.log.LogStorage.lambda$
    > > scheduleSnapshots$20(LogStorage.java:468)
    > > at java.util.concurrent.Executors$RunnableAdapter.
    > call(Executors.java:511)
    > > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    > > at
    > > java.util.concurrent.ScheduledThreadPoolExecutor$
    > > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    > > at
    > > java.util.concurrent.ScheduledThreadPoolExecutor$
    > ScheduledFutureTask.run(
    > > ScheduledThreadPoolExecutor.java:294)
    > > at
    > > java.util.concurrent.ThreadPoolExecutor.runWorker(
    > > ThreadPoolExecutor.java:1142)
    > > at
    > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
    > > ThreadPoolExecutor.java:617)
    > > at java.lang.Thread.run(Thread.java:745)
    > > Caused by: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid.
    > > at
    > > org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(
    > > PooledConnection.java:254)
    > > at
    > > org.apache.ibatis.datasource.pooled.PooledConnection.
    > > invoke(PooledConnection.java:243)
    > > at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
    > > at
    > > org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(
    > > JdbcTransaction.java:79)
    > > at org.apache.ibatis.executor.BaseExecutor.rollback(
    > BaseExecutor.java:249)
    > > at
    > > org.apache.ibatis.executor.CachingExecutor.rollback(
    > > CachingExecutor.java:119)
    > > at
    > > org.apache.ibatis.session.defaults.DefaultSqlSession.
    > > rollback(DefaultSqlSession.java:213)
    > > ... 19 common frames omitted
    > > ````
    > >
    > > This failure is silent and can be observed only through the
    > > `scheduler_log_snapshots` metric, if it isn't increasing then snapshots
    > are
    > > not being created. In this cluster, a snapshot was not taken for about 4
    > > days.
    > > For those unfamiliar with Aurora's replicated log storage system,
    > snapshot
    > > creation is important because it allows us to truncate the number of
    > > entries in the replicated log to a single large entry. This is required
    > > because the log recovery time is proportional to the number of entries in
    > > the log. Operators can observe the amount of time it takes to recover the
    > > log at startup via the `scheduler_log_recover_nanos_total` metric.
    > >
    > > The largest value observed for `scheduler_log_recover_nanos_total`
    > during
    > > this period was 8 minutes. This means that recovery from a failover would
    > > take at least 8 minutes. For reference, a system aiming to have 99.99%
    > > uptime can only sustain 4 minutes of downtime a month.
    > >
    > > The root cause of this can be found from the exception in the above stack
    > > trace:
    > > `Caused by: java.sql.SQLException: Error accessing PooledConnection.
    > > Connection is invalid.`
    > > This originates from the MyBatis connection pool used to communicate with
    > > the in memory SQL store. To create a snapshot, we run a `SCRIPT` query to
    > > dump the entire database into the replicated log [1].
    > >
    > > This exception is being thrown because we have a connection pool to
    > > communicate with the H2 SQL database. By default the connection pool has
    > > the following properties:
    > > * Maximum 10 active connections
    > > * Maximum connection time of 20s before being considered for eviction.
    > >
    > > Under high read load, there can be many pending SQL queries for a
    > > connection. If a single connection takes more than 20s it will likely be
    > > evicted. In this case running one of the `SCRIPT` queries was taking more
    > > than 20s and there were many pending queries which caused MyBatis to
    > evict
    > > the connection for the `SCRIPT` query, causing snapshot creation failure.
    > >
    > > To fix this issue, operators used the `-db_max_active_connection_count`
    > to
    > > increase the maximum number of active connections for MyBatis to 100.
    > Once
    > > the scheduler was able to serve requests, operators used `aurora_admin
    > > scheduler_snapshot` to force create a snapshot. Then a scheduler failover
    > > was induced and it was observed that recovery time dropped to about 40
    > > seconds.
    > >
    > > Today this cluster continues running with this flag and value to ensure
    > it
    > > can continue to serve a high read load.
    > >
    > > I would like to raise three questions:
    > > * Should we add a flag to tune the maximum connection time for MyBatis?
    > > * Should a Snapshot creation failure be fatal?
    > > * Should we change the default maximum connection time and maximum number
    > > of active connections?
    > >
    > > [1]:
    > > https://github.com/apache/aurora/blob/rel/0.16.0/src/
    > > main/java/org/apache/aurora/scheduler/storage/log/
    > > SnapshotStoreImpl.java#L107-L127
    > >
    > > --
    > > Zameer Manji
    > >
    >

Re: A mini postmortem on snapshot failures

Posted by meghdoot bhattacharya <me...@yahoo.com.INVALID>.

Is there a ticket for this? In our large test cluster, we hit into this under heavy load.
Thx

      From: John Sirois <js...@apache.org>
 To: dev@aurora.apache.org 
 Sent: Wednesday, October 5, 2016 11:04 AM
 Subject: Re: A mini postmortem on snapshot failures
   
On Tue, Oct 4, 2016 at 11:55 AM, Joshua Cohen <jc...@apache.org> wrote:

> Hi Zameer,
>
> Thanks for this writeup!
>
> I think one other option to consider would be using a connection for
> writing the snapshots that's not bound by the pool's maximum checkout time.
> I'm not sure if this is feasible or not, but I worry that there's
> potentially no upper bound on raising the maximum checkout time as the size
> of a cluster grows or its read traffic grows. It feels a bit heavy weight
> to up the max checkout time when potentially the only connection exceeding
> the limits is the one writing the snapshot.
>

+1 to the above, it would be great if the snapshotter thread could grab its
own dedicated connection.  In fact, since the operation is relatively rare
(compared to all other reads and writes), it could even grab and dispose of
a new connection per-snapshot to simplify things (ie: no need to check for
a stale connection like you would have to if you grabbed one and tried to
hold it for the lifetime of the scheduler process).


> I'd definitely be in favor of adding a flag to tune the maximum connection
> checkout.
>
> I'm neutral to negative on having Snapshot creation failures be fatal, I
> don't necessarily think that one failed snapshot should take the scheduler
> down, but I agree that some number of consecutive failures is a Bad Thing
> that is worthy of investigation. My concern with having failures be fatal
> is the pathological case where snapshots always fail causing your scheduler
> to failover once every SNAPSHOT_INTERVAL. Do you think it would be
> sufficient to add `scheduler_log_snapshots` to the list of important
> stats[1]?
>
> I'm also neutral on changing the defaults. I'm not sure if it's warranted,
> as the behavior will vary based on cluster. It seems like you guys got bit
> by this due to a comparatively heavy read load? Our cluster, on the other
> hand, is probably significantly larger, but is not queried as much, and we
> haven't run into issues with the defaults. However, as long as there are is
> no adverse impact to bumping the default values I've got no objections.
>
> Cheers,
>
> Joshua
>
> [1]
> https://github.com/apache/aurora/blob/master/docs/
> operations/monitoring.md#important-stats
>
> On Fri, Sep 30, 2016 at 7:34 PM, Zameer Manji <zm...@apache.org> wrote:
>
> > Aurora Developers and Users,
> >
> > I would like to share failure case I experienced recently. In a modestly
> > sized production cluster with high read load, snapshot creation began to
> > fail. The logs showed the following:
> >
> > ````
> > W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid. org.apache.ibatis.exceptions.
> PersistenceException:
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.
> > at
> > org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
> > ExceptionFactory.java:30)
> > at
> > org.apache.ibatis.session.defaults.DefaultSqlSession.
> > rollback(DefaultSqlSession.java:216)
> > at
> > org.apache.ibatis.session.SqlSessionManager.rollback(
> > SqlSessionManager.java:299)
> > at
> > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
> > TransactionalMethodInterceptor.java:116)
> > at
> > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$
> > write$0(DbStorage.java:175)
> > at
> > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(
> > GatingDelayExecutor.java:62)
> > at
> > org.apache.aurora.scheduler.storage.db.DbStorage.write(
> DbStorage.java:173)
> > at
> > org.apache.aurora.common.inject.TimedInterceptor.
> > invoke(TimedInterceptor.java:83)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.
> > doInTransaction(LogStorage.java:521)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.write(
> > LogStorage.java:551)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.
> > doSnapshot(LogStorage.java:489)
> > at
> > org.apache.aurora.common.inject.TimedInterceptor.
> > invoke(TimedInterceptor.java:83)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.
> > snapshot(LogStorage.java:565)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.lambda$
> > scheduleSnapshots$20(LogStorage.java:468)
> > at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(
> > ScheduledThreadPoolExecutor.java:294)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.
> > at
> > org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(
> > PooledConnection.java:254)
> > at
> > org.apache.ibatis.datasource.pooled.PooledConnection.
> > invoke(PooledConnection.java:243)
> > at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
> > at
> > org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(
> > JdbcTransaction.java:79)
> > at org.apache.ibatis.executor.BaseExecutor.rollback(
> BaseExecutor.java:249)
> > at
> > org.apache.ibatis.executor.CachingExecutor.rollback(
> > CachingExecutor.java:119)
> > at
> > org.apache.ibatis.session.defaults.DefaultSqlSession.
> > rollback(DefaultSqlSession.java:213)
> > ... 19 common frames omitted
> > ````
> >
> > This failure is silent and can be observed only through the
> > `scheduler_log_snapshots` metric, if it isn't increasing then snapshots
> are
> > not being created. In this cluster, a snapshot was not taken for about 4
> > days.
> > For those unfamiliar with Aurora's replicated log storage system,
> snapshot
> > creation is important because it allows us to truncate the number of
> > entries in the replicated log to a single large entry. This is required
> > because the log recovery time is proportional to the number of entries in
> > the log. Operators can observe the amount of time it takes to recover the
> > log at startup via the `scheduler_log_recover_nanos_total` metric.
> >
> > The largest value observed for `scheduler_log_recover_nanos_total`
> during
> > this period was 8 minutes. This means that recovery from a failover would
> > take at least 8 minutes. For reference, a system aiming to have 99.99%
> > uptime can only sustain 4 minutes of downtime a month.
> >
> > The root cause of this can be found from the exception in the above stack
> > trace:
> > `Caused by: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.`
> > This originates from the MyBatis connection pool used to communicate with
> > the in memory SQL store. To create a snapshot, we run a `SCRIPT` query to
> > dump the entire database into the replicated log [1].
> >
> > This exception is being thrown because we have a connection pool to
> > communicate with the H2 SQL database. By default the connection pool has
> > the following properties:
> > * Maximum 10 active connections
> > * Maximum connection time of 20s before being considered for eviction.
> >
> > Under high read load, there can be many pending SQL queries for a
> > connection. If a single connection takes more than 20s it will likely be
> > evicted. In this case running one of the `SCRIPT` queries was taking more
> > than 20s and there were many pending queries which caused MyBatis to
> evict
> > the connection for the `SCRIPT` query, causing snapshot creation failure.
> >
> > To fix this issue, operators used the `-db_max_active_connection_count`
> to
> > increase the maximum number of active connections for MyBatis to 100.
> Once
> > the scheduler was able to serve requests, operators used `aurora_admin
> > scheduler_snapshot` to force create a snapshot. Then a scheduler failover
> > was induced and it was observed that recovery time dropped to about 40
> > seconds.
> >
> > Today this cluster continues running with this flag and value to ensure
> it
> > can continue to serve a high read load.
> >
> > I would like to raise three questions:
> > * Should we add a flag to tune the maximum connection time for MyBatis?
> > * Should a Snapshot creation failure be fatal?
> > * Should we change the default maximum connection time and maximum number
> > of active connections?
> >
> > [1]:
> > https://github.com/apache/aurora/blob/rel/0.16.0/src/
> > main/java/org/apache/aurora/scheduler/storage/log/
> > SnapshotStoreImpl.java#L107-L127
> >
> > --
> > Zameer Manji
> >
>

Re: A mini postmortem on snapshot failures

Posted by John Sirois <js...@apache.org>.

On Tue, Oct 4, 2016 at 11:55 AM, Joshua Cohen <jc...@apache.org> wrote:

> Hi Zameer,
>
> Thanks for this writeup!
>
> I think one other option to consider would be using a connection for
> writing the snapshots that's not bound by the pool's maximum checkout time.
> I'm not sure if this is feasible or not, but I worry that there's
> potentially no upper bound on raising the maximum checkout time as the size
> of a cluster grows or its read traffic grows. It feels a bit heavy weight
> to up the max checkout time when potentially the only connection exceeding
> the limits is the one writing the snapshot.
>

+1 to the above, it would be great if the snapshotter thread could grab its
own dedicated connection.  In fact, since the operation is relatively rare
(compared to all other reads and writes), it could even grab and dispose of
a new connection per-snapshot to simplify things (ie: no need to check for
a stale connection like you would have to if you grabbed one and tried to
hold it for the lifetime of the scheduler process).


> I'd definitely be in favor of adding a flag to tune the maximum connection
> checkout.
>
> I'm neutral to negative on having Snapshot creation failures be fatal, I
> don't necessarily think that one failed snapshot should take the scheduler
> down, but I agree that some number of consecutive failures is a Bad Thing
> that is worthy of investigation. My concern with having failures be fatal
> is the pathological case where snapshots always fail causing your scheduler
> to failover once every SNAPSHOT_INTERVAL. Do you think it would be
> sufficient to add `scheduler_log_snapshots` to the list of important
> stats[1]?
>
> I'm also neutral on changing the defaults. I'm not sure if it's warranted,
> as the behavior will vary based on cluster. It seems like you guys got bit
> by this due to a comparatively heavy read load? Our cluster, on the other
> hand, is probably significantly larger, but is not queried as much, and we
> haven't run into issues with the defaults. However, as long as there are is
> no adverse impact to bumping the default values I've got no objections.
>
> Cheers,
>
> Joshua
>
> [1]
> https://github.com/apache/aurora/blob/master/docs/
> operations/monitoring.md#important-stats
>
> On Fri, Sep 30, 2016 at 7:34 PM, Zameer Manji <zm...@apache.org> wrote:
>
> > Aurora Developers and Users,
> >
> > I would like to share failure case I experienced recently. In a modestly
> > sized production cluster with high read load, snapshot creation began to
> > fail. The logs showed the following:
> >
> > ````
> > W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid. org.apache.ibatis.exceptions.
> PersistenceException:
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.
> > at
> > org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
> > ExceptionFactory.java:30)
> > at
> > org.apache.ibatis.session.defaults.DefaultSqlSession.
> > rollback(DefaultSqlSession.java:216)
> > at
> > org.apache.ibatis.session.SqlSessionManager.rollback(
> > SqlSessionManager.java:299)
> > at
> > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
> > TransactionalMethodInterceptor.java:116)
> > at
> > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$
> > write$0(DbStorage.java:175)
> > at
> > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(
> > GatingDelayExecutor.java:62)
> > at
> > org.apache.aurora.scheduler.storage.db.DbStorage.write(
> DbStorage.java:173)
> > at
> > org.apache.aurora.common.inject.TimedInterceptor.
> > invoke(TimedInterceptor.java:83)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.
> > doInTransaction(LogStorage.java:521)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.write(
> > LogStorage.java:551)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.
> > doSnapshot(LogStorage.java:489)
> > at
> > org.apache.aurora.common.inject.TimedInterceptor.
> > invoke(TimedInterceptor.java:83)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.
> > snapshot(LogStorage.java:565)
> > at
> > org.apache.aurora.scheduler.storage.log.LogStorage.lambda$
> > scheduleSnapshots$20(LogStorage.java:468)
> > at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(
> > ScheduledThreadPoolExecutor.java:294)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.
> > at
> > org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(
> > PooledConnection.java:254)
> > at
> > org.apache.ibatis.datasource.pooled.PooledConnection.
> > invoke(PooledConnection.java:243)
> > at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
> > at
> > org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(
> > JdbcTransaction.java:79)
> > at org.apache.ibatis.executor.BaseExecutor.rollback(
> BaseExecutor.java:249)
> > at
> > org.apache.ibatis.executor.CachingExecutor.rollback(
> > CachingExecutor.java:119)
> > at
> > org.apache.ibatis.session.defaults.DefaultSqlSession.
> > rollback(DefaultSqlSession.java:213)
> > ... 19 common frames omitted
> > ````
> >
> > This failure is silent and can be observed only through the
> > `scheduler_log_snapshots` metric, if it isn't increasing then snapshots
> are
> > not being created. In this cluster, a snapshot was not taken for about 4
> > days.
> > For those unfamiliar with Aurora's replicated log storage system,
> snapshot
> > creation is important because it allows us to truncate the number of
> > entries in the replicated log to a single large entry. This is required
> > because the log recovery time is proportional to the number of entries in
> > the log. Operators can observe the amount of time it takes to recover the
> > log at startup via the `scheduler_log_recover_nanos_total` metric.
> >
> > The largest value observed for `scheduler_log_recover_nanos_total`
> during
> > this period was 8 minutes. This means that recovery from a failover would
> > take at least 8 minutes. For reference, a system aiming to have 99.99%
> > uptime can only sustain 4 minutes of downtime a month.
> >
> > The root cause of this can be found from the exception in the above stack
> > trace:
> > `Caused by: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.`
> > This originates from the MyBatis connection pool used to communicate with
> > the in memory SQL store. To create a snapshot, we run a `SCRIPT` query to
> > dump the entire database into the replicated log [1].
> >
> > This exception is being thrown because we have a connection pool to
> > communicate with the H2 SQL database. By default the connection pool has
> > the following properties:
> > * Maximum 10 active connections
> > * Maximum connection time of 20s before being considered for eviction.
> >
> > Under high read load, there can be many pending SQL queries for a
> > connection. If a single connection takes more than 20s it will likely be
> > evicted. In this case running one of the `SCRIPT` queries was taking more
> > than 20s and there were many pending queries which caused MyBatis to
> evict
> > the connection for the `SCRIPT` query, causing snapshot creation failure.
> >
> > To fix this issue, operators used the `-db_max_active_connection_count`
> to
> > increase the maximum number of active connections for MyBatis to 100.
> Once
> > the scheduler was able to serve requests, operators used `aurora_admin
> > scheduler_snapshot` to force create a snapshot. Then a scheduler failover
> > was induced and it was observed that recovery time dropped to about 40
> > seconds.
> >
> > Today this cluster continues running with this flag and value to ensure
> it
> > can continue to serve a high read load.
> >
> > I would like to raise three questions:
> > * Should we add a flag to tune the maximum connection time for MyBatis?
> > * Should a Snapshot creation failure be fatal?
> > * Should we change the default maximum connection time and maximum number
> > of active connections?
> >
> > [1]:
> > https://github.com/apache/aurora/blob/rel/0.16.0/src/
> > main/java/org/apache/aurora/scheduler/storage/log/
> > SnapshotStoreImpl.java#L107-L127
> >
> > --
> > Zameer Manji
> >
>

Re: A mini postmortem on snapshot failures

Posted by Joshua Cohen <jc...@apache.org>.

Hi Zameer,

Thanks for this writeup!

I think one other option to consider would be using a connection for
writing the snapshots that's not bound by the pool's maximum checkout time.
I'm not sure if this is feasible or not, but I worry that there's
potentially no upper bound on raising the maximum checkout time as the size
of a cluster grows or its read traffic grows. It feels a bit heavy weight
to up the max checkout time when potentially the only connection exceeding
the limits is the one writing the snapshot.

I'd definitely be in favor of adding a flag to tune the maximum connection
checkout.

I'm neutral to negative on having Snapshot creation failures be fatal, I
don't necessarily think that one failed snapshot should take the scheduler
down, but I agree that some number of consecutive failures is a Bad Thing
that is worthy of investigation. My concern with having failures be fatal
is the pathological case where snapshots always fail causing your scheduler
to failover once every SNAPSHOT_INTERVAL. Do you think it would be
sufficient to add `scheduler_log_snapshots` to the list of important
stats[1]?

I'm also neutral on changing the defaults. I'm not sure if it's warranted,
as the behavior will vary based on cluster. It seems like you guys got bit
by this due to a comparatively heavy read load? Our cluster, on the other
hand, is probably significantly larger, but is not queried as much, and we
haven't run into issues with the defaults. However, as long as there are is
no adverse impact to bumping the default values I've got no objections.

Cheers,

Joshua

[1]
https://github.com/apache/aurora/blob/master/docs/operations/monitoring.md#important-stats

On Fri, Sep 30, 2016 at 7:34 PM, Zameer Manji <zm...@apache.org> wrote:

> Aurora Developers and Users,
>
> I would like to share failure case I experienced recently. In a modestly
> sized production cluster with high read load, snapshot creation began to
> fail. The logs showed the following:
>
> ````
> W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
> ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> accessing PooledConnection. Connection is invalid.
> ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> Connection is invalid. org.apache.ibatis.exceptions.PersistenceException:
> ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> accessing PooledConnection. Connection is invalid.
> ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> Connection is invalid.
> at
> org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
> ExceptionFactory.java:30)
> at
> org.apache.ibatis.session.defaults.DefaultSqlSession.
> rollback(DefaultSqlSession.java:216)
> at
> org.apache.ibatis.session.SqlSessionManager.rollback(
> SqlSessionManager.java:299)
> at
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
> TransactionalMethodInterceptor.java:116)
> at
> org.apache.aurora.scheduler.storage.db.DbStorage.lambda$
> write$0(DbStorage.java:175)
> at
> org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(
> GatingDelayExecutor.java:62)
> at
> org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:173)
> at
> org.apache.aurora.common.inject.TimedInterceptor.
> invoke(TimedInterceptor.java:83)
> at
> org.apache.aurora.scheduler.storage.log.LogStorage.
> doInTransaction(LogStorage.java:521)
> at
> org.apache.aurora.scheduler.storage.log.LogStorage.write(
> LogStorage.java:551)
> at
> org.apache.aurora.scheduler.storage.log.LogStorage.
> doSnapshot(LogStorage.java:489)
> at
> org.apache.aurora.common.inject.TimedInterceptor.
> invoke(TimedInterceptor.java:83)
> at
> org.apache.aurora.scheduler.storage.log.LogStorage.
> snapshot(LogStorage.java:565)
> at
> org.apache.aurora.scheduler.storage.log.LogStorage.lambda$
> scheduleSnapshots$20(LogStorage.java:468)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
> ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.sql.SQLException: Error accessing PooledConnection.
> Connection is invalid.
> at
> org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(
> PooledConnection.java:254)
> at
> org.apache.ibatis.datasource.pooled.PooledConnection.
> invoke(PooledConnection.java:243)
> at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
> at
> org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(
> JdbcTransaction.java:79)
> at org.apache.ibatis.executor.BaseExecutor.rollback(BaseExecutor.java:249)
> at
> org.apache.ibatis.executor.CachingExecutor.rollback(
> CachingExecutor.java:119)
> at
> org.apache.ibatis.session.defaults.DefaultSqlSession.
> rollback(DefaultSqlSession.java:213)
> ... 19 common frames omitted
> ````
>
> This failure is silent and can be observed only through the
> `scheduler_log_snapshots` metric, if it isn't increasing then snapshots are
> not being created. In this cluster, a snapshot was not taken for about 4
> days.
> For those unfamiliar with Aurora's replicated log storage system, snapshot
> creation is important because it allows us to truncate the number of
> entries in the replicated log to a single large entry. This is required
> because the log recovery time is proportional to the number of entries in
> the log. Operators can observe the amount of time it takes to recover the
> log at startup via the `scheduler_log_recover_nanos_total` metric.
>
> The largest value observed for `scheduler_log_recover_nanos_total` during
> this period was 8 minutes. This means that recovery from a failover would
> take at least 8 minutes. For reference, a system aiming to have 99.99%
> uptime can only sustain 4 minutes of downtime a month.
>
> The root cause of this can be found from the exception in the above stack
> trace:
> `Caused by: java.sql.SQLException: Error accessing PooledConnection.
> Connection is invalid.`
> This originates from the MyBatis connection pool used to communicate with
> the in memory SQL store. To create a snapshot, we run a `SCRIPT` query to
> dump the entire database into the replicated log [1].
>
> This exception is being thrown because we have a connection pool to
> communicate with the H2 SQL database. By default the connection pool has
> the following properties:
> * Maximum 10 active connections
> * Maximum connection time of 20s before being considered for eviction.
>
> Under high read load, there can be many pending SQL queries for a
> connection. If a single connection takes more than 20s it will likely be
> evicted. In this case running one of the `SCRIPT` queries was taking more
> than 20s and there were many pending queries which caused MyBatis to evict
> the connection for the `SCRIPT` query, causing snapshot creation failure.
>
> To fix this issue, operators used the `-db_max_active_connection_count` to
> increase the maximum number of active connections for MyBatis to 100. Once
> the scheduler was able to serve requests, operators used `aurora_admin
> scheduler_snapshot` to force create a snapshot. Then a scheduler failover
> was induced and it was observed that recovery time dropped to about 40
> seconds.
>
> Today this cluster continues running with this flag and value to ensure it
> can continue to serve a high read load.
>
> I would like to raise three questions:
> * Should we add a flag to tune the maximum connection time for MyBatis?
> * Should a Snapshot creation failure be fatal?
> * Should we change the default maximum connection time and maximum number
> of active connections?
>
> [1]:
> https://github.com/apache/aurora/blob/rel/0.16.0/src/
> main/java/org/apache/aurora/scheduler/storage/log/
> SnapshotStoreImpl.java#L107-L127
>
> --
> Zameer Manji
>

Re: A mini postmortem on snapshot failures

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

An immediate failover seems rather drastic too me. However, I have no anecdotal evidence to back up this feeling or any other default config changes. Maybe Joshua can share what they are using so that we can adjust the default values accordingly?

Other thoughts:
• Have you tried this magic trick? https://issues.apache.org/jira/browse/AURORA-1211?focusedCommentId=14483533&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14483533
• A counter for failed scheduler backups sounds like a good idea. We could also mention this in our monitoring documentation.

Best regards,
Stephan

On 01/10/16 02:34, "Zameer Manji" <zm...@apache.org> wrote:

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji

Re: A mini postmortem on snapshot failures

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

Best regards,
Stephan

On 01/10/16 02:34, "Zameer Manji" <zm...@apache.org> wrote:

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:

Today this cluster continues running with this flag and value to ensure it
can continue to serve a high read load.

[1]:
https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java#L107-L127

--
Zameer Manji