You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Alex Huang <Al...@citrix.com> on 2014/04/01 00:04:53 UTC

[DESIGN] No agent calls within database transactions....

Hi All,

I was alerted to this problem recently and it's something that affects developers so I want to bring it up.  It is a design principle in CloudStack that we do not make agent calls within database transactions.  The reason is because when you make a call to an external system, there's no guarantee on how long the call takes or even whether the call returns.  When a call takes a long time, several bad things can happen:
	- The MySQL DB Connection held opened due to the DB transaction goes into idle. Eventually, a timeout in MySQL hits and the connection gets severed and the transaction is rolled back.  By default, this timeout is 45 seconds but can be changed via a parameter in my.cnf.  So it's problem that the agent call completes just fine but the DB transaction rolls back and changes are undone.
	- The rows locked in that transaction before the remote agent call could be holding up other foreign key checks into the table.  MySQL runs foreign key checks in transactions to make sure the data modification and the checks are done atomically.  Therefore, these checks must wait for other transactions to complete.  Hence, an agent call that takes sometime can severely slow down the system, particularly under scale.

We have two solutions to this:
	- Drive agent interactions with states.  There are many examples of this in VM, Volume, etc.
	- When the above cannot be done, acquire a lock in the lock table via a DAO method call.  Locks do not maintain DB transactions and therefore will not run into this problem.  However, you are responsible for releasing locks.  It used to be that if you forget to release the locks, the @DB annotation automatically releases locks once it went out of the scope and asserts to alert the developer.  However, the @DB annotation has been removed in the Spring work so I'm not sure if it's still done.  

This is a tough problem to solve because 
	1. It usually works just fine during functional testing.  During scale testing, this problem surfaces and often in unexpected places due to the foreign key check problem.
	2. For developers, it is difficult for them to know if a method that they're calling within a transaction ends up in an agent call.  

There is an assert in AgentManager to ensure that there are no db transactions before making a agent call.  Apparently, since the conversion to Maven, no one actually runs with assert on any more.  Due to that, this design principle has been lost in CloudStack and we're finding more and more calls being made in DB transactions.   To counter that, I decided to add a global parameter that turns the assert to an actual exception.  It is advised that all developers set this global parameter, check.txn.before.sending.agent.commands, during their own testing to make sure it doesn't call agent calls in transactions.

--Alex

  

RE: [DESIGN] No agent calls within database transactions....

Posted by Alex Huang <Al...@citrix.com>.
Assert should always be on when we're QAing the system, including BVT.  I've alerted many people who work on QA about that.  Unfortunately, that's a deployment time setting so it's up the deployer who's running the BVT to set that.  Perhaps, we can write a script to perform QA deployment and it has all of these things set already.  That way QA deployment is always the same.

--Alex

> -----Original Message-----
> From: Marcus [mailto:shadowsor@gmail.com]
> Sent: Monday, March 31, 2014 3:15 PM
> To: dev@cloudstack.apache.org
> Subject: Re: [DESIGN] No agent calls within database transactions....
> 
> Yeah, that assert issue has bitten us once or twice, and I know Ryan
> squawked about it at some point.  Do we have any point where enforcement
> will occur (BVT or some other tests)?
> 
> On Mon, Mar 31, 2014 at 4:04 PM, Alex Huang <Al...@citrix.com>
> wrote:
> > Hi All,
> >
> > I was alerted to this problem recently and it's something that affects
> developers so I want to bring it up.  It is a design principle in CloudStack that
> we do not make agent calls within database transactions.  The reason is
> because when you make a call to an external system, there's no guarantee
> on how long the call takes or even whether the call returns.  When a call
> takes a long time, several bad things can happen:
> >         - The MySQL DB Connection held opened due to the DB transaction
> goes into idle. Eventually, a timeout in MySQL hits and the connection gets
> severed and the transaction is rolled back.  By default, this timeout is 45
> seconds but can be changed via a parameter in my.cnf.  So it's problem that
> the agent call completes just fine but the DB transaction rolls back and
> changes are undone.
> >         - The rows locked in that transaction before the remote agent call could
> be holding up other foreign key checks into the table.  MySQL runs foreign
> key checks in transactions to make sure the data modification and the checks
> are done atomically.  Therefore, these checks must wait for other
> transactions to complete.  Hence, an agent call that takes sometime can
> severely slow down the system, particularly under scale.
> >
> > We have two solutions to this:
> >         - Drive agent interactions with states.  There are many examples of this
> in VM, Volume, etc.
> >         - When the above cannot be done, acquire a lock in the lock table via a
> DAO method call.  Locks do not maintain DB transactions and therefore will
> not run into this problem.  However, you are responsible for releasing locks.
> It used to be that if you forget to release the locks, the @DB annotation
> automatically releases locks once it went out of the scope and asserts to alert
> the developer.  However, the @DB annotation has been removed in the
> Spring work so I'm not sure if it's still done.
> >
> > This is a tough problem to solve because
> >         1. It usually works just fine during functional testing.  During scale
> testing, this problem surfaces and often in unexpected places due to the
> foreign key check problem.
> >         2. For developers, it is difficult for them to know if a method that
> they're calling within a transaction ends up in an agent call.
> >
> > There is an assert in AgentManager to ensure that there are no db
> transactions before making a agent call.  Apparently, since the conversion to
> Maven, no one actually runs with assert on any more.  Due to that, this
> design principle has been lost in CloudStack and we're finding more and more
> calls being made in DB transactions.   To counter that, I decided to add a
> global parameter that turns the assert to an actual exception.  It is advised
> that all developers set this global parameter,
> check.txn.before.sending.agent.commands, during their own testing to
> make sure it doesn't call agent calls in transactions.
> >
> > --Alex
> >
> >

Re: [DESIGN] No agent calls within database transactions....

Posted by Marcus <sh...@gmail.com>.
Yeah, that assert issue has bitten us once or twice, and I know Ryan
squawked about it at some point.  Do we have any point where
enforcement will occur (BVT or some other tests)?

On Mon, Mar 31, 2014 at 4:04 PM, Alex Huang <Al...@citrix.com> wrote:
> Hi All,
>
> I was alerted to this problem recently and it's something that affects developers so I want to bring it up.  It is a design principle in CloudStack that we do not make agent calls within database transactions.  The reason is because when you make a call to an external system, there's no guarantee on how long the call takes or even whether the call returns.  When a call takes a long time, several bad things can happen:
>         - The MySQL DB Connection held opened due to the DB transaction goes into idle. Eventually, a timeout in MySQL hits and the connection gets severed and the transaction is rolled back.  By default, this timeout is 45 seconds but can be changed via a parameter in my.cnf.  So it's problem that the agent call completes just fine but the DB transaction rolls back and changes are undone.
>         - The rows locked in that transaction before the remote agent call could be holding up other foreign key checks into the table.  MySQL runs foreign key checks in transactions to make sure the data modification and the checks are done atomically.  Therefore, these checks must wait for other transactions to complete.  Hence, an agent call that takes sometime can severely slow down the system, particularly under scale.
>
> We have two solutions to this:
>         - Drive agent interactions with states.  There are many examples of this in VM, Volume, etc.
>         - When the above cannot be done, acquire a lock in the lock table via a DAO method call.  Locks do not maintain DB transactions and therefore will not run into this problem.  However, you are responsible for releasing locks.  It used to be that if you forget to release the locks, the @DB annotation automatically releases locks once it went out of the scope and asserts to alert the developer.  However, the @DB annotation has been removed in the Spring work so I'm not sure if it's still done.
>
> This is a tough problem to solve because
>         1. It usually works just fine during functional testing.  During scale testing, this problem surfaces and often in unexpected places due to the foreign key check problem.
>         2. For developers, it is difficult for them to know if a method that they're calling within a transaction ends up in an agent call.
>
> There is an assert in AgentManager to ensure that there are no db transactions before making a agent call.  Apparently, since the conversion to Maven, no one actually runs with assert on any more.  Due to that, this design principle has been lost in CloudStack and we're finding more and more calls being made in DB transactions.   To counter that, I decided to add a global parameter that turns the assert to an actual exception.  It is advised that all developers set this global parameter, check.txn.before.sending.agent.commands, during their own testing to make sure it doesn't call agent calls in transactions.
>
> --Alex
>
>