You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by John Speidel <js...@hortonworks.com> on 2015/05/26 21:37:37 UTC

Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/
-----------------------------------------------------------

Review request for Ambari, Robert Nettleton and Tom Beerbower.


Bugs: AMBARI-11394
    https://issues.apache.org/jira/browse/AMBARI-11394


Repository: ambari


Description
-------

Provisioning a cluster may occasionally fail to complete as a result of an out of order database write.
This error presents itself as start task(s) that never progresses beyond the PENDING state. For these logical pending tasks, there are no associated physical tasks.
When a host is matched to a host request, an install request is submitted followed immediately by a start request. The install task transitions all host components desired_state for the host from INIT to INSTALLED. But, because of an error in the persistence layer, after the desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler thread) to INIT. As a result, the component is never started because it it's desired state is INIT and isn't processed by the start operation.
The root cause of this is that the public method ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT because this method is marked as @Transactional it's invocation is wrapped in a proxy which wraps the method invocation in a transaction. As a result, the transaction is committed in the proxy after the method returns outside of any synchronization which allows for out of order writes.


Diffs
-----

  ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java dd06eb5 

Diff: https://reviews.apache.org/r/34677/diff/


Testing
-------

- provisioned clusters via BP
- currently re-running unit test suite and will update with results prior to merging


Thanks,

John Speidel


Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

Posted by Robert Nettleton <rn...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/#review85232
-----------------------------------------------------------

Ship it!


Ship It!

- Robert Nettleton


On May 26, 2015, 7:37 p.m., John Speidel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34677/
> -----------------------------------------------------------
> 
> (Updated May 26, 2015, 7:37 p.m.)
> 
> 
> Review request for Ambari, Robert Nettleton and Tom Beerbower.
> 
> 
> Bugs: AMBARI-11394
>     https://issues.apache.org/jira/browse/AMBARI-11394
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Provisioning a cluster may occasionally fail to complete as a result of an out of order database write.
> This error presents itself as start task(s) that never progresses beyond the PENDING state. For these logical pending tasks, there are no associated physical tasks.
> When a host is matched to a host request, an install request is submitted followed immediately by a start request. The install task transitions all host components desired_state for the host from INIT to INSTALLED. But, because of an error in the persistence layer, after the desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler thread) to INIT. As a result, the component is never started because it it's desired state is INIT and isn't processed by the start operation.
> The root cause of this is that the public method ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT because this method is marked as @Transactional it's invocation is wrapped in a proxy which wraps the method invocation in a transaction. As a result, the transaction is committed in the proxy after the method returns outside of any synchronization which allows for out of order writes.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java dd06eb5 
> 
> Diff: https://reviews.apache.org/r/34677/diff/
> 
> 
> Testing
> -------
> 
> - provisioned clusters via BP
> - currently re-running unit test suite and will update with results prior to merging
> 
> 
> Thanks,
> 
> John Speidel
> 
>


Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

Posted by John Speidel <js...@hortonworks.com>.

> On May 26, 2015, 7:50 p.m., Robert Levas wrote:
> > Ship It!

Forgot to add test results:
Results :

Tests run: 3011, Failures: 0, Errors: 0, Skipped: 21
...
----------------------------------------------------------------------
Total run:743
Total errors:0
Total failures:0


- John


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/#review85239
-----------------------------------------------------------


On May 26, 2015, 7:42 p.m., John Speidel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34677/
> -----------------------------------------------------------
> 
> (Updated May 26, 2015, 7:42 p.m.)
> 
> 
> Review request for Ambari, Robert Nettleton and Tom Beerbower.
> 
> 
> Bugs: AMBARI-11394
>     https://issues.apache.org/jira/browse/AMBARI-11394
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Provisioning a cluster may occasionally fail to complete as a result of an out of order database write.
> This error presents itself as start task(s) that never progresses beyond the PENDING state. For these logical pending tasks, there are no associated physical tasks.
> When a host is matched to a host request, an install request is submitted followed immediately by a start request. The install task transitions all host components desired_state for the host from INIT to INSTALLED. But, because of an error in the persistence layer, after the desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler thread) to INIT. As a result, the component is never started because it it's desired state is INIT and isn't processed by the start operation.
> The root cause of this is that the public method ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT because this method is marked as @Transactional it's invocation is wrapped in a proxy which wraps the method invocation in a transaction. As a result, the transaction is committed in the proxy after the method returns outside of any synchronization which allows for out of order writes.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java dd06eb5 
> 
> Diff: https://reviews.apache.org/r/34677/diff/
> 
> 
> Testing
> -------
> 
> - provisioned clusters via BP
> - currently re-running unit test suite and will update with results prior to merging
> 
> Because this is a timing issue which according to a user only occurs for them once every ~150 clusters and I have been unable to reproduce, I wan't able to verify that this patch completely fixes this issue.  But, I can say with certainty that this the issue that was fixed could manifest itself precisely as the bug describes.
> 
> 
> Thanks,
> 
> John Speidel
> 
>


Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

Posted by Robert Levas <rl...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/#review85239
-----------------------------------------------------------

Ship it!


Ship It!

- Robert Levas


On May 26, 2015, 3:42 p.m., John Speidel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34677/
> -----------------------------------------------------------
> 
> (Updated May 26, 2015, 3:42 p.m.)
> 
> 
> Review request for Ambari, Robert Nettleton and Tom Beerbower.
> 
> 
> Bugs: AMBARI-11394
>     https://issues.apache.org/jira/browse/AMBARI-11394
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Provisioning a cluster may occasionally fail to complete as a result of an out of order database write.
> This error presents itself as start task(s) that never progresses beyond the PENDING state. For these logical pending tasks, there are no associated physical tasks.
> When a host is matched to a host request, an install request is submitted followed immediately by a start request. The install task transitions all host components desired_state for the host from INIT to INSTALLED. But, because of an error in the persistence layer, after the desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler thread) to INIT. As a result, the component is never started because it it's desired state is INIT and isn't processed by the start operation.
> The root cause of this is that the public method ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT because this method is marked as @Transactional it's invocation is wrapped in a proxy which wraps the method invocation in a transaction. As a result, the transaction is committed in the proxy after the method returns outside of any synchronization which allows for out of order writes.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java dd06eb5 
> 
> Diff: https://reviews.apache.org/r/34677/diff/
> 
> 
> Testing
> -------
> 
> - provisioned clusters via BP
> - currently re-running unit test suite and will update with results prior to merging
> 
> Because this is a timing issue which according to a user only occurs for them once every ~150 clusters and I have been unable to reproduce, I wan't able to verify that this patch completely fixes this issue.  But, I can say with certainty that this the issue that was fixed could manifest itself precisely as the bug describes.
> 
> 
> Thanks,
> 
> John Speidel
> 
>


Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

Posted by Erik Bergenholtz <eb...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/#review85235
-----------------------------------------------------------

Ship it!


- Erik Bergenholtz


On May 26, 2015, 7:42 p.m., John Speidel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34677/
> -----------------------------------------------------------
> 
> (Updated May 26, 2015, 7:42 p.m.)
> 
> 
> Review request for Ambari, Robert Nettleton and Tom Beerbower.
> 
> 
> Bugs: AMBARI-11394
>     https://issues.apache.org/jira/browse/AMBARI-11394
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Provisioning a cluster may occasionally fail to complete as a result of an out of order database write.
> This error presents itself as start task(s) that never progresses beyond the PENDING state. For these logical pending tasks, there are no associated physical tasks.
> When a host is matched to a host request, an install request is submitted followed immediately by a start request. The install task transitions all host components desired_state for the host from INIT to INSTALLED. But, because of an error in the persistence layer, after the desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler thread) to INIT. As a result, the component is never started because it it's desired state is INIT and isn't processed by the start operation.
> The root cause of this is that the public method ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT because this method is marked as @Transactional it's invocation is wrapped in a proxy which wraps the method invocation in a transaction. As a result, the transaction is committed in the proxy after the method returns outside of any synchronization which allows for out of order writes.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java dd06eb5 
> 
> Diff: https://reviews.apache.org/r/34677/diff/
> 
> 
> Testing
> -------
> 
> - provisioned clusters via BP
> - currently re-running unit test suite and will update with results prior to merging
> 
> Because this is a timing issue which according to a user only occurs for them once every ~150 clusters and I have been unable to reproduce, I wan't able to verify that this patch completely fixes this issue.  But, I can say with certainty that this the issue that was fixed could manifest itself precisely as the bug describes.
> 
> 
> Thanks,
> 
> John Speidel
> 
>


Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

Posted by Tom Beerbower <tb...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/#review85233
-----------------------------------------------------------

Ship it!


Ship It!

- Tom Beerbower


On May 26, 2015, 7:42 p.m., John Speidel wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34677/
> -----------------------------------------------------------
> 
> (Updated May 26, 2015, 7:42 p.m.)
> 
> 
> Review request for Ambari, Robert Nettleton and Tom Beerbower.
> 
> 
> Bugs: AMBARI-11394
>     https://issues.apache.org/jira/browse/AMBARI-11394
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Provisioning a cluster may occasionally fail to complete as a result of an out of order database write.
> This error presents itself as start task(s) that never progresses beyond the PENDING state. For these logical pending tasks, there are no associated physical tasks.
> When a host is matched to a host request, an install request is submitted followed immediately by a start request. The install task transitions all host components desired_state for the host from INIT to INSTALLED. But, because of an error in the persistence layer, after the desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler thread) to INIT. As a result, the component is never started because it it's desired state is INIT and isn't processed by the start operation.
> The root cause of this is that the public method ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT because this method is marked as @Transactional it's invocation is wrapped in a proxy which wraps the method invocation in a transaction. As a result, the transaction is committed in the proxy after the method returns outside of any synchronization which allows for out of order writes.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java dd06eb5 
> 
> Diff: https://reviews.apache.org/r/34677/diff/
> 
> 
> Testing
> -------
> 
> - provisioned clusters via BP
> - currently re-running unit test suite and will update with results prior to merging
> 
> Because this is a timing issue which according to a user only occurs for them once every ~150 clusters and I have been unable to reproduce, I wan't able to verify that this patch completely fixes this issue.  But, I can say with certainty that this the issue that was fixed could manifest itself precisely as the bug describes.
> 
> 
> Thanks,
> 
> John Speidel
> 
>


Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

Posted by John Speidel <js...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/
-----------------------------------------------------------

(Updated May 26, 2015, 7:42 p.m.)


Review request for Ambari, Robert Nettleton and Tom Beerbower.


Bugs: AMBARI-11394
    https://issues.apache.org/jira/browse/AMBARI-11394


Repository: ambari


Description
-------

Provisioning a cluster may occasionally fail to complete as a result of an out of order database write.
This error presents itself as start task(s) that never progresses beyond the PENDING state. For these logical pending tasks, there are no associated physical tasks.
When a host is matched to a host request, an install request is submitted followed immediately by a start request. The install task transitions all host components desired_state for the host from INIT to INSTALLED. But, because of an error in the persistence layer, after the desired_state is set to INSTALLED, it is overwritten on another thread (heartbeat handler thread) to INIT. As a result, the component is never started because it it's desired state is INIT and isn't processed by the start operation.
The root cause of this is that the public method ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'. Inside of this method the proper locks are acquired, BUT because this method is marked as @Transactional it's invocation is wrapped in a proxy which wraps the method invocation in a transaction. As a result, the transaction is committed in the proxy after the method returns outside of any synchronization which allows for out of order writes.


Diffs
-----

  ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java dd06eb5 

Diff: https://reviews.apache.org/r/34677/diff/


Testing (updated)
-------

- provisioned clusters via BP
- currently re-running unit test suite and will update with results prior to merging

Because this is a timing issue which according to a user only occurs for them once every ~150 clusters and I have been unable to reproduce, I wan't able to verify that this patch completely fixes this issue.  But, I can say with certainty that this the issue that was fixed could manifest itself precisely as the bug describes.


Thanks,

John Speidel