You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by Jonathan Hurley <jh...@hortonworks.com> on 2015/02/24 06:36:09 UTC

Review Request 31341: Performance: Cluster Installation Deadlocks When Setting Component States

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31341/
-----------------------------------------------------------

Review request for Ambari, Nate Cole and Tom Beerbower.


Bugs: AMBARI-9761
    https://issues.apache.org/jira/browse/AMBARI-9761


Repository: ambari


Description
-------

Another case of misunderstanding how locks work.

During provisioning of a cluster with at least 200 hosts, Ambari Server becomes unresponsive. Based on the thread dump, there exists a deadlock between:
- Cluster readers
- Cluster writers
- ServiceComponentHost writers

qtp626652285-97   ClusterImpl.convertToResponse() (cluster readLock)
qtp1282624353-47  ServiceComponentHostImpl.setRestartRequired() (sch writeLock)
qtp626652285-97   ServiceComponentHostImpl.getMaintenanceState() (sch readLock BLOCKED by qtp1282624353-47)
qtp1282624353-60  ClusterImpl.recalculateClusterVersionState() (cluster writeLock BLOCKED by qtp626652285-97)
qtp1282624353-47  ServiceComponentHostImpl.isPersisted() (cluster readLock BLOCKED by qtp1282624353-60)

The underlying problem is that a writeLock.lock() is parked which causes all subsequent readLock.lock() requests to also park. This includes the request from qtp1282624353-47 which is holding a writeLock on the SCH which, in turn, is blocking qtp626652285-97 (the original cluster readLock reader which blocks the cluster write)

Long story short is that I think we need to revisit locks again after 2.0.0; I just don't see a need for locking on reads in most places - that's what the database is doing for us.


Diffs
-----

  ambari-server/src/main/java/org/apache/ambari/server/events/listeners/upgrade/StackVersionListener.java 117526c 
  ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java 0de62ea 
  ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java c43044c 
  ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java 96a1443 

Diff: https://reviews.apache.org/r/31341/diff/


Testing
-------

Reproduced the deadlock in a unit test first, and then verified the deadlock does not occur anymore in the test after applying the patch.


Thanks,

Jonathan Hurley


Re: Review Request 31341: Performance: Cluster Installation Deadlocks When Setting Component States

Posted by Jonathan Hurley <jh...@hortonworks.com>.

> On Feb. 24, 2015, 7:23 a.m., Nate Cole wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java, lines 413-417
> > <https://reviews.apache.org/r/31341/diff/1/?file=873469#file873469line413>
> >
> >     Do we even need a read lock around the simple boolean check?  There is a write lock around the persist() method which seems like the only place where it's required.

Thanks for the review! I was debating the same issue. I've left the locks in place that involve non-thread-safe members. In this case though, you're right that the `persist()` method is enough. Furthermore, this member is only ever set in the event that the service is brand new.

Overall, we really need to remove the locks from many of the read methods. I'm seriously debating removing the read locks around `convertToResponse()` but felt that this close to release, that's risky.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31341/#review73801
-----------------------------------------------------------


On Feb. 24, 2015, 12:36 a.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31341/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2015, 12:36 a.m.)
> 
> 
> Review request for Ambari, Nate Cole and Tom Beerbower.
> 
> 
> Bugs: AMBARI-9761
>     https://issues.apache.org/jira/browse/AMBARI-9761
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Another case of misunderstanding how locks work.
> 
> During provisioning of a cluster with at least 200 hosts, Ambari Server becomes unresponsive. Based on the thread dump, there exists a deadlock between:
> - Cluster readers
> - Cluster writers
> - ServiceComponentHost writers
> 
> qtp626652285-97   ClusterImpl.convertToResponse() (cluster readLock)
> qtp1282624353-47  ServiceComponentHostImpl.setRestartRequired() (sch writeLock)
> qtp626652285-97   ServiceComponentHostImpl.getMaintenanceState() (sch readLock BLOCKED by qtp1282624353-47)
> qtp1282624353-60  ClusterImpl.recalculateClusterVersionState() (cluster writeLock BLOCKED by qtp626652285-97)
> qtp1282624353-47  ServiceComponentHostImpl.isPersisted() (cluster readLock BLOCKED by qtp1282624353-60)
> 
> The underlying problem is that a writeLock.lock() is parked which causes all subsequent readLock.lock() requests to also park. This includes the request from qtp1282624353-47 which is holding a writeLock on the SCH which, in turn, is blocking qtp626652285-97 (the original cluster readLock reader which blocks the cluster write)
> 
> Long story short is that I think we need to revisit locks again after 2.0.0; I just don't see a need for locking on reads in most places - that's what the database is doing for us.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/events/listeners/upgrade/StackVersionListener.java 117526c 
>   ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java 0de62ea 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java c43044c 
>   ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java 96a1443 
> 
> Diff: https://reviews.apache.org/r/31341/diff/
> 
> 
> Testing
> -------
> 
> Reproduced the deadlock in a unit test first, and then verified the deadlock does not occur anymore in the test after applying the patch.
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>


Re: Review Request 31341: Performance: Cluster Installation Deadlocks When Setting Component States

Posted by Nate Cole <nc...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31341/#review73801
-----------------------------------------------------------

Ship it!



ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java
<https://reviews.apache.org/r/31341/#comment120214>

    Do we even need a read lock around the simple boolean check?  There is a write lock around the persist() method which seems like the only place where it's required.


- Nate Cole


On Feb. 24, 2015, 12:36 a.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31341/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2015, 12:36 a.m.)
> 
> 
> Review request for Ambari, Nate Cole and Tom Beerbower.
> 
> 
> Bugs: AMBARI-9761
>     https://issues.apache.org/jira/browse/AMBARI-9761
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Another case of misunderstanding how locks work.
> 
> During provisioning of a cluster with at least 200 hosts, Ambari Server becomes unresponsive. Based on the thread dump, there exists a deadlock between:
> - Cluster readers
> - Cluster writers
> - ServiceComponentHost writers
> 
> qtp626652285-97   ClusterImpl.convertToResponse() (cluster readLock)
> qtp1282624353-47  ServiceComponentHostImpl.setRestartRequired() (sch writeLock)
> qtp626652285-97   ServiceComponentHostImpl.getMaintenanceState() (sch readLock BLOCKED by qtp1282624353-47)
> qtp1282624353-60  ClusterImpl.recalculateClusterVersionState() (cluster writeLock BLOCKED by qtp626652285-97)
> qtp1282624353-47  ServiceComponentHostImpl.isPersisted() (cluster readLock BLOCKED by qtp1282624353-60)
> 
> The underlying problem is that a writeLock.lock() is parked which causes all subsequent readLock.lock() requests to also park. This includes the request from qtp1282624353-47 which is holding a writeLock on the SCH which, in turn, is blocking qtp626652285-97 (the original cluster readLock reader which blocks the cluster write)
> 
> Long story short is that I think we need to revisit locks again after 2.0.0; I just don't see a need for locking on reads in most places - that's what the database is doing for us.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/events/listeners/upgrade/StackVersionListener.java 117526c 
>   ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java 0de62ea 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java c43044c 
>   ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java 96a1443 
> 
> Diff: https://reviews.apache.org/r/31341/diff/
> 
> 
> Testing
> -------
> 
> Reproduced the deadlock in a unit test first, and then verified the deadlock does not occur anymore in the test after applying the patch.
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>


Re: Review Request 31341: Performance: Cluster Installation Deadlocks When Setting Component States

Posted by Tom Beerbower <tb...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31341/#review73795
-----------------------------------------------------------

Ship it!


Ship It!

- Tom Beerbower


On Feb. 24, 2015, 5:36 a.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/31341/
> -----------------------------------------------------------
> 
> (Updated Feb. 24, 2015, 5:36 a.m.)
> 
> 
> Review request for Ambari, Nate Cole and Tom Beerbower.
> 
> 
> Bugs: AMBARI-9761
>     https://issues.apache.org/jira/browse/AMBARI-9761
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Another case of misunderstanding how locks work.
> 
> During provisioning of a cluster with at least 200 hosts, Ambari Server becomes unresponsive. Based on the thread dump, there exists a deadlock between:
> - Cluster readers
> - Cluster writers
> - ServiceComponentHost writers
> 
> qtp626652285-97   ClusterImpl.convertToResponse() (cluster readLock)
> qtp1282624353-47  ServiceComponentHostImpl.setRestartRequired() (sch writeLock)
> qtp626652285-97   ServiceComponentHostImpl.getMaintenanceState() (sch readLock BLOCKED by qtp1282624353-47)
> qtp1282624353-60  ClusterImpl.recalculateClusterVersionState() (cluster writeLock BLOCKED by qtp626652285-97)
> qtp1282624353-47  ServiceComponentHostImpl.isPersisted() (cluster readLock BLOCKED by qtp1282624353-60)
> 
> The underlying problem is that a writeLock.lock() is parked which causes all subsequent readLock.lock() requests to also park. This includes the request from qtp1282624353-47 which is holding a writeLock on the SCH which, in turn, is blocking qtp626652285-97 (the original cluster readLock reader which blocks the cluster write)
> 
> Long story short is that I think we need to revisit locks again after 2.0.0; I just don't see a need for locking on reads in most places - that's what the database is doing for us.
> 
> 
> Diffs
> -----
> 
>   ambari-server/src/main/java/org/apache/ambari/server/events/listeners/upgrade/StackVersionListener.java 117526c 
>   ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java 0de62ea 
>   ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java c43044c 
>   ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java 96a1443 
> 
> Diff: https://reviews.apache.org/r/31341/diff/
> 
> 
> Testing
> -------
> 
> Reproduced the deadlock in a unit test first, and then verified the deadlock does not occur anymore in the test after applying the patch.
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>