You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2015/08/04 22:52:04 UTC
[jira] [Commented] (AMBARI-12570) Cluster creates stuck at 9x%
(deadlock sql exception)
[ https://issues.apache.org/jira/browse/AMBARI-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14654328#comment-14654328 ]
Hadoop QA commented on AMBARI-12570:
------------------------------------
{color:green}+1 overall{color}. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12748729/AMBARI-12570.patch.1
against trunk revision .
{color:green}+1 @author{color}. The patch does not contain any @author tags.
{color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings.
{color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings.
{color:green}+1 core tests{color}. The patch passed unit tests in .
Test results: https://builds.apache.org/job/Ambari-trunk-test-patch/3524//testReport/
Console output: https://builds.apache.org/job/Ambari-trunk-test-patch/3524//console
This message is automatically generated.
> Cluster creates stuck at 9x% (deadlock sql exception)
> -----------------------------------------------------
>
> Key: AMBARI-12570
> URL: https://issues.apache.org/jira/browse/AMBARI-12570
> Project: Ambari
> Issue Type: Bug
> Affects Versions: 2.1.0
> Reporter: Jonathan Hurley
> Assignee: Jonathan Hurley
> Priority: Blocker
> Fix For: 2.1.1
>
> Attachments: AMBARI-12570.patch, AMBARI-12570.patch.1
>
>
> Similar to AMBARI-12526, Ambari installation via a blueprint on SQL Azure gets stuck somewhere between 90% and 100% because of a SQL Database deadlock.
> This is always between {{hostcomponentstate.current_state}} and {{hostcomponentstate.version}}.
> {code}
> Rollback reason:
> Local Exception Stack:
> Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.DatabaseException
> Internal Exception: com.microsoft.sqlserver.jdbc.SQLServerException: Transaction (Process ID 62) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
> Error Code: 1205
> Call: UPDATE hostcomponentstate SET current_state = ? WHERE ((((component_name = ?) AND (host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?))
> bind => [5 parameters bound]
> at org.eclipse.persistence.exceptions.DatabaseException.sqlException(DatabaseException.java:331)
> at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeDirectNoSelect(DatabaseAccessor.java:900)
> at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeNoSelect(DatabaseAccessor.java:962)
> at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.basicExecuteCall(DatabaseAccessor.java:631)
> at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatch(ParameterizedSQLBatchWritingMechanism.java:149)
> at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatchedStatements(ParameterizedSQLBatchWritingMechanism.java:134)
> at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.writesCompleted(DatabaseAccessor.java:1836)
> at org.eclipse.persistence.internal.sessions.AbstractSession.writesCompleted(AbstractSession.java:4244)
> at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.writesCompleted(UnitOfWorkImpl.java:5594)
> at org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.writeChanges(RepeatableWriteUnitOfWork.java:453)
> at org.eclipse.persistence.internal.jpa.EntityManagerImpl.flush(EntityManagerImpl.java:863)
> at org.eclipse.persistence.internal.jpa.QueryImpl.performPreQueryFlush(QueryImpl.java:963)
> at org.eclipse.persistence.internal.jpa.QueryImpl.executeReadQuery(QueryImpl.java:207)
> at org.eclipse.persistence.internal.jpa.QueryImpl.getSingleResult(QueryImpl.java:517)
> at org.eclipse.persistence.internal.jpa.EJBQueryImpl.getSingleResult(EJBQueryImpl.java:400)
> at org.apache.ambari.server.orm.dao.DaoUtils.selectOne(DaoUtils.java:80)
> at org.apache.ambari.server.orm.dao.StackDAO.find(StackDAO.java:93)
> at org.apache.ambari.server.orm.AmbariLocalSessionInterceptor.invoke(AmbariLocalSessionInterceptor.java:53)
> at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.setStackVersion(ServiceComponentHostImpl.java:1058)
> at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:628)
> at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:610)
> at org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354)
> at org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294)
> at org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39)
> at org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440)
> at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.handleEvent(ServiceComponentHostImpl.java:901)
> at org.apache.ambari.server.state.cluster.ClusterImpl.processServiceComponentHostEvents(ClusterImpl.java:2508)
> at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:68)
> at org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:343)
> at org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:195)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> - We have dual X-locks on {{hostcomponentstate}} asking for U-locks when updating the CLUSTERED INDEX.
> - Both dual X-locks, from different transactions and different processes, are on the same row (technically impossible) - based on the XML execution plan, we can see that the concurrent UPDATE statements are executing on different rows due to their CLUSTERED INDEX predicate.
> - In Java, Ambari has locks which prevent concurrent U- or X-locks on the same row
> - Only happens on SQL Server
> My best suspicion right now is that we have a key hash collision happening on this table. That's why two processes appear to have the same lock even though they are on different rows.
> I was able to use a database dump that I took to compare hash values from {{hostcomponentstate}}:
> {code:sql}
> SELECT %%lockres%% as lock_hash, cluster_id, host_id, service_name, component_name
> FROM hostcomponentstate
> ORDER BY host_id, service_name, %%lockres%%
> {code}
> {code}
> lock_hash cluster_id host_id service_name component_name
> (0d4a8b0869f5) 2 1 HDFS SECONDARY_NAMENODE
> (0d4a8b0869f5) 2 1 HDFS HDFS_CLIENT
> (99fb0081b824) 2 1 MAPREDUCE2 HISTORYSERVER
> (7086998db3dc) 2 1 YARN APP_TIMELINE_SERVER
> (7086998db3dc) 2 1 YARN RESOURCEMANAGER
> (3bef52323322) 2 1 ZOOKEEPER ZOOKEEPER_SERVER
> ...
> {code}
> SQL Server is producing lock hashes that collide! It seems like the issue here is that we are using a CLUSTERED INDEX on 4 columns, 3 of which are always the same (cluster, host, service) in many cases. The only variable is the component name. When the hash gets truncated to 6 bytes, we get duplicates.
> So, I think this totally aligns with my suspicions as to why this is only a SQL Server problem since other database don't lock like this. It also makes sense that this is the only table this happens on since we are not using a surrogate PK here. I think we have a few options here:
> - Add more columns into the CLUSTERED INDEX in hopes we get a more unique hash. The problem is that the other columns are also basically the same.
> - Change the CLUSTERED INDEX to an UNCLUSTERED INDEX (since this is our main query criteria) and use a single, unique BIGINT PK as we do for many other tables. I'm just not sure how SQL Server locks on a row when there is a CLUSTERED INDEX which is not part of the predicate.
> - Remove the CLUSTERED INDEX entirely (performance would probably tank)
> - It's possible we can try to partition this table differently so that lock space is more unique.
> - Change the existing CLUSTERED INDEX so that it disallows row and page level locks, forcing all X-locks to be table-level locks. This would, in theory, prevent the deadlock and would not require us to change any data. But it would introduce a bottleneck on the table for anything more than a single read.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)