You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "Suresh Subbiah (JIRA)" <ji...@apache.org> on 2015/10/08 07:11:26 UTC

[jira] [Assigned] (TRAFODION-924) LP Bug: 1413241 - ENDTRANSACTION hang, transaction state FORGETTING

     [ https://issues.apache.org/jira/browse/TRAFODION-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Suresh Subbiah reassigned TRAFODION-924:
----------------------------------------

    Assignee: Atanu Mishra  (was: John de Roo)

> LP Bug: 1413241 - ENDTRANSACTION hang, transaction state FORGETTING
> -------------------------------------------------------------------
>
>                 Key: TRAFODION-924
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-924
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: dtm
>            Reporter: Apache Trafodion
>            Assignee: Atanu Mishra
>            Priority: Critical
>             Fix For: 2.0-incubating
>
>
> A loop to reexecute the seabase developer regression suite hung on the 14th iteration in TEST016. The sqlci console looked like this:
> >>-- char type
> >>create table mcStatPart1
> +>(a int not null not droppable,
> +>b char(10) not null not droppable,
> +>f int, txt char(100),
> +>primary key (a,b))
> +>salt using 8 partitions ;
> --- SQL operation complete.
> >>
> >>insert into mcStatPart1 values (1,'123',1,'xyz'),(1,'133',1,'xyz'),(1,'423',1,'xyz'),(2,'111',1,'xyz'),(2,'223',1,'xyz'),(2,'323',1,'xyz'),(2,'423',1,'xyz'),
> +>                           (3,'123',1,'xyz'),(3,'133',1,'xyz'),(3,'423',1,'xyz'),(4,'111',1,'xyz'),(4,'223',1,'xyz'),(4,'323',1,'xyz'),(4,'423',1,'xyz');
> A pstack of the sqlci (0,13231) showed it blocking in a call to ENDTRANSACTION.   And dtmci showed this for the transaction:
> DTMCI > list
> Transid         Owner	eventQ	pending	Joiners	TSEs	State
> (0,13742)       0,13231	0	0	0	0	FORGETTING
> Here's a copy of Sean's analysis:
> From: Broeder, Sean 
> Sent: Wednesday, January 21, 2015 8:43 AM
> To: Hanlon, Mike; Cooper, Joanie
> Cc: DeRoo, John
> Subject: RE: ENDTRANSACTION hang, transaction state FORGETTING
> Hi Mike,
> It looks like we have a zookeeper problem right at the time of the commit.  A table is offline:
> 2015-01-21 11:13:45,529 WARN zookeeper.ZKUtil: hconnection-0x1646b7c-0x14aefd0ac4a5e18, quorum=localhost:47570, baseZNode=/hbase Unable to get data of znode /hbase/table/TRAFODION.HBASE.MCSTATPART1
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/table/TRAFODION.HBASE.MCSTATPART1
> Then we fail after 3 retries of sending the commit request
> 2015-01-21 11:14:04,405 ERROR transactional.TransactionManager: doCommitX, result size: 0
> 2015-01-21 11:14:04,405 ERROR transactional.TransactionManager: doCommitX, result size: 0
> Normally we would create a recovery entry for this transaction to redrive commit, but it appears we are unable to do that due to the zookeeper errors 
> 2015-01-21 11:14:04,408 DEBUG client.HConnectionManager$HConnectionImplementation: Removed all cached region locations that map to g4t3005.houston.hp.com,4       2243,1421362639257
> 471340 2015-01-21 11:14:05,255 WARN zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:47570, exception=org.apache.zookeeper.KeeperExc       eption$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/table/TRAFODION.HBASE.MCSTATPART1
> 471341 2015-01-21 11:14:05,256 WARN zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:47570, exception=org.apache.zookeeper.KeeperExc       eption$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/table/TRAFODION.HBASE.MCSTATPART1
> 471342 2015-01-21 11:14:05,256 INFO util.RetryCounter: Sleeping 1000ms before retry #0...
> 471343 2015-01-21 11:14:05,256 INFO util.RetryCounter: Sleeping 1000ms before retry #0...
> Hbase looks like it’s having troubles as I can’t even do a list operation from the hbase shell
> 2015-01-21 14:40:28,816 ERROR [main] client.HConnectionManager$HConnectionImplementation: Can't get connection to ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase
> We need to think of how better to handle this in the TransactionManager, but in reality I’m not sure what we can do if Zookeeper fails.  You can open an LP bug so we have record of it and can discuss what to do.
> Thanks,
> Sean
> _____________________________________________
> From: Hanlon, Mike 
> Sent: Wednesday, January 21, 2015 6:17 AM
> To: Cooper, Joanie
> Cc: Broeder, Sean; DeRoo, John
> Subject: ENDTRANSACTION hang, transaction state FORGETTING
> Hi Joanie,
> Have we seen this before? A SQL regression test (in this case seabase/TEST016) hangs in a call to ENDTRANSACTION. The transaction state is shown in dtmci to be FORGETTING.  It probably is not easy to reproduce, since the problem occurred on the 14th iteration of a loop to re-execute the seabase suite. 
> There are a lot of messages in /opt/home/mhanlon/trafodion/core/sqf/logs/trafodion.dtm.log on my workstation, sqws112. The transid in question is 13742. Would somebody like to look while things are still hung, before I try to force a cleanup?
> thanks
> Mike



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)