You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Teng Qiu (JIRA)" <ji...@apache.org> on 2016/03/25 11:51:25 UTC
[jira] [Commented] (ACCUMULO-3336) ZK session reconnect still results in loss of ZK lock

    [ https://issues.apache.org/jira/browse/ACCUMULO-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211692#comment-15211692 ] 

Teng Qiu commented on ACCUMULO-3336:
------------------------------------

Hi, our accumulo cluster keeps running into this problem, all the tserver nodes are always gone after a few days, tserver process is still running, port still open, but always get exceptions like:
{code}
2016-03-24 19:40:46,358 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/xxxxxx/tables/2/conf/table.split.threshold
{code}
or
{code}
2016-03-24 19:40:56,464 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/xxxxxx/config/tserver.assignment.duration.warning
{code}

(xxxxxx is accumulo instance id)

we tried both accumulo version 1.7.0 and 1.7.1, zookeeper 3.4.6 and 3.4.8, always losing all tservers

[~elserj]
bq. is it all/many of your tabletservers that are losing their lock, or just the one that is co-located with the master, GC, monitor, and tracer?

our cluster is running on aws, ec2 instance type t2.medium (2 cores, 4 gb ram), 1 master node with gc and monitor, more than 15 tserver nodes, all dead...

tserver memory settings configured with that official bootstrap_config for 3gb size https://github.com/apache/accumulo/blob/rel/1.7.1/assemble/bin/bootstrap_config.sh#L159-L171

gc looks good, no stop-the-world, last gc message were
{code}
2016-03-22 21:49:05,996 [server.GarbageCollectionLogger] DEBUG: gc ParNew=18.77(+0.00) secs ConcurrentMarkSweep=0.24(+0.00) secs freemem=845,489,560(-120,178,064) totalmem=1,021,313,024
2016-03-22 21:54:36,010 [server.GarbageCollectionLogger] DEBUG: gc ParNew=18.77(+0.00) secs ConcurrentMarkSweep=0.24(+0.00) secs freemem=844,930,600(-120,737,024) totalmem=1,021,313,024
2016-03-22 22:00:01,024 [server.GarbageCollectionLogger] DEBUG: gc ParNew=18.78(+0.00) secs ConcurrentMarkSweep=0.24(+0.00) secs freemem=845,521,112(-120,146,512) totalmem=1,021,313,024
{code}

after March 22nd, no gc messages anymore...

most strange thing is, we have a smaller cluster, with 1 master node and 3 tserver nodes, same instance type, same mem settings, which is running very stable, since around 2 month... all tserver nodes are still there.

any ideas?

Thanks

> ZK session reconnect still results in loss of ZK lock
> -----------------------------------------------------
>
>                 Key: ACCUMULO-3336
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3336
>             Project: Accumulo
>          Issue Type: Bug
>          Components: zookeeper
>    Affects Versions: 1.5.2, 1.6.1
>            Reporter: Josh Elser
>             Fix For: 1.8.0
>
>
> Saw the following
> {noformat}
> 2014-11-14 08:38:30,612 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:30,621 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/config/tserver.compaction.warn.time
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:96)
> 	at org.apache.accumulo.server.conf.ZooConfiguration._get(ZooConfiguration.java:65)
> 	at org.apache.accumulo.server.conf.ZooConfiguration.get(ZooConfiguration.java:90)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getTimeInMillis(AccumuloConfiguration.java:136)
> 	at org.apache.accumulo.tserver.CompactionWatcher.run(CompactionWatcher.java:84)
> 	at org.apache.accumulo.server.util.time.SimpleTimer$LoggingTimerTask.run(SimpleTimer.java:42)
> 	at java.util.TimerThread.mainLoop(Timer.java:555)
> 	at java.util.TimerThread.run(Timer.java:505)
> 2014-11-14 08:38:30,672 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:30,672 [zookeeper.ZooLock] DEBUG: event null None Disconnected
> 2014-11-14 08:38:31,484 [zookeeper.ZooReader] WARN : Saw (possibly) transient exception communicating with ZooKeeper
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tservers/ip-172-31-13-177:37709
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooReader.getStatus(ZooReader.java:109)
> 	at org.apache.accumulo.fate.zookeeper.ZooLock.process(ZooLock.java:381)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2014-11-14 08:38:31,484 [zookeeper.ZooCache] WARN : Saw (possibly) transient exception communicating with ZooKeeper, will retry
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tables/!0/namespace
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:260)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:157)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:285)
> 	at org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:232)
> 	at org.apache.accumulo.core.client.impl.Tables.getNamespaceId(Tables.java:304)
> 	at org.apache.accumulo.server.conf.TableParentConfiguration.getNamespaceId(TableParentConfiguration.java:47)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.getPath(NamespaceConfiguration.java:85)
> 	at org.apache.accumulo.server.conf.NamespaceConfiguration.get(NamespaceConfiguration.java:98)
> 	at org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:107)
> 	at org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:103)
> 	at org.apache.accumulo.core.conf.AccumuloConfiguration.getCount(AccumuloConfiguration.java:193)
> 	at org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:2636)
> 	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> 	at java.lang.Thread.run(Thread.java:745)
> 2014-11-14 08:38:31,484 [zookeeper.Retry] DEBUG: Sleeping for 250ms before retrying operation
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to localhost:12644
> 2014-11-14 08:38:31,485 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout 30000 with auth
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Removing closed ZooKeeper session to localhost:12644
> 2014-11-14 08:38:31,588 [zookeeper.ZooSession] DEBUG: Connecting to localhost:12644 with timeout 30000 with auth
> 2014-11-14 08:38:31,692 [tserver.TabletServer] DEBUG: gc ParNew=0.10(+0.04) secs ConcurrentMarkSweep=0.05(+0.00) secs freemem=118,013,904(+6,412,200) totalmem=129,761,280
> 2014-11-14 08:38:31,692 [tserver.TabletServer] WARN : GC pause checker not called in a timely fashion. Expected every 5.0 seconds but was 43.1 seconds since last check
> 2014-11-14 08:38:31,700 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,701 [zookeeper.DistributedWorkQueue] DEBUG: Looking for work in /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,715 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.13.177:35935 !0 1 entries in 0.03 secs, nbTimes = [24 24 24.00 1] 
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Scanning trace hosts in zookeeper: /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/tracers
> 2014-11-14 08:38:31,737 [trace.ZooTraceClient] DEBUG: Trace hosts: []
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/replication/workqueue
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/bulk_failed_copyq
> 2014-11-14 08:38:31,739 [zookeeper.DistributedWorkQueue] INFO : Got unexpected zookeeper event: None for /accumulo/00f38e67-3c29-4cf0-a394-f423a0b33b6b/recovery
> 2014-11-14 08:38:31,739 [zookeeper.ZooSession] DEBUG: Session expired, state of current session : Expired
> 2014-11-14 08:38:31,739 [zookeeper.ZooLock] DEBUG: event null None Expired
> 2014-11-14 08:38:31,741 [tserver.TabletServer] FATAL: Lost tablet server lock (reason = SESSION_EXPIRED), exiting.
> {noformat}
> ZooKeeper code appears to had disconnected, closed the disconnected connection and then opened a new session. However, the ZooLock, IIRC, didn't reconnect and hung the tserver.
> If we want to support this, it might require rehashing some of the ZooLock code (to prevent the tserver from processing while the tserver doesn't have its lock).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)