You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Noe Detore <nd...@minerkasch.com> on 2016/10/07 14:34:25 UTC

Lost tablet server lock..SESSION_EXPIRED

Any updates on this issue
https://issues.apache.org/jira/browse/ACCUMULO-3336 ? I am seeing this
behavior using 1.7.2 on one of our clusters. Not seeing on other clusters,
but what could be some causes? Swap on server looks good as there is none.
Are there particular configurations to adjust?

org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired ...
2016-10-06 23:22:30,633 [zookeeper.DistributedWorkQueue] INFO : Got
unexpected zookeeper event: None for ...
2016-10-06 23:22:30,679 [tserver.TabletServer] ERROR: Lost tablet server
lock (reason = SESSION_EXPIRED), exiting

Thanks
Noe

Re: Lost tablet server lock..SESSION_EXPIRED

Posted by Josh Elser <jo...@gmail.com>.
What the server busy? Was there high iowait? Was the JVM spending a lot 
of time in GCs?

You haven't provided enough information to prove that you saw the same 
thing I reported in ACCUMULO-3336.

This is most likely an issue you need to figure out on properly 
configuring Accumulo for your system, not a bug in Accumulo.

Noe Detore wrote:
> Any updates on this issue
> https://issues.apache.org/jira/browse/ACCUMULO-3336 ? I am seeing this
> behavior using 1.7.2 on one of our clusters. Not seeing on other
> clusters, but what could be some causes? Swap on server looks good as
> there is none. Are there particular configurations to adjust?
>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired ...
> 2016-10-06 23:22:30,633 [zookeeper.DistributedWorkQueue] INFO : Got
> unexpected zookeeper event: None for ...
> 2016-10-06 23:22:30,679 [tserver.TabletServer] ERROR: Lost tablet server
> lock (reason = SESSION_EXPIRED), exiting
>
> Thanks
> Noe

Re: Lost tablet server lock..SESSION_EXPIRED

Posted by Josh Elser <jo...@gmail.com>.
Jeff Kubina wrote:
> If you are doing a lot of ingesting via batch writes (which the Upsess
> implies), you might consider increasing tserver.walog.max.size to 2G
> instead of 1G (but doing so will cause the loss of more data if a
> tserver dies).

There is only dataloss if you lose each datanode which is hosting the 
blocks of that WAL. Losing a TabletServer does *not* imply data loss.

In the common case, increasing the size of the WAL will just increase 
the amount of time it takes to perform recovery of the WAL after a 
TabletServer dies.

> The troubleshooting
> <https://github.com/apache/accumulo/blob/master/docs/src/main/asciidoc/chapters/troubleshooting.txt>
> documentation with accumulo is helpful in finding latency issues too.
>

- Josh

Re: Lost tablet server lock..SESSION_EXPIRED

Posted by Jeff Kubina <je...@gmail.com>.
I have not tried the G1 gc yet but it does look like it is production ready
according to Oracle.

You can use jstat to monitor gc of a tserver to see if gc really is the
issue for the pauses.

My usual gc related options for tservers are

-XX:NewSize=2G
-XX:MaxNewSize=2G
-XX:MaxPermSize=512m
-XX:CMSInitiatingOccupancyFraction=50
-XX+UseParNewGC
-XX:SurvivorRatio=6
-XX:ParallelGCThreads=16
-XX:ConGCThreads=8
-XX:+UseCondCardMark
-XX:+UnlockDiagnosticVMOptions
-XX:ParGCCardsPerStrideChunk=4096
-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled

If you are doing a lot of ingesting via batch writes (which the Upsess
implies), you might consider increasing tserver.walog.max.size to 2G
instead of 1G (but doing so will cause the loss of more data if a tserver
dies).

The troubleshooting
<https://github.com/apache/accumulo/blob/master/docs/src/main/asciidoc/chapters/troubleshooting.txt>
documentation with accumulo is helpful in finding latency issues too.



-- 
Jeff Kubina
410-988-4436


On Thu, Oct 13, 2016 at 10:49 AM, Noe Detore <nd...@minerkasch.com> wrote:

> Yes, seeing a lot of DEBUG:Upsess. Also seeing  [server.GarbageCollectionLogger]
> DEBUG: gc ParNew=64.69(+1.24) secs ConcurrentMarkSweep=102.51(+0.06) secs
> freemem=4,844,821,808(-20,292,780,896) totalmem=25,525,551,104
> 2016-10-13 11:22:17,963 [zookeeper.ZooLock] DEBUG: event null None
> Disconnected
>
> During hotspot seems like a java gc pause is causing zk heart beat to miss
> and then expire. Are there recommend java gc configurations?  We are using
> native memory. Would trying G1 gc be advised?
>
> Thank you
>
> On Fri, Oct 7, 2016 at 8:23 PM, Jeff Kubina <je...@gmail.com> wrote:
>
>> Noe,
>>
>> Do you have a lot (1000s) of "[tserver.TableServer] DEBUG: UpSess ..."
>> messages in your tserver logs prior to the FATAL or "ERROR: Lost tablet
>> server lock" error message?
>>
>> Jeff
>>
>>
>> --
>> Jeff Kubina
>> 410-988-4436
>>
>>
>> On Fri, Oct 7, 2016 at 10:34 AM, Noe Detore <nd...@minerkasch.com>
>> wrote:
>>
>>> Any updates on this issue https://issues.apache.org/jira
>>> /browse/ACCUMULO-3336 ? I am seeing this behavior using 1.7.2 on one of
>>> our clusters. Not seeing on other clusters, but what could be some causes?
>>> Swap on server looks good as there is none. Are there particular
>>> configurations to adjust?
>>>
>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired ...
>>> 2016-10-06 23:22:30,633 [zookeeper.DistributedWorkQueue] INFO : Got
>>> unexpected zookeeper event: None for ...
>>> 2016-10-06 23:22:30,679 [tserver.TabletServer] ERROR: Lost tablet server
>>> lock (reason = SESSION_EXPIRED), exiting
>>>
>>> Thanks
>>> Noe
>>>
>>
>>
>

Re: Lost tablet server lock..SESSION_EXPIRED

Posted by Noe Detore <nd...@minerkasch.com>.
Yes, seeing a lot of DEBUG:Upsess. Also seeing
 [server.GarbageCollectionLogger] DEBUG: gc ParNew=64.69(+1.24) secs
ConcurrentMarkSweep=102.51(+0.06) secs
freemem=4,844,821,808(-20,292,780,896) totalmem=25,525,551,104
2016-10-13 11:22:17,963 [zookeeper.ZooLock] DEBUG: event null None
Disconnected

During hotspot seems like a java gc pause is causing zk heart beat to miss
and then expire. Are there recommend java gc configurations?  We are using
native memory. Would trying G1 gc be advised?

Thank you

On Fri, Oct 7, 2016 at 8:23 PM, Jeff Kubina <je...@gmail.com> wrote:

> Noe,
>
> Do you have a lot (1000s) of "[tserver.TableServer] DEBUG: UpSess ..."
> messages in your tserver logs prior to the FATAL or "ERROR: Lost tablet
> server lock" error message?
>
> Jeff
>
>
> --
> Jeff Kubina
> 410-988-4436
>
>
> On Fri, Oct 7, 2016 at 10:34 AM, Noe Detore <nd...@minerkasch.com>
> wrote:
>
>> Any updates on this issue https://issues.apache.org/jira
>> /browse/ACCUMULO-3336 ? I am seeing this behavior using 1.7.2 on one of
>> our clusters. Not seeing on other clusters, but what could be some causes?
>> Swap on server looks good as there is none. Are there particular
>> configurations to adjust?
>>
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired ...
>> 2016-10-06 23:22:30,633 [zookeeper.DistributedWorkQueue] INFO : Got
>> unexpected zookeeper event: None for ...
>> 2016-10-06 23:22:30,679 [tserver.TabletServer] ERROR: Lost tablet server
>> lock (reason = SESSION_EXPIRED), exiting
>>
>> Thanks
>> Noe
>>
>
>

Re: Lost tablet server lock..SESSION_EXPIRED

Posted by Jeff Kubina <je...@gmail.com>.
Noe,

Do you have a lot (1000s) of "[tserver.TableServer] DEBUG: UpSess ..."
messages in your tserver logs prior to the FATAL or "ERROR: Lost tablet
server lock" error message?

Jeff


-- 
Jeff Kubina
410-988-4436


On Fri, Oct 7, 2016 at 10:34 AM, Noe Detore <nd...@minerkasch.com> wrote:

> Any updates on this issue https://issues.apache.org/
> jira/browse/ACCUMULO-3336 ? I am seeing this behavior using 1.7.2 on one
> of our clusters. Not seeing on other clusters, but what could be some
> causes? Swap on server looks good as there is none. Are there particular
> configurations to adjust?
>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired ...
> 2016-10-06 23:22:30,633 [zookeeper.DistributedWorkQueue] INFO : Got
> unexpected zookeeper event: None for ...
> 2016-10-06 23:22:30,679 [tserver.TabletServer] ERROR: Lost tablet server
> lock (reason = SESSION_EXPIRED), exiting
>
> Thanks
> Noe
>