You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "mohit.kaushik" <mo...@orkash.com> on 2015/11/05 08:18:17 UTC

Unable to write data, tablet servers lose there locks

I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 ) which was 
working fine before I ran into this issue. whenever I start writing data 
with a batchwritter, tablet servers loses there locks one by one. I 
found in zookeeper logs repeatedly trying and closing socket connection 
for servers and log has infinite repetitions of following line.

2015-11-05 12:11:23,860 [myid:3] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - 
Accepted socket connection from /192.168.10.124:47503
2015-11-05 12:11:23,861 [myid:3] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - 
Processing stat command from /192.168.10.124:47503
2015-11-05 12:11:23,869 [myid:3] - INFO 
[Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
2015-11-05 12:11:23,870 [myid:3] - INFO [Thread-244:NIOServerCnxn@1007] 
- Closed socket connection for client /192.168.10.124:47503 (no session 
established for client)

I found it similar to ZOOKEEPER-832 if it is. There is one thread 
discussing on socket connection but it do not provide much help in my 
case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E

There are no exceptions in tserver logs and tablet servers simply lose 
there locks.

  I can scan data without any problem/exception. I need to know what is 
the cause of the problem and work around. Would upgrading resolve the 
issue or it needs some configuration changes. My current zoo.cfg is as 
follows.

clientPort=2181
syncLimit=5
tickTime=2000
initLimit=10
maxClientCnxn=100

I can upload full logs if anyone needs. Please do let me know if you 
need any other info.

-Mohit

Re: Unable to write data, tablet servers lose there locks

Posted by "mohit.kaushik" <mo...@orkash.com>.
  Eric/Josef,

The issue is resoved now, You were right, I think the OS swapout the 
tservers as GC was not working properly. It had a conflicting port with 
some other service as I recently made some changes and I also have 
increased GC heap memory limit. And yes my Monitor was running on 
192.168.10.124 :) .

Thanks

On 11/05/2015 07:46 PM, Josef Roehrl - PHEMI wrote:
> Everything else not withstanding, if you see any swap space being 
> used, you need to adjust things to prevent swapping first.
>
> My 2 cents.
>
> On Thu, Nov 5, 2015 at 2:12 PM, Eric Newton <eric.newton@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Comments inline:
>
>     On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik
>     <mohit.kaushik@orkash.com <ma...@orkash.com>> wrote:
>
>
>         I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 )
>         which was working fine before I ran into this issue. whenever
>         I start writing data with a batchwritter, tablet servers loses
>         there locks one by one. I found in zookeeper logs repeatedly
>         trying and closing socket connection for servers and log has
>         infinite repetitions of following line.
>
>
>     By far, the most common reason why locks are lost is due to java
>     gc pauses.  In turn, these pauses are almost always due to memory
>     pressure within the entire system. The OS sees a nice big hunk of
>     memory in the tserver and swaps it out. Over the years we've tuned
>     various settings to prevent this, and other memory-hogging, but if
>     you are pushing the system hard, you may have to tune your
>     existing memory settings.
>
>     The tserver occasionally prints some gc stats in the debug log. If
>     you see a >30s pause between these messages, memory pressure is
>     probably the problem.
>
>
>         2015-11-05 12:11:23,860 [myid:3] - INFO
>         [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197
>         <ht...@197>] -
>         Accepted socket connection from /192.168.10.124:47503
>         <http://192.168.10.124:47503>
>         2015-11-05 12:11:23,861 [myid:3] - INFO
>         [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827
>         <ht...@827>] - Processing
>         stat command from /192.168.10.124:47503
>         <http://192.168.10.124:47503>
>         2015-11-05 12:11:23,869 [myid:3] - INFO
>         [Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
>         2015-11-05 12:11:23,870 [myid:3] - INFO
>         [Thread-244:NIOServerCnxn@1007] - Closed socket connection for
>         client /192.168.10.124:47503 <http://192.168.10.124:47503> (no
>         session established for client)
>
>
>     Yes, this is quite annoying: you get these messages when the
>     monitor grabs the zookeeper status EVERY 5s.  Your monitor is
>     running on 192.168.10.124. right?
>
>     These messages are expected.
>
>         I found it similar to ZOOKEEPER-832 if it is. There is one
>         thread discussing on socket connection but it do not provide
>         much help in my
>         case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E
>         <mailto:case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E>
>
>         There are no exceptions in tserver logs and tablet servers
>         simply lose there locks.
>
>
>     Ah, is it possible the JVM is killing itself because GC overhead
>     is climbing too high? You can check the .out (or .err) file for
>     this error.
>
>          I can scan data without any problem/exception. I need to know
>         what is the cause of the problem and work around. Would
>         upgrading resolve the issue or it needs some configuration
>         changes.
>
>
>     Check all your system processes. I know old versions of the SNMP
>     servers would leak resources, putting memory pressure on the
>     system after a few months.  Check to see if your tserver is
>     approximately the size you need. If you aren't already doing it,
>     you will want to monitor system memory/swap usage, and see if it
>     correlates to the lost servers.  Zookeeper itself is also subject
>     to gc pauses, so they can die from the same cause, although it's a
>     much smaller process.
>
>         My current zoo.cfg is as follows.
>
>         clientPort=2181
>         syncLimit=5
>         tickTime=2000
>         initLimit=10
>         maxClientCnxn=100
>
>
>     That's all fine, but you may want to turn on the zookeeper clean-up:
>
>     http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration
>
>
>     Search for "autopurge".
>
>
>         I can upload full logs if anyone needs. Please do let me know
>         if you need any other info.
>
>
>     How much memory is allocated to the various processes? Do you have
>     swap turned on? Do you see the delay in the debug GC messages?
>
>     You could try turning off swap, so the OS will kill your process
>     instead of killing itself. :-)
>
>     -Eric
>
>
>
>
> -- 
>>
>> Josef Roehrl
>> Senior Software Developer
>> *PHEMI Systems*
>> 180-887 Great Northern Way
>> Vancouver, BC V5T 4T5
>> 604-336-1119
>> Website <http://www.phemi.com/> Twitter 
>> <https://twitter.com/PHEMISystems> Linkedin 
>> <http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1> 
>>

Re: Unable to write data, tablet servers lose there locks

Posted by Josef Roehrl - PHEMI <jr...@phemi.com>.
Everything else not withstanding, if you see any swap space being used, you
need to adjust things to prevent swapping first.

My 2 cents.

On Thu, Nov 5, 2015 at 2:12 PM, Eric Newton <er...@gmail.com> wrote:

> Comments inline:
>
> On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik <mo...@orkash.com>
> wrote:
>
>>
>> I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 ) which was
>> working fine before I ran into this issue. whenever I start writing data
>> with a batchwritter, tablet servers loses there locks one by one. I found
>> in zookeeper logs repeatedly trying and closing socket connection for
>> servers and log has infinite repetitions of following line.
>>
>
> By far, the most common reason why locks are lost is due to java gc
> pauses.  In turn, these pauses are almost always due to memory pressure
> within the entire system. The OS sees a nice big hunk of memory in the
> tserver and swaps it out. Over the years we've tuned various settings to
> prevent this, and other memory-hogging, but if you are pushing the system
> hard, you may have to tune your existing memory settings.
>
> The tserver occasionally prints some gc stats in the debug log. If you see
> a >30s pause between these messages, memory pressure is probably the
> problem.
>
>
>>
>> 2015-11-05 12:11:23,860 [myid:3] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
>> connection from /192.168.10.124:47503
>> 2015-11-05 12:11:23,861 [myid:3] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing stat command from /
>> 192.168.10.124:47503
>> 2015-11-05 12:11:23,869 [myid:3] - INFO
>> [Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
>> 2015-11-05 12:11:23,870 [myid:3] - INFO  [Thread-244:NIOServerCnxn@1007]
>> - Closed socket connection for client /192.168.10.124:47503 (no session
>> established for client)
>>
>
> Yes, this is quite annoying: you get these messages when the monitor grabs
> the zookeeper status EVERY 5s.  Your monitor is running on 192.168.10.124.
> right?
>
> These messages are expected.
>
>
>> I found it similar to ZOOKEEPER-832 if it is. There is one thread
>> discussing on socket connection but it do not provide much help in my
>> case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E
>>
>> There are no exceptions in tserver logs and tablet servers simply lose
>> there locks.
>>
>
> Ah, is it possible the JVM is killing itself because GC overhead is
> climbing too high? You can check the .out (or .err) file for this error.
>
>
>>  I can scan data without any problem/exception. I need to know what is
>> the cause of the problem and work around. Would upgrading resolve the issue
>> or it needs some configuration changes.
>>
>
> Check all your system processes. I know old versions of the SNMP servers
> would leak resources, putting memory pressure on the system after a few
> months.  Check to see if your tserver is approximately the size you need.
> If you aren't already doing it, you will want to monitor system memory/swap
> usage, and see if it correlates to the lost servers.  Zookeeper itself is
> also subject to gc pauses, so they can die from the same cause, although
> it's a much smaller process.
>
>
>
>> My current zoo.cfg is as follows.
>>
>> clientPort=2181
>> syncLimit=5
>> tickTime=2000
>> initLimit=10
>> maxClientCnxn=100
>>
>
> That's all fine, but you may want to turn on the zookeeper clean-up:
>
>
> http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration
>
>
> Search for "autopurge".
>
>
>>
>> I can upload full logs if anyone needs. Please do let me know if you need
>> any other info.
>>
>
> How much memory is allocated to the various processes? Do you have swap
> turned on? Do you see the delay in the debug GC messages?
>
> You could try turning off swap, so the OS will kill your process instead
> of killing itself. :-)
>
> -Eric
>



-- 


Josef Roehrl
Senior Software Developer
*PHEMI Systems*
180-887 Great Northern Way
Vancouver, BC V5T 4T5
604-336-1119
Website <http://www.phemi.com/> Twitter <https://twitter.com/PHEMISystems>
Linkedin
<http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>

Re: Unable to write data, tablet servers lose there locks

Posted by Eric Newton <er...@gmail.com>.
Comments inline:

On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik <mo...@orkash.com>
wrote:

>
> I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 ) which was
> working fine before I ran into this issue. whenever I start writing data
> with a batchwritter, tablet servers loses there locks one by one. I found
> in zookeeper logs repeatedly trying and closing socket connection for
> servers and log has infinite repetitions of following line.
>

By far, the most common reason why locks are lost is due to java gc
pauses.  In turn, these pauses are almost always due to memory pressure
within the entire system. The OS sees a nice big hunk of memory in the
tserver and swaps it out. Over the years we've tuned various settings to
prevent this, and other memory-hogging, but if you are pushing the system
hard, you may have to tune your existing memory settings.

The tserver occasionally prints some gc stats in the debug log. If you see
a >30s pause between these messages, memory pressure is probably the
problem.


>
> 2015-11-05 12:11:23,860 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> connection from /192.168.10.124:47503
> 2015-11-05 12:11:23,861 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing stat command from /
> 192.168.10.124:47503
> 2015-11-05 12:11:23,869 [myid:3] - INFO
> [Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
> 2015-11-05 12:11:23,870 [myid:3] - INFO  [Thread-244:NIOServerCnxn@1007]
> - Closed socket connection for client /192.168.10.124:47503 (no session
> established for client)
>

Yes, this is quite annoying: you get these messages when the monitor grabs
the zookeeper status EVERY 5s.  Your monitor is running on 192.168.10.124.
right?

These messages are expected.


> I found it similar to ZOOKEEPER-832 if it is. There is one thread
> discussing on socket connection but it do not provide much help in my
> case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E
>
> There are no exceptions in tserver logs and tablet servers simply lose
> there locks.
>

Ah, is it possible the JVM is killing itself because GC overhead is
climbing too high? You can check the .out (or .err) file for this error.


>  I can scan data without any problem/exception. I need to know what is the
> cause of the problem and work around. Would upgrading resolve the issue or
> it needs some configuration changes.
>

Check all your system processes. I know old versions of the SNMP servers
would leak resources, putting memory pressure on the system after a few
months.  Check to see if your tserver is approximately the size you need.
If you aren't already doing it, you will want to monitor system memory/swap
usage, and see if it correlates to the lost servers.  Zookeeper itself is
also subject to gc pauses, so they can die from the same cause, although
it's a much smaller process.



> My current zoo.cfg is as follows.
>
> clientPort=2181
> syncLimit=5
> tickTime=2000
> initLimit=10
> maxClientCnxn=100
>

That's all fine, but you may want to turn on the zookeeper clean-up:

http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration


Search for "autopurge".


>
> I can upload full logs if anyone needs. Please do let me know if you need
> any other info.
>

How much memory is allocated to the various processes? Do you have swap
turned on? Do you see the delay in the debug GC messages?

You could try turning off swap, so the OS will kill your process instead of
killing itself. :-)

-Eric