You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bill Au <bi...@gmail.com> on 2010/10/20 22:09:42 UTC

Cassandra crashed - possible JMX threads leak

One of my Cassandra server crashed with the following:

ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
CassandraDaemon.java (line 82) Uncaught exception in thread
Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:597)
        at
org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)


I took threads dump in the JVM on all the other Cassandra severs in my
cluster.  They all have thousand of threads looking like this:

"JMX server connection timeout 183373" daemon prio=10 tid=0x00002aad230db800
nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at
com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
        - locked <0x00002aab056ccee0> (a [I)
        at java.lang.Thread.run(Thread.java:619)

It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
creates a JMXConnector but never calls its close() method.  I tried setting
jmx.remote.x.server.connection.timeout to 0 hoping that would disable the
JMX server connection timeout threads.  But that did not make any
difference.

Has anyone else seen this?

Bill

Re: Cassandra crashed - possible JMX threads leak

Posted by Frank LoVecchio <fr...@isidorey.com>.
Ah, missed that.

Thanks Aaron.

On Wed, Oct 20, 2010 at 11:49 PM, Aaron Morton <aa...@thelastpickle.com>wrote:

> Sounds like the problem discussed here
> http://wiki.apache.org/cassandra/Operations?highlight=(allocate)|(memory)<http://wiki.apache.org/cassandra/Operations?highlight=(allocate)%7C(memory)>
>
>
> <http://wiki.apache.org/cassandra/Operations?highlight=(allocate)%7C(memory)>If
> you have the JNA jar it should work
> http://www.mail-archive.com/user@cassandra.apache.org/msg06284.html
> <http://www.mail-archive.com/user@cassandra.apache.org/msg06284.html>
> http://www.riptano.com/blog/whats-new-cassandra-065
>
> <http://www.riptano.com/blog/whats-new-cassandra-065>Aaron
>
>
> On 21 Oct, 2010,at 06:29 PM, Frank LoVecchio <fr...@isidorey.com> wrote:
>
> I have a cluster of 3 0.7 beta 2 nodes (built today from the latest trunk)
> running on Large, EBS-backed, x64 EC2 instances; RF=3.  I attempted to write
> somewhere near 500,000 records every 15 minutes from a total of 5 different
> computers (using Pelops and multi-threading).   Though my network blew up
> and I'm not quite sure how many records were inserted, I lost a node a
> couple hours later, and the other 2 were at severely high memory useage.  Is
> this a memory leak of some kind, or something I can configure / watch for in
> the future?
>
> A nodetool does this:
>
> [ec2-user@XXX bin]$ ./nodetool -h localhost ring
> Address  Status State   Load         Token
>
>                                                     XXX
> ipXXX   Down   Normal  564.76 MB       XXX
> ipXXX   Up       Normal  564.83 MB       XXX
> ipXXX   Up       Normal  563.06 MB       XXX
>
> A top on the box that is down shows this: (dual core x64)
>
> Cpu(s): 19.9%us,  5.9%sy,  0.0%ni,  8.8%id, 57.3%wa,  0.0%hi,  0.0%si,
>  8.1%st
> Mem:   7651528k total,  7611112k used,    40416k free,    66056k buffers
> Swap:        0k total,        0k used,        0k free,  3294076k cached
>
> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
> 22514 root      20   0 5790m 4.0g 167m S    91.9        54.8 152:45.08 java
>
>
> I see this error in the log file:
>
> ERROR [CompactionExecutor:1] 2010-10-21 01:35:05,318
> AbstractCassandraDaemon.java (line 88) Fatal exception in thread
> Thread[CompactionExecutor:1,1,main]
> java.io.IOError: java.io.IOException: Cannot run program "ln":
> java.io.IOException: error=12, Cannot allocate memory
> at
> org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1368)
>  at org.apache.cassandra.db.Table.snapshot(Table.java:163)
> at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:232)
>  at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:106)
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:84)
>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>  at java.lang.Thread.run(Thread.java:636)
> Caused by: java.io.IOException: Cannot run program "ln":
> java.io.IOException: error=12, Cannot allocate memory
>  at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
> at
> org.apache.cassandra.io.util.FileUtils.createHardLinkWithExec(FileUtils.java:263)
>  at
> org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:229)
> at
> org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1360)
>  ... 9 more
> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
> allocate memory
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
>  at java.lang.ProcessImpl.start(ProcessImpl.java:81)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
>  ... 12 more
>
>
> On Wed, Oct 20, 2010 at 3:16 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> can you reproduce this by, say, running nodeprobe ring in a bash while
>> loop?
>>
>>
>> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
>> > One of my Cassandra server crashed with the following:
>> >
>> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnnnnn.nnn] 2010-10-19 00:25:10,419
>>
>> > CassandraDaemon.java (line 82) Uncaught exception in thread
>> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
>> > java.lang.OutOfMemoryError: unable to create new native thread
>> >         at java.lang.Thread.start0(Native Method)
>> >         at java.lang.Thread.start(Thread.java:597)
>> >         at
>> >
>> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
>> >
>> >
>> > I took threads dump in the JVM on all the other Cassandra severs in my
>> > cluster.  They all have thousand of threads looking like this:
>> >
>> > "JMX server connection timeout 183373" daemon prio=10
>> tid=0x00002aad230db800
>> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
>> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>> >         at java.lang.Object.wait(Native Method)
>> >         at
>> >
>> com.sunjmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
>>
>> >         - locked <0x00002aab056ccee0> (a [I)
>> >         at java.lang.Thread.run(Thread.java:619)
>> >
>> > It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
>> > creates a JMXConnector but never calls its close() method.  I tried
>> setting
>> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable
>> the
>> > JMX server connection timeout threads.  But that did not make any
>> > difference.
>> >
>> > Has anyone else seen this?
>> >
>> > Bill
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>

Re: Cassandra crashed - possible JMX threads leak

Posted by Aaron Morton <aa...@thelastpickle.com>.
Sounds like the problem discussed here http://wiki.apache.org/cassandra/Operations?highlight=(allocate)|(memory)

If you have the JNA jar it should work 
http://www.mail-archive.com/user@cassandra.apache.org/msg06284.html
http://www.riptano.com/blog/whats-new-cassandra-065

Aaron


On 21 Oct, 2010,at 06:29 PM, Frank LoVecchio <fr...@isidorey.com> wrote:

I have a cluster of 3 0.7 beta 2 nodes (built today from the latest trunk) running on Large, EBS-backed, x64 EC2 instances; RF=3.  I attempted to write somewhere near 500,000 records every 15 minutes from a total of 5 different computers (using Pelops and multi-threading).   Though my network blew up and I'm not quite sure how many records were inserted, I lost a node a couple hours later, and the other 2 were at severely high memory useage.  Is this a memory leak of some kind, or something I can configure / watch for in the future?

A nodetool does this:

[ec2-user@XXX bin]$ ./nodetool -h localhost ring
Address  Status State   Load         Token                                       
                                                    XXX  
ipXXX   Down   Normal  564.76 MB       XXX      
ipXXX   Up       Normal  564.83 MB       XXX     
ipXXX   Up       Normal  563.06 MB       XXX     

A top on the box that is down shows this: (dual core x64)

Cpu(s): 19.9%us,  5.9%sy,  0.0%ni,  8.8%id, 57.3%wa,  0.0%hi,  0.0%si,  8.1%st
Mem:   7651528k total,  7611112k used,    40416k free,    66056k buffers
Swap:        0k total,        0k used,        0k free,  3294076k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                        
22514 root      20   0 5790m 4.0g 167m S    91.9        54.8 152:45.08 java           

I see this error in the log file:

ERROR [CompactionExecutor:1] 2010-10-21 01:35:05,318 AbstractCassandraDaemon.java (line 88) Fatal exception in thread Thread[CompactionExecutor:1,1,main]
java.io.IOError: java.io.IOException: Cannot run program "ln": java.io.IOException: error=12, Cannot allocate memory
	at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1368)
	at org.apache.cassandra.db.Table.snapshot(Table.java:163)
	at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:232)
	at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:106)
	at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:84)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: Cannot run program "ln": java.io.IOException: error=12, Cannot allocate memory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
	at org.apache.cassandra.io.util.FileUtils.createHardLinkWithExec(FileUtils.java:263)
	at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:229)
	at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1360)
	... 9 more
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
	at java.lang.ProcessImpl.start(ProcessImpl.java:81)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
	... 12 more
 

On Wed, Oct 20, 2010 at 3:16 PM, Jonathan Ellis <jb...@gmail.com> wrote:
can you reproduce this by, say, running nodeprobe ring in a bash while loop?


On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
> One of my Cassandra server crashed with the following:
>
> ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
> CassandraDaemon.java (line 82) Uncaught exception in thread
> Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:597)
>         at
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
>
>
> I took threads dump in the JVM on all the other Cassandra severs in my
> cluster.  They all have thousand of threads looking like this:
>
> "JMX server connection timeout 183373" daemon prio=10 tid=0x00002aad230db800
> nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
>         - locked <0x00002aab056ccee0> (a [I)
>         at java.lang.Thread.run(Thread.java:619)
>
> It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
> creates a JMXConnector but never calls its close() method.  I tried setting
> jmx.remote.x.server.connection.timeout to 0 hoping that would disable the
> JMX server connection timeout threads.  But that did not make any
> difference.
>
> Has anyone else seen this?
>
> Bill
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Cassandra crashed - possible JMX threads leak

Posted by Frank LoVecchio <fr...@isidorey.com>.
I have a cluster of 3 0.7 beta 2 nodes (built today from the latest trunk)
running on Large, EBS-backed, x64 EC2 instances; RF=3.  I attempted to write
somewhere near 500,000 records every 15 minutes from a total of 5 different
computers (using Pelops and multi-threading).   Though my network blew up
and I'm not quite sure how many records were inserted, I lost a node a
couple hours later, and the other 2 were at severely high memory useage.  Is
this a memory leak of some kind, or something I can configure / watch for in
the future?

A nodetool does this:

[ec2-user@XXX bin]$ ./nodetool -h localhost ring
Address  Status State   Load         Token

                                                    XXX
ipXXX   Down   Normal  564.76 MB       XXX
ipXXX   Up       Normal  564.83 MB       XXX
ipXXX   Up       Normal  563.06 MB       XXX

A top on the box that is down shows this: (dual core x64)

Cpu(s): 19.9%us,  5.9%sy,  0.0%ni,  8.8%id, 57.3%wa,  0.0%hi,  0.0%si,
 8.1%st
Mem:   7651528k total,  7611112k used,    40416k free,    66056k buffers
Swap:        0k total,        0k used,        0k free,  3294076k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

22514 root      20   0 5790m 4.0g 167m S    91.9        54.8 152:45.08 java


I see this error in the log file:

ERROR [CompactionExecutor:1] 2010-10-21 01:35:05,318
AbstractCassandraDaemon.java (line 88) Fatal exception in thread
Thread[CompactionExecutor:1,1,main]
java.io.IOError: java.io.IOException: Cannot run program "ln":
java.io.IOException: error=12, Cannot allocate memory
at
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1368)
at org.apache.cassandra.db.Table.snapshot(Table.java:163)
at
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:232)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:106)
at
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:84)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: Cannot run program "ln":
java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at
org.apache.cassandra.io.util.FileUtils.createHardLinkWithExec(FileUtils.java:263)
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:229)
at
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1360)
... 9 more
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.start(ProcessImpl.java:81)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 12 more


On Wed, Oct 20, 2010 at 3:16 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> can you reproduce this by, say, running nodeprobe ring in a bash while
> loop?
>
> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
> > One of my Cassandra server crashed with the following:
> >
> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
> > CassandraDaemon.java (line 82) Uncaught exception in thread
> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> > java.lang.OutOfMemoryError: unable to create new native thread
> >         at java.lang.Thread.start0(Native Method)
> >         at java.lang.Thread.start(Thread.java:597)
> >         at
> >
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
> >
> >
> > I took threads dump in the JVM on all the other Cassandra severs in my
> > cluster.  They all have thousand of threads looking like this:
> >
> > "JMX server connection timeout 183373" daemon prio=10
> tid=0x00002aad230db800
> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >         at java.lang.Object.wait(Native Method)
> >         at
> >
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
> >         - locked <0x00002aab056ccee0> (a [I)
> >         at java.lang.Thread.run(Thread.java:619)
> >
> > It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
> > creates a JMXConnector but never calls its close() method.  I tried
> setting
> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable the
> > JMX server connection timeout threads.  But that did not make any
> > difference.
> >
> > Has anyone else seen this?
> >
> > Bill
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Cassandra crashed - possible JMX threads leak

Posted by Norman Maurer <no...@apache.org>.
Depending on finalize() is really not want you want todo, so I think
the API change would be preferable.

Bye,
Norman


2010/10/26 Bill Au <bi...@gmail.com>:
> I would be happy to submit a patch but is it a bit more trickier than simply
> calling JMXConenctor.close().  NodeProbe's use of the JMXConnector is not
> exposed in its API  The JMX connection is created in NodeProbe's
> constructor.  Without changing the API, the only place to call close() would
> be in NodeProbe's finalize().  I am not sure if that's the best thing to
> do.  I think this warrant a discussion on the developer mailing list.  I
> will start an new mail thread there.
>
> Anyways, I am still trying to understand why the JMX server connection
> timeout threads pile up rather quickly when I restart a node in a live
> cluster.  I took a look at the Cassandra source and see that NodeProbe is
> the only place that creates and uses a JMX connection.  And NobeProbe is
> only used by the tools.  So it seems that there is another JMX thread leak
> in Cassandra.
>
> Bill
>
> On Fri, Oct 22, 2010 at 4:33 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> Is the fix as simple as calling close() then?  Can you submit a patch for
>> that?
>>
>> On Fri, Oct 22, 2010 at 2:49 PM, Bill Au <bi...@gmail.com> wrote:
>> > Not with the nodeprobe or nodetool command because the JVM these two
>> > commands spawn has a very short life span.
>> >
>> > I am using a webapp to monitor my cassandra cluster.  It pretty much
>> > uses
>> > the same code as NodeCmd class.  For each incoming request, it creates
>> > an
>> > NodeProbe object and use it to get get various status of the cluster.  I
>> > can
>> > reproduce the Cassandra JVM crash by issuing requests to this webapp in
>> > a
>> > bash while loop.  I took a deeper look and here is what I discovered:
>> >
>> > In the webapp when NodeProbe creates a JMXConnector to connect to the
>> > Cassandra JMX port, a thread
>> > (com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created
>> > and
>> > run in the webapp's JVM.  Meanwhile in the Cassamdra JVM there is a
>> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to
>> > timeout remote JMX connection.  However, since NodeProbe does not call
>> > JMXConnector.close(), the JMX client checker threads remains in the
>> > webapp's
>> > JVM even after the NobeProbe object has been garbage collected.  So this
>> > JMX
>> > connection is still considered open and that keeps the JMX timeout
>> > thread
>> > running inside the Cassandra JVM.  The number of JMX client checker
>> > threads
>> > in my webapp's JVM matches up with the number of JMX server timeout
>> > threads
>> > in my Cassandra's JVM.  If I stop my webapp's JVM,
>> > all the JMX server timeout threads in my Cassandra's JVM all disappear
>> > after
>> > 2 minutes, the default timeout for a JMX connection.  This is why the
>> > problem cannot be reproduced by nodeprobe or nodetool.  Even though
>> > JMXConnector.close() is not called, the JVM exits shortly so the JMX
>> > client
>> > checker thread do not stay around.  So their corresponding JMX server
>> > timeout thread goes away after two minutes.  This is not the case with
>> > my
>> > weabpp since its JVM keeps running, so all the JMX client checker
>> > threads
>> > keep running as well.  The threads keep piling up until it crashes
>> > Casssandra's JVM.
>> >
>> > In my case I think I can change my webapp to use a static NodeProbe
>> > instead
>> > of creating a new one for every request.  That should get around the
>> > leak.
>> >
>> > However, I have seen the leak occurs in another situation.  On more than
>> > one
>> > occasions when I restarted one node in a live multi-node clusters, I see
>> > that the JMX server timeout threads quickly piled up (number in the
>> > thousands) in Cassandra's JVM.  It only happened on a live cluster that
>> > is
>> > servicing read and write requests.  I am guessing the hinted hand off
>> > might
>> > have something to do with it.  I am still trying to understand what is
>> > happening there.
>> >
>> > Bill
>> >
>> >
>> > On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jb...@gmail.com>
>> > wrote:
>> >>
>> >> can you reproduce this by, say, running nodeprobe ring in a bash while
>> >> loop?
>> >>
>> >> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
>> >> > One of my Cassandra server crashed with the following:
>> >> >
>> >> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
>> >> > CassandraDaemon.java (line 82) Uncaught exception in thread
>> >> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
>> >> > java.lang.OutOfMemoryError: unable to create new native thread
>> >> >         at java.lang.Thread.start0(Native Method)
>> >> >         at java.lang.Thread.start(Thread.java:597)
>> >> >         at
>> >> >
>> >> >
>> >> > org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
>> >> >
>> >> >
>> >> > I took threads dump in the JVM on all the other Cassandra severs in
>> >> > my
>> >> > cluster.  They all have thousand of threads looking like this:
>> >> >
>> >> > "JMX server connection timeout 183373" daemon prio=10
>> >> > tid=0x00002aad230db800
>> >> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
>> >> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>> >> >         at java.lang.Object.wait(Native Method)
>> >> >         at
>> >> >
>> >> >
>> >> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
>> >> >         - locked <0x00002aab056ccee0> (a [I)
>> >> >         at java.lang.Thread.run(Thread.java:619)
>> >> >
>> >> > It seems to me that there is a JMX threads leak in Cassandra.
>> >> > NodeProbe
>> >> > creates a JMXConnector but never calls its close() method.  I tried
>> >> > setting
>> >> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable
>> >> > the
>> >> > JMX server connection timeout threads.  But that did not make any
>> >> > difference.
>> >> >
>> >> > Has anyone else seen this?
>> >> >
>> >> > Bill
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder of Riptano, the source for professional Cassandra support
>> >> http://riptano.com
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>

Re: Cassandra crashed - possible JMX threads leak

Posted by Bill Au <bi...@gmail.com>.
I would be happy to submit a patch but is it a bit more trickier than simply
calling JMXConenctor.close().  NodeProbe's use of the JMXConnector is not
exposed in its API  The JMX connection is created in NodeProbe's
constructor.  Without changing the API, the only place to call close() would
be in NodeProbe's finalize().  I am not sure if that's the best thing to
do.  I think this warrant a discussion on the developer mailing list.  I
will start an new mail thread there.

Anyways, I am still trying to understand why the JMX server connection
timeout threads pile up rather quickly when I restart a node in a live
cluster.  I took a look at the Cassandra source and see that NodeProbe is
the only place that creates and uses a JMX connection.  And NobeProbe is
only used by the tools.  So it seems that there is another JMX thread leak
in Cassandra.

Bill

On Fri, Oct 22, 2010 at 4:33 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Is the fix as simple as calling close() then?  Can you submit a patch for
> that?
>
> On Fri, Oct 22, 2010 at 2:49 PM, Bill Au <bi...@gmail.com> wrote:
> > Not with the nodeprobe or nodetool command because the JVM these two
> > commands spawn has a very short life span.
> >
> > I am using a webapp to monitor my cassandra cluster.  It pretty much uses
> > the same code as NodeCmd class.  For each incoming request, it creates an
> > NodeProbe object and use it to get get various status of the cluster.  I
> can
> > reproduce the Cassandra JVM crash by issuing requests to this webapp in a
> > bash while loop.  I took a deeper look and here is what I discovered:
> >
> > In the webapp when NodeProbe creates a JMXConnector to connect to the
> > Cassandra JMX port, a thread
> > (com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created
> and
> > run in the webapp's JVM.  Meanwhile in the Cassamdra JVM there is a
> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to
> > timeout remote JMX connection.  However, since NodeProbe does not call
> > JMXConnector.close(), the JMX client checker threads remains in the
> webapp's
> > JVM even after the NobeProbe object has been garbage collected.  So this
> JMX
> > connection is still considered open and that keeps the JMX timeout thread
> > running inside the Cassandra JVM.  The number of JMX client checker
> threads
> > in my webapp's JVM matches up with the number of JMX server timeout
> threads
> > in my Cassandra's JVM.  If I stop my webapp's JVM,
> > all the JMX server timeout threads in my Cassandra's JVM all disappear
> after
> > 2 minutes, the default timeout for a JMX connection.  This is why the
> > problem cannot be reproduced by nodeprobe or nodetool.  Even though
> > JMXConnector.close() is not called, the JVM exits shortly so the JMX
> client
> > checker thread do not stay around.  So their corresponding JMX server
> > timeout thread goes away after two minutes.  This is not the case with my
> > weabpp since its JVM keeps running, so all the JMX client checker threads
> > keep running as well.  The threads keep piling up until it crashes
> > Casssandra's JVM.
> >
> > In my case I think I can change my webapp to use a static NodeProbe
> instead
> > of creating a new one for every request.  That should get around the
> leak.
> >
> > However, I have seen the leak occurs in another situation.  On more than
> one
> > occasions when I restarted one node in a live multi-node clusters, I see
> > that the JMX server timeout threads quickly piled up (number in the
> > thousands) in Cassandra's JVM.  It only happened on a live cluster that
> is
> > servicing read and write requests.  I am guessing the hinted hand off
> might
> > have something to do with it.  I am still trying to understand what is
> > happening there.
> >
> > Bill
> >
> >
> > On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> can you reproduce this by, say, running nodeprobe ring in a bash while
> >> loop?
> >>
> >> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
> >> > One of my Cassandra server crashed with the following:
> >> >
> >> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
> >> > CassandraDaemon.java (line 82) Uncaught exception in thread
> >> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> >> > java.lang.OutOfMemoryError: unable to create new native thread
> >> >         at java.lang.Thread.start0(Native Method)
> >> >         at java.lang.Thread.start(Thread.java:597)
> >> >         at
> >> >
> >> >
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
> >> >
> >> >
> >> > I took threads dump in the JVM on all the other Cassandra severs in my
> >> > cluster.  They all have thousand of threads looking like this:
> >> >
> >> > "JMX server connection timeout 183373" daemon prio=10
> >> > tid=0x00002aad230db800
> >> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
> >> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >> >         at java.lang.Object.wait(Native Method)
> >> >         at
> >> >
> >> >
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
> >> >         - locked <0x00002aab056ccee0> (a [I)
> >> >         at java.lang.Thread.run(Thread.java:619)
> >> >
> >> > It seems to me that there is a JMX threads leak in Cassandra.
> NodeProbe
> >> > creates a JMXConnector but never calls its close() method.  I tried
> >> > setting
> >> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable
> >> > the
> >> > JMX server connection timeout threads.  But that did not make any
> >> > difference.
> >> >
> >> > Has anyone else seen this?
> >> >
> >> > Bill
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Cassandra crashed - possible JMX threads leak

Posted by Jonathan Ellis <jb...@gmail.com>.
Is the fix as simple as calling close() then?  Can you submit a patch for that?

On Fri, Oct 22, 2010 at 2:49 PM, Bill Au <bi...@gmail.com> wrote:
> Not with the nodeprobe or nodetool command because the JVM these two
> commands spawn has a very short life span.
>
> I am using a webapp to monitor my cassandra cluster.  It pretty much uses
> the same code as NodeCmd class.  For each incoming request, it creates an
> NodeProbe object and use it to get get various status of the cluster.  I can
> reproduce the Cassandra JVM crash by issuing requests to this webapp in a
> bash while loop.  I took a deeper look and here is what I discovered:
>
> In the webapp when NodeProbe creates a JMXConnector to connect to the
> Cassandra JMX port, a thread
> (com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created and
> run in the webapp's JVM.  Meanwhile in the Cassamdra JVM there is a
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to
> timeout remote JMX connection.  However, since NodeProbe does not call
> JMXConnector.close(), the JMX client checker threads remains in the webapp's
> JVM even after the NobeProbe object has been garbage collected.  So this JMX
> connection is still considered open and that keeps the JMX timeout thread
> running inside the Cassandra JVM.  The number of JMX client checker threads
> in my webapp's JVM matches up with the number of JMX server timeout threads
> in my Cassandra's JVM.  If I stop my webapp's JVM,
> all the JMX server timeout threads in my Cassandra's JVM all disappear after
> 2 minutes, the default timeout for a JMX connection.  This is why the
> problem cannot be reproduced by nodeprobe or nodetool.  Even though
> JMXConnector.close() is not called, the JVM exits shortly so the JMX client
> checker thread do not stay around.  So their corresponding JMX server
> timeout thread goes away after two minutes.  This is not the case with my
> weabpp since its JVM keeps running, so all the JMX client checker threads
> keep running as well.  The threads keep piling up until it crashes
> Casssandra's JVM.
>
> In my case I think I can change my webapp to use a static NodeProbe instead
> of creating a new one for every request.  That should get around the leak.
>
> However, I have seen the leak occurs in another situation.  On more than one
> occasions when I restarted one node in a live multi-node clusters, I see
> that the JMX server timeout threads quickly piled up (number in the
> thousands) in Cassandra's JVM.  It only happened on a live cluster that is
> servicing read and write requests.  I am guessing the hinted hand off might
> have something to do with it.  I am still trying to understand what is
> happening there.
>
> Bill
>
>
> On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> can you reproduce this by, say, running nodeprobe ring in a bash while
>> loop?
>>
>> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
>> > One of my Cassandra server crashed with the following:
>> >
>> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
>> > CassandraDaemon.java (line 82) Uncaught exception in thread
>> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
>> > java.lang.OutOfMemoryError: unable to create new native thread
>> >         at java.lang.Thread.start0(Native Method)
>> >         at java.lang.Thread.start(Thread.java:597)
>> >         at
>> >
>> > org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
>> >
>> >
>> > I took threads dump in the JVM on all the other Cassandra severs in my
>> > cluster.  They all have thousand of threads looking like this:
>> >
>> > "JMX server connection timeout 183373" daemon prio=10
>> > tid=0x00002aad230db800
>> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
>> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>> >         at java.lang.Object.wait(Native Method)
>> >         at
>> >
>> > com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
>> >         - locked <0x00002aab056ccee0> (a [I)
>> >         at java.lang.Thread.run(Thread.java:619)
>> >
>> > It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
>> > creates a JMXConnector but never calls its close() method.  I tried
>> > setting
>> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable
>> > the
>> > JMX server connection timeout threads.  But that did not make any
>> > difference.
>> >
>> > Has anyone else seen this?
>> >
>> > Bill
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra crashed - possible JMX threads leak

Posted by Bill Au <bi...@gmail.com>.
Not with the nodeprobe or nodetool command because the JVM these two
commands spawn has a very short life span.

I am using a webapp to monitor my cassandra cluster.  It pretty much uses
the same code as NodeCmd class.  For each incoming request, it creates an
NodeProbe object and use it to get get various status of the cluster.  I can
reproduce the Cassandra JVM crash by issuing requests to this webapp in a
bash while loop.  I took a deeper look and here is what I discovered:

In the webapp when NodeProbe creates a JMXConnector to connect to the
Cassandra JMX port, a thread
(com.sun.jmx.remote.internal.ClientCommunicatorAdmin$Checker) is created and
run in the webapp's JVM.  Meanwhile in the Cassamdra JVM there is a
com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout thread to
timeout remote JMX connection.  However, since NodeProbe does not call
JMXConnector.close(), the JMX client checker threads remains in the webapp's
JVM even after the NobeProbe object has been garbage collected.  So this JMX
connection is still considered open and that keeps the JMX timeout thread
running inside the Cassandra JVM.  The number of JMX client checker threads
in my webapp's JVM matches up with the number of JMX server timeout threads
in my Cassandra's JVM.  If I stop my webapp's JVM,
all the JMX server timeout threads in my Cassandra's JVM all disappear after
2 minutes, the default timeout for a JMX connection.  This is why the
problem cannot be reproduced by nodeprobe or nodetool.  Even though
JMXConnector.close() is not called, the JVM exits shortly so the JMX client
checker thread do not stay around.  So their corresponding JMX server
timeout thread goes away after two minutes.  This is not the case with my
weabpp since its JVM keeps running, so all the JMX client checker threads
keep running as well.  The threads keep piling up until it crashes
Casssandra's JVM.

In my case I think I can change my webapp to use a static NodeProbe instead
of creating a new one for every request.  That should get around the leak.

However, I have seen the leak occurs in another situation.  On more than one
occasions when I restarted one node in a live multi-node clusters, I see
that the JMX server timeout threads quickly piled up (number in the
thousands) in Cassandra's JVM.  It only happened on a live cluster that is
servicing read and write requests.  I am guessing the hinted hand off might
have something to do with it.  I am still trying to understand what is
happening there.

Bill


On Wed, Oct 20, 2010 at 5:16 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> can you reproduce this by, say, running nodeprobe ring in a bash while
> loop?
>
> On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
> > One of my Cassandra server crashed with the following:
> >
> > ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
> > CassandraDaemon.java (line 82) Uncaught exception in thread
> > Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> > java.lang.OutOfMemoryError: unable to create new native thread
> >         at java.lang.Thread.start0(Native Method)
> >         at java.lang.Thread.start(Thread.java:597)
> >         at
> >
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
> >
> >
> > I took threads dump in the JVM on all the other Cassandra severs in my
> > cluster.  They all have thousand of threads looking like this:
> >
> > "JMX server connection timeout 183373" daemon prio=10
> tid=0x00002aad230db800
> > nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
> >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> >         at java.lang.Object.wait(Native Method)
> >         at
> >
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
> >         - locked <0x00002aab056ccee0> (a [I)
> >         at java.lang.Thread.run(Thread.java:619)
> >
> > It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
> > creates a JMXConnector but never calls its close() method.  I tried
> setting
> > jmx.remote.x.server.connection.timeout to 0 hoping that would disable the
> > JMX server connection timeout threads.  But that did not make any
> > difference.
> >
> > Has anyone else seen this?
> >
> > Bill
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Cassandra crashed - possible JMX threads leak

Posted by Jonathan Ellis <jb...@gmail.com>.
can you reproduce this by, say, running nodeprobe ring in a bash while loop?

On Wed, Oct 20, 2010 at 3:09 PM, Bill Au <bi...@gmail.com> wrote:
> One of my Cassandra server crashed with the following:
>
> ERROR [ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn] 2010-10-19 00:25:10,419
> CassandraDaemon.java (line 82) Uncaught exception in thread
> Thread[ACCEPT-xxx.xxx.xxx/nnn.nnn.nnn.nnn,5,main]
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:597)
>         at
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:533)
>
>
> I took threads dump in the JVM on all the other Cassandra severs in my
> cluster.  They all have thousand of threads looking like this:
>
> "JMX server connection timeout 183373" daemon prio=10 tid=0x00002aad230db800
> nid=0x5cf6 in Object.wait() [0x00002aad7a316000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at
> com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(ServerCommunicatorAdmin.java:150)
>         - locked <0x00002aab056ccee0> (a [I)
>         at java.lang.Thread.run(Thread.java:619)
>
> It seems to me that there is a JMX threads leak in Cassandra.  NodeProbe
> creates a JMXConnector but never calls its close() method.  I tried setting
> jmx.remote.x.server.connection.timeout to 0 hoping that would disable the
> JMX server connection timeout threads.  But that did not make any
> difference.
>
> Has anyone else seen this?
>
> Bill
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com