You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@tomcat.apache.org by David Rees <dr...@gmail.com> on 2008/04/01 01:23:54 UTC

Re: Cluster Memory Leak - ClusterData and LinkObject classes

On Mon, Mar 31, 2008 at 12:49 PM, Rainer Jung <ra...@kippdata.de> wrote:
>  First to make sure: counting objects in general only makes sense after a
>  full GC. Otherwise the heap dump will contain garbage too.

Yes, I made sure the objects I was looking at had a valid GC
reference. They really were getting stuck in the queue.

>  Just some basic info: the LinkObject objects can be either in a
>  FastQueue, or they are used in a FastAsyncSocketSender directly after
>  removing from the FastQueue and before actually sending.
<snip>

Thank you for the detailed description on how the Queues work with the cluster.

>  Why you had that many LinkObjects is not clear. You could first try to
>  check, if the LinkObjects actually belong to a Queue, or not (e.g. then
>  they are already in the Sender). Have a look at your log files, if there
>  are errors or unexpected cluster membership messages.

One problem I've intermittently had with clustering is that after a
Tomcat restart (we shut down one node and it immediately restarts,
generally within 30 seconds), they two nodes don't consistently sync
up. (The restarted node would not have the sessions from the other
node, but new sessions would get replicated over) I have to think that
this may be related to this issue.

I checked the logs and didn't see any issues in the Tomcat logs with
members dropping from the cluster until the JVM got close to running
out of memory and performing a lot of full GCs - when examing the
dump, the vast majority of space in the heap (600+MB out of 1GB) was
with byte arrays referenced by LinkObjects.

>  In general I would suggest to not use the waitForAck feature. That's not
>  a strict rule, but if you do async replication and use session
>  stickyness for load balancing, then you usually put a strong focus on
>  the replication not influencing your webapp negatively. Activating
>  waitForAck lets you realize more reliably, if there is a replication
>  problem, but it also increases the overhead. You mileage may vary.

So what would cause the FastQueue to accumulate ClusterData even when
the cluster is apparently running properly? Is there any failsafe
(besides setting a maximum queuesize) to allow old data to be purged?
I mean, 600k ClusterData objects is a lot!

-Dave

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Cluster Memory Leak - ClusterData and LinkObject classes

Posted by Filip Hanik - Dev Lists <de...@hanik.com>.

David Rees wrote:
> On Mon, Mar 31, 2008 at 4:48 PM, Filip Hanik - Dev Lists
> <de...@hanik.com> wrote:
>   
>> David Rees wrote:
>>  > One problem I've intermittently had with clustering is that after a
>>  > Tomcat restart (we shut down one node and it immediately restarts,
>>  > generally within 30 seconds), they two nodes don't consistently sync
>>  > up. (The restarted node would not have the sessions from the other
>>  > node, but new sessions would get replicated over) I have to think that
>>  > this may be related to this issue.
>>
>>  I believe you have to wait at least 30seconds before you bring up the
>>  other node.
>>  especially, if you are using mcastDropTime="30000" (could be the
>>  default?) then your nodes wont even realize this one is gone, and when
>>  you bring it back up within 30seconds, to the other nodes its like
>>  nothing ever changed.
>>     
>
> OK, I'll have to try to figure out how to keep Tomcat from starting up
> until 30 seconds have passed, then. Seems like this type of limitation
> deserved a big fat warning in the Cluster HOW-TO? Do you think that
> this could trigger the Queue build-up issue?
>   
my guess is no, queue build up would be somewhere else.
>   
>>  As rainer mentioned, if you are just starting to use cluster, switch to
>>  TC6 to avoid the migration you will have to make. TC6 also handles this
>>  scenario regardless of what you set your droptime to
>>     
>
> I'd like to migration to TC6, but that also means that we have to
> perform full validation testing which takes time and effort... Not to
> mention that it seems that TC6 has only recently stabilized compared
> to TC5. But it's good to know that TC6 resolves this issue. Can you
> explain the features in TC6 which prevents this from being an issue?
>   
in TC6 a member/node receives a unique ID each time the JVM starts up 
and the member is created
in TC5.5 a member is identified by IP:listenPort, and that is the same 
each time you restart a tomcat, in TC6, it also uses the unique ID, so 
in tomcat 6, it does recognize the difference between a node process 
restarted vs a node just got disconnected from the network

Filip
> -Dave
>
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>
>
>   


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Cluster Memory Leak - ClusterData and LinkObject classes

Posted by David Rees <dr...@gmail.com>.

On Mon, Mar 31, 2008 at 4:48 PM, Filip Hanik - Dev Lists
<de...@hanik.com> wrote:
> David Rees wrote:
>  > One problem I've intermittently had with clustering is that after a
>  > Tomcat restart (we shut down one node and it immediately restarts,
>  > generally within 30 seconds), they two nodes don't consistently sync
>  > up. (The restarted node would not have the sessions from the other
>  > node, but new sessions would get replicated over) I have to think that
>  > this may be related to this issue.
>
>  I believe you have to wait at least 30seconds before you bring up the
>  other node.
>  especially, if you are using mcastDropTime="30000" (could be the
>  default?) then your nodes wont even realize this one is gone, and when
>  you bring it back up within 30seconds, to the other nodes its like
>  nothing ever changed.

OK, I'll have to try to figure out how to keep Tomcat from starting up
until 30 seconds have passed, then. Seems like this type of limitation
deserved a big fat warning in the Cluster HOW-TO? Do you think that
this could trigger the Queue build-up issue?

>  As rainer mentioned, if you are just starting to use cluster, switch to
>  TC6 to avoid the migration you will have to make. TC6 also handles this
>  scenario regardless of what you set your droptime to

I'd like to migration to TC6, but that also means that we have to
perform full validation testing which takes time and effort... Not to
mention that it seems that TC6 has only recently stabilized compared
to TC5. But it's good to know that TC6 resolves this issue. Can you
explain the features in TC6 which prevents this from being an issue?

-Dave

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org

Re: Cluster Memory Leak - ClusterData and LinkObject classes

Posted by Filip Hanik - Dev Lists <de...@hanik.com>.

David Rees wrote:
> On Mon, Mar 31, 2008 at 12:49 PM, Rainer Jung <ra...@kippdata.de> wrote:
>   
>>  First to make sure: counting objects in general only makes sense after a
>>  full GC. Otherwise the heap dump will contain garbage too.
>>     
>
> Yes, I made sure the objects I was looking at had a valid GC
> reference. They really were getting stuck in the queue.
>
>   
>>  Just some basic info: the LinkObject objects can be either in a
>>  FastQueue, or they are used in a FastAsyncSocketSender directly after
>>  removing from the FastQueue and before actually sending.
>>     
> <snip>
>
> Thank you for the detailed description on how the Queues work with the cluster.
>
>   
>>  Why you had that many LinkObjects is not clear. You could first try to
>>  check, if the LinkObjects actually belong to a Queue, or not (e.g. then
>>  they are already in the Sender). Have a look at your log files, if there
>>  are errors or unexpected cluster membership messages.
>>     
>
> One problem I've intermittently had with clustering is that after a
> Tomcat restart (we shut down one node and it immediately restarts,
> generally within 30 seconds), they two nodes don't consistently sync
> up. (The restarted node would not have the sessions from the other
> node, but new sessions would get replicated over) I have to think that
> this may be related to this issue.
>   
I believe you have to wait at least 30seconds before you bring up the 
other node.
especially, if you are using mcastDropTime="30000" (could be the 
default?) then your nodes wont even realize this one is gone, and when 
you bring it back up within 30seconds, to the other nodes its like 
nothing ever changed.

As rainer mentioned, if you are just starting to use cluster, switch to 
TC6 to avoid the migration you will have to make. TC6 also handles this 
scenario regardless of what you set your droptime to

Filip
> I checked the logs and didn't see any issues in the Tomcat logs with
> members dropping from the cluster until the JVM got close to running
> out of memory and performing a lot of full GCs - when examing the
> dump, the vast majority of space in the heap (600+MB out of 1GB) was
> with byte arrays referenced by LinkObjects.
>
>   
>>  In general I would suggest to not use the waitForAck feature. That's not
>>  a strict rule, but if you do async replication and use session
>>  stickyness for load balancing, then you usually put a strong focus on
>>  the replication not influencing your webapp negatively. Activating
>>  waitForAck lets you realize more reliably, if there is a replication
>>  problem, but it also increases the overhead. You mileage may vary.
>>     
>
> So what would cause the FastQueue to accumulate ClusterData even when
> the cluster is apparently running properly? Is there any failsafe
> (besides setting a maximum queuesize) to allow old data to be purged?
> I mean, 600k ClusterData objects is a lot!
>
> -Dave
>
> ---------------------------------------------------------------------
> To start a new topic, e-mail: users@tomcat.apache.org
> To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
> For additional commands, e-mail: users-help@tomcat.apache.org
>
>
>
>   


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org
For additional commands, e-mail: users-help@tomcat.apache.org