You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by shimi <sh...@gmail.com> on 2010/09/20 16:51:59 UTC

FatClient Gossip error and some other problems

I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).
I replaced all of the servers in the cluster (0.6.4) with new ones (0.6.5).
My old cluster was unbalanced since I was using Random Partitioner and I
bootstrapped all the nodes without specifying their tokens.

Since I wanted the the cluster to be balanced I first added all the new
nodes one after the other (with the right tokens this time) and then I run
decommission on all the old ones, one after the other.
One of the decommissioned nodes began throwing too many open files errors
while It was decommissioning taking other nodes with him. After the second
try I decided to stop it and run removetoken on his token from one of the
other nodes. After that everything went well except that in the end one of
the nodes looked unbalanced.

I decided to run repair on the cluster. What I got is totally unbalanced
nodes with way to much data then what is suppose to be. each node had x2-x4
more data.
I run cleanup and all of them except the one which was unbalanced to begin
with got back to the size they were suppose to be.
Now whenever I try to run cleanup on this node I get:

 INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java
(line 339) AntiCompacting ...
 INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) GC
for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 used;
max is 6552551424
 INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150)
Pool Name                    Active   Pending
 INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
STREAM-STAGE                      0         0
 INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
RESPONSE-STAGE                    0         0
 INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156)
ROW-READ-STAGE                    8       717
 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
LB-OPERATIONS                     0         0
 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
MISCELLANEOUS-POOL                0         0
 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
GMFD                              0         2
 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
CONSISTENCY-MANAGER               0         1
 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
LB-TARGET                         0         0
 INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156)
ROW-MUTATION-STAGE                0         0
 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
MESSAGE-STREAMING-POOL            0         0
 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
LOAD-BALANCER-STAGE               0         0
 INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156)
FLUSH-SORTER-POOL                 0         0
 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
MEMTABLE-POST-FLUSHER             0         0
 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
AE-SERVICE-STAGE                  0         0
 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
FLUSH-WRITER-POOL                 0         0
 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
HINTED-HANDOFF-POOL               0         0
 INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161)
CompactionManager               n/a         0
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402
SSTableDeletingReference.java (line 104) Deleted ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727
SSTableDeletingReference.java (line 104) Deleted ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730
SSTableDeletingReference.java (line 104) Deleted ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735
SSTableDeletingReference.java (line 104) Deleted ...

and after that I saw an increase in the node response time and the number
ROW-READ-STAGE pending tasks. Since there was no indication that something
is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), the
only thing that I could have done is to restart the server.

I don't know if this is related but every hour I see this error (I think it
is the IP of the machine that I couldn't decommission properly):

 INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) FatClient
/X.X.X.X has been silent for 3600000ms, removing from gossip
ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip error
java.util.ConcurrentModificationException
    at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
    at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
    at
org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
    at java.util.TimerThread.mainLoop(Timer.java:512)
    at java.util.TimerThread.run(Timer.java:462)
 INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node
/X.X.X.X is now part of the cluster

Does anyone have any idea how can I cleanup the problematic node?
Does anyone have any idea how can I get rid of the Gossip error?

Shimi

Re: FatClient Gossip error and some other problems

Posted by shimi <sh...@gmail.com>.
I was patient (although it is hard when you have millions of requests which
are not served in time). I was waiting for a long time. There was nothing in
the Logs and in JMX.

Shimi

On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek <gd...@gmail.com> wrote:

> On Mon, Sep 20, 2010 at 09:51, shimi <sh...@gmail.com> wrote:
> > I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).
> > I replaced all of the servers in the cluster (0.6.4) with new ones
> (0.6.5).
> > My old cluster was unbalanced since I was using Random Partitioner and I
> > bootstrapped all the nodes without specifying their tokens.
> >
> > Since I wanted the the cluster to be balanced I first added all the new
> > nodes one after the other (with the right tokens this time) and then I
> run
> > decommission on all the old ones, one after the other.
> > One of the decommissioned nodes began throwing too many open files errors
> > while It was decommissioning taking other nodes with him. After the
> second
> > try I decided to stop it and run removetoken on his token from one of the
> > other nodes. After that everything went well except that in the end one
> of
> > the nodes looked unbalanced.
> >
> > I decided to run repair on the cluster. What I got is totally unbalanced
> > nodes with way to much data then what is suppose to be. each node had
> x2-x4
> > more data.
> > I run cleanup and all of them except the one which was unbalanced to
> begin
> > with got back to the size they were suppose to be.
> > Now whenever I try to run cleanup on this node I get:
> >
> >  INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java
> > (line 339) AntiCompacting ...
> >  INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129)
> GC
> > for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520
> used;
> > max is 6552551424
> >  INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150)
> > Pool Name                    Active   Pending
> >  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> > STREAM-STAGE                      0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> > RESPONSE-STAGE                    0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156)
> > ROW-READ-STAGE                    8       717
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > LB-OPERATIONS                     0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > MISCELLANEOUS-POOL                0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > GMFD                              0         2
> >  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> > CONSISTENCY-MANAGER               0         1
> >  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> > LB-TARGET                         0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156)
> > ROW-MUTATION-STAGE                0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> > LOAD-BALANCER-STAGE               0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156)
> > FLUSH-SORTER-POOL                 0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> > MEMTABLE-POST-FLUSHER             0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> > AE-SERVICE-STAGE                  0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> > FLUSH-WRITER-POOL                 0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> > HINTED-HANDOFF-POOL               0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161)
> > CompactionManager               n/a         0
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735
> > SSTableDeletingReference.java (line 104) Deleted ...
> >
> > and after that I saw an increase in the node response time and the number
> > ROW-READ-STAGE pending tasks. Since there was no indication that
> something
> > is wrong or that the node is doing anyuthing (logs ,nodetool and JMX),
> the
> > only thing that I could have done is to restart the server.
> >
> > I don't know if this is related but every hour I see this error (I think
> it
> > is the IP of the machine that I couldn't decommission properly):
> >
> >  INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402)
> FatClient
> > /X.X.X.X has been silent for 3600000ms, removing from gossip
> > ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip
> error
> > java.util.ConcurrentModificationException
> >     at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
> >     at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
> >     at
> > org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
> >     at java.util.TimerThread.mainLoop(Timer.java:512)
> >     at java.util.TimerThread.run(Timer.java:462)
> >  INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node
> > /X.X.X.X is now part of the cluster
> >
> > Does anyone have any idea how can I cleanup the problematic node?
>
> You may just need to be patient.  Have you tried monitoring the
> CompactionManager in jmx to see if it is doing things?
>
> > Does anyone have any idea how can I get rid of the Gossip error?
>
> This is CASSANDRA-1494. You can ignore it.
>
> Gary.
>

Re: FatClient Gossip error and some other problems

Posted by Gary Dusbabek <gd...@gmail.com>.
On Mon, Sep 20, 2010 at 09:51, shimi <sh...@gmail.com> wrote:
> I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).
> I replaced all of the servers in the cluster (0.6.4) with new ones (0.6.5).
> My old cluster was unbalanced since I was using Random Partitioner and I
> bootstrapped all the nodes without specifying their tokens.
>
> Since I wanted the the cluster to be balanced I first added all the new
> nodes one after the other (with the right tokens this time) and then I run
> decommission on all the old ones, one after the other.
> One of the decommissioned nodes began throwing too many open files errors
> while It was decommissioning taking other nodes with him. After the second
> try I decided to stop it and run removetoken on his token from one of the
> other nodes. After that everything went well except that in the end one of
> the nodes looked unbalanced.
>
> I decided to run repair on the cluster. What I got is totally unbalanced
> nodes with way to much data then what is suppose to be. each node had x2-x4
> more data.
> I run cleanup and all of them except the one which was unbalanced to begin
> with got back to the size they were suppose to be.
> Now whenever I try to run cleanup on this node I get:
>
>  INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java
> (line 339) AntiCompacting ...
>  INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) GC
> for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 used;
> max is 6552551424
>  INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150)
> Pool Name                    Active   Pending
>  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> STREAM-STAGE                      0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> RESPONSE-STAGE                    0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156)
> ROW-READ-STAGE                    8       717
>  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> LB-OPERATIONS                     0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> MISCELLANEOUS-POOL                0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> GMFD                              0         2
>  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> CONSISTENCY-MANAGER               0         1
>  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> LB-TARGET                         0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156)
> ROW-MUTATION-STAGE                0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> MESSAGE-STREAMING-POOL            0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> LOAD-BALANCER-STAGE               0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156)
> FLUSH-SORTER-POOL                 0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> MEMTABLE-POST-FLUSHER             0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> AE-SERVICE-STAGE                  0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> FLUSH-WRITER-POOL                 0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> HINTED-HANDOFF-POOL               0         0
>  INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161)
> CompactionManager               n/a         0
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402
> SSTableDeletingReference.java (line 104) Deleted ...
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727
> SSTableDeletingReference.java (line 104) Deleted ...
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730
> SSTableDeletingReference.java (line 104) Deleted ...
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735
> SSTableDeletingReference.java (line 104) Deleted ...
>
> and after that I saw an increase in the node response time and the number
> ROW-READ-STAGE pending tasks. Since there was no indication that something
> is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), the
> only thing that I could have done is to restart the server.
>
> I don't know if this is related but every hour I see this error (I think it
> is the IP of the machine that I couldn't decommission properly):
>
>  INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) FatClient
> /X.X.X.X has been silent for 3600000ms, removing from gossip
> ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip error
> java.util.ConcurrentModificationException
>     at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
>     at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
>     at
> org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
>     at java.util.TimerThread.mainLoop(Timer.java:512)
>     at java.util.TimerThread.run(Timer.java:462)
>  INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node
> /X.X.X.X is now part of the cluster
>
> Does anyone have any idea how can I cleanup the problematic node?

You may just need to be patient.  Have you tried monitoring the
CompactionManager in jmx to see if it is doing things?

> Does anyone have any idea how can I get rid of the Gossip error?

This is CASSANDRA-1494. You can ignore it.

Gary.