You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Terje Marthinussen <tm...@gmail.com> on 2011/04/24 16:24:39 UTC

0.8 loosing nodes?

World as seen from .81 in the below ring
.81     Up     Normal  85.55 GB        8.33%   Token(bytes[30])
.82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
.83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
.84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
.85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
.86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
.87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
.88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
.89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
.90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
.91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
.92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])


>From .82
.81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
.82     Up     Normal  83.23 GB        8.33%   Token(bytes[313230])
.83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
.84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
.85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
.86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
.87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
.88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
.89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
.90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
.91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
.92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])

>From .84
10.10.42.81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
10.10.42.82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
10.10.42.83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
10.10.42.84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
10.10.42.85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
10.10.42.86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
10.10.42.87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
10.10.42.88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
10.10.42.89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
10.10.42.90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
10.10.42.91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
10.10.42.92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])

All of the nodes seems to be working when looked at individually and I can
see on for instance .84 that
 INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611)
InetAddress /.81 is now dead.

but there is no other messages related to the nodes "dissappearing"  as far
as I can see in the 18 hours since that message occured.

Restarting seems to recover things, but nodes seems to go away again (0.8
also seem to be prone to commit logs being unreadable in some cases?)

This is 0.8 build from trunk last Friday.

I will try to enable some more debugging tomorrow to see if there is
something interesting, just curious if anyone else had noticed something
like this.

Regards,
Terje

Re: 0.8 loosing nodes?

Posted by Brandon Williams <dr...@gmail.com>.

On Mon, Apr 25, 2011 at 12:21 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> I bet the problem is with the other tasks on the executor that Gossip
> heartbeat runs on.
>
> I see at least two that could cause blocking: hint cleanup
> post-delivery and flush-expired-memtables, both of which call
> forceFlush which will block if the flush queue + threads are full.
>
> We've run into this before (CASSANDRA-2253); we should move Gossip
> back to its own dedicated executor or it will keep happening whenever
> someone accidentally puts something on the "shared" executor that can
> block.
>
> Created https://issues.apache.org/jira/browse/CASSANDRA-2554 to fix
> this.  Thanks for tracking down the problem!

This is good to have too, but isn't the problem: I broke it in the
gossiper refactoring.

https://issues.apache.org/jira/browse/CASSANDRA-2565

-Brandon

Re: 0.8 loosing nodes?

Posted by Jonathan Ellis <jb...@gmail.com>.

I bet the problem is with the other tasks on the executor that Gossip
heartbeat runs on.

I see at least two that could cause blocking: hint cleanup
post-delivery and flush-expired-memtables, both of which call
forceFlush which will block if the flush queue + threads are full.

We've run into this before (CASSANDRA-2253); we should move Gossip
back to its own dedicated executor or it will keep happening whenever
someone accidentally puts something on the "shared" executor that can
block.

Created https://issues.apache.org/jira/browse/CASSANDRA-2554 to fix
this.  Thanks for tracking down the problem!

On Mon, Apr 25, 2011 at 11:51 AM, Terje Marthinussen
<tm...@gmail.com> wrote:
> Got just enough time to look at this done today to verify that:
>
> Sometimes nodes (under pressure) fails to send heartbeats for  long
> enough to get marked as dead by other nodes (why is a good question,
> which I need to check better. Does not seem to be GC).
>
> The node does however start sending heartbeats again and other nodes
> log that they receive the heartbeats,  but this will not get it marked
> as UP again until restarted.
>
> So, seems like 2 issues:
> - Nodes pausing (may be just node overload)
> - Nodes are not marked as UP unless restarted
>
> Regards,
> Terje
>
> On 24 Apr 2011, at 23:24, Terje Marthinussen <tm...@gmail.com> wrote:
>
>> World as seen from .81 in the below ring
>> .81     Up     Normal  85.55 GB        8.33%   Token(bytes[30])
>> .82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
>> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
>> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
>> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
>> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
>> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
>> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
>> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
>> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
>> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
>> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>>
>>
>> From .82
>> .81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
>> .82     Up     Normal  83.23 GB        8.33%   Token(bytes[313230])
>> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
>> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
>> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
>> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
>> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
>> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
>> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
>> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
>> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
>> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>>
>> From .84
>> 10.10.42.81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
>> 10.10.42.82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
>> 10.10.42.83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
>> 10.10.42.84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
>> 10.10.42.85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
>> 10.10.42.86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
>> 10.10.42.87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
>> 10.10.42.88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
>> 10.10.42.89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
>> 10.10.42.90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
>> 10.10.42.91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
>> 10.10.42.92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>>
>> All of the nodes seems to be working when looked at individually and I can see on for instance .84 that
>>  INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) InetAddress /.81 is now dead.
>>
>> but there is no other messages related to the nodes "dissappearing"  as far as I can see in the 18 hours since that message occured.
>>
>> Restarting seems to recover things, but nodes seems to go away again (0.8 also seem to be prone to commit logs being unreadable in some cases?)
>>
>> This is 0.8 build from trunk last Friday.
>>
>> I will try to enable some more debugging tomorrow to see if there is something interesting, just curious if anyone else had noticed something like this.
>>
>> Regards,
>> Terje
>>
>>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: 0.8 loosing nodes?

Posted by Terje Marthinussen <tm...@gmail.com>.

Got just enough time to look at this done today to verify that:

Sometimes nodes (under pressure) fails to send heartbeats for  long
enough to get marked as dead by other nodes (why is a good question,
which I need to check better. Does not seem to be GC).

The node does however start sending heartbeats again and other nodes
log that they receive the heartbeats,  but this will not get it marked
as UP again until restarted.

So, seems like 2 issues:
- Nodes pausing (may be just node overload)
- Nodes are not marked as UP unless restarted

Regards,
Terje

On 24 Apr 2011, at 23:24, Terje Marthinussen <tm...@gmail.com> wrote:

> World as seen from .81 in the below ring
> .81     Up     Normal  85.55 GB        8.33%   Token(bytes[30])
> .82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>
>
> From .82
> .81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
> .82     Up     Normal  83.23 GB        8.33%   Token(bytes[313230])
> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>
> From .84
> 10.10.42.81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
> 10.10.42.82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
> 10.10.42.83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
> 10.10.42.84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
> 10.10.42.85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
> 10.10.42.86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
> 10.10.42.87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
> 10.10.42.88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
> 10.10.42.89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
> 10.10.42.90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
> 10.10.42.91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
> 10.10.42.92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>
> All of the nodes seems to be working when looked at individually and I can see on for instance .84 that
>  INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) InetAddress /.81 is now dead.
>
> but there is no other messages related to the nodes "dissappearing"  as far as I can see in the 18 hours since that message occured.
>
> Restarting seems to recover things, but nodes seems to go away again (0.8 also seem to be prone to commit logs being unreadable in some cases?)
>
> This is 0.8 build from trunk last Friday.
>
> I will try to enable some more debugging tomorrow to see if there is something interesting, just curious if anyone else had noticed something like this.
>
> Regards,
> Terje
>
>