You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Terje Marthinussen <tm...@gmail.com> on 2011/04/24 16:24:39 UTC
0.8 loosing nodes?
World as seen from .81 in the below ring
.81 Up Normal 85.55 GB 8.33% Token(bytes[30])
.82 Down Normal 83.23 GB 8.33% Token(bytes[313230])
.83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
.84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
.85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
.86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
.87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
.88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
.89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
.90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
.91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
.92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>From .82
.81 Down Normal 85.55 GB 8.33% Token(bytes[30])
.82 Up Normal 83.23 GB 8.33% Token(bytes[313230])
.83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
.84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
.85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
.86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
.87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
.88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
.89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
.90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
.91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
.92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>From .84
10.10.42.81 Down Normal 85.55 GB 8.33% Token(bytes[30])
10.10.42.82 Down Normal 83.23 GB 8.33% Token(bytes[313230])
10.10.42.83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
10.10.42.84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
10.10.42.85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
10.10.42.86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
10.10.42.87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
10.10.42.88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
10.10.42.89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
10.10.42.90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
10.10.42.91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
10.10.42.92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
All of the nodes seems to be working when looked at individually and I can
see on for instance .84 that
INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611)
InetAddress /.81 is now dead.
but there is no other messages related to the nodes "dissappearing" as far
as I can see in the 18 hours since that message occured.
Restarting seems to recover things, but nodes seems to go away again (0.8
also seem to be prone to commit logs being unreadable in some cases?)
This is 0.8 build from trunk last Friday.
I will try to enable some more debugging tomorrow to see if there is
something interesting, just curious if anyone else had noticed something
like this.
Regards,
Terje
Re: 0.8 loosing nodes?
Posted by Brandon Williams <dr...@gmail.com>.
On Mon, Apr 25, 2011 at 12:21 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> I bet the problem is with the other tasks on the executor that Gossip
> heartbeat runs on.
>
> I see at least two that could cause blocking: hint cleanup
> post-delivery and flush-expired-memtables, both of which call
> forceFlush which will block if the flush queue + threads are full.
>
> We've run into this before (CASSANDRA-2253); we should move Gossip
> back to its own dedicated executor or it will keep happening whenever
> someone accidentally puts something on the "shared" executor that can
> block.
>
> Created https://issues.apache.org/jira/browse/CASSANDRA-2554 to fix
> this. Thanks for tracking down the problem!
This is good to have too, but isn't the problem: I broke it in the
gossiper refactoring.
https://issues.apache.org/jira/browse/CASSANDRA-2565
-Brandon
Re: 0.8 loosing nodes?
Posted by Jonathan Ellis <jb...@gmail.com>.
I bet the problem is with the other tasks on the executor that Gossip
heartbeat runs on.
I see at least two that could cause blocking: hint cleanup
post-delivery and flush-expired-memtables, both of which call
forceFlush which will block if the flush queue + threads are full.
We've run into this before (CASSANDRA-2253); we should move Gossip
back to its own dedicated executor or it will keep happening whenever
someone accidentally puts something on the "shared" executor that can
block.
Created https://issues.apache.org/jira/browse/CASSANDRA-2554 to fix
this. Thanks for tracking down the problem!
On Mon, Apr 25, 2011 at 11:51 AM, Terje Marthinussen
<tm...@gmail.com> wrote:
> Got just enough time to look at this done today to verify that:
>
> Sometimes nodes (under pressure) fails to send heartbeats for long
> enough to get marked as dead by other nodes (why is a good question,
> which I need to check better. Does not seem to be GC).
>
> The node does however start sending heartbeats again and other nodes
> log that they receive the heartbeats, but this will not get it marked
> as UP again until restarted.
>
> So, seems like 2 issues:
> - Nodes pausing (may be just node overload)
> - Nodes are not marked as UP unless restarted
>
> Regards,
> Terje
>
> On 24 Apr 2011, at 23:24, Terje Marthinussen <tm...@gmail.com> wrote:
>
>> World as seen from .81 in the below ring
>> .81 Up Normal 85.55 GB 8.33% Token(bytes[30])
>> .82 Down Normal 83.23 GB 8.33% Token(bytes[313230])
>> .83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
>> .84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
>> .85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
>> .86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
>> .87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
>> .88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
>> .89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
>> .90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
>> .91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
>> .92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>>
>>
>> From .82
>> .81 Down Normal 85.55 GB 8.33% Token(bytes[30])
>> .82 Up Normal 83.23 GB 8.33% Token(bytes[313230])
>> .83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
>> .84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
>> .85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
>> .86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
>> .87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
>> .88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
>> .89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
>> .90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
>> .91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
>> .92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>>
>> From .84
>> 10.10.42.81 Down Normal 85.55 GB 8.33% Token(bytes[30])
>> 10.10.42.82 Down Normal 83.23 GB 8.33% Token(bytes[313230])
>> 10.10.42.83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
>> 10.10.42.84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
>> 10.10.42.85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
>> 10.10.42.86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
>> 10.10.42.87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
>> 10.10.42.88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
>> 10.10.42.89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
>> 10.10.42.90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
>> 10.10.42.91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
>> 10.10.42.92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>>
>> All of the nodes seems to be working when looked at individually and I can see on for instance .84 that
>> INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) InetAddress /.81 is now dead.
>>
>> but there is no other messages related to the nodes "dissappearing" as far as I can see in the 18 hours since that message occured.
>>
>> Restarting seems to recover things, but nodes seems to go away again (0.8 also seem to be prone to commit logs being unreadable in some cases?)
>>
>> This is 0.8 build from trunk last Friday.
>>
>> I will try to enable some more debugging tomorrow to see if there is something interesting, just curious if anyone else had noticed something like this.
>>
>> Regards,
>> Terje
>>
>>
>
--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com
Re: 0.8 loosing nodes?
Posted by Terje Marthinussen <tm...@gmail.com>.
Got just enough time to look at this done today to verify that:
Sometimes nodes (under pressure) fails to send heartbeats for long
enough to get marked as dead by other nodes (why is a good question,
which I need to check better. Does not seem to be GC).
The node does however start sending heartbeats again and other nodes
log that they receive the heartbeats, but this will not get it marked
as UP again until restarted.
So, seems like 2 issues:
- Nodes pausing (may be just node overload)
- Nodes are not marked as UP unless restarted
Regards,
Terje
On 24 Apr 2011, at 23:24, Terje Marthinussen <tm...@gmail.com> wrote:
> World as seen from .81 in the below ring
> .81 Up Normal 85.55 GB 8.33% Token(bytes[30])
> .82 Down Normal 83.23 GB 8.33% Token(bytes[313230])
> .83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
> .84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
> .85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
> .86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
> .87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
> .88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
> .89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
> .90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
> .91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
> .92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>
>
> From .82
> .81 Down Normal 85.55 GB 8.33% Token(bytes[30])
> .82 Up Normal 83.23 GB 8.33% Token(bytes[313230])
> .83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
> .84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
> .85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
> .86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
> .87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
> .88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
> .89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
> .90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
> .91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
> .92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>
> From .84
> 10.10.42.81 Down Normal 85.55 GB 8.33% Token(bytes[30])
> 10.10.42.82 Down Normal 83.23 GB 8.33% Token(bytes[313230])
> 10.10.42.83 Up Normal 70.43 GB 8.33% Token(bytes[313437])
> 10.10.42.84 Up Normal 81.7 GB 8.33% Token(bytes[313836])
> 10.10.42.85 Up Normal 108.39 GB 8.33% Token(bytes[323336])
> 10.10.42.86 Up Normal 126.19 GB 8.33% Token(bytes[333234])
> 10.10.42.87 Up Normal 127.16 GB 8.33% Token(bytes[333939])
> 10.10.42.88 Up Normal 135.92 GB 8.33% Token(bytes[343739])
> 10.10.42.89 Up Normal 117.1 GB 8.33% Token(bytes[353730])
> 10.10.42.90 Up Normal 101.67 GB 8.33% Token(bytes[363635])
> 10.10.42.91 Down Normal 88.33 GB 8.33% Token(bytes[383036])
> 10.10.42.92 Up Normal 129.95 GB 8.33% Token(bytes[6a])
>
> All of the nodes seems to be working when looked at individually and I can see on for instance .84 that
> INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) InetAddress /.81 is now dead.
>
> but there is no other messages related to the nodes "dissappearing" as far as I can see in the 18 hours since that message occured.
>
> Restarting seems to recover things, but nodes seems to go away again (0.8 also seem to be prone to commit logs being unreadable in some cases?)
>
> This is 0.8 build from trunk last Friday.
>
> I will try to enable some more debugging tomorrow to see if there is something interesting, just curious if anyone else had noticed something like this.
>
> Regards,
> Terje
>
>