You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jason Hill <ja...@gmail.com> on 2012/10/23 04:41:19 UTC

Node Dead/Up

Hello,

I'm on version 1.0.11.

I'm seeing this in my system log with occasional frequency:

INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
InetAddress /10.50.10.21 is now dead.
INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
InetAddress /10.50.10.21 is now UP


INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
(line 228) Streaming to /10.50.10.25 <--this line included for context
INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
InetAddress /10.50.10.25 is now dead.
INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
InetAddress /10.50.10.25 is now UP
INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
AntiEntropyService.java (line 233) [repair
#5a3383c0-1cb5-11e2-0000-56b66459adef] Sending completed merkle tree
to /10.50.10.25 for (Innovari,TICCompressedLoad) <--this line included
for context

What is this telling me? Is my network dropping for less than a
second? Are my nodes really dead and then up? Can someone shed some
light on this for me?

cheers,
Jason

Re: Node Dead/Up

Posted by Jason Wee <pe...@gmail.com>.
On Wed, Oct 24, 2012 at 2:32 PM, aaron morton <aa...@thelastpickle.com>wrote:

>  I don't see errors in the logs, but I do see
> a lot of dropped mutations and reads. Any correlation?
>
> Yes.
> The dropped messages mean the server is overloaded.
>
> +1 . Been there..overloaded system normally produce frequent dropped
mutation and/or reads. Run nodetool tpstats and will reveal many indicator.


> Look for log messages from the GCInspector in
> /var/log/cassandra/system.log and/or an overloaded IO system see
> http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24/10/2012, at 1:27 PM, Jason Hill <ja...@gmail.com> wrote:
>
> thanks for the replies.
>
> I'll check the load on the node that is reported as DOWN/UP. At first
> glace it does not appear to be overloaded. But, I will dig in deeper,
> is there a specific indicator on an ubuntu server that would be useful
> to me?
>
> Also, I didn't make it clear, but in my original post, there are logs
> from 2 different nodes: 10.21 and 10.25. They are each reporting that
> the other is DOWN/UP at the same time. Would that still point me to
> the suggestions you made? I don't see errors in the logs, but I do see
> a lot of dropped mutations and reads. Any correlation?
>
> thanks again,
> Jason
>
> On Tue, Oct 23, 2012 at 12:49 AM, aaron morton <aa...@thelastpickle.com>
> wrote:
>
> check 10.50.10.21 for what is the system load.
>
> +1
>
> And take a look in the logs on 10.21.
>
> 10.21 is being seen as down by the other nodes. it could be:
>
> * 10.21 failing to gossip fast enough, say by being overloaded to in long
> ParNew GC pauses.
> * This node failing to process gossip fast , say by being overloaded to in
> long ParNew GC pauses.
> * Problems with the tubes used to connect the nodes.
>
> (It's probably the first one.)
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 23/10/2012, at 8:19 PM, Jason Wee <pe...@gmail.com> wrote:
>
> check 10.50.10.21 for what is the system load.
>
> On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill <ja...@gmail.com>
> wrote:
>
>
> Hello,
>
> I'm on version 1.0.11.
>
> I'm seeing this in my system log with occasional frequency:
>
> INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
> InetAddress /10.50.10.21 is now dead.
> INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
> InetAddress /10.50.10.21 is now UP
>
>
> INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
> (line 228) Streaming to /10.50.10.25 <--this line included for context
> INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
> InetAddress /10.50.10.25 is now dead.
> INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
> InetAddress /10.50.10.25 is now UP
> INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
> AntiEntropyService.java (line 233) [repair
> #5a3383c0-1cb5-11e2-0000-56b66459adef] Sending completed merkle tree
> to /10.50.10.25 for (Innovari,TICCompressedLoad) <--this line included
> for context
>
> What is this telling me? Is my network dropping for less than a
> second? Are my nodes really dead and then up? Can someone shed some
> light on this for me?
>
> cheers,
> Jason
>
>
>
>
>
>

Re: Node Dead/Up

Posted by aaron morton <aa...@thelastpickle.com>.
>  I don't see errors in the logs, but I do see
> a lot of dropped mutations and reads. Any correlation?
Yes. 
The dropped messages mean the server is overloaded. 

Look for log messages from the GCInspector in /var/log/cassandra/system.log and/or an overloaded IO system see http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 1:27 PM, Jason Hill <ja...@gmail.com> wrote:

> thanks for the replies.
> 
> I'll check the load on the node that is reported as DOWN/UP. At first
> glace it does not appear to be overloaded. But, I will dig in deeper,
> is there a specific indicator on an ubuntu server that would be useful
> to me?
> 
> Also, I didn't make it clear, but in my original post, there are logs
> from 2 different nodes: 10.21 and 10.25. They are each reporting that
> the other is DOWN/UP at the same time. Would that still point me to
> the suggestions you made? I don't see errors in the logs, but I do see
> a lot of dropped mutations and reads. Any correlation?
> 
> thanks again,
> Jason
> 
> On Tue, Oct 23, 2012 at 12:49 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> check 10.50.10.21 for what is the system load.
>> 
>> +1
>> 
>> And take a look in the logs on 10.21.
>> 
>> 10.21 is being seen as down by the other nodes. it could be:
>> 
>> * 10.21 failing to gossip fast enough, say by being overloaded to in long
>> ParNew GC pauses.
>> * This node failing to process gossip fast , say by being overloaded to in
>> long ParNew GC pauses.
>> * Problems with the tubes used to connect the nodes.
>> 
>> (It's probably the first one.)
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 23/10/2012, at 8:19 PM, Jason Wee <pe...@gmail.com> wrote:
>> 
>> check 10.50.10.21 for what is the system load.
>> 
>> On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill <ja...@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> I'm on version 1.0.11.
>>> 
>>> I'm seeing this in my system log with occasional frequency:
>>> 
>>> INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
>>> InetAddress /10.50.10.21 is now dead.
>>> INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
>>> InetAddress /10.50.10.21 is now UP
>>> 
>>> 
>>> INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
>>> (line 228) Streaming to /10.50.10.25 <--this line included for context
>>> INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
>>> InetAddress /10.50.10.25 is now dead.
>>> INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
>>> InetAddress /10.50.10.25 is now UP
>>> INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
>>> AntiEntropyService.java (line 233) [repair
>>> #5a3383c0-1cb5-11e2-0000-56b66459adef] Sending completed merkle tree
>>> to /10.50.10.25 for (Innovari,TICCompressedLoad) <--this line included
>>> for context
>>> 
>>> What is this telling me? Is my network dropping for less than a
>>> second? Are my nodes really dead and then up? Can someone shed some
>>> light on this for me?
>>> 
>>> cheers,
>>> Jason
>> 
>> 
>> 


Re: Node Dead/Up

Posted by Jason Hill <ja...@gmail.com>.
thanks for the replies.

I'll check the load on the node that is reported as DOWN/UP. At first
glace it does not appear to be overloaded. But, I will dig in deeper,
is there a specific indicator on an ubuntu server that would be useful
to me?

Also, I didn't make it clear, but in my original post, there are logs
from 2 different nodes: 10.21 and 10.25. They are each reporting that
the other is DOWN/UP at the same time. Would that still point me to
the suggestions you made? I don't see errors in the logs, but I do see
a lot of dropped mutations and reads. Any correlation?

thanks again,
Jason

On Tue, Oct 23, 2012 at 12:49 AM, aaron morton <aa...@thelastpickle.com> wrote:
> check 10.50.10.21 for what is the system load.
>
> +1
>
> And take a look in the logs on 10.21.
>
> 10.21 is being seen as down by the other nodes. it could be:
>
> * 10.21 failing to gossip fast enough, say by being overloaded to in long
> ParNew GC pauses.
> * This node failing to process gossip fast , say by being overloaded to in
> long ParNew GC pauses.
> * Problems with the tubes used to connect the nodes.
>
> (It's probably the first one.)
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 23/10/2012, at 8:19 PM, Jason Wee <pe...@gmail.com> wrote:
>
> check 10.50.10.21 for what is the system load.
>
> On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill <ja...@gmail.com> wrote:
>>
>> Hello,
>>
>> I'm on version 1.0.11.
>>
>> I'm seeing this in my system log with occasional frequency:
>>
>> INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
>> InetAddress /10.50.10.21 is now dead.
>> INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
>> InetAddress /10.50.10.21 is now UP
>>
>>
>> INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
>> (line 228) Streaming to /10.50.10.25 <--this line included for context
>> INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
>> InetAddress /10.50.10.25 is now dead.
>> INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
>> InetAddress /10.50.10.25 is now UP
>> INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
>> AntiEntropyService.java (line 233) [repair
>> #5a3383c0-1cb5-11e2-0000-56b66459adef] Sending completed merkle tree
>> to /10.50.10.25 for (Innovari,TICCompressedLoad) <--this line included
>> for context
>>
>> What is this telling me? Is my network dropping for less than a
>> second? Are my nodes really dead and then up? Can someone shed some
>> light on this for me?
>>
>> cheers,
>> Jason
>
>
>

Re: Node Dead/Up

Posted by aaron morton <aa...@thelastpickle.com>.
> check 10.50.10.21 for what is the system load.
+1

And take a look in the logs on 10.21. 

10.21 is being seen as down by the other nodes. it could be:

* 10.21 failing to gossip fast enough, say by being overloaded to in long ParNew GC pauses. 
* This node failing to process gossip fast , say by being overloaded to in long ParNew GC pauses. 
* Problems with the tubes used to connect the nodes. 

(It's probably the first one.)
 
Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 8:19 PM, Jason Wee <pe...@gmail.com> wrote:

> check 10.50.10.21 for what is the system load.
> 
> On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill <ja...@gmail.com> wrote:
> Hello,
> 
> I'm on version 1.0.11.
> 
> I'm seeing this in my system log with occasional frequency:
> 
> INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
> InetAddress /10.50.10.21 is now dead.
> INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
> InetAddress /10.50.10.21 is now UP
> 
> 
> INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
> (line 228) Streaming to /10.50.10.25 <--this line included for context
> INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
> InetAddress /10.50.10.25 is now dead.
> INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
> InetAddress /10.50.10.25 is now UP
> INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
> AntiEntropyService.java (line 233) [repair
> #5a3383c0-1cb5-11e2-0000-56b66459adef] Sending completed merkle tree
> to /10.50.10.25 for (Innovari,TICCompressedLoad) <--this line included
> for context
> 
> What is this telling me? Is my network dropping for less than a
> second? Are my nodes really dead and then up? Can someone shed some
> light on this for me?
> 
> cheers,
> Jason
> 


Re: Node Dead/Up

Posted by Jason Wee <pe...@gmail.com>.
check 10.50.10.21 for what is the system load.

On Tue, Oct 23, 2012 at 10:41 AM, Jason Hill <ja...@gmail.com> wrote:

> Hello,
>
> I'm on version 1.0.11.
>
> I'm seeing this in my system log with occasional frequency:
>
> INFO [GossipTasks:1] 2012-10-23 02:26:34,449 Gossiper.java (line 818)
> InetAddress /10.50.10.21 is now dead.
> INFO [GossipStage:1] 2012-10-23 02:26:34,620 Gossiper.java (line 804)
> InetAddress /10.50.10.21 is now UP
>
>
> INFO [StreamStage:1] 2012-10-23 02:24:38,763 StreamOutSession.java
> (line 228) Streaming to /10.50.10.25 <--this line included for context
> INFO [GossipTasks:1] 2012-10-23 02:26:30,603 Gossiper.java (line 818)
> InetAddress /10.50.10.25 is now dead.
> INFO [GossipStage:1] 2012-10-23 02:26:40,763 Gossiper.java (line 804)
> InetAddress /10.50.10.25 is now UP
> INFO [AntiEntropyStage:1] 2012-10-23 02:27:30,249
> AntiEntropyService.java (line 233) [repair
> #5a3383c0-1cb5-11e2-0000-56b66459adef] Sending completed merkle tree
> to /10.50.10.25 for (Innovari,TICCompressedLoad) <--this line included
> for context
>
> What is this telling me? Is my network dropping for less than a
> second? Are my nodes really dead and then up? Can someone shed some
> light on this for me?
>
> cheers,
> Jason
>