You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Ramzi Rabah (JIRA)" <ji...@apache.org> on 2009/12/23 18:36:29 UTC

[jira] Created: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
----------------------------------------------------------------------------------------

                 Key: CASSANDRA-651
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
             Project: Cassandra
          Issue Type: Bug
         Environment: latest in 0.5 branch
            Reporter: Ramzi Rabah


>From the cassandra user message board: 
"I just recently upgraded to latest in 0.5 branch, and I am running
into a serious issue. I have a cluster with 4 nodes, rackunaware
strategy, and using my own tokens distributed evenly over the hash
space. I am writing/reading equally to them at an equal rate of about
230 reads/writes per second(and cfstats shows that). The first 3 nodes
are seeds, the last one isn't. When I start all the nodes together at
the same time, they all receive equal amounts of reads/writes (about
230).
When I bring node 4 down and bring it back up again, node 4's load
fluctuates between the 230 it used to get to sometimes no traffic at
all. The other 3 still have the same amount of traffic. And no errors
what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by Jonathan Ellis <jb...@gmail.com>.

If you want this to be part of the jira record, you need to add it as
a comment on the issue; jira is not configured to turn emails into
comments automatically.

On Sun, Dec 27, 2009 at 11:07 PM, Michael Lee
<ma...@gmail.com> wrote:
> Confirm this issue by following tests
> suppose a cluster contained 8 nodes, which contained about 10000 rows(key range from 1 to 10000):
> Address       Status     Load          Range                                      Ring
>                                       170141183460469231731687303715884105728
> 10.237.4.85   Up         757.13 MB     21267647932558653966460912964485513216     |<--|
> 10.237.1.135  Up         761.54 MB     42535295865117307932921825928971026432     |   ^
> 10.237.1.137  Up         748.02 MB     63802943797675961899382738893456539648     v   |
> 10.237.1.139  Up         732.36 MB     85070591730234615865843651857942052864     |   ^
> 10.237.1.140  Up         725.6 MB      106338239662793269832304564822427566080    v   |
> 10.237.1.141  Up         726.59 MB     127605887595351923798765477786913079296    |   ^
> 10.237.1.143  Up         728.16 MB     148873535527910577765226390751398592512    v   |
> 10.237.1.144  Up         745.69 MB     170141183460469231731687303715884105728    |-->|
>
> (1)     Read keys range [1-10000], all keys read out ok ( client send read request directly to 10.237.4.85, 10.237.1.137, 10.237.1.140, 10.237.1.143 )
> (2)     Turn-off 10.237.1.135 while remain pressure, some read request will time out,
> after all nodes know 10.237.1.135 has down (about 10 s later), all read request become ok again, that’s fine
> (3)     After turn-on 10.237.1.135(and cassandra service also), some read request will time out again, and will remain FOREVER even all nodes know 10.237.1.135 has up,
> That’s a PROBLEM!
> (4)     Reboot 10.237.1.135, problem remains.
> (5)     If stop pressure and reboot whole cluster then perform step 1, all things are fine, again…..
>
> All read request use Quorum policy, version of Cassandra is apache-cassandra-incubating-0.5.0-beta2, and I’ve tested apache-cassandra-incubating-0.5.0-RC1, problem remains.
>
> After read system.log, I found after 10.237.1.135 down and up again, other nodes will not establish tcp connection to it(on tcp port 7000 ) forever!
> And read request sent to 10.237.1.135(into Pending-Writes because socket channel is closed) will not sent to net forever(from observing tcpdump).
>
> It’s seems when 10.237.1.135 going down in step2, some socket channel was reset ,
> after 10.237.1.135 come back, these socket channel remain closed, forever
> ---------END----------
>
>
> -----Original Message-----
> From: Jonathan Ellis (JIRA) [mailto:jira@apache.org]
> Sent: Thursday, December 24, 2009 10:47 AM
> To: cassandra-commits@incubator.apache.org
> Subject: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
>
>
>     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Jonathan Ellis updated CASSANDRA-651:
> -------------------------------------
>
>    Fix Version/s: 0.5
>         Assignee: Jaakko Laine
>
>> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
>> ----------------------------------------------------------------------------------------
>>
>>                 Key: CASSANDRA-651
>>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>>             Project: Cassandra
>>          Issue Type: Bug
>>          Components: Core
>>    Affects Versions: 0.5
>>         Environment: latest in 0.5 branch
>>            Reporter: Ramzi Rabah
>>            Assignee: Jaakko Laine
>>             Fix For: 0.5
>>
>>
>> From the cassandra user message board:
>> "I just recently upgraded to latest in 0.5 branch, and I am running
>> into a serious issue. I have a cluster with 4 nodes, rackunaware
>> strategy, and using my own tokens distributed evenly over the hash
>> space. I am writing/reading equally to them at an equal rate of about
>> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
>> are seeds, the last one isn't. When I start all the nodes together at
>> the same time, they all receive equal amounts of reads/writes (about
>> 230).
>> When I bring node 4 down and bring it back up again, node 4's load
>> fluctuates between the 230 it used to get to sometimes no traffic at
>> all. The other 3 still have the same amount of traffic. And no errors
>> what so ever seen in logs. "
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

RE: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by Michael Lee <ma...@gmail.com>.

Confirm this issue by following tests
suppose a cluster contained 8 nodes, which contained about 10000 rows(key range from 1 to 10000):
Address       Status     Load          Range                                      Ring
                                       170141183460469231731687303715884105728    
10.237.4.85   Up         757.13 MB     21267647932558653966460912964485513216     |<--|
10.237.1.135  Up         761.54 MB     42535295865117307932921825928971026432     |   ^
10.237.1.137  Up         748.02 MB     63802943797675961899382738893456539648     v   |
10.237.1.139  Up         732.36 MB     85070591730234615865843651857942052864     |   ^
10.237.1.140  Up         725.6 MB      106338239662793269832304564822427566080    v   |
10.237.1.141  Up         726.59 MB     127605887595351923798765477786913079296    |   ^
10.237.1.143  Up         728.16 MB     148873535527910577765226390751398592512    v   |
10.237.1.144  Up         745.69 MB     170141183460469231731687303715884105728    |-->|

(1)	Read keys range [1-10000], all keys read out ok ( client send read request directly to 10.237.4.85, 10.237.1.137, 10.237.1.140, 10.237.1.143 )
(2)	Turn-off 10.237.1.135 while remain pressure, some read request will time out,
after all nodes know 10.237.1.135 has down (about 10 s later), all read request become ok again, that’s fine
(3)	After turn-on 10.237.1.135(and cassandra service also), some read request will time out again, and will remain FOREVER even all nodes know 10.237.1.135 has up, 
That’s a PROBLEM!
(4)	Reboot 10.237.1.135, problem remains.
(5)	If stop pressure and reboot whole cluster then perform step 1, all things are fine, again…..

All read request use Quorum policy, version of Cassandra is apache-cassandra-incubating-0.5.0-beta2, and I’ve tested apache-cassandra-incubating-0.5.0-RC1, problem remains.

After read system.log, I found after 10.237.1.135 down and up again, other nodes will not establish tcp connection to it(on tcp port 7000 ) forever! 
And read request sent to 10.237.1.135(into Pending-Writes because socket channel is closed) will not sent to net forever(from observing tcpdump).

It’s seems when 10.237.1.135 going down in step2, some socket channel was reset ,
after 10.237.1.135 come back, these socket channel remain closed, forever
---------END----------


-----Original Message-----
From: Jonathan Ellis (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, December 24, 2009 10:47 AM
To: cassandra-commits@incubator.apache.org
Subject: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.


     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-651:
-------------------------------------

    Fix Version/s: 0.5
         Assignee: Jaakko Laine

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by Michael Lee <ma...@gmail.com>.

Confirm this issue by following tests
suppose a cluster contained 8 nodes, which contained about 10000 rows(key range from 1 to 10000):
Address       Status     Load          Range                                      Ring
                                       170141183460469231731687303715884105728    
10.237.4.85   Up         757.13 MB     21267647932558653966460912964485513216     |<--|
10.237.1.135  Up         761.54 MB     42535295865117307932921825928971026432     |   ^
10.237.1.137  Up         748.02 MB     63802943797675961899382738893456539648     v   |
10.237.1.139  Up         732.36 MB     85070591730234615865843651857942052864     |   ^
10.237.1.140  Up         725.6 MB      106338239662793269832304564822427566080    v   |
10.237.1.141  Up         726.59 MB     127605887595351923798765477786913079296    |   ^
10.237.1.143  Up         728.16 MB     148873535527910577765226390751398592512    v   |
10.237.1.144  Up         745.69 MB     170141183460469231731687303715884105728    |-->|

(1)	Read keys range [1-10000], all keys read out ok ( client send read request directly to 10.237.4.85, 10.237.1.137, 10.237.1.140, 10.237.1.143 )
(2)	Turn-off 10.237.1.135 while remain pressure, some read request will time out,
after all nodes know 10.237.1.135 has down (about 10 s later), all read request become ok again, that’s fine
(3)	After turn-on 10.237.1.135(and cassandra service also), some read request will time out again, and will remain FOREVER even all nodes know 10.237.1.135 has up, 
That’s a PROBLEM!
(4)	Reboot 10.237.1.135, problem remains.
(5)	If stop pressure and reboot whole cluster then perform step 1, all things are fine, again…..

All read request use Quorum policy, version of Cassandra is apache-cassandra-incubating-0.5.0-beta2, and I’ve tested apache-cassandra-incubating-0.5.0-RC1, problem remains.

After read system.log, I found after 10.237.1.135 down and up again, other nodes will not establish tcp connection to it(on tcp port 7000 ) forever! 
And read request sent to 10.237.1.135(into Pending-Writes because socket channel is closed) will not sent to net forever(from observing tcpdump).

It’s seems when 10.237.1.135 going down in step2, some socket channel was reset ,
after 10.237.1.135 come back, these socket channel remain closed, forever
---------END----------


-----Original Message-----
From: Jonathan Ellis (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, December 24, 2009 10:47 AM
To: cassandra-commits@incubator.apache.org
Subject: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.


     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-651:
-------------------------------------

    Fix Version/s: 0.5
         Assignee: Jaakko Laine

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-651:
-------------------------------------

    Attachment:     (was: 651-v4.patch)

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch, 651-v4.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Ramzi Rabah (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ramzi Rabah updated CASSANDRA-651:
----------------------------------

          Component/s: Core
    Affects Version/s: 0.5

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795433#action_12795433 ] 

Jonathan Ellis commented on CASSANDRA-651:
------------------------------------------

Got it: the problem is that sendOneWay calls shutdown on SocketException, not errorClose.  so the part that really matters in gary's fix is adding the nulling out to shutdown.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795466#action_12795466 ] 

Gary Dusbabek commented on CASSANDRA-651:
-----------------------------------------

Reviewed.  +1 on the v4 patch.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch, 651-v4.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-651:
------------------------------------

    Attachment: 651.patch

Patched into trunk.  Link the gossip and messaging service so that invalid connection pools can be shutdown when a node goes offline.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>         Attachments: 651.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Ramzi Rabah (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794124#action_12794124 ] 

Ramzi Rabah commented on CASSANDRA-651:
---------------------------------------

This is definitely a regression in 0.5. I tested out 0.4.2 and it works perfectly fine, and load goes back up to 100% on the restarted node. 

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-651:
-------------------------------------

    Fix Version/s: 0.5
         Assignee: Jaakko Laine

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795110#action_12795110 ] 

Jonathan Ellis commented on CASSANDRA-651:
------------------------------------------

that said, it could be good to have the FD *in addition* to the other, so that if a node goes down for a while that doesn't have much traffic, we don't lose the first attempted message once it's back up unnecessarily.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-651:
-------------------------------------

    Fix Version/s:     (was: 0.9)

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch, 651-v4.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-651:
-------------------------------------

    Attachment: 651-v4.patch

this version gets rid of the formely-problematic-and-now-redundant SocketException block, and renames TCM.shutdown to reset.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch, 651-v4.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795125#action_12795125 ] 

Brandon Williams commented on CASSANDRA-651:
--------------------------------------------

I can still reproduce the issue with this patch applied.  I'm receiving the following traceback:

ERROR - Fatal exception in thread Thread[TCP Selector Manager,5,main]
java.lang.AssertionError
        at org.apache.cassandra.net.TcpConnectionManager.destroy(TcpConnectionManager.java:85)
        at org.apache.cassandra.net.TcpConnection.errorClose(TcpConnection.java:319)
        at org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:364)
        at org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:143)
        at org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:107)

Because ackCon is already null.  With the assert removed, I'm unable to reproduce.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795103#action_12795103 ] 

Jonathan Ellis edited comment on CASSANDRA-651 at 12/29/09 6:55 PM:
--------------------------------------------------------------------

relying on FD to notice is not going to work, though, since FD is not instantaneous (and cannot be made so).  [edit: that is, a node could die and come back, or be partitioned and be available again, quickly enough that FD does not notice but old connections are still invalid.] 

can we have it attempt to reconnect when it encounters an error sending instead?

      was (Author: jbellis):
    relying on FD to notice is not going to work, though, since FD is not instantaneous (and cannot be made so).  can we have it attempt to reconnect when it encounters an error sending instead?
  
> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795405#action_12795405 ] 

Gary Dusbabek commented on CASSANDRA-651:
-----------------------------------------

MessagingService.sendOneWay inconveniently swallows errors, so StorageProxy is never wise about them.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795062#action_12795062 ] 

Jonathan Ellis commented on CASSANDRA-651:
------------------------------------------

Brandon Williams said: "I see what's happening with 651 -- TcpConnectionManager keeps trying to reuse a closed connection and never opens new ones to the recovered node"

sounds like a regression from CASSANDRA-488 to me.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek reassigned CASSANDRA-651:
---------------------------------------

    Assignee: Gary Dusbabek  (was: Jaakko Laine)

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795390#action_12795390 ] 

Jonathan Ellis commented on CASSANDRA-651:
------------------------------------------

I still think we should not rely on FD, or we will still hit this bug with short-lived partitions (which do occur in the wild).

something like brandon's throwing an exception if not connected or awaiting connection.

I'm still baffled that write() apparently doesn't throw when the connection dies...  Are we missing something there?

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794945#action_12794945 ] 

Brandon Williams commented on CASSANDRA-651:
--------------------------------------------

I was able to reproduce this in a 4 node setup as well.  The recovered node does not appear to receive any writes after rejoining the cluster, and I receive TimedOutExceptions in the client.   I was able to write directly to the recovered node and things appeared to work, however after a while a different node OOM'd.  I examined the dump in MAT and it shows org.apache.cassandra.net.MessagingService occupies 72.53% of the available heap, followed by java.util.concurrent.LinkedBlockingQueue using 11.87%.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795593#action_12795593 ] 

Hudson commented on CASSANDRA-651:
----------------------------------

Integrated in Cassandra #309 (See [http://hudson.zones.apache.org/hudson/job/Cassandra/309/])
      TcpConnectionManager was holding on to disconnected connections, giving the false indication they were being used.


> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5, 0.9
>
>         Attachments: 651-v2.patch, 651-v3.patch, 651-v4.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795103#action_12795103 ] 

Jonathan Ellis commented on CASSANDRA-651:
------------------------------------------

relying on FD to notice is not going to work, though, since FD is not instantaneous (and cannot be made so).  can we have it attempt to reconnect when it encounters an error sending instead?

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-651:
-------------------------------------

    Attachment: 651-v4.patch

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch, 651-v4.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795212#action_12795212 ] 

Brandon Williams commented on CASSANDRA-651:
--------------------------------------------

+1, no longer reproducible with this patch.  The recovered node begins receiving writes normally.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795403#action_12795403 ] 

Jonathan Ellis commented on CASSANDRA-651:
------------------------------------------

My mistake: write _was_ throwing, but clearing the old conn out was not working.  Still trying to understand why.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-651:
------------------------------------

    Attachment: 651-v2.patch

The last patch diffed in the wrong direction.  This one is correct.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Ramzi Rabah (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795388#action_12795388 ] 

Ramzi Rabah commented on CASSANDRA-651:
---------------------------------------

+1 from me too, this seems to have fixed the problem. 

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Ramzi Rabah (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794121#action_12794121 ] 

Ramzi Rabah commented on CASSANDRA-651:
---------------------------------------

More info:
 I do see that Node X.X.X.X is dead, and
Node X.X.X.X has restarted.

This show up on all the 3 other servers:
 INFO [Timer-1] 2009-12-22 20:38:43,738 Gossiper.java (line 194)
InetAddress /10.6.168.20 is now dead.

Node /10.6.168.20 has restarted, now UP again
 INFO [GMFD:1] 2009-12-22 20:43:12,812 StorageService.java (line 475)
Node /10.6.168.20 state jump to normal



> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795310#action_12795310 ] 

Gary Dusbabek commented on CASSANDRA-651:
-----------------------------------------

Jaako and I had a discussion in which we agreed that it would be better to have MessagingService implement IEndPointStateChangeSubscriber and subscribe to the gossiper rather than just implement IFailureDetector.  This would have provided a way to basically turn connection pools on and off and would allow writes to fail a bit faster.

I implemented that this morning.  It's a few more lines of code and all it really buys us is more descriptive error messages.  We'll just have to put up with errored writes until gossip takes the failed node out of the ring.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-651:
------------------------------------

    Attachment: 651-v3.patch

Updated to include replacing calls to destroy() with shutdown().  Chances are if one TC is crappy, the other is not going to be useful either.

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch, 651-v3.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Gary Dusbabek (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Dusbabek updated CASSANDRA-651:
------------------------------------

    Attachment:     (was: 651.patch)

> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Gary Dusbabek
>             Fix For: 0.5
>
>         Attachments: 651-v2.patch
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

Posted by "Michael Lee (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795010#action_12795010 ] 

Michael Lee commented on CASSANDRA-651:
---------------------------------------

Confirm this issue by 8 nodes tests,

After read system.log, I found after one node down and up again, some other nodes will not establish tcp connection to it(on tcp port 7000 ) forever! 
And read request sent to it (into Pending-Writes because socket channel is closed) will not sent to ethernet forever(from observing tcpdump).

maybe PendingWrites Queue consume lots memory and OOM (yes, it not the recovered node, but  the node who try to send request to recovered node!)

It's seems when recovered node going down, some other node's socket channel was reset , after it come back, these socket channel remain closed, forever


> cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-651
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>         Environment: latest in 0.5 branch
>            Reporter: Ramzi Rabah
>            Assignee: Jaakko Laine
>             Fix For: 0.5
>
>
> From the cassandra user message board: 
> "I just recently upgraded to latest in 0.5 branch, and I am running
> into a serious issue. I have a cluster with 4 nodes, rackunaware
> strategy, and using my own tokens distributed evenly over the hash
> space. I am writing/reading equally to them at an equal rate of about
> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
> are seeds, the last one isn't. When I start all the nodes together at
> the same time, they all receive equal amounts of reads/writes (about
> 230).
> When I bring node 4 down and bring it back up again, node 4's load
> fluctuates between the 230 it used to get to sometimes no traffic at
> all. The other 3 still have the same amount of traffic. And no errors
> what so ever seen in logs. " 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.