You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Vram Kouramajian <vr...@gmail.com> on 2011/04/09 20:52:43 UTC

Site Not Surviving a Single Cassandra Node Crash

We have a 5 Cassandra nodes with the following configuration:

Casandra Version: 0.6.11
Number of Nodes: 5
Replication Factor: 3
Client: Hector 0.6.0-14
Write Consistency Level: Quorum
Read Consistency Level: Quorum
Ring Topology:
   Owns    Range                                      Ring

132756707369141912386052673276321963528
192.168.89.153Up         4.15 GB       33.87%
20237398133070283622632741498697119875     |<--|
192.168.89.155Up         5.17 GB       18.29%
51358066040236348437506517944084891398     |   ^
192.168.89.154Up         7.41 GB       33.97%
109158969152851862753910401160326064203    v   |
192.168.89.152Up         5.07 GB       6.34%
119944993359936402983569623214763193674    |   ^
192.168.89.151Up         4.22 GB       7.53%
132756707369141912386052673276321963528    |-->|

We believe that our setup should survive the crash of one of the
Cassandra nodes. But, we had few crashes and the system stopped
functioning until we brought back the Cassandra nodes.

Any clues?

Vram

Re: Site Not Surviving a Single Cassandra Node Crash

Posted by Roland Gude <ro...@yoochoose.com>.

Not sure about that Hector Version, but there was a Hector Bug that Hector did Not stop using a Dead Node As Proxy and that it did not do proper Load balancing in the requests. If you enable trace Logs for Hector you can See which nodes it uses for requests. If there is a newer 0.6 Hector you should give it a try.
Furthermore i Suggest Brunhild down One Node and request data with the cli. If that Works it is probably the Hector bug.

Am 10.04.2011 um 06:57 schrieb "Patricio Echagüe" <pa...@gmail.com>>:

What is the consistency level you are using ?

And as Ed said, if you can provide the stacktrace that would help too.

On Sat, Apr 9, 2011 at 7:02 PM, aaron morton <<m...@thelastpickle.com>> wrote:
btw, the nodes are a tad out of balance was that deliberate ?

<http://wiki.apache.org/cassandra/Operations#Token_selection>http://wiki.apache.org/cassandra/Operations#Token_selection
<http://wiki.apache.org/cassandra/Operations#Load_balancing>http://wiki.apache.org/cassandra/Operations#Load_balancing

Aaron

On 10 Apr 2011, at 08:44, Ed Anuff wrote:

Sounds like the problem might be on the hector side. Lots of hector
users on this list, but usually not a bad idea to ask on
<ma...@googlegroups.com> (cc'd).

The jetty servers stopping responding is a bit vague, somewhere in
your logs is an error message that should shed some light on where
things are going awry. If you can find the exception that's being
thrown in hector and post that, it'd make it much easier to help you
out.

On Sat, Apr 9, 2011 at 12:11 PM, Vram Kouramajian
<<m...@gmail.com>> wrote:
The hector clients are used as part of our jetty servers. And, the
jetty servers stop responding when one of the Cassandra nodes go down.

Vram

On Sat, Apr 9, 2011 at 11:54 AM, Joe Stump <<m...@joestump.net>> wrote:
Did the Cassandra cluster go down or did you start getting failures from the client when it routed queries to the downed node? The key in the client is to keep working around the ring if the initial node is down.

--Joe

On Apr 9, 2011, at 12:52 PM, Vram Kouramajian wrote:

We have a 5 Cassandra nodes with the following configuration:

Casandra Version: 0.6.11
Number of Nodes: 5
Replication Factor: 3
Client: Hector 0.6.0-14
Write Consistency Level: Quorum
Read Consistency Level: Quorum
Ring Topology:
Owns Range Ring

132756707369141912386052673276321963528
192.168.89.153Up 4.15 GB 33.87%
20237398133070283622632741498697119875 |<--|
192.168.89.155Up 5.17 GB 18.29%
51358066040236348437506517944084891398 | ^
192.168.89.154Up 7.41 GB 33.97%
109158969152851862753910401160326064203 v |
192.168.89.152Up 5.07 GB 6.34%
119944993359936402983569623214763193674 | ^
192.168.89.151Up 4.22 GB 7.53%
132756707369141912386052673276321963528 |-->|

We believe that our setup should survive the crash of one of the
Cassandra nodes. But, we had few crashes and the system stopped
functioning until we brought back the Cassandra nodes.

Any clues?

Vram

Re: Site Not Surviving a Single Cassandra Node Crash

Posted by Patricio Echagüe <pa...@gmail.com>.

What is the consistency level you are using ?

And as Ed said, if you can provide the stacktrace that would help too.

On Sat, Apr 9, 2011 at 7:02 PM, aaron morton <aa...@thelastpickle.com>wrote:

> btw, the nodes are a tad out of balance was that deliberate ?
>
> http://wiki.apache.org/cassandra/Operations#Token_selection
> http://wiki.apache.org/cassandra/Operations#Load_balancing
>
>
> Aaron
>
> On 10 Apr 2011, at 08:44, Ed Anuff wrote:
>
> Sounds like the problem might be on the hector side.  Lots of hector
> users on this list, but usually not a bad idea to ask on
> hector-users@googlegroups.com (cc'd).
>
> The jetty servers stopping responding is a bit vague, somewhere in
> your logs is an error message that should shed some light on where
> things are going awry.  If you can find the exception that's being
> thrown in hector and post that, it'd make it much easier to help you
> out.
>
> Ed
>
> On Sat, Apr 9, 2011 at 12:11 PM, Vram Kouramajian
> <vr...@gmail.com> wrote:
>
> The hector clients are used as part of our jetty servers. And, the
>
> jetty servers stop responding when one of the Cassandra nodes go down.
>
>
> Vram
>
>
> On Sat, Apr 9, 2011 at 11:54 AM, Joe Stump <jo...@joestump.net> wrote:
>
> Did the Cassandra cluster go down or did you start getting failures from
> the client when it routed queries to the downed node? The key in the client
> is to keep working around the ring if the initial node is down.
>
>
> --Joe
>
>
> On Apr 9, 2011, at 12:52 PM, Vram Kouramajian wrote:
>
>
> We have a 5 Cassandra nodes with the following configuration:
>
>
> Casandra Version: 0.6.11
>
> Number of Nodes: 5
>
> Replication Factor: 3
>
> Client: Hector 0.6.0-14
>
> Write Consistency Level: Quorum
>
> Read Consistency Level: Quorum
>
> Ring Topology:
>
>   Owns    Range                                      Ring
>
>
> 132756707369141912386052673276321963528
>
> 192.168.89.153Up         4.15 GB       33.87%
>
> 20237398133070283622632741498697119875     |<--|
>
> 192.168.89.155Up         5.17 GB       18.29%
>
> 51358066040236348437506517944084891398     |   ^
>
> 192.168.89.154Up         7.41 GB       33.97%
>
> 109158969152851862753910401160326064203    v   |
>
> 192.168.89.152Up         5.07 GB       6.34%
>
> 119944993359936402983569623214763193674    |   ^
>
> 192.168.89.151Up         4.22 GB       7.53%
>
> 132756707369141912386052673276321963528    |-->|
>
>
> We believe that our setup should survive the crash of one of the
>
> Cassandra nodes. But, we had few crashes and the system stopped
>
> functioning until we brought back the Cassandra nodes.
>
>
> Any clues?
>
>
> Vram
>
>
>
>
>
>

Re: Site Not Surviving a Single Cassandra Node Crash

Posted by aaron morton <aa...@thelastpickle.com>.

btw, the nodes are a tad out of balance was that deliberate ? 

http://wiki.apache.org/cassandra/Operations#Token_selection
http://wiki.apache.org/cassandra/Operations#Load_balancing


Aaron

On 10 Apr 2011, at 08:44, Ed Anuff wrote:

> Sounds like the problem might be on the hector side.  Lots of hector
> users on this list, but usually not a bad idea to ask on
> hector-users@googlegroups.com (cc'd).
> 
> The jetty servers stopping responding is a bit vague, somewhere in
> your logs is an error message that should shed some light on where
> things are going awry.  If you can find the exception that's being
> thrown in hector and post that, it'd make it much easier to help you
> out.
> 
> Ed
> 
> On Sat, Apr 9, 2011 at 12:11 PM, Vram Kouramajian
> <vr...@gmail.com> wrote:
>> The hector clients are used as part of our jetty servers. And, the
>> jetty servers stop responding when one of the Cassandra nodes go down.
>> 
>> Vram
>> 
>> On Sat, Apr 9, 2011 at 11:54 AM, Joe Stump <jo...@joestump.net> wrote:
>>> Did the Cassandra cluster go down or did you start getting failures from the client when it routed queries to the downed node? The key in the client is to keep working around the ring if the initial node is down.
>>> 
>>> --Joe
>>> 
>>> On Apr 9, 2011, at 12:52 PM, Vram Kouramajian wrote:
>>> 
>>>> We have a 5 Cassandra nodes with the following configuration:
>>>> 
>>>> Casandra Version: 0.6.11
>>>> Number of Nodes: 5
>>>> Replication Factor: 3
>>>> Client: Hector 0.6.0-14
>>>> Write Consistency Level: Quorum
>>>> Read Consistency Level: Quorum
>>>> Ring Topology:
>>>>   Owns    Range                                      Ring
>>>> 
>>>> 132756707369141912386052673276321963528
>>>> 192.168.89.153Up         4.15 GB       33.87%
>>>> 20237398133070283622632741498697119875     |<--|
>>>> 192.168.89.155Up         5.17 GB       18.29%
>>>> 51358066040236348437506517944084891398     |   ^
>>>> 192.168.89.154Up         7.41 GB       33.97%
>>>> 109158969152851862753910401160326064203    v   |
>>>> 192.168.89.152Up         5.07 GB       6.34%
>>>> 119944993359936402983569623214763193674    |   ^
>>>> 192.168.89.151Up         4.22 GB       7.53%
>>>> 132756707369141912386052673276321963528    |-->|
>>>> 
>>>> We believe that our setup should survive the crash of one of the
>>>> Cassandra nodes. But, we had few crashes and the system stopped
>>>> functioning until we brought back the Cassandra nodes.
>>>> 
>>>> Any clues?
>>>> 
>>>> Vram
>>> 
>>> 
>>

Re: Site Not Surviving a Single Cassandra Node Crash

Posted by Ed Anuff <ed...@anuff.com>.

Sounds like the problem might be on the hector side.  Lots of hector
users on this list, but usually not a bad idea to ask on
hector-users@googlegroups.com (cc'd).

The jetty servers stopping responding is a bit vague, somewhere in
your logs is an error message that should shed some light on where
things are going awry.  If you can find the exception that's being
thrown in hector and post that, it'd make it much easier to help you
out.

Ed

On Sat, Apr 9, 2011 at 12:11 PM, Vram Kouramajian
<vr...@gmail.com> wrote:
> The hector clients are used as part of our jetty servers. And, the
> jetty servers stop responding when one of the Cassandra nodes go down.
>
> Vram
>
> On Sat, Apr 9, 2011 at 11:54 AM, Joe Stump <jo...@joestump.net> wrote:
>> Did the Cassandra cluster go down or did you start getting failures from the client when it routed queries to the downed node? The key in the client is to keep working around the ring if the initial node is down.
>>
>> --Joe
>>
>> On Apr 9, 2011, at 12:52 PM, Vram Kouramajian wrote:
>>
>>> We have a 5 Cassandra nodes with the following configuration:
>>>
>>> Casandra Version: 0.6.11
>>> Number of Nodes: 5
>>> Replication Factor: 3
>>> Client: Hector 0.6.0-14
>>> Write Consistency Level: Quorum
>>> Read Consistency Level: Quorum
>>> Ring Topology:
>>>   Owns    Range                                      Ring
>>>
>>> 132756707369141912386052673276321963528
>>> 192.168.89.153Up         4.15 GB       33.87%
>>> 20237398133070283622632741498697119875     |<--|
>>> 192.168.89.155Up         5.17 GB       18.29%
>>> 51358066040236348437506517944084891398     |   ^
>>> 192.168.89.154Up         7.41 GB       33.97%
>>> 109158969152851862753910401160326064203    v   |
>>> 192.168.89.152Up         5.07 GB       6.34%
>>> 119944993359936402983569623214763193674    |   ^
>>> 192.168.89.151Up         4.22 GB       7.53%
>>> 132756707369141912386052673276321963528    |-->|
>>>
>>> We believe that our setup should survive the crash of one of the
>>> Cassandra nodes. But, we had few crashes and the system stopped
>>> functioning until we brought back the Cassandra nodes.
>>>
>>> Any clues?
>>>
>>> Vram
>>
>>
>

Re: Site Not Surviving a Single Cassandra Node Crash

Posted by Vram Kouramajian <vr...@gmail.com>.

The hector clients are used as part of our jetty servers. And, the
jetty servers stop responding when one of the Cassandra nodes go down.

Vram

On Sat, Apr 9, 2011 at 11:54 AM, Joe Stump <jo...@joestump.net> wrote:
> Did the Cassandra cluster go down or did you start getting failures from the client when it routed queries to the downed node? The key in the client is to keep working around the ring if the initial node is down.
>
> --Joe
>
> On Apr 9, 2011, at 12:52 PM, Vram Kouramajian wrote:
>
>> We have a 5 Cassandra nodes with the following configuration:
>>
>> Casandra Version: 0.6.11
>> Number of Nodes: 5
>> Replication Factor: 3
>> Client: Hector 0.6.0-14
>> Write Consistency Level: Quorum
>> Read Consistency Level: Quorum
>> Ring Topology:
>>   Owns    Range                                      Ring
>>
>> 132756707369141912386052673276321963528
>> 192.168.89.153Up         4.15 GB       33.87%
>> 20237398133070283622632741498697119875     |<--|
>> 192.168.89.155Up         5.17 GB       18.29%
>> 51358066040236348437506517944084891398     |   ^
>> 192.168.89.154Up         7.41 GB       33.97%
>> 109158969152851862753910401160326064203    v   |
>> 192.168.89.152Up         5.07 GB       6.34%
>> 119944993359936402983569623214763193674    |   ^
>> 192.168.89.151Up         4.22 GB       7.53%
>> 132756707369141912386052673276321963528    |-->|
>>
>> We believe that our setup should survive the crash of one of the
>> Cassandra nodes. But, we had few crashes and the system stopped
>> functioning until we brought back the Cassandra nodes.
>>
>> Any clues?
>>
>> Vram
>
>

Re: Site Not Surviving a Single Cassandra Node Crash

Posted by Joe Stump <jo...@joestump.net>.

Did the Cassandra cluster go down or did you start getting failures from the client when it routed queries to the downed node? The key in the client is to keep working around the ring if the initial node is down.

--Joe

On Apr 9, 2011, at 12:52 PM, Vram Kouramajian wrote:

> We have a 5 Cassandra nodes with the following configuration:
> 
> Casandra Version: 0.6.11
> Number of Nodes: 5
> Replication Factor: 3
> Client: Hector 0.6.0-14
> Write Consistency Level: Quorum
> Read Consistency Level: Quorum
> Ring Topology:
>   Owns    Range                                      Ring
> 
> 132756707369141912386052673276321963528
> 192.168.89.153Up         4.15 GB       33.87%
> 20237398133070283622632741498697119875     |<--|
> 192.168.89.155Up         5.17 GB       18.29%
> 51358066040236348437506517944084891398     |   ^
> 192.168.89.154Up         7.41 GB       33.97%
> 109158969152851862753910401160326064203    v   |
> 192.168.89.152Up         5.07 GB       6.34%
> 119944993359936402983569623214763193674    |   ^
> 192.168.89.151Up         4.22 GB       7.53%
> 132756707369141912386052673276321963528    |-->|
> 
> We believe that our setup should survive the crash of one of the
> Cassandra nodes. But, we had few crashes and the system stopped
> functioning until we brought back the Cassandra nodes.
> 
> Any clues?
> 
> Vram