You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Brian Tarbox <ta...@cabotresearch.com> on 2014/03/20 20:31:32 UTC

this seems like a flaw in Node Selection

I've noticed that one of my systems is getting hammered...and that more and
more traffic is being sent to the system having trouble.  Looking at
LeastLoadedNodeSelector.java I can see why.

LoadLoadedNodeSelector finds the node in the cluster that is least loaded
but its calculation of least loaded is based on the number of active
connections and ignores failures which tends to cause more connections to
be made to the machine that failed on a previous attempt.

Here is the code for the compare function that sorts the list of nodes.  It
checks for active count, borrowed count and then lastly corrupted count.
Corrupted count is the interesting one but its almost never gotten to since
the borrowed count will almost always differ between the nodes in the
cluster.

       * public int compareTo(Candidate candidate) {*
*            int value = numActive - candidate.numActive;*

*            if (value == 0)*
*                value = numBorrowed - candidate.numBorrowed;*

*            if (value == 0)*
*                value = numCorrupted - candidate.numCorrupted;*

*            return value;*
*        }*

I've seen this problem with other companies and products: leastloaded as a
means of picking servers is almost always liable to death spirals when a
server can have a failure.

Is there any way to configure away from this in C*?

Thanks,

Brian Tarbox

Re: this seems like a flaw in Node Selection

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Mar 20, 2014 at 1:17 PM, Brian Tarbox <ta...@cabotresearch.com>wrote:

> Yes, I was going to say (sorry for the brain-freeze) that this is behavior
> in Pelops not in C* itself.
>

*OHHHHH*

I presumed you were talking about the code in Cassandra that does the
analogous thing.. :D

Yes, obviously don't disable the dynamic snitch to deal with client
behavior!

=Rob

Re: this seems like a flaw in Node Selection

Posted by Brian Tarbox <ta...@cabotresearch.com>.

Yes, I was going to say (sorry for the brain-freeze) that this is behavior
in Pelops not in C* itself.


On Thu, Mar 20, 2014 at 4:15 PM, Tyler Hobbs <ty...@datastax.com> wrote:

> Brian,
>
> Are you referring to Pelops?  The code you mentioned doesn't exist in
> Cassandra.
>
>
> On Thu, Mar 20, 2014 at 3:07 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Thu, Mar 20, 2014 at 1:03 PM, Brian Tarbox <ta...@cabotresearch.com>wrote:
>>
>>> Does this still apply since we're using 1.2.13?  (should have said that
>>> in the original message)>
>>>
>>
>> I checked the cassandra-1.2 branch to verify that the "dynamic_snitch"
>> config file option is still supported there; it is.
>>
>> =Rob
>>
>>
>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: this seems like a flaw in Node Selection

Posted by Tyler Hobbs <ty...@datastax.com>.

Brian,

Are you referring to Pelops?  The code you mentioned doesn't exist in
Cassandra.


On Thu, Mar 20, 2014 at 3:07 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Thu, Mar 20, 2014 at 1:03 PM, Brian Tarbox <ta...@cabotresearch.com>wrote:
>
>> Does this still apply since we're using 1.2.13?  (should have said that
>> in the original message)>
>>
>
> I checked the cassandra-1.2 branch to verify that the "dynamic_snitch"
> config file option is still supported there; it is.
>
> =Rob
>
>



-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: this seems like a flaw in Node Selection

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Mar 20, 2014 at 1:03 PM, Brian Tarbox <ta...@cabotresearch.com>wrote:

> Does this still apply since we're using 1.2.13?  (should have said that in
> the original message)>
>

I checked the cassandra-1.2 branch to verify that the "dynamic_snitch"
config file option is still supported there; it is.

=Rob

Re: this seems like a flaw in Node Selection

Posted by Brian Tarbox <ta...@cabotresearch.com>.

Does this still apply since we're using 1.2.13?  (should have said that in
the original message)>

Thank you.


On Thu, Mar 20, 2014 at 3:57 PM, Robert Coli <rc...@eventbrite.com> wrote:

>  On Thu, Mar 20, 2014 at 12:31 PM, Brian Tarbox <ta...@cabotresearch.com>wrote:
>
>> I've seen this problem with other companies and products: leastloaded as
>> a means of picking servers is almost always liable to death spirals when a
>> server can have a failure.
>>
>> Is there any way to configure away from this in C*?
>>
>
> Disable the dynamic snitch... via the hidden but still valid [1]
> configuration directive "dynamic_snitch" in cassandra.yaml.
>
> "dynamic_snitch: false"
>
> https://issues.apache.org/jira/browse/CASSANDRA-3229
>
> Given the various other bad edge cases with the Dynamic Snitch (sending
> requests to the "wrong" DC on reset, etc.) this might be worth considering
> in general... especially with the speculative execution stuff in 2.0...
>
> I do note that https://issues.apache.org/jira/browse/CASSANDRA-6465 has a
> patch in 2.0.5, yay...
>
> =Rob
>

Re: this seems like a flaw in Node Selection

Posted by Robert Coli <rc...@eventbrite.com>.

 On Thu, Mar 20, 2014 at 12:31 PM, Brian Tarbox <ta...@cabotresearch.com>wrote:

> I've seen this problem with other companies and products: leastloaded as a
> means of picking servers is almost always liable to death spirals when a
> server can have a failure.
>
> Is there any way to configure away from this in C*?
>

Disable the dynamic snitch... via the hidden but still valid [1]
configuration directive "dynamic_snitch" in cassandra.yaml.

"dynamic_snitch: false"

https://issues.apache.org/jira/browse/CASSANDRA-3229

Given the various other bad edge cases with the Dynamic Snitch (sending
requests to the "wrong" DC on reset, etc.) this might be worth considering
in general... especially with the speculative execution stuff in 2.0...

I do note that https://issues.apache.org/jira/browse/CASSANDRA-6465 has a
patch in 2.0.5, yay...

=Rob