You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Daniel Doubleday <da...@gmx.net> on 2010/12/03 11:00:36 UTC

Dont bogart that connection my friend

Hi all

I have found an anti pattern the other day which I wanted to share, although its pretty special case.

Special case because our production cluster is somewhat strange: 3 servers, rf = 3. We do consistent reads/writes with quorum.

I did a long running read series (loads of reads as fast as I can) with one connection. Since all queries could be handled by that node the overall latency is determined by its own and the fastest second node (cause the quorum is satisfied with 2 reads). What will happen than is that after a couple of minutes one of the other two nodes will go in 100% io wait and will drop most of its read messages. Leaving it practically dead while the other 2 nodes keep responding at an average of ~10ms. The node that died was only a little slower ~13ms average but it will inevitably queue up messages. Average response time increases to timeout (10 secs) flat. It never recovers.

It happened all the time. And it wasn't the same node that would die.

The solution was that I return the connection to the pool and get a new one for every read to balance the load on the client side.

Obviously this will not happen in a cluster where the percentage of all rows on one node is enough. But the same thing will probably happen if you scan by continuos tokens (meaning that you will read from the same node a long time).

Cheers,

Daniel Doubleday
smeet.com, Berlin

Re: Dont bogart that connection my friend

Posted by Jonathan Ellis <jb...@gmail.com>.
Ah, got it.  Thanks for clearing that up!

On Sat, Dec 4, 2010 at 11:56 AM, Daniel Doubleday
<da...@gmx.net> wrote:
> Ah ok. No that was not the case.
>
> The client which did the long running scan didn't wait for the slowest node.
> Only other clients that asked the slow node directly were affected.
>
> Sorry about the confusion.
>
>
> On 04.12.10 05:44, Jonathan Ellis wrote:
>>
>> That makes sense, but this shouldn't make requests last for the
>> timeout duration -- at quorum, it should be responding to the client
>> as soon as it gets that second-fastest reply.  If I'm understanding
>> right that this was making the response to the client block until the
>> overwhelmed node timed out, that's a bug.  What version of Cassandra
>> is this?
>>
>> On Fri, Dec 3, 2010 at 7:27 PM, Daniel Doubleday
>> <da...@gmx.net>  wrote:
>>>
>>> Yes.
>>>
>>> I thought that would make sense, no? I guessed that the quorum read
>>> forces
>>> the slowest of the 3 nodes to keep the pace of the faster ones. But it
>>> cant.
>>> No matter how small the performance diff is. So it will just fill up.
>>>
>>> Also when saying 'practically dead' and 'never recovers' I meant for the
>>> time I kept the reads up. As soon as I stopped the scan it recovered. It
>>> just was not able to recover during the load because for that it would
>>> have
>>> to become faster that the other nodes and with full queues that just
>>> wouldn't happen.
>>>
>>> By changing the node for every read I would hit the slower node every
>>> couple
>>> of reads. This forced the client to wait for the slower node.
>>>
>>> I guess to change that behavior you would need to use something like
>>> dynamic
>>> snitch and ask only as many peer nodes as necessary to satisfy quorum and
>>> only ask other nodes when reads fail. But that would probably increase
>>> latency and cause whatever other problems. Since you probably don't want
>>> to
>>> run the cluster at a load at which the weakest node of a replication
>>> group
>>> can't keep up I don't think this is an issue at all.
>>>
>>> Just wanted to prevent others shooting their own foot as I did.
>>>
>>> On 03.12.10 23:36, Jonathan Ellis wrote:
>>>>
>>>> Am I understanding correctly that you had all connections going to one
>>>> cassandra node, which caused one of the *other* nodes to die, and
>>>> spreading the connections around the cluster fixed it?
>>>>
>>>> On Fri, Dec 3, 2010 at 4:00 AM, Daniel Doubleday
>>>> <da...@gmx.net>    wrote:
>>>>>
>>>>> Hi all
>>>>>
>>>>> I have found an anti pattern the other day which I wanted to share,
>>>>> although its pretty special case.
>>>>>
>>>>> Special case because our production cluster is somewhat strange: 3
>>>>> servers, rf = 3. We do consistent reads/writes with quorum.
>>>>>
>>>>> I did a long running read series (loads of reads as fast as I can) with
>>>>> one connection. Since all queries could be handled by that node the
>>>>> overall
>>>>> latency is determined by its own and the fastest second node (cause the
>>>>> quorum is satisfied with 2 reads). What will happen than is that after
>>>>> a
>>>>> couple of minutes one of the other two nodes will go in 100% io wait
>>>>> and
>>>>> will drop most of its read messages. Leaving it practically dead while
>>>>> the
>>>>> other 2 nodes keep responding at an average of ~10ms. The node that
>>>>> died was
>>>>> only a little slower ~13ms average but it will inevitably queue up
>>>>> messages.
>>>>> Average response time increases to timeout (10 secs) flat. It never
>>>>> recovers.
>>>>>
>>>>> It happened all the time. And it wasn't the same node that would die.
>>>>>
>>>>> The solution was that I return the connection to the pool and get a new
>>>>> one for every read to balance the load on the client side.
>>>>>
>>>>> Obviously this will not happen in a cluster where the percentage of all
>>>>> rows on one node is enough. But the same thing will probably happen if
>>>>> you
>>>>> scan by continuos tokens (meaning that you will read from the same node
>>>>> a
>>>>> long time).
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Daniel Doubleday
>>>>> smeet.com, Berlin
>>>>
>>>
>>
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Dont bogart that connection my friend

Posted by Daniel Doubleday <da...@gmx.net>.
Ah ok. No that was not the case.

The client which did the long running scan didn't wait for the slowest node.
Only other clients that asked the slow node directly were affected.

Sorry about the confusion.


On 04.12.10 05:44, Jonathan Ellis wrote:
> That makes sense, but this shouldn't make requests last for the
> timeout duration -- at quorum, it should be responding to the client
> as soon as it gets that second-fastest reply.  If I'm understanding
> right that this was making the response to the client block until the
> overwhelmed node timed out, that's a bug.  What version of Cassandra
> is this?
>
> On Fri, Dec 3, 2010 at 7:27 PM, Daniel Doubleday
> <da...@gmx.net>  wrote:
>> Yes.
>>
>> I thought that would make sense, no? I guessed that the quorum read forces
>> the slowest of the 3 nodes to keep the pace of the faster ones. But it cant.
>> No matter how small the performance diff is. So it will just fill up.
>>
>> Also when saying 'practically dead' and 'never recovers' I meant for the
>> time I kept the reads up. As soon as I stopped the scan it recovered. It
>> just was not able to recover during the load because for that it would have
>> to become faster that the other nodes and with full queues that just
>> wouldn't happen.
>>
>> By changing the node for every read I would hit the slower node every couple
>> of reads. This forced the client to wait for the slower node.
>>
>> I guess to change that behavior you would need to use something like dynamic
>> snitch and ask only as many peer nodes as necessary to satisfy quorum and
>> only ask other nodes when reads fail. But that would probably increase
>> latency and cause whatever other problems. Since you probably don't want to
>> run the cluster at a load at which the weakest node of a replication group
>> can't keep up I don't think this is an issue at all.
>>
>> Just wanted to prevent others shooting their own foot as I did.
>>
>> On 03.12.10 23:36, Jonathan Ellis wrote:
>>> Am I understanding correctly that you had all connections going to one
>>> cassandra node, which caused one of the *other* nodes to die, and
>>> spreading the connections around the cluster fixed it?
>>>
>>> On Fri, Dec 3, 2010 at 4:00 AM, Daniel Doubleday
>>> <da...@gmx.net>    wrote:
>>>> Hi all
>>>>
>>>> I have found an anti pattern the other day which I wanted to share,
>>>> although its pretty special case.
>>>>
>>>> Special case because our production cluster is somewhat strange: 3
>>>> servers, rf = 3. We do consistent reads/writes with quorum.
>>>>
>>>> I did a long running read series (loads of reads as fast as I can) with
>>>> one connection. Since all queries could be handled by that node the overall
>>>> latency is determined by its own and the fastest second node (cause the
>>>> quorum is satisfied with 2 reads). What will happen than is that after a
>>>> couple of minutes one of the other two nodes will go in 100% io wait and
>>>> will drop most of its read messages. Leaving it practically dead while the
>>>> other 2 nodes keep responding at an average of ~10ms. The node that died was
>>>> only a little slower ~13ms average but it will inevitably queue up messages.
>>>> Average response time increases to timeout (10 secs) flat. It never
>>>> recovers.
>>>>
>>>> It happened all the time. And it wasn't the same node that would die.
>>>>
>>>> The solution was that I return the connection to the pool and get a new
>>>> one for every read to balance the load on the client side.
>>>>
>>>> Obviously this will not happen in a cluster where the percentage of all
>>>> rows on one node is enough. But the same thing will probably happen if you
>>>> scan by continuos tokens (meaning that you will read from the same node a
>>>> long time).
>>>>
>>>> Cheers,
>>>>
>>>> Daniel Doubleday
>>>> smeet.com, Berlin
>>>
>>
>
>


Re: Dont bogart that connection my friend

Posted by Jonathan Ellis <jb...@gmail.com>.
That makes sense, but this shouldn't make requests last for the
timeout duration -- at quorum, it should be responding to the client
as soon as it gets that second-fastest reply.  If I'm understanding
right that this was making the response to the client block until the
overwhelmed node timed out, that's a bug.  What version of Cassandra
is this?

On Fri, Dec 3, 2010 at 7:27 PM, Daniel Doubleday
<da...@gmx.net> wrote:
> Yes.
>
> I thought that would make sense, no? I guessed that the quorum read forces
> the slowest of the 3 nodes to keep the pace of the faster ones. But it cant.
> No matter how small the performance diff is. So it will just fill up.
>
> Also when saying 'practically dead' and 'never recovers' I meant for the
> time I kept the reads up. As soon as I stopped the scan it recovered. It
> just was not able to recover during the load because for that it would have
> to become faster that the other nodes and with full queues that just
> wouldn't happen.
>
> By changing the node for every read I would hit the slower node every couple
> of reads. This forced the client to wait for the slower node.
>
> I guess to change that behavior you would need to use something like dynamic
> snitch and ask only as many peer nodes as necessary to satisfy quorum and
> only ask other nodes when reads fail. But that would probably increase
> latency and cause whatever other problems. Since you probably don't want to
> run the cluster at a load at which the weakest node of a replication group
> can't keep up I don't think this is an issue at all.
>
> Just wanted to prevent others shooting their own foot as I did.
>
> On 03.12.10 23:36, Jonathan Ellis wrote:
>>
>> Am I understanding correctly that you had all connections going to one
>> cassandra node, which caused one of the *other* nodes to die, and
>> spreading the connections around the cluster fixed it?
>>
>> On Fri, Dec 3, 2010 at 4:00 AM, Daniel Doubleday
>> <da...@gmx.net>  wrote:
>>>
>>> Hi all
>>>
>>> I have found an anti pattern the other day which I wanted to share,
>>> although its pretty special case.
>>>
>>> Special case because our production cluster is somewhat strange: 3
>>> servers, rf = 3. We do consistent reads/writes with quorum.
>>>
>>> I did a long running read series (loads of reads as fast as I can) with
>>> one connection. Since all queries could be handled by that node the overall
>>> latency is determined by its own and the fastest second node (cause the
>>> quorum is satisfied with 2 reads). What will happen than is that after a
>>> couple of minutes one of the other two nodes will go in 100% io wait and
>>> will drop most of its read messages. Leaving it practically dead while the
>>> other 2 nodes keep responding at an average of ~10ms. The node that died was
>>> only a little slower ~13ms average but it will inevitably queue up messages.
>>> Average response time increases to timeout (10 secs) flat. It never
>>> recovers.
>>>
>>> It happened all the time. And it wasn't the same node that would die.
>>>
>>> The solution was that I return the connection to the pool and get a new
>>> one for every read to balance the load on the client side.
>>>
>>> Obviously this will not happen in a cluster where the percentage of all
>>> rows on one node is enough. But the same thing will probably happen if you
>>> scan by continuos tokens (meaning that you will read from the same node a
>>> long time).
>>>
>>> Cheers,
>>>
>>> Daniel Doubleday
>>> smeet.com, Berlin
>>
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Dont bogart that connection my friend

Posted by Daniel Doubleday <da...@gmx.net>.
Yes.

I thought that would make sense, no? I guessed that the quorum read 
forces the slowest of the 3 nodes to keep the pace of the faster ones. 
But it cant. No matter how small the performance diff is. So it will 
just fill up.

Also when saying 'practically dead' and 'never recovers' I meant for the 
time I kept the reads up. As soon as I stopped the scan it recovered. It 
just was not able to recover during the load because for that it would 
have to become faster that the other nodes and with full queues that 
just wouldn't happen.

By changing the node for every read I would hit the slower node every 
couple of reads. This forced the client to wait for the slower node.

I guess to change that behavior you would need to use something like 
dynamic snitch and ask only as many peer nodes as necessary to satisfy 
quorum and only ask other nodes when reads fail. But that would probably 
increase latency and cause whatever other problems. Since you probably 
don't want to run the cluster at a load at which the weakest node of a 
replication group can't keep up I don't think this is an issue at all.

Just wanted to prevent others shooting their own foot as I did.

On 03.12.10 23:36, Jonathan Ellis wrote:
> Am I understanding correctly that you had all connections going to one
> cassandra node, which caused one of the *other* nodes to die, and
> spreading the connections around the cluster fixed it?
>
> On Fri, Dec 3, 2010 at 4:00 AM, Daniel Doubleday
> <da...@gmx.net>  wrote:
>> Hi all
>>
>> I have found an anti pattern the other day which I wanted to share, although its pretty special case.
>>
>> Special case because our production cluster is somewhat strange: 3 servers, rf = 3. We do consistent reads/writes with quorum.
>>
>> I did a long running read series (loads of reads as fast as I can) with one connection. Since all queries could be handled by that node the overall latency is determined by its own and the fastest second node (cause the quorum is satisfied with 2 reads). What will happen than is that after a couple of minutes one of the other two nodes will go in 100% io wait and will drop most of its read messages. Leaving it practically dead while the other 2 nodes keep responding at an average of ~10ms. The node that died was only a little slower ~13ms average but it will inevitably queue up messages. Average response time increases to timeout (10 secs) flat. It never recovers.
>>
>> It happened all the time. And it wasn't the same node that would die.
>>
>> The solution was that I return the connection to the pool and get a new one for every read to balance the load on the client side.
>>
>> Obviously this will not happen in a cluster where the percentage of all rows on one node is enough. But the same thing will probably happen if you scan by continuos tokens (meaning that you will read from the same node a long time).
>>
>> Cheers,
>>
>> Daniel Doubleday
>> smeet.com, Berlin
>
>


Re: Dont bogart that connection my friend

Posted by Jonathan Ellis <jb...@gmail.com>.
Am I understanding correctly that you had all connections going to one
cassandra node, which caused one of the *other* nodes to die, and
spreading the connections around the cluster fixed it?

On Fri, Dec 3, 2010 at 4:00 AM, Daniel Doubleday
<da...@gmx.net> wrote:
> Hi all
>
> I have found an anti pattern the other day which I wanted to share, although its pretty special case.
>
> Special case because our production cluster is somewhat strange: 3 servers, rf = 3. We do consistent reads/writes with quorum.
>
> I did a long running read series (loads of reads as fast as I can) with one connection. Since all queries could be handled by that node the overall latency is determined by its own and the fastest second node (cause the quorum is satisfied with 2 reads). What will happen than is that after a couple of minutes one of the other two nodes will go in 100% io wait and will drop most of its read messages. Leaving it practically dead while the other 2 nodes keep responding at an average of ~10ms. The node that died was only a little slower ~13ms average but it will inevitably queue up messages. Average response time increases to timeout (10 secs) flat. It never recovers.
>
> It happened all the time. And it wasn't the same node that would die.
>
> The solution was that I return the connection to the pool and get a new one for every read to balance the load on the client side.
>
> Obviously this will not happen in a cluster where the percentage of all rows on one node is enough. But the same thing will probably happen if you scan by continuos tokens (meaning that you will read from the same node a long time).
>
> Cheers,
>
> Daniel Doubleday
> smeet.com, Berlin



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com