You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Carlos Alonso <in...@mrcalonso.com> on 2016/03/01 19:30:37 UTC

Re: Consistent read timeouts for bursts of reads

We have had similar issues sometimes.

Usually the problem was that failing queries where reading the same
partition that another query still running and that partition is too big.

The fact that is reading the same partition is why your query works upon
retry. The fact that the partition (or the retrieved range) is too big is
why the nodes get overloaded and end up dropping the read requests.

If you see GC pressure that would point towards my hypothesis too.

Hope this helps.

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 25 February 2016 at 16:34, Emīls Šolmanis <em...@gmail.com>
wrote:

> Having had a read through the archives, I missed this at first, but this
> seems to be *exactly* like what we're experiencing.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>
> Only difference is we're getting this for reads and using CQL, but the
> behaviour is identical.
>
> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis <em...@gmail.com>
> wrote:
>
>> Hello,
>>
>> We're having a problem with concurrent requests. It seems that whenever
>> we try resolving more
>> than ~ 15 queries at the same time, one or two get a read timeout and
>> then succeed on a retry.
>>
>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>> AWS.
>>
>> What we've found while investigating:
>>
>>  * this is not db-wide. Trying the same pattern against another table
>> everything works fine.
>>  * it fails 1 or 2 requests regardless of how many are executed in
>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>> requests and doesn't seem to scale up.
>>  * the problem is consistently reproducible. It happens both under
>> heavier load and when just firing off a single batch of requests for
>> testing.
>>  * tracing the faulty requests says everything is great. An example
>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>  * the only peculiar thing in the logs is there's no acknowledgement of
>> the request being accepted by the server, as seen in
>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>  * there's nothing funny in the timed out Cassandra node's logs around
>> that time as far as I can tell, not even in the debug logs.
>>
>> Any ideas about what might be causing this, pointers to server config
>> options, or how else we might debug this would be much appreciated.
>>
>> Kind regards,
>> Emils
>>
>>