You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Brian Jeltema <br...@digitalenvoy.net> on 2013/06/24 18:10:13 UTC

Hadoop/Cassandra 1.2 timeouts

I'm having problems with Hadoop job failures on a Cassandra 1.2 cluster due to 

    Caused by: TimedOutException()
    2013-06-24 11:29:11,953  INFO  Driver  - 	at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12932)

This is running on a 6-node cluster, RF=3. If I run the job with CL=ONE, it usually runs pretty well, with an occasional timeout. But
if I run at CL=QUORUM, the number of timeouts is often enough to kill the job. The table being read is effectively read-only when this job runs.
It has from 5 to 10 million rows, with each row having no more than 256 columns. Each column typically only has a few hundred bytes of data at most.

I've fiddled with the batch-range size and increasing the timeout, without a lot of luck. I see some evidence of GC activity in the Cassandra logs, but
it's hard to see a clear correlation with the timeouts.

I could use some suggestions on an approach to pin down the root cause.

TIA

Brian

Re: Hadoop/Cassandra 1.2 timeouts

Posted by aaron morton <aa...@thelastpickle.com>.
It's an inter node timeout waiting for the read to complete. Normally means the cluster is overloaded in some fashion, check for GC activity and/or overloaded IOPs. 

If you reduce the batch_size it should help. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/06/2013, at 4:10 AM, Brian Jeltema <br...@digitalenvoy.net> wrote:

> I'm having problems with Hadoop job failures on a Cassandra 1.2 cluster due to 
> 
>    Caused by: TimedOutException()
>    2013-06-24 11:29:11,953  INFO  Driver  - 	at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12932)
> 
> This is running on a 6-node cluster, RF=3. If I run the job with CL=ONE, it usually runs pretty well, with an occasional timeout. But
> if I run at CL=QUORUM, the number of timeouts is often enough to kill the job. The table being read is effectively read-only when this job runs.
> It has from 5 to 10 million rows, with each row having no more than 256 columns. Each column typically only has a few hundred bytes of data at most.
> 
> I've fiddled with the batch-range size and increasing the timeout, without a lot of luck. I see some evidence of GC activity in the Cassandra logs, but
> it's hard to see a clear correlation with the timeouts.
> 
> I could use some suggestions on an approach to pin down the root cause.
> 
> TIA
> 
> Brian