You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Patrik Modesto <pa...@gmail.com> on 2011/11/24 08:45:15 UTC

timeout while doing repair

Hi,

I have a test cluster of 4 nodes running Debian and Cassandra 0.8.7,
there are 3 keyspaces, all with RF=3, a node has load around 40GB.
When I run "nodetool repair" after a while all thrift clients that
read with CL.QUORUM get TimeoutException and even some that use just
CL.ONE. I've tried to run repair on just one keyspace and read from
other keyspace, but I still get the TimeoutException.

I tried to tune compaction_throughput_mb_per_sec and
concurrent_compactors but without success. The same problem is
happening on our production cluster of 8 nodes (same setup).

Where may be the problem?

Regards,
Patrik

Re: timeout while doing repair

Posted by Jahangir Mohammed <md...@gmail.com>.

That will give you a snapshot of thread pools. You should look at
ROW-READ-STAGE and see pending and active. If there are many pending, it
means that the cluster is not able to keep up with the read requests coming
along.

Thanks,
Jahangir Mohammed.

On Thu, Nov 24, 2011 at 2:14 PM, Patrik Modesto <pa...@gmail.com>wrote:

> We have our own servers, it is 16 core CPU, 32GB ram,8 1TB disks.
>
> I didn't check tpstats, just iotop where cassandra used all the io
> capacity when compacting/repairing.
>
> I had to completely clean the test cluster, but I'll check tpstats in the
> production. What should I look for?
>
> Regards,
> Patrik
> Dne 24.11.2011 19:13 "Jahangir Mohammed" <md...@gmail.com>
> napsal(a):
>
> What I know is timeout is because of increased load on node due to repair.
>>
>> Hardware? EC2?
>>
>> Did you check tpstats?
>>
>> On Thu, Nov 24, 2011 at 11:42 AM, Patrik Modesto <
>> patrik.modesto@gmail.com> wrote:
>>
>>> Thanks for the reply. I know I can configure longer timeout but in our
>>> use case, reply longer than 1second is unacceptable.
>>>
>>> What I don't understand is why I get timeout while reading differrent
>>> keyspace than the repair is working on. I get timeouts even doing
>>> compaction.
>>>
>>> Besides usual access we do lots of reads and writes using Hadoop
>>> mapreduce jobs so we need to do compact/repair quite often.
>>>
>>> Regards
>>> Patrik
>>> Dne 24.11.2011 15:00 "Jahangir Mohammed" <md...@gmail.com>
>>> napsal(a):
>>>
>>>  Do you use any client which gives you this timeout ?
>>>>
>>>> If you don't specify any timeout from client, look at
>>>> rpc_timeout_in_ms. Increase it and see if you still suffer this.
>>>>
>>>> Repair is a costly process.
>>>>
>>>> Thanks,
>>>> Jahangir Mohammed.
>>>>
>>>>
>>>>
>>>> On Thu, Nov 24, 2011 at 2:45 AM, Patrik Modesto <
>>>> patrik.modesto@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a test cluster of 4 nodes running Debian and Cassandra 0.8.7,
>>>>> there are 3 keyspaces, all with RF=3, a node has load around 40GB.
>>>>> When I run "nodetool repair" after a while all thrift clients that
>>>>> read with CL.QUORUM get TimeoutException and even some that use just
>>>>> CL.ONE. I've tried to run repair on just one keyspace and read from
>>>>> other keyspace, but I still get the TimeoutException.
>>>>>
>>>>> I tried to tune compaction_throughput_mb_per_sec and
>>>>> concurrent_compactors but without success. The same problem is
>>>>> happening on our production cluster of 8 nodes (same setup).
>>>>>
>>>>> Where may be the problem?
>>>>>
>>>>> Regards,
>>>>> Patrik
>>>>>
>>>>
>>>>
>>

Re: timeout while doing repair

Posted by Patrik Modesto <pa...@gmail.com>.

We have our own servers, it is 16 core CPU, 32GB ram,8 1TB disks.

I didn't check tpstats, just iotop where cassandra used all the io capacity
when compacting/repairing.

I had to completely clean the test cluster, but I'll check tpstats in the
production. What should I look for?

Regards,
Patrik
Dne 24.11.2011 19:13 "Jahangir Mohammed" <md...@gmail.com>
napsal(a):

> What I know is timeout is because of increased load on node due to repair.
>
> Hardware? EC2?
>
> Did you check tpstats?
>
> On Thu, Nov 24, 2011 at 11:42 AM, Patrik Modesto <patrik.modesto@gmail.com
> > wrote:
>
>> Thanks for the reply. I know I can configure longer timeout but in our
>> use case, reply longer than 1second is unacceptable.
>>
>> What I don't understand is why I get timeout while reading differrent
>> keyspace than the repair is working on. I get timeouts even doing
>> compaction.
>>
>> Besides usual access we do lots of reads and writes using Hadoop
>> mapreduce jobs so we need to do compact/repair quite often.
>>
>> Regards
>> Patrik
>> Dne 24.11.2011 15:00 "Jahangir Mohammed" <md...@gmail.com>
>> napsal(a):
>>
>>  Do you use any client which gives you this timeout ?
>>>
>>> If you don't specify any timeout from client, look at rpc_timeout_in_ms.
>>> Increase it and see if you still suffer this.
>>>
>>> Repair is a costly process.
>>>
>>> Thanks,
>>> Jahangir Mohammed.
>>>
>>>
>>>
>>> On Thu, Nov 24, 2011 at 2:45 AM, Patrik Modesto <
>>> patrik.modesto@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a test cluster of 4 nodes running Debian and Cassandra 0.8.7,
>>>> there are 3 keyspaces, all with RF=3, a node has load around 40GB.
>>>> When I run "nodetool repair" after a while all thrift clients that
>>>> read with CL.QUORUM get TimeoutException and even some that use just
>>>> CL.ONE. I've tried to run repair on just one keyspace and read from
>>>> other keyspace, but I still get the TimeoutException.
>>>>
>>>> I tried to tune compaction_throughput_mb_per_sec and
>>>> concurrent_compactors but without success. The same problem is
>>>> happening on our production cluster of 8 nodes (same setup).
>>>>
>>>> Where may be the problem?
>>>>
>>>> Regards,
>>>> Patrik
>>>>
>>>
>>>
>

Re: timeout while doing repair

Posted by Jahangir Mohammed <md...@gmail.com>.

What I know is timeout is because of increased load on node due to repair.

Hardware? EC2?

Did you check tpstats?

On Thu, Nov 24, 2011 at 11:42 AM, Patrik Modesto
<pa...@gmail.com>wrote:

> Thanks for the reply. I know I can configure longer timeout but in our use
> case, reply longer than 1second is unacceptable.
>
> What I don't understand is why I get timeout while reading differrent
> keyspace than the repair is working on. I get timeouts even doing
> compaction.
>
> Besides usual access we do lots of reads and writes using Hadoop mapreduce
> jobs so we need to do compact/repair quite often.
>
> Regards
> Patrik
> Dne 24.11.2011 15:00 "Jahangir Mohammed" <md...@gmail.com>
> napsal(a):
>
> Do you use any client which gives you this timeout ?
>>
>> If you don't specify any timeout from client, look at rpc_timeout_in_ms.
>> Increase it and see if you still suffer this.
>>
>> Repair is a costly process.
>>
>> Thanks,
>> Jahangir Mohammed.
>>
>>
>>
>> On Thu, Nov 24, 2011 at 2:45 AM, Patrik Modesto <patrik.modesto@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I have a test cluster of 4 nodes running Debian and Cassandra 0.8.7,
>>> there are 3 keyspaces, all with RF=3, a node has load around 40GB.
>>> When I run "nodetool repair" after a while all thrift clients that
>>> read with CL.QUORUM get TimeoutException and even some that use just
>>> CL.ONE. I've tried to run repair on just one keyspace and read from
>>> other keyspace, but I still get the TimeoutException.
>>>
>>> I tried to tune compaction_throughput_mb_per_sec and
>>> concurrent_compactors but without success. The same problem is
>>> happening on our production cluster of 8 nodes (same setup).
>>>
>>> Where may be the problem?
>>>
>>> Regards,
>>> Patrik
>>>
>>
>>

Re: timeout while doing repair

Posted by Patrik Modesto <pa...@gmail.com>.

Thanks for the reply. I know I can configure longer timeout but in our use
case, reply longer than 1second is unacceptable.

What I don't understand is why I get timeout while reading differrent
keyspace than the repair is working on. I get timeouts even doing
compaction.

Besides usual access we do lots of reads and writes using Hadoop mapreduce
jobs so we need to do compact/repair quite often.

Regards
Patrik
Dne 24.11.2011 15:00 "Jahangir Mohammed" <md...@gmail.com>
napsal(a):

> Do you use any client which gives you this timeout ?
>
> If you don't specify any timeout from client, look at rpc_timeout_in_ms.
> Increase it and see if you still suffer this.
>
> Repair is a costly process.
>
> Thanks,
> Jahangir Mohammed.
>
>
>
> On Thu, Nov 24, 2011 at 2:45 AM, Patrik Modesto <pa...@gmail.com>wrote:
>
>> Hi,
>>
>> I have a test cluster of 4 nodes running Debian and Cassandra 0.8.7,
>> there are 3 keyspaces, all with RF=3, a node has load around 40GB.
>> When I run "nodetool repair" after a while all thrift clients that
>> read with CL.QUORUM get TimeoutException and even some that use just
>> CL.ONE. I've tried to run repair on just one keyspace and read from
>> other keyspace, but I still get the TimeoutException.
>>
>> I tried to tune compaction_throughput_mb_per_sec and
>> concurrent_compactors but without success. The same problem is
>> happening on our production cluster of 8 nodes (same setup).
>>
>> Where may be the problem?
>>
>> Regards,
>> Patrik
>>
>
>

Re: timeout while doing repair

Posted by Jahangir Mohammed <md...@gmail.com>.

Do you use any client which gives you this timeout ?

If you don't specify any timeout from client, look at rpc_timeout_in_ms.
Increase it and see if you still suffer this.

Repair is a costly process.

Thanks,
Jahangir Mohammed.

On Thu, Nov 24, 2011 at 2:45 AM, Patrik Modesto <pa...@gmail.com>wrote:

> Hi,
>
> I have a test cluster of 4 nodes running Debian and Cassandra 0.8.7,
> there are 3 keyspaces, all with RF=3, a node has load around 40GB.
> When I run "nodetool repair" after a while all thrift clients that
> read with CL.QUORUM get TimeoutException and even some that use just
> CL.ONE. I've tried to run repair on just one keyspace and read from
> other keyspace, but I still get the TimeoutException.
>
> I tried to tune compaction_throughput_mb_per_sec and
> concurrent_compactors but without success. The same problem is
> happening on our production cluster of 8 nodes (same setup).
>
> Where may be the problem?
>
> Regards,
> Patrik
>