You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Yan Chunlu <sp...@gmail.com> on 2011/07/20 10:44:09 UTC

node repair eat up all disk io and slow down entire cluster(3 nodes)

at the beginning of using cassandra, I have no idea that I should run "node
repair" frequently, so basically, I have 3 nodes with RF=3 and have not run
node repair for months, the data size is 20G.

the problem is when I start running node repair now, it eat up all disk io
and the server load became 20+ and increasing, the worst thing is, the
entire cluster has slowed down and can not handle request. so I have to stop
it immediately because it make my web service unavailable.

the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G memory,
with Western Digital WD RE3 WD1002FBYS SATA disk.

I really have no idea what to do now, as currently I have already found some
data loss, any suggestions would be appreciated.

Re: node repair eat up all disk io and slow down entire cluster(3 nodes)

Posted by Yan Chunlu <sp...@gmail.com>.

thank you very much for the help, I will try to adjust minor compaction and
also dealing with single CF at a time.

On Thu, Jul 21, 2011 at 7:56 AM, Aaron Morton <aa...@thelastpickle.com>wrote:

> If you have never run repair also check the section on repair on this page
> http://wiki.apache.org/cassandra/Operations About how frequently it should
> be run.
>
> There is an issue where repair can stream too much data, and this can lead
> to excessive disk use.
>
> My non scientific approach to the never run repair before problem is to
> repair a single CF at a time, starting with the small ones that are less
> likely to have differences as they will stream the smallest amount of data.
>
> If you really want to conserve disk IO during the repair consider disabling
> the minor compaction by setting the min and max thresholds to 0 via node
> tool.
>
> hope that helps.
>
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/07/2011, at 11:46 PM, Yan Chunlu <sp...@gmail.com> wrote:
>
> just found this:
> <https://issues.apache.org/jira/browse/CASSANDRA-2156>
> https://issues.apache.org/jira/browse/CASSANDRA-2156
>
> but seems only available to 0.8 and people submitted a patch for 0.6, I am
> using 0.7.4, do I need to dig into the code and make my own patch?
>
> does add compaction throttle solve the io problem?  thanks!
>
> On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu < <sp...@gmail.com>
> springrider@gmail.com> wrote:
>
>> at the beginning of using cassandra, I have no idea that I should run
>> "node repair" frequently, so basically, I have 3 nodes with RF=3 and have
>> not run node repair for months, the data size is 20G.
>>
>> the problem is when I start running node repair now, it eat up all disk io
>> and the server load became 20+ and increasing, the worst thing is, the
>> entire cluster has slowed down and can not handle request. so I have to stop
>> it immediately because it make my web service unavailable.
>>
>> the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G
>> memory, with Western Digital WD RE3 WD1002FBYS SATA disk.
>>
>> I really have no idea what to do now, as currently I have already found
>> some data loss, any suggestions would be appreciated.
>>
>
>
>
> --
> 闫春路
>
>


-- 
闫春路

Re: node repair eat up all disk io and slow down entire cluster(3 nodes)

Posted by Yan Chunlu <sp...@gmail.com>.

SStable Rebuilding, it might be the problem of CASSANDRA-2280

On Thu, Jul 21, 2011 at 7:52 PM, aaron morton <aa...@thelastpickle.com>wrote:

> What are you seeing in compaction stats ?
>
> You may see some of  https://issues.apache.org/jira/browse/CASSANDRA-2280
>
> Cheers
>
>  -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 21 Jul 2011, at 23:17, Yan Chunlu wrote:
>
> after tried nodetool -h reagon repair key cf, I found that even repair
> single CF, it involves rebuild all sstables(using nodetool compactionstats),
> is that normal?
>
> On Thu, Jul 21, 2011 at 7:56 AM, Aaron Morton <aa...@thelastpickle.com>wrote:
>
>> If you have never run repair also check the section on repair on this
>> page
>> http://wiki.apache.org/cassandra/Operations About how frequently it
>> should be run.
>>
>> There is an issue where repair can stream too much data, and this can lead
>> to excessive disk use.
>>
>> My non scientific approach to the never run repair before problem is to
>> repair a single CF at a time, starting with the small ones that are less
>> likely to have differences as they will stream the smallest amount of data.
>>
>> If you really want to conserve disk IO during the repair consider
>> disabling the minor compaction by setting the min and max thresholds to 0
>> via node tool.
>>
>> hope that helps.
>>
>>
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 20/07/2011, at 11:46 PM, Yan Chunlu <sp...@gmail.com> wrote:
>>
>> just found this:
>> <https://issues.apache.org/jira/browse/CASSANDRA-2156>
>> https://issues.apache.org/jira/browse/CASSANDRA-2156
>>
>> but seems only available to 0.8 and people submitted a patch for 0.6, I am
>> using 0.7.4, do I need to dig into the code and make my own patch?
>>
>> does add compaction throttle solve the io problem?  thanks!
>>
>> On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu < <sp...@gmail.com>
>> springrider@gmail.com> wrote:
>>
>>> at the beginning of using cassandra, I have no idea that I should run
>>> "node repair" frequently, so basically, I have 3 nodes with RF=3 and have
>>> not run node repair for months, the data size is 20G.
>>>
>>> the problem is when I start running node repair now, it eat up all disk
>>> io and the server load became 20+ and increasing, the worst thing is, the
>>> entire cluster has slowed down and can not handle request. so I have to stop
>>> it immediately because it make my web service unavailable.
>>>
>>> the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G
>>> memory, with Western Digital WD RE3 WD1002FBYS SATA disk.
>>>
>>> I really have no idea what to do now, as currently I have already found
>>> some data loss, any suggestions would be appreciated.
>>>
>>
>>
>>
>> --
>> 闫春路
>>
>>
>
>
> --
> 闫春路
>
>
>


-- 
闫春路

Re: node repair eat up all disk io and slow down entire cluster(3 nodes)

Posted by aaron morton <aa...@thelastpickle.com>.

What are you seeing in compaction stats ? 

You may see some of  https://issues.apache.org/jira/browse/CASSANDRA-2280 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 21 Jul 2011, at 23:17, Yan Chunlu wrote:

> after tried nodetool -h reagon repair key cf, I found that even repair single CF, it involves rebuild all sstables(using nodetool compactionstats), is that normal? 
> 
> On Thu, Jul 21, 2011 at 7:56 AM, Aaron Morton <aa...@thelastpickle.com> wrote:
> If you have never run repair also check the section on repair on this page 
> http://wiki.apache.org/cassandra/Operations About how frequently it should be run.
> 
> There is an issue where repair can stream too much data, and this can lead to excessive disk use.
> 
> My non scientific approach to the never run repair before problem is to repair a single CF at a time, starting with the small ones that are less likely to have differences as they will stream the smallest amount of data. 
> 
> If you really want to conserve disk IO during the repair consider disabling the minor compaction by setting the min and max thresholds to 0 via node tool.
> 
> hope that helps.
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/07/2011, at 11:46 PM, Yan Chunlu <sp...@gmail.com> wrote:
> 
>> just found this:
>> https://issues.apache.org/jira/browse/CASSANDRA-2156
>> 
>> but seems only available to 0.8 and people submitted a patch for 0.6, I am using 0.7.4, do I need to dig into the code and make my own patch?
>> 
>> does add compaction throttle solve the io problem?  thanks!
>> 
>> On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu <sp...@gmail.com> wrote:
>> at the beginning of using cassandra, I have no idea that I should run "node repair" frequently, so basically, I have 3 nodes with RF=3 and have not run node repair for months, the data size is 20G.
>> 
>> the problem is when I start running node repair now, it eat up all disk io and the server load became 20+ and increasing, the worst thing is, the entire cluster has slowed down and can not handle request. so I have to stop it immediately because it make my web service unavailable.
>> 
>> the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G memory, with Western Digital WD RE3 WD1002FBYS SATA disk.
>> 
>> I really have no idea what to do now, as currently I have already found some data loss, any suggestions would be appreciated.  
>> 
>> 
>> 
>> -- 
>> 闫春路
> 
> 
> 
> -- 
> 闫春路

Re: node repair eat up all disk io and slow down entire cluster(3 nodes)

Posted by Yan Chunlu <sp...@gmail.com>.

after tried nodetool -h reagon repair key cf, I found that even repair
single CF, it involves rebuild all sstables(using nodetool compactionstats),
is that normal?

On Thu, Jul 21, 2011 at 7:56 AM, Aaron Morton <aa...@thelastpickle.com>wrote:

> If you have never run repair also check the section on repair on this page
> http://wiki.apache.org/cassandra/Operations About how frequently it should
> be run.
>
> There is an issue where repair can stream too much data, and this can lead
> to excessive disk use.
>
> My non scientific approach to the never run repair before problem is to
> repair a single CF at a time, starting with the small ones that are less
> likely to have differences as they will stream the smallest amount of data.
>
> If you really want to conserve disk IO during the repair consider disabling
> the minor compaction by setting the min and max thresholds to 0 via node
> tool.
>
> hope that helps.
>
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/07/2011, at 11:46 PM, Yan Chunlu <sp...@gmail.com> wrote:
>
> just found this:
> <https://issues.apache.org/jira/browse/CASSANDRA-2156>
> https://issues.apache.org/jira/browse/CASSANDRA-2156
>
> but seems only available to 0.8 and people submitted a patch for 0.6, I am
> using 0.7.4, do I need to dig into the code and make my own patch?
>
> does add compaction throttle solve the io problem?  thanks!
>
> On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu < <sp...@gmail.com>
> springrider@gmail.com> wrote:
>
>> at the beginning of using cassandra, I have no idea that I should run
>> "node repair" frequently, so basically, I have 3 nodes with RF=3 and have
>> not run node repair for months, the data size is 20G.
>>
>> the problem is when I start running node repair now, it eat up all disk io
>> and the server load became 20+ and increasing, the worst thing is, the
>> entire cluster has slowed down and can not handle request. so I have to stop
>> it immediately because it make my web service unavailable.
>>
>> the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G
>> memory, with Western Digital WD RE3 WD1002FBYS SATA disk.
>>
>> I really have no idea what to do now, as currently I have already found
>> some data loss, any suggestions would be appreciated.
>>
>
>
>
> --
> �ƴ�·
>
>


-- 
�ƴ�·

Re: node repair eat up all disk io and slow down entire cluster(3 nodes)

Posted by Aaron Morton <aa...@thelastpickle.com>.

If you have never run repair also check the section on repair on this page 
http://wiki.apache.org/cassandra/Operations About how frequently it should be run.

There is an issue where repair can stream too much data, and this can lead to excessive disk use.

My non scientific approach to the never run repair before problem is to repair a single CF at a time, starting with the small ones that are less likely to have differences as they will stream the smallest amount of data. 

If you really want to conserve disk IO during the repair consider disabling the minor compaction by setting the min and max thresholds to 0 via node tool.

hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 20/07/2011, at 11:46 PM, Yan Chunlu <sp...@gmail.com> wrote:

> just found this:
> https://issues.apache.org/jira/browse/CASSANDRA-2156
> 
> but seems only available to 0.8 and people submitted a patch for 0.6, I am using 0.7.4, do I need to dig into the code and make my own patch?
> 
> does add compaction throttle solve the io problem?  thanks!
> 
> On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu <sp...@gmail.com> wrote:
> at the beginning of using cassandra, I have no idea that I should run "node repair" frequently, so basically, I have 3 nodes with RF=3 and have not run node repair for months, the data size is 20G.
> 
> the problem is when I start running node repair now, it eat up all disk io and the server load became 20+ and increasing, the worst thing is, the entire cluster has slowed down and can not handle request. so I have to stop it immediately because it make my web service unavailable.
> 
> the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G memory, with Western Digital WD RE3 WD1002FBYS SATA disk.
> 
> I really have no idea what to do now, as currently I have already found some data loss, any suggestions would be appreciated.
> 
> 
> 
> -- 
> 闫春路

Re: node repair eat up all disk io and slow down entire cluster(3 nodes)

Posted by Yan Chunlu <sp...@gmail.com>.

just found this:
https://issues.apache.org/jira/browse/CASSANDRA-2156

but seems only available to 0.8 and people submitted a patch for 0.6, I am
using 0.7.4, do I need to dig into the code and make my own patch?

does add compaction throttle solve the io problem?  thanks!

On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu <sp...@gmail.com> wrote:

> at the beginning of using cassandra, I have no idea that I should run "node
> repair" frequently, so basically, I have 3 nodes with RF=3 and have not run
> node repair for months, the data size is 20G.
>
> the problem is when I start running node repair now, it eat up all disk io
> and the server load became 20+ and increasing, the worst thing is, the
> entire cluster has slowed down and can not handle request. so I have to stop
> it immediately because it make my web service unavailable.
>
> the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G memory,
> with Western Digital WD RE3 WD1002FBYS SATA disk.
>
> I really have no idea what to do now, as currently I have already found
> some data loss, any suggestions would be appreciated.
>



-- 
闫春路