You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Stefano Ortolani <os...@gmail.com> on 2016/08/09 16:32:20 UTC

Incremental repairs leading to unrepaired data

Hi all,

I am running incremental repaird on a weekly basis (can't do it every day
as one single run takes 36 hours), and every time, I have at least one node
dropping mutations as part of the process (this almost always during the
anticompaction phase). Ironically this leads to a system where repairing
makes data consistent at the cost of making some other data not consistent.

Does anybody know why this is happening?

My feeling is that this might be caused by anticompacting column families
with really wide rows and with many SStables. If that is the case, any way
I can throttle that?

Thanks!
Stefano

Re: Incremental repairs leading to unrepaired data

Posted by kurt Greaves <ku...@instaclustr.com>.

Can't say I have too many ideas. If load is low during the repair it
shouldn't be happening. Your disks aren't overutilised correct? No other
processes writing loads of data to them?

Re: Incremental repairs leading to unrepaired data

Posted by Stefano Ortolani <os...@gmail.com>.

That is not happening anymore since I am repairing a keyspace with
much less data (the other one is still there in write-only mode).
The command I am using is the most boring (even shed the -pr option so
to keep anticompactions to a minimum): nodetool -h localhost repair
<keyspace>
It's executed sequentially on each node (no overlapping, next node
waits for the previous to complete).

Regards,
Stefano Ortolani

On Mon, Oct 31, 2016 at 11:18 PM, kurt Greaves <ku...@instaclustr.com> wrote:
> Blowing out to 1k SSTables seems a bit full on. What args are you passing to
> repair?
>
> Kurt Greaves
> kurt@instaclustr.com
> www.instaclustr.com
>
> On 31 October 2016 at 09:49, Stefano Ortolani <os...@gmail.com> wrote:
>>
>> I've collected some more data-points, and I still see dropped
>> mutations with compaction_throughput_mb_per_sec set to 8.
>> The only notable thing regarding the current setup is that I have
>> another keyspace (not being repaired though) with really wide rows
>> (100MB per partition), but that shouldn't have any impact in theory.
>> Nodes do not seem that overloaded either and don't see any GC spikes
>> while those mutations are dropped :/
>>
>> Hitting a dead end here, any further idea where to look for further ideas?
>>
>> Regards,
>> Stefano
>>
>> On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolani <os...@gmail.com>
>> wrote:
>> > That's what I was thinking. Maybe GC pressure?
>> > Some more details: during anticompaction I have some CFs exploding to 1K
>> > SStables (to be back to ~200 upon completion).
>> > HW specs should be quite good (12 cores/32 GB ram) but, I admit, still
>> > relying on spinning disks, with ~150GB per node.
>> > Current version is 3.0.8.
>> >
>> >
>> > On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta <pa...@gmail.com>
>> > wrote:
>> >>
>> >> That's pretty low already, but perhaps you should lower to see if it
>> >> will
>> >> improve the dropped mutations during anti-compaction (even if it
>> >> increases
>> >> repair time), otherwise the problem might be somewhere else. Generally
>> >> dropped mutations is a signal of cluster overload, so if there's
>> >> nothing
>> >> else wrong perhaps you need to increase your capacity. What version are
>> >> you
>> >> in?
>> >>
>> >> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>> >>>
>> >>> Not yet. Right now I have it set at 16.
>> >>> Would halving it more or less double the repair time?
>> >>>
>> >>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pa...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Anticompaction throttling can be done by setting the usual
>> >>>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via
>> >>>> nodetool
>> >>>> setcompactionthroughput. Did you try lowering that  and checking if
>> >>>> that
>> >>>> improves the dropped mutations?
>> >>>>
>> >>>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>> >>>>>
>> >>>>> Hi all,
>> >>>>>
>> >>>>> I am running incremental repaird on a weekly basis (can't do it
>> >>>>> every
>> >>>>> day as one single run takes 36 hours), and every time, I have at
>> >>>>> least one
>> >>>>> node dropping mutations as part of the process (this almost always
>> >>>>> during
>> >>>>> the anticompaction phase). Ironically this leads to a system where
>> >>>>> repairing
>> >>>>> makes data consistent at the cost of making some other data not
>> >>>>> consistent.
>> >>>>>
>> >>>>> Does anybody know why this is happening?
>> >>>>>
>> >>>>> My feeling is that this might be caused by anticompacting column
>> >>>>> families with really wide rows and with many SStables. If that is
>> >>>>> the case,
>> >>>>> any way I can throttle that?
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Stefano
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>
>

Re: Incremental repairs leading to unrepaired data

Posted by kurt Greaves <ku...@instaclustr.com>.

Blowing out to 1k SSTables seems a bit full on. What args are you passing
to repair?

Kurt Greaves
kurt@instaclustr.com
www.instaclustr.com

On 31 October 2016 at 09:49, Stefano Ortolani <os...@gmail.com> wrote:

> I've collected some more data-points, and I still see dropped
> mutations with compaction_throughput_mb_per_sec set to 8.
> The only notable thing regarding the current setup is that I have
> another keyspace (not being repaired though) with really wide rows
> (100MB per partition), but that shouldn't have any impact in theory.
> Nodes do not seem that overloaded either and don't see any GC spikes
> while those mutations are dropped :/
>
> Hitting a dead end here, any further idea where to look for further ideas?
>
> Regards,
> Stefano
>
> On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolani <os...@gmail.com>
> wrote:
> > That's what I was thinking. Maybe GC pressure?
> > Some more details: during anticompaction I have some CFs exploding to 1K
> > SStables (to be back to ~200 upon completion).
> > HW specs should be quite good (12 cores/32 GB ram) but, I admit, still
> > relying on spinning disks, with ~150GB per node.
> > Current version is 3.0.8.
> >
> >
> > On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta <pa...@gmail.com>
> > wrote:
> >>
> >> That's pretty low already, but perhaps you should lower to see if it
> will
> >> improve the dropped mutations during anti-compaction (even if it
> increases
> >> repair time), otherwise the problem might be somewhere else. Generally
> >> dropped mutations is a signal of cluster overload, so if there's nothing
> >> else wrong perhaps you need to increase your capacity. What version are
> you
> >> in?
> >>
> >> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
> >>>
> >>> Not yet. Right now I have it set at 16.
> >>> Would halving it more or less double the repair time?
> >>>
> >>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pa...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Anticompaction throttling can be done by setting the usual
> >>>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via
> nodetool
> >>>> setcompactionthroughput. Did you try lowering that  and checking if
> that
> >>>> improves the dropped mutations?
> >>>>
> >>>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I am running incremental repaird on a weekly basis (can't do it every
> >>>>> day as one single run takes 36 hours), and every time, I have at
> least one
> >>>>> node dropping mutations as part of the process (this almost always
> during
> >>>>> the anticompaction phase). Ironically this leads to a system where
> repairing
> >>>>> makes data consistent at the cost of making some other data not
> consistent.
> >>>>>
> >>>>> Does anybody know why this is happening?
> >>>>>
> >>>>> My feeling is that this might be caused by anticompacting column
> >>>>> families with really wide rows and with many SStables. If that is
> the case,
> >>>>> any way I can throttle that?
> >>>>>
> >>>>> Thanks!
> >>>>> Stefano
> >>>>
> >>>>
> >>>
> >>
> >
>

Re: Incremental repairs leading to unrepaired data

Posted by Stefano Ortolani <os...@gmail.com>.

I've collected some more data-points, and I still see dropped
mutations with compaction_throughput_mb_per_sec set to 8.
The only notable thing regarding the current setup is that I have
another keyspace (not being repaired though) with really wide rows
(100MB per partition), but that shouldn't have any impact in theory.
Nodes do not seem that overloaded either and don't see any GC spikes
while those mutations are dropped :/

Hitting a dead end here, any further idea where to look for further ideas?

Regards,
Stefano

On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolani <os...@gmail.com> wrote:
> That's what I was thinking. Maybe GC pressure?
> Some more details: during anticompaction I have some CFs exploding to 1K
> SStables (to be back to ~200 upon completion).
> HW specs should be quite good (12 cores/32 GB ram) but, I admit, still
> relying on spinning disks, with ~150GB per node.
> Current version is 3.0.8.
>
>
> On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta <pa...@gmail.com>
> wrote:
>>
>> That's pretty low already, but perhaps you should lower to see if it will
>> improve the dropped mutations during anti-compaction (even if it increases
>> repair time), otherwise the problem might be somewhere else. Generally
>> dropped mutations is a signal of cluster overload, so if there's nothing
>> else wrong perhaps you need to increase your capacity. What version are you
>> in?
>>
>> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>>>
>>> Not yet. Right now I have it set at 16.
>>> Would halving it more or less double the repair time?
>>>
>>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pa...@gmail.com>
>>> wrote:
>>>>
>>>> Anticompaction throttling can be done by setting the usual
>>>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
>>>> setcompactionthroughput. Did you try lowering that  and checking if that
>>>> improves the dropped mutations?
>>>>
>>>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I am running incremental repaird on a weekly basis (can't do it every
>>>>> day as one single run takes 36 hours), and every time, I have at least one
>>>>> node dropping mutations as part of the process (this almost always during
>>>>> the anticompaction phase). Ironically this leads to a system where repairing
>>>>> makes data consistent at the cost of making some other data not consistent.
>>>>>
>>>>> Does anybody know why this is happening?
>>>>>
>>>>> My feeling is that this might be caused by anticompacting column
>>>>> families with really wide rows and with many SStables. If that is the case,
>>>>> any way I can throttle that?
>>>>>
>>>>> Thanks!
>>>>> Stefano
>>>>
>>>>
>>>
>>
>

Re: Incremental repairs leading to unrepaired data

Posted by Stefano Ortolani <os...@gmail.com>.

That's what I was thinking. Maybe GC pressure?
Some more details: during anticompaction I have some CFs exploding to 1K
SStables (to be back to ~200 upon completion).
HW specs should be quite good (12 cores/32 GB ram) but, I admit, still
relying on spinning disks, with ~150GB per node.
Current version is 3.0.8.


On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta <pa...@gmail.com>
wrote:

> That's pretty low already, but perhaps you should lower to see if it will
> improve the dropped mutations during anti-compaction (even if it increases
> repair time), otherwise the problem might be somewhere else. Generally
> dropped mutations is a signal of cluster overload, so if there's nothing
> else wrong perhaps you need to increase your capacity. What version are you
> in?
>
> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>
>> Not yet. Right now I have it set at 16.
>> Would halving it more or less double the repair time?
>>
>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pa...@gmail.com>
>> wrote:
>>
>>> Anticompaction throttling can be done by setting the usual
>>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
>>> setcompactionthroughput. Did you try lowering that  and checking if that
>>> improves the dropped mutations?
>>>
>>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>>>
>>>> Hi all,
>>>>
>>>> I am running incremental repaird on a weekly basis (can't do it every
>>>> day as one single run takes 36 hours), and every time, I have at least one
>>>> node dropping mutations as part of the process (this almost always during
>>>> the anticompaction phase). Ironically this leads to a system where
>>>> repairing makes data consistent at the cost of making some other data not
>>>> consistent.
>>>>
>>>> Does anybody know why this is happening?
>>>>
>>>> My feeling is that this might be caused by anticompacting column
>>>> families with really wide rows and with many SStables. If that is the case,
>>>> any way I can throttle that?
>>>>
>>>> Thanks!
>>>> Stefano
>>>>
>>>
>>>
>>
>

Re: Incremental repairs leading to unrepaired data

Posted by Paulo Motta <pa...@gmail.com>.

That's pretty low already, but perhaps you should lower to see if it will
improve the dropped mutations during anti-compaction (even if it increases
repair time), otherwise the problem might be somewhere else. Generally
dropped mutations is a signal of cluster overload, so if there's nothing
else wrong perhaps you need to increase your capacity. What version are you
in?

2016-08-10 8:21 GMT-03:00 Stefano Ortolani <os...@gmail.com>:

> Not yet. Right now I have it set at 16.
> Would halving it more or less double the repair time?
>
> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pa...@gmail.com>
> wrote:
>
>> Anticompaction throttling can be done by setting the usual
>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
>> setcompactionthroughput. Did you try lowering that  and checking if that
>> improves the dropped mutations?
>>
>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I am running incremental repaird on a weekly basis (can't do it every
>>> day as one single run takes 36 hours), and every time, I have at least one
>>> node dropping mutations as part of the process (this almost always during
>>> the anticompaction phase). Ironically this leads to a system where
>>> repairing makes data consistent at the cost of making some other data not
>>> consistent.
>>>
>>> Does anybody know why this is happening?
>>>
>>> My feeling is that this might be caused by anticompacting column
>>> families with really wide rows and with many SStables. If that is the case,
>>> any way I can throttle that?
>>>
>>> Thanks!
>>> Stefano
>>>
>>
>>
>

Re: Incremental repairs leading to unrepaired data

Posted by Stefano Ortolani <os...@gmail.com>.

Not yet. Right now I have it set at 16.
Would halving it more or less double the repair time?

On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta <pa...@gmail.com>
wrote:

> Anticompaction throttling can be done by setting the usual
> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
> setcompactionthroughput. Did you try lowering that  and checking if that
> improves the dropped mutations?
>
> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani <os...@gmail.com>:
>
>> Hi all,
>>
>> I am running incremental repaird on a weekly basis (can't do it every day
>> as one single run takes 36 hours), and every time, I have at least one node
>> dropping mutations as part of the process (this almost always during the
>> anticompaction phase). Ironically this leads to a system where repairing
>> makes data consistent at the cost of making some other data not consistent.
>>
>> Does anybody know why this is happening?
>>
>> My feeling is that this might be caused by anticompacting column families
>> with really wide rows and with many SStables. If that is the case, any way
>> I can throttle that?
>>
>> Thanks!
>> Stefano
>>
>
>

Re: Incremental repairs leading to unrepaired data

Posted by Paulo Motta <pa...@gmail.com>.

Anticompaction throttling can be done by setting the usual
compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool
setcompactionthroughput. Did you try lowering that  and checking if that
improves the dropped mutations?

2016-08-09 13:32 GMT-03:00 Stefano Ortolani <os...@gmail.com>:

> Hi all,
>
> I am running incremental repaird on a weekly basis (can't do it every day
> as one single run takes 36 hours), and every time, I have at least one node
> dropping mutations as part of the process (this almost always during the
> anticompaction phase). Ironically this leads to a system where repairing
> makes data consistent at the cost of making some other data not consistent.
>
> Does anybody know why this is happening?
>
> My feeling is that this might be caused by anticompacting column families
> with really wide rows and with many SStables. If that is the case, any way
> I can throttle that?
>
> Thanks!
> Stefano
>