You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ezra Stuetzel <ez...@riskiq.net> on 2016/08/17 18:39:51 UTC

large number of pending compactions, sstables steadily increasing

I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to fix
issue) which seems to be stuck in a weird state -- with a large number of
pending compactions and sstables. The node is compacting about 500gb/day,
number of pending compactions is going up at about 50/day. It is at about
2300 pending compactions now. I have tried increasing number of compaction
threads and the compaction throughput, which doesn't seem to help eliminate
the many pending compactions.

I have tried running 'nodetool cleanup' and 'nodetool compact'. The latter
has fixed the issue in the past, but most recently I was getting OOM
errors, probably due to the large number of sstables. I upgraded to 2.2.7
and am no longer getting OOM errors, but also it does not resolve the
issue. I do see this message in the logs:

INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
> CompactionManager.java:610 - Cannot perform a full major compaction as
> repaired and unrepaired sstables cannot be compacted together. These two
> set of sstables will be compacted separately.
>
Below are the 'nodetool tablestats' comparing a normal and the problematic
node. You can see problematic node has many many more sstables, and they
are all in level 1. What is the best way to fix this? Can I just delete
those sstables somehow then run a repair?

> Normal node

keyspace: mykeyspace
>
>     Read Count: 0
>
>     Read Latency: NaN ms.
>
>     Write Count: 31905656
>
>     Write Latency: 0.051713177939359714 ms.
>
>     Pending Flushes: 0
>
>         Table: mytable
>
>         SSTable count: 1908
>
>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306, 0,
>> 0, 0, 0]
>
>         Space used (live): 301894591442
>
>         Space used (total): 301894591442
>
>
>>
>> Problematic node
>
> Keyspace: mykeyspace
>
>     Read Count: 0
>
>     Read Latency: NaN ms.
>
>     Write Count: 30520190
>
>     Write Latency: 0.05171286705620116 ms.
>
>     Pending Flushes: 0
>
>         Table: mytable
>
>         SSTable count: 14105
>
>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0, 0,
>> 0]
>
>         Space used (live): 561143255289
>
>         Space used (total): 561143255289
>
> Thanks,

Ezra

Re: large number of pending compactions, sstables steadily increasing

Posted by Ezra Stuetzel <ez...@riskiq.net>.

Yes, leveled compaction strategy.

Concurrent compactors were 2, I changed to 8 recently and no change. Also
at same time changed compaction throughput from 64 to to 384 mb/s. The
number of pending was still increasing after the change. Other nodes are
handling the same throughput with the previous compaction settings.

We are using c4.2xlarge in ec2. 8 vCPUs, ssds, 15GB memory.

No errors or exceptions in logs. Some possibly relevant log entries I
noticed:

INFO  [CompactionExecutor:16] 2016-08-17 19:15:04,711
> CompactionManager.java:654 - Will not compact
> /export/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/lb-961-big:
> it is not an active sstable
>
> INFO  [CompactionExecutor:16] 2016-08-17 19:15:04,711
> CompactionManager.java:654 - Will not compact
> /export/cassandra/data/system/batchlog-0290003c977e397cac3efdfdc01d626b/lb-960-big:
> it is not an active sstable
>
> INFO  [CompactionExecutor:16] 2016-08-17 19:15:04,711
> CompactionManager.java:664 - No files to compact for user defined compaction
>
WARN  [CompactionExecutor:3] 2016-08-16 19:52:07,134
> BigTableWriter.java:184 - Writing large partition
> system/hints:3b4f02ef-ac1f-4bea-9d0c-1048564b749d (150461319 bytes)

WARN  [CompactionExecutor:3] 2016-08-16 19:52:09,501
> BigTableWriter.java:184 - Writing large partition
> system/hints:3b4f02ef-ac1f-4bea-9d0c-1048564b749d (149619989 bytes)

WARN  [epollEventLoopGroup-2-2] 2016-08-16 19:52:12,911 Frame.java:203 -
> Detected connection using native protocol version 2. Both version 1 and 2
> of the native protocol are now deprecated and support will be removed in
> Cassandra 3.0. You are encouraged to upgrade to a client driver using
> version 3 of the native protocol

WARN  [GossipTasks:1] 2016-08-16 20:51:45,643 FailureDetector.java:287 -
> Not marking nodes down due to local pause of 131385662140 > 5000000000

WARN  [CompactionExecutor:5] 2016-08-17 01:50:05,200
> MajorLeveledCompactionWriter.java:63 - Many sstables involved in
> compaction, skipping storing ancestor information to avoid running out of
> memory

WARN  [CompactionExecutor:4] 2016-08-17 01:50:48,684
> MajorLeveledCompactionWriter.java:63 - Many sstables involved in
> compaction, skipping storing ancestor information to avoid running out of
> memory

WARN  [GossipTasks:1] 2016-08-17 04:35:10,697 FailureDetector.java:287 -
> Not marking nodes down due to local pause of 8628650983 > 5000000000

WARN  [GossipTasks:1] 2016-08-17 04:42:55,524 FailureDetector.java:287 -
> Not marking nodes down due to local pause of 9141089664 > 5000000000




On Wed, Aug 17, 2016 at 11:49 AM, Jeff Jirsa <je...@crowdstrike.com>
wrote:

> What compaction strategy? Looks like leveled – is that what you expect?
>
>
>
> Any exceptions in the logs?
>
>
>
> Are you throttling compaction?
>
>
>
> SSD or spinning disks?
>
>
>
> How many cores?
>
>
>
> How many concurrent compactors?
>
>
>
>
>
>
>
> *From: *Ezra Stuetzel <ez...@riskiq.net>
> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Date: *Wednesday, August 17, 2016 at 11:39 AM
> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Subject: *large number of pending compactions, sstables steadily
> increasing
>
>
>
> I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to
> fix issue) which seems to be stuck in a weird state -- with a large number
> of pending compactions and sstables. The node is compacting about
> 500gb/day, number of pending compactions is going up at about 50/day. It is
> at about 2300 pending compactions now. I have tried increasing number of
> compaction threads and the compaction throughput, which doesn't seem to
> help eliminate the many pending compactions.
>
>
>
> I have tried running 'nodetool cleanup' and 'nodetool compact'. The latter
> has fixed the issue in the past, but most recently I was getting OOM
> errors, probably due to the large number of sstables. I upgraded to 2.2.7
> and am no longer getting OOM errors, but also it does not resolve the
> issue. I do see this message in the logs:
>
>
>
> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
> CompactionManager.java:610 - Cannot perform a full major compaction as
> repaired and unrepaired sstables cannot be compacted together. These two
> set of sstables will be compacted separately.
>
> Below are the 'nodetool tablestats' comparing a normal and the problematic
> node. You can see problematic node has many many more sstables, and they
> are all in level 1. What is the best way to fix this? Can I just delete
> those sstables somehow then run a repair?
>
> Normal node
>
> keyspace: mykeyspace
>
>     Read Count: 0
>
>     Read Latency: NaN ms.
>
>     Write Count: 31905656
>
>     Write Latency: 0.051713177939359714 ms.
>
>     Pending Flushes: 0
>
>         Table: mytable
>
>         SSTable count: 1908
>
>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306, 0,
> 0, 0, 0]
>
>         Space used (live): 301894591442
>
>         Space used (total): 301894591442
>
>
>
>
>
> Problematic node
>
> Keyspace: mykeyspace
>
>     Read Count: 0
>
>     Read Latency: NaN ms.
>
>     Write Count: 30520190
>
>     Write Latency: 0.05171286705620116 ms.
>
>     Pending Flushes: 0
>
>         Table: mytable
>
>         SSTable count: 14105
>
>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0, 0,
> 0]
>
>         Space used (live): 561143255289
>
>         Space used (total): 561143255289
>
> Thanks,
>
> Ezra
>

Re: large number of pending compactions, sstables steadily increasing

Posted by Jeff Jirsa <je...@crowdstrike.com>.

What compaction strategy? Looks like leveled – is that what you expect? 

 

Any exceptions in the logs? 

 

Are you throttling compaction?

 

SSD or spinning disks?

 

How many cores?

 

How many concurrent compactors? 

 

 

 

From: Ezra Stuetzel <ez...@riskiq.net>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, August 17, 2016 at 11:39 AM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: large number of pending compactions, sstables steadily increasing

 

I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to fix issue) which seems to be stuck in a weird state -- with a large number of pending compactions and sstables. The node is compacting about 500gb/day, number of pending compactions is going up at about 50/day. It is at about 2300 pending compactions now. I have tried increasing number of compaction threads and the compaction throughput, which doesn't seem to help eliminate the many pending compactions.   

 

I have tried running 'nodetool cleanup' and 'nodetool compact'. The latter has fixed the issue in the past, but most recently I was getting OOM errors, probably due to the large number of sstables. I upgraded to 2.2.7 and am no longer getting OOM errors, but also it does not resolve the issue. I do see this message in the logs:

 

INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985 CompactionManager.java:610 - Cannot perform a full major compaction as repaired and unrepaired sstables cannot be compacted together. These two set of sstables will be compacted separately.

Below are the 'nodetool tablestats' comparing a normal and the problematic node. You can see problematic node has many many more sstables, and they are all in level 1. What is the best way to fix this? Can I just delete those sstables somehow then run a repair?

Normal node

keyspace: mykeyspace

    Read Count: 0

    Read Latency: NaN ms.

    Write Count: 31905656

    Write Latency: 0.051713177939359714 ms.

    Pending Flushes: 0

        Table: mytable

        SSTable count: 1908

        SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306, 0, 0, 0, 0]

        Space used (live): 301894591442

        Space used (total): 301894591442

 

 

Problematic node

Keyspace: mykeyspace

    Read Count: 0

    Read Latency: NaN ms.

    Write Count: 30520190

    Write Latency: 0.05171286705620116 ms.

    Pending Flushes: 0

        Table: mytable

        SSTable count: 14105

        SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0, 0, 0]

        Space used (live): 561143255289

        Space used (total): 561143255289

Thanks,

Ezra

Re: large number of pending compactions, sstables steadily increasing

Posted by Eiti Kimura <ei...@movile.com>.

Ben, Benjamin thanks for reply,

What your doing here is to change from LeveledCompactions to
SizeTieredCompaction. This task is in progress and we are going to measure
the results just for some column families.
Ben, thanks for the  procedure, will try it later again. When the problem
happened here, we started to 'destroy' the node, witch means, decommission,
remove all of data directories and bootstrap it again, the problem is that
bootstrap was taking much time to complete more than 5 hours...

Benjamin, I hope they start to take care of this ticket you are pointing,
looks like a bug for me and a generalized problem since there is a lot of
people using Cassandra 2.1.x.
Do you know guys if Cassandra 3.1 and 3.0 are affected by this problem as
well?

Regards,
Eiti



J.P. Eiti Kimura
Plataformas

+55 19 3518  <https://www.movile.com/assinaturaemail/#>5500
+ <https://www.movile.com/assinaturaemail/#>55 19 98232 2792
skype: eitikimura
<https://www.linkedin.com/company/movile>
<https://pt.pinterest.com/Movile/>  <https://twitter.com/movile_LATAM>
<https://www.facebook.com/Movile>

2016-11-07 18:49 GMT-02:00 Benjamin Roth <be...@jaumo.com>:

> Hm, this MAY somehow relate to the issue I encountered recently:
> https://issues.apache.org/jira/browse/CASSANDRA-12730
> I also made a proposal to mitigate excessive (unnecessary) flushes during
> repair streams but unfortunately nobody commented on it yet.
> Maybe there are some opinions on it around here?
>
> 2016-11-07 20:15 GMT+00:00 Ben Slater <be...@instaclustr.com>:
>
>> What I’ve seen happen a number of times is you get in a negative feedback
>> loop:
>> not enough capacity to keep up with compactions (often triggered by
>> repair or compaction hitting a large partition) -> more sstables -> more
>> expensive reads -> even less capacity to keep up with compactions -> repeat
>>
>> The way we deal with this at Instaclustr is typically to take the node
>> offline to let it catch up with compactions. We take it offline by running
>> nodetool disablegossip + disablethrift + disablebinary, unthrottle
>> compactions (nodetool setcompactionthroughput 0) and then leave it to chug
>> through compactions until it gets close to zero then reverse the settings
>> or restart C* to set things back to normal. This typically resolves the
>> issues. If you see it happening regularly your cluster probably needs more
>> processing capacity (or other tuning).
>>
>> Cheers
>> Ben
>>
>> On Tue, 8 Nov 2016 at 02:38 Eiti Kimura <ei...@movile.com> wrote:
>>
>>> Hey guys,
>>>
>>> Do we have any conclusions about this case? Ezra, did you solve your
>>> problem?
>>> We are facing a very similar problem here. LeveledCompaction with VNodes
>>> and looks like a node went to a weird state and start to consume lot of
>>> CPU, the compaction process seems to be stucked and the number of SSTables
>>> increased significantly.
>>>
>>> Do you have any clue about it?
>>>
>>> Thanks,
>>> Eiti
>>>
>>>
>>>
>>> J.P. Eiti Kimura
>>> Plataformas
>>>
>>> +55 19 3518  <https://www.movile.com/assinaturaemail/#>5500
>>> + <https://www.movile.com/assinaturaemail/#>55 19 98232 2792
>>> skype: eitikimura
>>> <https://www.linkedin.com/company/movile>
>>> <https://pt.pinterest.com/Movile/>  <https://twitter.com/movile_LATAM>
>>> <https://www.facebook.com/Movile>
>>>
>>> 2016-09-11 18:20 GMT-03:00 Jens Rantil <je...@tink.se>:
>>>
>>> I just want to chime in and say that we also had issues keeping up with
>>> compaction once (with vnodes/ssd disks) and I also want to recommend
>>> keeping track of your open file limit which might bite you.
>>>
>>> Cheers,
>>> Jens
>>>
>>>
>>> On Friday, August 19, 2016, Mark Rose <ma...@markrose.ca> wrote:
>>>
>>> Hi Ezra,
>>>
>>> Are you making frequent changes to your rows (including TTL'ed
>>> values), or mostly inserting new ones? If you're only inserting new
>>> data, it's probable using size-tiered compaction would work better for
>>> you. If you are TTL'ing whole rows, consider date-tiered.
>>>
>>> If leveled compaction is still the best strategy, one way to catch up
>>> with compactions is to have less data per partition -- in other words,
>>> use more machines. Leveled compaction is CPU expensive. You are CPU
>>> bottlenecked currently, or from the other perspective, you have too
>>> much data per node for leveled compaction.
>>>
>>> At this point, compaction is so far behind that you'll likely be
>>> getting high latency if you're reading old rows (since dozens to
>>> hundreds of uncompacted sstables will likely need to be checked for
>>> matching rows). You may be better off with size tiered compaction,
>>> even if it will mean always reading several sstables per read (higher
>>> latency than when leveled can keep up).
>>>
>>> How much data do you have per node? Do you update/insert to/delete
>>> rows? Do you TTL?
>>>
>>> Cheers,
>>> Mark
>>>
>>> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net>
>>> wrote:
>>> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping
>>> to fix
>>> > issue) which seems to be stuck in a weird state -- with a large number
>>> of
>>> > pending compactions and sstables. The node is compacting about
>>> 500gb/day,
>>> > number of pending compactions is going up at about 50/day. It is at
>>> about
>>> > 2300 pending compactions now. I have tried increasing number of
>>> compaction
>>> > threads and the compaction throughput, which doesn't seem to help
>>> eliminate
>>> > the many pending compactions.
>>> >
>>> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
>>> latter
>>> > has fixed the issue in the past, but most recently I was getting OOM
>>> errors,
>>> > probably due to the large number of sstables. I upgraded to 2.2.7 and
>>> am no
>>> > longer getting OOM errors, but also it does not resolve the issue. I
>>> do see
>>> > this message in the logs:
>>> >
>>> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
>>> >> CompactionManager.java:610 - Cannot perform a full major compaction as
>>> >> repaired and unrepaired sstables cannot be compacted together. These
>>> two set
>>> >> of sstables will be compacted separately.
>>> >
>>> > Below are the 'nodetool tablestats' comparing a normal and the
>>> problematic
>>> > node. You can see problematic node has many many more sstables, and
>>> they are
>>> > all in level 1. What is the best way to fix this? Can I just delete
>>> those
>>> > sstables somehow then run a repair?
>>> >>
>>> >> Normal node
>>> >>>
>>> >>> keyspace: mykeyspace
>>> >>>
>>> >>>     Read Count: 0
>>> >>>
>>> >>>     Read Latency: NaN ms.
>>> >>>
>>> >>>     Write Count: 31905656
>>> >>>
>>> >>>     Write Latency: 0.051713177939359714 ms.
>>> >>>
>>> >>>     Pending Flushes: 0
>>> >>>
>>> >>>         Table: mytable
>>> >>>
>>> >>>         SSTable count: 1908
>>> >>>
>>> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000,
>>> 306, 0,
>>> >>> 0, 0, 0]
>>> >>>
>>> >>>         Space used (live): 301894591442
>>> >>>
>>> >>>         Space used (total): 301894591442
>>> >>>
>>> >>>
>>> >>>
>>> >>> Problematic node
>>> >>>
>>> >>> Keyspace: mykeyspace
>>> >>>
>>> >>>     Read Count: 0
>>> >>>
>>> >>>     Read Latency: NaN ms.
>>> >>>
>>> >>>     Write Count: 30520190
>>> >>>
>>> >>>     Write Latency: 0.05171286705620116 ms.
>>> >>>
>>> >>>     Pending Flushes: 0
>>> >>>
>>> >>>         Table: mytable
>>> >>>
>>> >>>         SSTable count: 14105
>>> >>>
>>> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0,
>>> 0,
>>> >>> 0, 0]
>>> >>>
>>> >>>         Space used (live): 561143255289
>>> >>>
>>> >>>         Space used (total): 561143255289
>>> >
>>> > Thanks,
>>> >
>>> > Ezra
>>>
>>>
>>>
>>> --
>>> Jens Rantil
>>> Backend engineer
>>> Tink AB
>>>
>>> Email: jens.rantil@tink.se
>>> Phone: +46 708 84 18 32
>>> Web: www.tink.se
>>>
>>> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
>>> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>>>  Twitter <https://twitter.com/tink>
>>>
>>>
>>>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>

Re: large number of pending compactions, sstables steadily increasing

Posted by Benjamin Roth <be...@jaumo.com>.

Hm, this MAY somehow relate to the issue I encountered recently:
https://issues.apache.org/jira/browse/CASSANDRA-12730
I also made a proposal to mitigate excessive (unnecessary) flushes during
repair streams but unfortunately nobody commented on it yet.
Maybe there are some opinions on it around here?

2016-11-07 20:15 GMT+00:00 Ben Slater <be...@instaclustr.com>:

> What I’ve seen happen a number of times is you get in a negative feedback
> loop:
> not enough capacity to keep up with compactions (often triggered by repair
> or compaction hitting a large partition) -> more sstables -> more expensive
> reads -> even less capacity to keep up with compactions -> repeat
>
> The way we deal with this at Instaclustr is typically to take the node
> offline to let it catch up with compactions. We take it offline by running
> nodetool disablegossip + disablethrift + disablebinary, unthrottle
> compactions (nodetool setcompactionthroughput 0) and then leave it to chug
> through compactions until it gets close to zero then reverse the settings
> or restart C* to set things back to normal. This typically resolves the
> issues. If you see it happening regularly your cluster probably needs more
> processing capacity (or other tuning).
>
> Cheers
> Ben
>
> On Tue, 8 Nov 2016 at 02:38 Eiti Kimura <ei...@movile.com> wrote:
>
>> Hey guys,
>>
>> Do we have any conclusions about this case? Ezra, did you solve your
>> problem?
>> We are facing a very similar problem here. LeveledCompaction with VNodes
>> and looks like a node went to a weird state and start to consume lot of
>> CPU, the compaction process seems to be stucked and the number of SSTables
>> increased significantly.
>>
>> Do you have any clue about it?
>>
>> Thanks,
>> Eiti
>>
>>
>>
>> J.P. Eiti Kimura
>> Plataformas
>>
>> +55 19 3518  <https://www.movile.com/assinaturaemail/#>5500
>> + <https://www.movile.com/assinaturaemail/#>55 19 98232 2792
>> skype: eitikimura
>> <https://www.linkedin.com/company/movile>
>> <https://pt.pinterest.com/Movile/>  <https://twitter.com/movile_LATAM>
>> <https://www.facebook.com/Movile>
>>
>> 2016-09-11 18:20 GMT-03:00 Jens Rantil <je...@tink.se>:
>>
>> I just want to chime in and say that we also had issues keeping up with
>> compaction once (with vnodes/ssd disks) and I also want to recommend
>> keeping track of your open file limit which might bite you.
>>
>> Cheers,
>> Jens
>>
>>
>> On Friday, August 19, 2016, Mark Rose <ma...@markrose.ca> wrote:
>>
>> Hi Ezra,
>>
>> Are you making frequent changes to your rows (including TTL'ed
>> values), or mostly inserting new ones? If you're only inserting new
>> data, it's probable using size-tiered compaction would work better for
>> you. If you are TTL'ing whole rows, consider date-tiered.
>>
>> If leveled compaction is still the best strategy, one way to catch up
>> with compactions is to have less data per partition -- in other words,
>> use more machines. Leveled compaction is CPU expensive. You are CPU
>> bottlenecked currently, or from the other perspective, you have too
>> much data per node for leveled compaction.
>>
>> At this point, compaction is so far behind that you'll likely be
>> getting high latency if you're reading old rows (since dozens to
>> hundreds of uncompacted sstables will likely need to be checked for
>> matching rows). You may be better off with size tiered compaction,
>> even if it will mean always reading several sstables per read (higher
>> latency than when leveled can keep up).
>>
>> How much data do you have per node? Do you update/insert to/delete
>> rows? Do you TTL?
>>
>> Cheers,
>> Mark
>>
>> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net>
>> wrote:
>> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to
>> fix
>> > issue) which seems to be stuck in a weird state -- with a large number
>> of
>> > pending compactions and sstables. The node is compacting about
>> 500gb/day,
>> > number of pending compactions is going up at about 50/day. It is at
>> about
>> > 2300 pending compactions now. I have tried increasing number of
>> compaction
>> > threads and the compaction throughput, which doesn't seem to help
>> eliminate
>> > the many pending compactions.
>> >
>> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
>> latter
>> > has fixed the issue in the past, but most recently I was getting OOM
>> errors,
>> > probably due to the large number of sstables. I upgraded to 2.2.7 and
>> am no
>> > longer getting OOM errors, but also it does not resolve the issue. I do
>> see
>> > this message in the logs:
>> >
>> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
>> >> CompactionManager.java:610 - Cannot perform a full major compaction as
>> >> repaired and unrepaired sstables cannot be compacted together. These
>> two set
>> >> of sstables will be compacted separately.
>> >
>> > Below are the 'nodetool tablestats' comparing a normal and the
>> problematic
>> > node. You can see problematic node has many many more sstables, and
>> they are
>> > all in level 1. What is the best way to fix this? Can I just delete
>> those
>> > sstables somehow then run a repair?
>> >>
>> >> Normal node
>> >>>
>> >>> keyspace: mykeyspace
>> >>>
>> >>>     Read Count: 0
>> >>>
>> >>>     Read Latency: NaN ms.
>> >>>
>> >>>     Write Count: 31905656
>> >>>
>> >>>     Write Latency: 0.051713177939359714 ms.
>> >>>
>> >>>     Pending Flushes: 0
>> >>>
>> >>>         Table: mytable
>> >>>
>> >>>         SSTable count: 1908
>> >>>
>> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000,
>> 306, 0,
>> >>> 0, 0, 0]
>> >>>
>> >>>         Space used (live): 301894591442
>> >>>
>> >>>         Space used (total): 301894591442
>> >>>
>> >>>
>> >>>
>> >>> Problematic node
>> >>>
>> >>> Keyspace: mykeyspace
>> >>>
>> >>>     Read Count: 0
>> >>>
>> >>>     Read Latency: NaN ms.
>> >>>
>> >>>     Write Count: 30520190
>> >>>
>> >>>     Write Latency: 0.05171286705620116 ms.
>> >>>
>> >>>     Pending Flushes: 0
>> >>>
>> >>>         Table: mytable
>> >>>
>> >>>         SSTable count: 14105
>> >>>
>> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0,
>> 0,
>> >>> 0, 0]
>> >>>
>> >>>         Space used (live): 561143255289
>> >>>
>> >>>         Space used (total): 561143255289
>> >
>> > Thanks,
>> >
>> > Ezra
>>
>>
>>
>> --
>> Jens Rantil
>> Backend engineer
>> Tink AB
>>
>> Email: jens.rantil@tink.se
>> Phone: +46 708 84 18 32
>> Web: www.tink.se
>>
>> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
>> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>>  Twitter <https://twitter.com/tink>
>>
>>
>>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: large number of pending compactions, sstables steadily increasing

Posted by Ben Slater <be...@instaclustr.com>.

What I’ve seen happen a number of times is you get in a negative feedback
loop:
not enough capacity to keep up with compactions (often triggered by repair
or compaction hitting a large partition) -> more sstables -> more expensive
reads -> even less capacity to keep up with compactions -> repeat

The way we deal with this at Instaclustr is typically to take the node
offline to let it catch up with compactions. We take it offline by running
nodetool disablegossip + disablethrift + disablebinary, unthrottle
compactions (nodetool setcompactionthroughput 0) and then leave it to chug
through compactions until it gets close to zero then reverse the settings
or restart C* to set things back to normal. This typically resolves the
issues. If you see it happening regularly your cluster probably needs more
processing capacity (or other tuning).

Cheers
Ben

On Tue, 8 Nov 2016 at 02:38 Eiti Kimura <ei...@movile.com> wrote:

> Hey guys,
>
> Do we have any conclusions about this case? Ezra, did you solve your
> problem?
> We are facing a very similar problem here. LeveledCompaction with VNodes
> and looks like a node went to a weird state and start to consume lot of
> CPU, the compaction process seems to be stucked and the number of SSTables
> increased significantly.
>
> Do you have any clue about it?
>
> Thanks,
> Eiti
>
>
>
> J.P. Eiti Kimura
> Plataformas
>
> +55 19 3518  <https://www.movile.com/assinaturaemail/#>5500
> + <https://www.movile.com/assinaturaemail/#>55 19 98232 2792
> skype: eitikimura
> <https://www.linkedin.com/company/movile>
> <https://pt.pinterest.com/Movile/>  <https://twitter.com/movile_LATAM>
> <https://www.facebook.com/Movile>
>
> 2016-09-11 18:20 GMT-03:00 Jens Rantil <je...@tink.se>:
>
> I just want to chime in and say that we also had issues keeping up with
> compaction once (with vnodes/ssd disks) and I also want to recommend
> keeping track of your open file limit which might bite you.
>
> Cheers,
> Jens
>
>
> On Friday, August 19, 2016, Mark Rose <ma...@markrose.ca> wrote:
>
> Hi Ezra,
>
> Are you making frequent changes to your rows (including TTL'ed
> values), or mostly inserting new ones? If you're only inserting new
> data, it's probable using size-tiered compaction would work better for
> you. If you are TTL'ing whole rows, consider date-tiered.
>
> If leveled compaction is still the best strategy, one way to catch up
> with compactions is to have less data per partition -- in other words,
> use more machines. Leveled compaction is CPU expensive. You are CPU
> bottlenecked currently, or from the other perspective, you have too
> much data per node for leveled compaction.
>
> At this point, compaction is so far behind that you'll likely be
> getting high latency if you're reading old rows (since dozens to
> hundreds of uncompacted sstables will likely need to be checked for
> matching rows). You may be better off with size tiered compaction,
> even if it will mean always reading several sstables per read (higher
> latency than when leveled can keep up).
>
> How much data do you have per node? Do you update/insert to/delete
> rows? Do you TTL?
>
> Cheers,
> Mark
>
> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net>
> wrote:
> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to
> fix
> > issue) which seems to be stuck in a weird state -- with a large number of
> > pending compactions and sstables. The node is compacting about 500gb/day,
> > number of pending compactions is going up at about 50/day. It is at about
> > 2300 pending compactions now. I have tried increasing number of
> compaction
> > threads and the compaction throughput, which doesn't seem to help
> eliminate
> > the many pending compactions.
> >
> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
> latter
> > has fixed the issue in the past, but most recently I was getting OOM
> errors,
> > probably due to the large number of sstables. I upgraded to 2.2.7 and am
> no
> > longer getting OOM errors, but also it does not resolve the issue. I do
> see
> > this message in the logs:
> >
> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
> >> CompactionManager.java:610 - Cannot perform a full major compaction as
> >> repaired and unrepaired sstables cannot be compacted together. These
> two set
> >> of sstables will be compacted separately.
> >
> > Below are the 'nodetool tablestats' comparing a normal and the
> problematic
> > node. You can see problematic node has many many more sstables, and they
> are
> > all in level 1. What is the best way to fix this? Can I just delete those
> > sstables somehow then run a repair?
> >>
> >> Normal node
> >>>
> >>> keyspace: mykeyspace
> >>>
> >>>     Read Count: 0
> >>>
> >>>     Read Latency: NaN ms.
> >>>
> >>>     Write Count: 31905656
> >>>
> >>>     Write Latency: 0.051713177939359714 ms.
> >>>
> >>>     Pending Flushes: 0
> >>>
> >>>         Table: mytable
> >>>
> >>>         SSTable count: 1908
> >>>
> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306,
> 0,
> >>> 0, 0, 0]
> >>>
> >>>         Space used (live): 301894591442
> >>>
> >>>         Space used (total): 301894591442
> >>>
> >>>
> >>>
> >>> Problematic node
> >>>
> >>> Keyspace: mykeyspace
> >>>
> >>>     Read Count: 0
> >>>
> >>>     Read Latency: NaN ms.
> >>>
> >>>     Write Count: 30520190
> >>>
> >>>     Write Latency: 0.05171286705620116 ms.
> >>>
> >>>     Pending Flushes: 0
> >>>
> >>>         Table: mytable
> >>>
> >>>         SSTable count: 14105
> >>>
> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0,
> >>> 0, 0]
> >>>
> >>>         Space used (live): 561143255289
> >>>
> >>>         Space used (total): 561143255289
> >
> > Thanks,
> >
> > Ezra
>
>
>
> --
> Jens Rantil
> Backend engineer
> Tink AB
>
> Email: jens.rantil@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
>
> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>  Twitter <https://twitter.com/tink>
>
>
>

Re: large number of pending compactions, sstables steadily increasing

Posted by Eiti Kimura <ei...@movile.com>.

Hey guys,

Do we have any conclusions about this case? Ezra, did you solve your
problem?
We are facing a very similar problem here. LeveledCompaction with VNodes
and looks like a node went to a weird state and start to consume lot of
CPU, the compaction process seems to be stucked and the number of SSTables
increased significantly.

Do you have any clue about it?

Thanks,
Eiti



J.P. Eiti Kimura
Plataformas

+55 19 3518  <https://www.movile.com/assinaturaemail/#>5500
+ <https://www.movile.com/assinaturaemail/#>55 19 98232 2792
skype: eitikimura
<https://www.linkedin.com/company/movile>
<https://pt.pinterest.com/Movile/>  <https://twitter.com/movile_LATAM>
<https://www.facebook.com/Movile>

2016-09-11 18:20 GMT-03:00 Jens Rantil <je...@tink.se>:

> I just want to chime in and say that we also had issues keeping up with
> compaction once (with vnodes/ssd disks) and I also want to recommend
> keeping track of your open file limit which might bite you.
>
> Cheers,
> Jens
>
>
> On Friday, August 19, 2016, Mark Rose <ma...@markrose.ca> wrote:
>
>> Hi Ezra,
>>
>> Are you making frequent changes to your rows (including TTL'ed
>> values), or mostly inserting new ones? If you're only inserting new
>> data, it's probable using size-tiered compaction would work better for
>> you. If you are TTL'ing whole rows, consider date-tiered.
>>
>> If leveled compaction is still the best strategy, one way to catch up
>> with compactions is to have less data per partition -- in other words,
>> use more machines. Leveled compaction is CPU expensive. You are CPU
>> bottlenecked currently, or from the other perspective, you have too
>> much data per node for leveled compaction.
>>
>> At this point, compaction is so far behind that you'll likely be
>> getting high latency if you're reading old rows (since dozens to
>> hundreds of uncompacted sstables will likely need to be checked for
>> matching rows). You may be better off with size tiered compaction,
>> even if it will mean always reading several sstables per read (higher
>> latency than when leveled can keep up).
>>
>> How much data do you have per node? Do you update/insert to/delete
>> rows? Do you TTL?
>>
>> Cheers,
>> Mark
>>
>> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net>
>> wrote:
>> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to
>> fix
>> > issue) which seems to be stuck in a weird state -- with a large number
>> of
>> > pending compactions and sstables. The node is compacting about
>> 500gb/day,
>> > number of pending compactions is going up at about 50/day. It is at
>> about
>> > 2300 pending compactions now. I have tried increasing number of
>> compaction
>> > threads and the compaction throughput, which doesn't seem to help
>> eliminate
>> > the many pending compactions.
>> >
>> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
>> latter
>> > has fixed the issue in the past, but most recently I was getting OOM
>> errors,
>> > probably due to the large number of sstables. I upgraded to 2.2.7 and
>> am no
>> > longer getting OOM errors, but also it does not resolve the issue. I do
>> see
>> > this message in the logs:
>> >
>> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
>> >> CompactionManager.java:610 - Cannot perform a full major compaction as
>> >> repaired and unrepaired sstables cannot be compacted together. These
>> two set
>> >> of sstables will be compacted separately.
>> >
>> > Below are the 'nodetool tablestats' comparing a normal and the
>> problematic
>> > node. You can see problematic node has many many more sstables, and
>> they are
>> > all in level 1. What is the best way to fix this? Can I just delete
>> those
>> > sstables somehow then run a repair?
>> >>
>> >> Normal node
>> >>>
>> >>> keyspace: mykeyspace
>> >>>
>> >>>     Read Count: 0
>> >>>
>> >>>     Read Latency: NaN ms.
>> >>>
>> >>>     Write Count: 31905656
>> >>>
>> >>>     Write Latency: 0.051713177939359714 ms.
>> >>>
>> >>>     Pending Flushes: 0
>> >>>
>> >>>         Table: mytable
>> >>>
>> >>>         SSTable count: 1908
>> >>>
>> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000,
>> 306, 0,
>> >>> 0, 0, 0]
>> >>>
>> >>>         Space used (live): 301894591442
>> >>>
>> >>>         Space used (total): 301894591442
>> >>>
>> >>>
>> >>>
>> >>> Problematic node
>> >>>
>> >>> Keyspace: mykeyspace
>> >>>
>> >>>     Read Count: 0
>> >>>
>> >>>     Read Latency: NaN ms.
>> >>>
>> >>>     Write Count: 30520190
>> >>>
>> >>>     Write Latency: 0.05171286705620116 ms.
>> >>>
>> >>>     Pending Flushes: 0
>> >>>
>> >>>         Table: mytable
>> >>>
>> >>>         SSTable count: 14105
>> >>>
>> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0,
>> 0,
>> >>> 0, 0]
>> >>>
>> >>>         Space used (live): 561143255289
>> >>>
>> >>>         Space used (total): 561143255289
>> >
>> > Thanks,
>> >
>> > Ezra
>>
>
>
> --
> Jens Rantil
> Backend engineer
> Tink AB
>
> Email: jens.rantil@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
>
> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>  Twitter <https://twitter.com/tink>
>
>

Re: large number of pending compactions, sstables steadily increasing

Posted by Jens Rantil <je...@tink.se>.

I just want to chime in and say that we also had issues keeping up with
compaction once (with vnodes/ssd disks) and I also want to recommend
keeping track of your open file limit which might bite you.

Cheers,
Jens

On Friday, August 19, 2016, Mark Rose <ma...@markrose.ca> wrote:

> Hi Ezra,
>
> Are you making frequent changes to your rows (including TTL'ed
> values), or mostly inserting new ones? If you're only inserting new
> data, it's probable using size-tiered compaction would work better for
> you. If you are TTL'ing whole rows, consider date-tiered.
>
> If leveled compaction is still the best strategy, one way to catch up
> with compactions is to have less data per partition -- in other words,
> use more machines. Leveled compaction is CPU expensive. You are CPU
> bottlenecked currently, or from the other perspective, you have too
> much data per node for leveled compaction.
>
> At this point, compaction is so far behind that you'll likely be
> getting high latency if you're reading old rows (since dozens to
> hundreds of uncompacted sstables will likely need to be checked for
> matching rows). You may be better off with size tiered compaction,
> even if it will mean always reading several sstables per read (higher
> latency than when leveled can keep up).
>
> How much data do you have per node? Do you update/insert to/delete
> rows? Do you TTL?
>
> Cheers,
> Mark
>
> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ezra.stuetzel@riskiq.net
> <javascript:;>> wrote:
> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to
> fix
> > issue) which seems to be stuck in a weird state -- with a large number of
> > pending compactions and sstables. The node is compacting about 500gb/day,
> > number of pending compactions is going up at about 50/day. It is at about
> > 2300 pending compactions now. I have tried increasing number of
> compaction
> > threads and the compaction throughput, which doesn't seem to help
> eliminate
> > the many pending compactions.
> >
> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
> latter
> > has fixed the issue in the past, but most recently I was getting OOM
> errors,
> > probably due to the large number of sstables. I upgraded to 2.2.7 and am
> no
> > longer getting OOM errors, but also it does not resolve the issue. I do
> see
> > this message in the logs:
> >
> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
> >> CompactionManager.java:610 - Cannot perform a full major compaction as
> >> repaired and unrepaired sstables cannot be compacted together. These
> two set
> >> of sstables will be compacted separately.
> >
> > Below are the 'nodetool tablestats' comparing a normal and the
> problematic
> > node. You can see problematic node has many many more sstables, and they
> are
> > all in level 1. What is the best way to fix this? Can I just delete those
> > sstables somehow then run a repair?
> >>
> >> Normal node
> >>>
> >>> keyspace: mykeyspace
> >>>
> >>>     Read Count: 0
> >>>
> >>>     Read Latency: NaN ms.
> >>>
> >>>     Write Count: 31905656
> >>>
> >>>     Write Latency: 0.051713177939359714 ms.
> >>>
> >>>     Pending Flushes: 0
> >>>
> >>>         Table: mytable
> >>>
> >>>         SSTable count: 1908
> >>>
> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306,
> 0,
> >>> 0, 0, 0]
> >>>
> >>>         Space used (live): 301894591442
> >>>
> >>>         Space used (total): 301894591442
> >>>
> >>>
> >>>
> >>> Problematic node
> >>>
> >>> Keyspace: mykeyspace
> >>>
> >>>     Read Count: 0
> >>>
> >>>     Read Latency: NaN ms.
> >>>
> >>>     Write Count: 30520190
> >>>
> >>>     Write Latency: 0.05171286705620116 ms.
> >>>
> >>>     Pending Flushes: 0
> >>>
> >>>         Table: mytable
> >>>
> >>>         SSTable count: 14105
> >>>
> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0,
> >>> 0, 0]
> >>>
> >>>         Space used (live): 561143255289
> >>>
> >>>         Space used (total): 561143255289
> >
> > Thanks,
> >
> > Ezra
>


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.rantil@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook <https://www.facebook.com/#!/tink.se> Linkedin
<http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
 Twitter <https://twitter.com/tink>

Re: large number of pending compactions, sstables steadily increasing

Posted by Ezra Stuetzel <ez...@riskiq.net>.

Yes, I am using vnodes. Each of our nodes has 256 tokens.

On Mon, Aug 22, 2016 at 2:57 AM, Carlos Alonso <in...@mrcalonso.com> wrote:

> Are you using vnodes? I've heard of similar sstable explosion issues when
> operating with vnodes.
>
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
>
> On 20 August 2016 at 22:22, Ezra Stuetzel <ez...@riskiq.net>
> wrote:
>
>> Hey Mark,
>> Yes, there are frequent changes to rows. In fact we re-write each row 5
>> times. 95% of our rows are TTL'ed, but it is the select 5% that aren't
>> TTL'ed that led to not use date tiered compaction. I think the node got
>> into a weird state and I'm not sure how, but it wound up with a lot of
>> sstables and many pending compactions. We did have a 6 node cluster, but we
>> wanted to change the machine type to higher CPU and SSDs. So we
>> bootstrapped 4 new nodes one at a time then removed the original 6 nodes
>> one at a time. A few of these 6 nodes were running OOM so we had to
>> assasinate them (some data loss was acceptable). When I increased the
>> compaction throughput and number compaction executors, I did not see any
>> change in the rate of increase of pending compactions. However I did not
>> look at the number of sstables then. Now, looking at the graphs below,
>> increasing those two settings showed an immediate decline in sstable count,
>> but a delayed dramatic decline (~3 day delay) in pending compactions. All
>> nodes should have the same load so I am hoping it won't occur again. If it
>> doesn't I'll try switching to size tiered or date tiered. Between 8/17 and
>> 8/18 is when I increased the compaction settings. We have about 280GB per
>> node for this table, except this one problematic node had about twice that,
>> but it seems to have recovered that space when the pending compactions
>> dropped off. Graphs for the sstables, pending compactions, and disk space
>> are below which start when the 4 nodes were being bootstrapped.
>>
>> [image: Inline image 2]
>>
>> [image: Inline image 1]
>> [image: Inline image 3]
>>
>> On Fri, Aug 19, 2016 at 11:41 AM, Mark Rose <ma...@markrose.ca> wrote:
>>
>>> Hi Ezra,
>>>
>>> Are you making frequent changes to your rows (including TTL'ed
>>> values), or mostly inserting new ones? If you're only inserting new
>>> data, it's probable using size-tiered compaction would work better for
>>> you. If you are TTL'ing whole rows, consider date-tiered.
>>>
>>> If leveled compaction is still the best strategy, one way to catch up
>>> with compactions is to have less data per partition -- in other words,
>>> use more machines. Leveled compaction is CPU expensive. You are CPU
>>> bottlenecked currently, or from the other perspective, you have too
>>> much data per node for leveled compaction.
>>>
>>> At this point, compaction is so far behind that you'll likely be
>>> getting high latency if you're reading old rows (since dozens to
>>> hundreds of uncompacted sstables will likely need to be checked for
>>> matching rows). You may be better off with size tiered compaction,
>>> even if it will mean always reading several sstables per read (higher
>>> latency than when leveled can keep up).
>>>
>>> How much data do you have per node? Do you update/insert to/delete
>>> rows? Do you TTL?
>>>
>>> Cheers,
>>> Mark
>>>
>>> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net>
>>> wrote:
>>> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping
>>> to fix
>>> > issue) which seems to be stuck in a weird state -- with a large number
>>> of
>>> > pending compactions and sstables. The node is compacting about
>>> 500gb/day,
>>> > number of pending compactions is going up at about 50/day. It is at
>>> about
>>> > 2300 pending compactions now. I have tried increasing number of
>>> compaction
>>> > threads and the compaction throughput, which doesn't seem to help
>>> eliminate
>>> > the many pending compactions.
>>> >
>>> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
>>> latter
>>> > has fixed the issue in the past, but most recently I was getting OOM
>>> errors,
>>> > probably due to the large number of sstables. I upgraded to 2.2.7 and
>>> am no
>>> > longer getting OOM errors, but also it does not resolve the issue. I
>>> do see
>>> > this message in the logs:
>>> >
>>> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
>>> >> CompactionManager.java:610 - Cannot perform a full major compaction as
>>> >> repaired and unrepaired sstables cannot be compacted together. These
>>> two set
>>> >> of sstables will be compacted separately.
>>> >
>>> > Below are the 'nodetool tablestats' comparing a normal and the
>>> problematic
>>> > node. You can see problematic node has many many more sstables, and
>>> they are
>>> > all in level 1. What is the best way to fix this? Can I just delete
>>> those
>>> > sstables somehow then run a repair?
>>> >>
>>> >> Normal node
>>> >>>
>>> >>> keyspace: mykeyspace
>>> >>>
>>> >>>     Read Count: 0
>>> >>>
>>> >>>     Read Latency: NaN ms.
>>> >>>
>>> >>>     Write Count: 31905656
>>> >>>
>>> >>>     Write Latency: 0.051713177939359714 ms.
>>> >>>
>>> >>>     Pending Flushes: 0
>>> >>>
>>> >>>         Table: mytable
>>> >>>
>>> >>>         SSTable count: 1908
>>> >>>
>>> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000,
>>> 306, 0,
>>> >>> 0, 0, 0]
>>> >>>
>>> >>>         Space used (live): 301894591442
>>> >>>
>>> >>>         Space used (total): 301894591442
>>> >>>
>>> >>>
>>> >>>
>>> >>> Problematic node
>>> >>>
>>> >>> Keyspace: mykeyspace
>>> >>>
>>> >>>     Read Count: 0
>>> >>>
>>> >>>     Read Latency: NaN ms.
>>> >>>
>>> >>>     Write Count: 30520190
>>> >>>
>>> >>>     Write Latency: 0.05171286705620116 ms.
>>> >>>
>>> >>>     Pending Flushes: 0
>>> >>>
>>> >>>         Table: mytable
>>> >>>
>>> >>>         SSTable count: 14105
>>> >>>
>>> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0,
>>> 0,
>>> >>> 0, 0]
>>> >>>
>>> >>>         Space used (live): 561143255289
>>> >>>
>>> >>>         Space used (total): 561143255289
>>> >
>>> > Thanks,
>>> >
>>> > Ezra
>>>
>>
>>
>

Re: large number of pending compactions, sstables steadily increasing

Posted by Carlos Alonso <in...@mrcalonso.com>.

Are you using vnodes? I've heard of similar sstable explosion issues when
operating with vnodes.

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 20 August 2016 at 22:22, Ezra Stuetzel <ez...@riskiq.net> wrote:

> Hey Mark,
> Yes, there are frequent changes to rows. In fact we re-write each row 5
> times. 95% of our rows are TTL'ed, but it is the select 5% that aren't
> TTL'ed that led to not use date tiered compaction. I think the node got
> into a weird state and I'm not sure how, but it wound up with a lot of
> sstables and many pending compactions. We did have a 6 node cluster, but we
> wanted to change the machine type to higher CPU and SSDs. So we
> bootstrapped 4 new nodes one at a time then removed the original 6 nodes
> one at a time. A few of these 6 nodes were running OOM so we had to
> assasinate them (some data loss was acceptable). When I increased the
> compaction throughput and number compaction executors, I did not see any
> change in the rate of increase of pending compactions. However I did not
> look at the number of sstables then. Now, looking at the graphs below,
> increasing those two settings showed an immediate decline in sstable count,
> but a delayed dramatic decline (~3 day delay) in pending compactions. All
> nodes should have the same load so I am hoping it won't occur again. If it
> doesn't I'll try switching to size tiered or date tiered. Between 8/17 and
> 8/18 is when I increased the compaction settings. We have about 280GB per
> node for this table, except this one problematic node had about twice that,
> but it seems to have recovered that space when the pending compactions
> dropped off. Graphs for the sstables, pending compactions, and disk space
> are below which start when the 4 nodes were being bootstrapped.
>
> [image: Inline image 2]
>
> [image: Inline image 1]
> [image: Inline image 3]
>
> On Fri, Aug 19, 2016 at 11:41 AM, Mark Rose <ma...@markrose.ca> wrote:
>
>> Hi Ezra,
>>
>> Are you making frequent changes to your rows (including TTL'ed
>> values), or mostly inserting new ones? If you're only inserting new
>> data, it's probable using size-tiered compaction would work better for
>> you. If you are TTL'ing whole rows, consider date-tiered.
>>
>> If leveled compaction is still the best strategy, one way to catch up
>> with compactions is to have less data per partition -- in other words,
>> use more machines. Leveled compaction is CPU expensive. You are CPU
>> bottlenecked currently, or from the other perspective, you have too
>> much data per node for leveled compaction.
>>
>> At this point, compaction is so far behind that you'll likely be
>> getting high latency if you're reading old rows (since dozens to
>> hundreds of uncompacted sstables will likely need to be checked for
>> matching rows). You may be better off with size tiered compaction,
>> even if it will mean always reading several sstables per read (higher
>> latency than when leveled can keep up).
>>
>> How much data do you have per node? Do you update/insert to/delete
>> rows? Do you TTL?
>>
>> Cheers,
>> Mark
>>
>> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net>
>> wrote:
>> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to
>> fix
>> > issue) which seems to be stuck in a weird state -- with a large number
>> of
>> > pending compactions and sstables. The node is compacting about
>> 500gb/day,
>> > number of pending compactions is going up at about 50/day. It is at
>> about
>> > 2300 pending compactions now. I have tried increasing number of
>> compaction
>> > threads and the compaction throughput, which doesn't seem to help
>> eliminate
>> > the many pending compactions.
>> >
>> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
>> latter
>> > has fixed the issue in the past, but most recently I was getting OOM
>> errors,
>> > probably due to the large number of sstables. I upgraded to 2.2.7 and
>> am no
>> > longer getting OOM errors, but also it does not resolve the issue. I do
>> see
>> > this message in the logs:
>> >
>> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
>> >> CompactionManager.java:610 - Cannot perform a full major compaction as
>> >> repaired and unrepaired sstables cannot be compacted together. These
>> two set
>> >> of sstables will be compacted separately.
>> >
>> > Below are the 'nodetool tablestats' comparing a normal and the
>> problematic
>> > node. You can see problematic node has many many more sstables, and
>> they are
>> > all in level 1. What is the best way to fix this? Can I just delete
>> those
>> > sstables somehow then run a repair?
>> >>
>> >> Normal node
>> >>>
>> >>> keyspace: mykeyspace
>> >>>
>> >>>     Read Count: 0
>> >>>
>> >>>     Read Latency: NaN ms.
>> >>>
>> >>>     Write Count: 31905656
>> >>>
>> >>>     Write Latency: 0.051713177939359714 ms.
>> >>>
>> >>>     Pending Flushes: 0
>> >>>
>> >>>         Table: mytable
>> >>>
>> >>>         SSTable count: 1908
>> >>>
>> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000,
>> 306, 0,
>> >>> 0, 0, 0]
>> >>>
>> >>>         Space used (live): 301894591442
>> >>>
>> >>>         Space used (total): 301894591442
>> >>>
>> >>>
>> >>>
>> >>> Problematic node
>> >>>
>> >>> Keyspace: mykeyspace
>> >>>
>> >>>     Read Count: 0
>> >>>
>> >>>     Read Latency: NaN ms.
>> >>>
>> >>>     Write Count: 30520190
>> >>>
>> >>>     Write Latency: 0.05171286705620116 ms.
>> >>>
>> >>>     Pending Flushes: 0
>> >>>
>> >>>         Table: mytable
>> >>>
>> >>>         SSTable count: 14105
>> >>>
>> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0,
>> 0,
>> >>> 0, 0]
>> >>>
>> >>>         Space used (live): 561143255289
>> >>>
>> >>>         Space used (total): 561143255289
>> >
>> > Thanks,
>> >
>> > Ezra
>>
>
>

Re: large number of pending compactions, sstables steadily increasing

Posted by Ezra Stuetzel <ez...@riskiq.net>.

Hey Mark,
Yes, there are frequent changes to rows. In fact we re-write each row 5
times. 95% of our rows are TTL'ed, but it is the select 5% that aren't
TTL'ed that led to not use date tiered compaction. I think the node got
into a weird state and I'm not sure how, but it wound up with a lot of
sstables and many pending compactions. We did have a 6 node cluster, but we
wanted to change the machine type to higher CPU and SSDs. So we
bootstrapped 4 new nodes one at a time then removed the original 6 nodes
one at a time. A few of these 6 nodes were running OOM so we had to
assasinate them (some data loss was acceptable). When I increased the
compaction throughput and number compaction executors, I did not see any
change in the rate of increase of pending compactions. However I did not
look at the number of sstables then. Now, looking at the graphs below,
increasing those two settings showed an immediate decline in sstable count,
but a delayed dramatic decline (~3 day delay) in pending compactions. All
nodes should have the same load so I am hoping it won't occur again. If it
doesn't I'll try switching to size tiered or date tiered. Between 8/17 and
8/18 is when I increased the compaction settings. We have about 280GB per
node for this table, except this one problematic node had about twice that,
but it seems to have recovered that space when the pending compactions
dropped off. Graphs for the sstables, pending compactions, and disk space
are below which start when the 4 nodes were being bootstrapped.

[image: Inline image 2]

[image: Inline image 1]
[image: Inline image 3]

On Fri, Aug 19, 2016 at 11:41 AM, Mark Rose <ma...@markrose.ca> wrote:

> Hi Ezra,
>
> Are you making frequent changes to your rows (including TTL'ed
> values), or mostly inserting new ones? If you're only inserting new
> data, it's probable using size-tiered compaction would work better for
> you. If you are TTL'ing whole rows, consider date-tiered.
>
> If leveled compaction is still the best strategy, one way to catch up
> with compactions is to have less data per partition -- in other words,
> use more machines. Leveled compaction is CPU expensive. You are CPU
> bottlenecked currently, or from the other perspective, you have too
> much data per node for leveled compaction.
>
> At this point, compaction is so far behind that you'll likely be
> getting high latency if you're reading old rows (since dozens to
> hundreds of uncompacted sstables will likely need to be checked for
> matching rows). You may be better off with size tiered compaction,
> even if it will mean always reading several sstables per read (higher
> latency than when leveled can keep up).
>
> How much data do you have per node? Do you update/insert to/delete
> rows? Do you TTL?
>
> Cheers,
> Mark
>
> On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net>
> wrote:
> > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to
> fix
> > issue) which seems to be stuck in a weird state -- with a large number of
> > pending compactions and sstables. The node is compacting about 500gb/day,
> > number of pending compactions is going up at about 50/day. It is at about
> > 2300 pending compactions now. I have tried increasing number of
> compaction
> > threads and the compaction throughput, which doesn't seem to help
> eliminate
> > the many pending compactions.
> >
> > I have tried running 'nodetool cleanup' and 'nodetool compact'. The
> latter
> > has fixed the issue in the past, but most recently I was getting OOM
> errors,
> > probably due to the large number of sstables. I upgraded to 2.2.7 and am
> no
> > longer getting OOM errors, but also it does not resolve the issue. I do
> see
> > this message in the logs:
> >
> >> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
> >> CompactionManager.java:610 - Cannot perform a full major compaction as
> >> repaired and unrepaired sstables cannot be compacted together. These
> two set
> >> of sstables will be compacted separately.
> >
> > Below are the 'nodetool tablestats' comparing a normal and the
> problematic
> > node. You can see problematic node has many many more sstables, and they
> are
> > all in level 1. What is the best way to fix this? Can I just delete those
> > sstables somehow then run a repair?
> >>
> >> Normal node
> >>>
> >>> keyspace: mykeyspace
> >>>
> >>>     Read Count: 0
> >>>
> >>>     Read Latency: NaN ms.
> >>>
> >>>     Write Count: 31905656
> >>>
> >>>     Write Latency: 0.051713177939359714 ms.
> >>>
> >>>     Pending Flushes: 0
> >>>
> >>>         Table: mytable
> >>>
> >>>         SSTable count: 1908
> >>>
> >>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306,
> 0,
> >>> 0, 0, 0]
> >>>
> >>>         Space used (live): 301894591442
> >>>
> >>>         Space used (total): 301894591442
> >>>
> >>>
> >>>
> >>> Problematic node
> >>>
> >>> Keyspace: mykeyspace
> >>>
> >>>     Read Count: 0
> >>>
> >>>     Read Latency: NaN ms.
> >>>
> >>>     Write Count: 30520190
> >>>
> >>>     Write Latency: 0.05171286705620116 ms.
> >>>
> >>>     Pending Flushes: 0
> >>>
> >>>         Table: mytable
> >>>
> >>>         SSTable count: 14105
> >>>
> >>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0,
> >>> 0, 0]
> >>>
> >>>         Space used (live): 561143255289
> >>>
> >>>         Space used (total): 561143255289
> >
> > Thanks,
> >
> > Ezra
>

Re: large number of pending compactions, sstables steadily increasing

Posted by Mark Rose <ma...@markrose.ca>.

Hi Ezra,

Are you making frequent changes to your rows (including TTL'ed
values), or mostly inserting new ones? If you're only inserting new
data, it's probable using size-tiered compaction would work better for
you. If you are TTL'ing whole rows, consider date-tiered.

If leveled compaction is still the best strategy, one way to catch up
with compactions is to have less data per partition -- in other words,
use more machines. Leveled compaction is CPU expensive. You are CPU
bottlenecked currently, or from the other perspective, you have too
much data per node for leveled compaction.

At this point, compaction is so far behind that you'll likely be
getting high latency if you're reading old rows (since dozens to
hundreds of uncompacted sstables will likely need to be checked for
matching rows). You may be better off with size tiered compaction,
even if it will mean always reading several sstables per read (higher
latency than when leveled can keep up).

How much data do you have per node? Do you update/insert to/delete
rows? Do you TTL?

Cheers,
Mark

On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzel <ez...@riskiq.net> wrote:
> I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to fix
> issue) which seems to be stuck in a weird state -- with a large number of
> pending compactions and sstables. The node is compacting about 500gb/day,
> number of pending compactions is going up at about 50/day. It is at about
> 2300 pending compactions now. I have tried increasing number of compaction
> threads and the compaction throughput, which doesn't seem to help eliminate
> the many pending compactions.
>
> I have tried running 'nodetool cleanup' and 'nodetool compact'. The latter
> has fixed the issue in the past, but most recently I was getting OOM errors,
> probably due to the large number of sstables. I upgraded to 2.2.7 and am no
> longer getting OOM errors, but also it does not resolve the issue. I do see
> this message in the logs:
>
>> INFO  [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985
>> CompactionManager.java:610 - Cannot perform a full major compaction as
>> repaired and unrepaired sstables cannot be compacted together. These two set
>> of sstables will be compacted separately.
>
> Below are the 'nodetool tablestats' comparing a normal and the problematic
> node. You can see problematic node has many many more sstables, and they are
> all in level 1. What is the best way to fix this? Can I just delete those
> sstables somehow then run a repair?
>>
>> Normal node
>>>
>>> keyspace: mykeyspace
>>>
>>>     Read Count: 0
>>>
>>>     Read Latency: NaN ms.
>>>
>>>     Write Count: 31905656
>>>
>>>     Write Latency: 0.051713177939359714 ms.
>>>
>>>     Pending Flushes: 0
>>>
>>>         Table: mytable
>>>
>>>         SSTable count: 1908
>>>
>>>         SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306, 0,
>>> 0, 0, 0]
>>>
>>>         Space used (live): 301894591442
>>>
>>>         Space used (total): 301894591442
>>>
>>>
>>>
>>> Problematic node
>>>
>>> Keyspace: mykeyspace
>>>
>>>     Read Count: 0
>>>
>>>     Read Latency: NaN ms.
>>>
>>>     Write Count: 30520190
>>>
>>>     Write Latency: 0.05171286705620116 ms.
>>>
>>>     Pending Flushes: 0
>>>
>>>         Table: mytable
>>>
>>>         SSTable count: 14105
>>>
>>>         SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0,
>>> 0, 0]
>>>
>>>         Space used (live): 561143255289
>>>
>>>         Space used (total): 561143255289
>
> Thanks,
>
> Ezra