You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Benjamin Roth (JIRA)" <ji...@apache.org> on 2016/09/29 06:01:20 UTC

[jira] [Created] (CASSANDRA-12730) Thousands of empty SSTables created during repair - TMOF death

Benjamin Roth created CASSANDRA-12730:
-----------------------------------------

Summary: Thousands of empty SSTables created during repair - TMOF death
Key: CASSANDRA-12730
URL: https://issues.apache.org/jira/browse/CASSANDRA-12730
Project: Cassandra
Issue Type: Bug
Components: Local Write-Read Paths
Reporter: Benjamin Roth
Priority: Critical

Last night I ran a repair on a keyspace with 7 tables and 4 MVs each containing a few hundret million records. After a few hours a node died because of "too many open files".
Normally one would just raise the limit, but: We already set this to 100k. The problem was that the repair created roughly over 100k SSTables for a certain MV. The strange thing is that these SSTables had almost no data (like 53bytes, 90bytes, ...). Some of them (<5%) had a few 100 KB, very few (<1% had normal sizes like >= few MB). I could understand, that SSTables queue up as they are flushed and not compacted in time but then they should have at least a few MB (depending on config and avail mem), right?
Of course then the node runs out of FDs and I guess it is not a good idea to raise the limit even higher as I expect that this would just create even more empty SSTables before dying at last.

Only 1 CF (MV) was affected. All other CFs (also MVs) behave sanely. Empty SSTables have been created equally over time. 100-150 every minute. Among the empty SSTables there are also Tables that look normal like having few MBs.
I didn't see any errors or exceptions in the logs until TMOF occured. Just tons of streams due to the repair (which I actually run over cs-reaper as subrange, full repairs).
After having restarted that node (and no more repair running), the number of SSTables went down again as they are compacted away slowly.

According to [~zznate] this issue may relate to CASSANDRA-10342 + CASSANDRA-8641

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)