You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Reynald Borer <re...@gmail.com> on 2018/12/11 08:57:50 UTC
Re: 1.2.19: AssertionError when running compactions on a CF with TTLed columns

Hi everyone,

I was finally able to sort out my problem in an "interesting" manner that I
think is worth sharing on the list!

What I did is the following: on each node, I stopped Cassandra, completely
dropped the data files of the column family, started Cassandra again and
issued a repair for this column family.

The process took time since the cluster is formed of 40 nodes, but once
done, the nodes didn't exhibit this assertion error anymore!

I believe this was triggered because of me tweaking the
"sstable_size_in_mb" parameter. Somehow I had data files with different
sizes and it confused Cassandra.

So, problem solved now :-)

Cheers,
Reynald


On Fri, Aug 31, 2018 at 7:45 AM Reynald Borer <re...@gmail.com>
wrote:

> Hi everyone,
>
> I'm running a Cassandra 1.2.19 cluster of 40 nodes and compactions of a
> specific column family are sporadically raising an AssertionError like this
> (full stack trace visible under
> https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a):
>
> ERROR [CompactionExecutor:9137] 2018-08-27 11:43:05,197
> org.apache.cassandra.service.CassandraDaemon - Exception in thread
> Thread[CompactionExecutor:9137,1,main]
> java.lang.AssertionError: 2
> at
> org.apache.cassandra.db.compaction.LeveledManifest.replace(LeveledManifest.java:267)
>
> The data written in this column family can be seen as wide rows, that is,
> rows with lots of columns. Each column has a TTL of 7 days though.
>
> Whenever this happens, it seems to block compactions of this column family
> (I see the pending compactions increasing) until I restart the failing node.
>
> I have searched on jira and on this mailing-list about this issue without
> too much luck. I suspect it may be related to
> https://issues.apache.org/jira/browse/CASSANDRA-6563 although it's hard
> for to confirm.
>
> I know this version is pretty old, does this issue anyway rings a bell to
> one of you?
>
> Here are some more details about my cluster:
>
> - it is composed of 40 nodes
> - it is pretty old and I'm in the process of upgrading it, thus it was
> running without issues under version 1.0.12 & 1.1.12
> - it really affect a single column family only (schema can be seen on
> https://gist.github.com/rborer/46862d6d693c0163aa8fe0e74caa2d9a#file-schema-txt
> )
> - my cluster is set up with RandomPartitioner (inherited from when it was
> set up on version 0.7) and a replication factor of 3
> - it's running weekly repairs (and this assertion happens mostly during
> repairs)
> - what I also noted is that since the cluster was upgraded to 1.2.19 the
> disk size of this column family keeps increasing (it went from 400G to
> 1.2T!)
>
> Thanks in advance for your help.
>
> Best regards,
> Reynald
>
>
>
>