You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Avraham Kalvo (JIRA)" <ji...@apache.org> on 2019/06/10 05:36:00 UTC
[jira] [Created] (CASSANDRA-15152) Batch Log - Mutation too large while bootstrapping a newly added node

Avraham Kalvo created CASSANDRA-15152:
-----------------------------------------

             Summary: Batch Log - Mutation too large while bootstrapping a newly added node
                 Key: CASSANDRA-15152
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15152
             Project: Cassandra
          Issue Type: Bug
          Components: Consistency/Batch Log
            Reporter: Avraham Kalvo


Scaling our six nodes cluster by three more nodes, we came upon behavior in which bootstrap appears hung under `UJ` (two previously added were joined within approximately 2.5 hours).

Examining the logs the following became apparent shortly after the bootstrap process has commenced for this node:
```
ERROR [BatchlogTasks:1] 2019-06-05 14:43:46,508 CassandraDaemon.java:207 - Exception in thread Thread[BatchlogTasks:1,5,main]
java.lang.IllegalArgumentException: Mutation of 108035175 bytes is too large for the maximum size of 16777216
        at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:256) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:520) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.db.Keyspace.applyNotDeferrable(Keyspace.java:399) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:213) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendSingleReplayMutation(BatchlogManager.java:427) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendReplays(BatchlogManager.java:402) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.replay(BatchlogManager.java:318) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.batchlog.BatchlogManager.processBatchlogEntries(BatchlogManager.java:238) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.batchlog.BatchlogManager.replayFailedBatches(BatchlogManager.java:207) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) ~[apache-cassandra-3.0.10.jar:3.0.10]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_201]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_201]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_201]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_201]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_201]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_201]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
```

And since then, repeating itself in the logs.

We decided to discard the newly added apparently still joining node by doing the following:
1. at first - simply restarting it, which resulted in it starting up apparently normally 
2. then - decommission it by issuing `nodetool decommission`, this took long (over 2.5 hours) and eventually was terminated by issuing `nodetool removenode`
3. node removal was hung on a specific token, which led us to complete it by force.
4. forcing the node removal has generated a corruption with one of the `system.batches` table SSTables, which was removed (backed up) from its underlying data dir as mitigation (78MB worth)
5. cluster-wide repair was run
6. `Mutation too large` error is now repeating itself in three different permutations (alerted sizes) under three different nodes (our standard replication factor is of three)

We're not sure whether we're hitting https://issues.apache.org/jira/browse/CASSANDRA-11670 or not, as it's said to be resolved in our current version of 3.0.10.
Still would like to verify what's the root cause for this? as we need to make clear whether we are to expect this happening in production environments.

How would you recommend verifying to which keyspace.table does this mutation belong to?

Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org