You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Avraham Kalvo (JIRA)" <ji...@apache.org> on 2019/06/10 06:05:00 UTC
[jira] [Commented] (CASSANDRA-15152) Batch Log - Mutation too large while bootstrapping a newly added node

    [ https://issues.apache.org/jira/browse/CASSANDRA-15152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859724#comment-16859724 ] 

Avraham Kalvo commented on CASSANDRA-15152:
-------------------------------------------

Switching log level to trace has disclosed the following, just before the error we’re getting:
`TRACE [BatchlogTasks:1] 2019-06-10 05:45:40,251 BatchlogManager.java:309 - Replaying batch 5694cca0-8834-11e9-b262-b3ace0831935`

How should one query the `system.batches` table to see the actual mutation(s) list (Blob to Text? Casting?)
Would this table disclose the exact keyspace.table the mutations is related to? thanks.



> Batch Log - Mutation too large while bootstrapping a newly added node
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-15152
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15152
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Batch Log
>            Reporter: Avraham Kalvo
>            Priority: Normal
>
> Scaling our six nodes cluster by three more nodes, we came upon behavior in which bootstrap appears hung under `UJ` (two previously added were joined within approximately 2.5 hours).
> Examining the logs the following became apparent shortly after the bootstrap process has commenced for this node:
> ```
> ERROR [BatchlogTasks:1] 2019-06-05 14:43:46,508 CassandraDaemon.java:207 - Exception in thread Thread[BatchlogTasks:1,5,main]
> java.lang.IllegalArgumentException: Mutation of 108035175 bytes is too large for the maximum size of 16777216
>         at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:256) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:520) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.db.Keyspace.applyNotDeferrable(Keyspace.java:399) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.db.Mutation.apply(Mutation.java:213) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendSingleReplayMutation(BatchlogManager.java:427) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendReplays(BatchlogManager.java:402) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.replay(BatchlogManager.java:318) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.batchlog.BatchlogManager.processBatchlogEntries(BatchlogManager.java:238) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.batchlog.BatchlogManager.replayFailedBatches(BatchlogManager.java:207) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) ~[apache-cassandra-3.0.10.jar:3.0.10]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_201]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_201]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_201]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_201]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_201]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_201]
>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> ```
> And since then, repeating itself in the logs.
> We decided to discard the newly added apparently still joining node by doing the following:
> 1. at first - simply restarting it, which resulted in it starting up apparently normally 
> 2. then - decommission it by issuing `nodetool decommission`, this took long (over 2.5 hours) and eventually was terminated by issuing `nodetool removenode`
> 3. node removal was hung on a specific token, which led us to complete it by force.
> 4. forcing the node removal has generated a corruption with one of the `system.batches` table SSTables, which was removed (backed up) from its underlying data dir as mitigation (78MB worth)
> 5. cluster-wide repair was run
> 6. `Mutation too large` error is now repeating itself in three different permutations (alerted sizes) under three different nodes (our standard replication factor is of three)
> We're not sure whether we're hitting https://issues.apache.org/jira/browse/CASSANDRA-11670 or not, as it's said to be resolved in our current version of 3.0.10.
> Still would like to verify what's the root cause for this? as we need to make clear whether we are to expect this happening in production environments.
> How would you recommend verifying to which keyspace.table does this mutation belong to?
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org