You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2013/01/15 15:50:13 UTC

[jira] [Commented] (CASSANDRA-5158) Compaction fails after hours of frequent updates

    [ https://issues.apache.org/jira/browse/CASSANDRA-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553852#comment-13553852 ] 

Jonathan Ellis commented on CASSANDRA-5158:
-------------------------------------------

This sounds like a SAN i/o virtualization bug.  Recommend switching to direct-attached storage.  Could also be bad ram.

If scrub is erroring out we might be able to address that.  But if it's just saying "yeah, this is too corrupt to recover" then it's probably right.

                
> Compaction fails after hours of frequent updates
> ------------------------------------------------
>
>                 Key: CASSANDRA-5158
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5158
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.8
>         Environment: Virtualized Ubuntu 12.04 (linux 2.6.18-274.3.1.el5), 
> SAN disk, 
> tested with Sun JRE 1.6.0_24 and 1.6.0_38-b05, 
> 8 Gb of RAM
>            Reporter: Tommi Koivula
>
> A data corruption occurs in one of our customer's environment after ~10 hours of running with constant update-load (1-2 writes/sec). Compaction of one of the column families start failing after about 10 hours and scrub is not able to fix it either. Writes are not producing any errors and compaction of other column families works fine. We are not doing any reads at this stage and always started from a fresh cassandra instance (no data) during startup. After failure we have tried to restart Cassandra, which succeeds, and to retrieve all keys, which fails with similar error.
> Cassandra configuration is the default except for directory locations. The number of rows in the table is always less than 100k and doesn't increase as our application is purging old data in hourly basis. The problematic table is very simple with only one column:
>    CREATE TABLE xxx ( KEY uuid PRIMARY KEY, value text) WITH gc_grace_seconds=0;
> We have tried enabling and disabling JNA without affect on this. JRE was also upgraded to 1.6.0_38-b05 with no change.
> The problem reproduces every time at some point but we haven't being able to reproduce it faster so it seems to require running several hours of updates. We had this problem with Cassandra 1.1.0 and upgraded to 1.1.8 but the upgrade didn't fix it. The cluster has only one node so the application is suffering a total data loss.
> Error message varies a bit in different runs, here is two produced by two different Cassandra 1.1.8 runs:
> [CompactionExecutor:15]|||Exception in thread Thread[CompactionExecutor:15,1,main] java.lang.NegativeArraySizeException: null
>         at org.apache.cassandra.io.sstable.IndexHelper.skipBloomFilter(IndexHelper.java:57)
> 	at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdentityIterator.java:144)
>         at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdentityIterator.java:86)
>         at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdentityIterator.java:70)
>         at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:189)
>         at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:153)
>         at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:145)
>         at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:40)
>         at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:149)
>         at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:126)
>         at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:100)
>         at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
>         at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
>         at com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
>         at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
>         at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
>         at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173)
>         at org.apache.cassandra.db.compaction.CompactionManager$2.runMayThrow(CompactionManager.java:164)
>         at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662) [na:1.6.0_38]
> [CompactionExecutor:127]|org.apache.cassandra.utils.OutputHandler$LogOutput||Non-fatal error reading row (stacktrace follows)
> java.io.IOError: java.io.EOFException: bloom filter claims to be -793390915
> bytes, longer than entire row size -3407588055136546636
>         at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdentityIterator.java:156)
>         at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdentityIterator.java:86)
>         at org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:170)
>         at org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:496)
>         at org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:485)
>         at org.apache.cassandra.db.compaction.CompactionManager.access$300(CompactionManager.java:69)
>         at org.apache.cassandra.db.compaction.CompactionManager$4.perform(CompactionManager.java:235)
>         at org.apache.cassandra.db.compaction.CompactionManager$3.call(CompactionManager.java:205)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662) [na:1.6.0_38]
> Caused by: java.io.EOFException: bloom filter claims to be -793390915 bytes,
> longer than entire row size -3407588055136546636
>         at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:129)
>         at org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTableIdentityIterator.java:121)
>         ... 12 common frames omitted
> Data directory for the table looks like this:
>  $ ls -l data/p/m
> total 100644
> -rw-r--r-- 1 user group     3342 Jan 14 14:08 p-m_nodes-hf-1-CompressionInfo.db
> -rw-r--r-- 1 user group  8809019 Jan 14 14:08 p-m_nodes-hf-1-Data.db
> -rw-r--r-- 1 user group   155632 Jan 14 14:08 p-m_nodes-hf-1-Filter.db
> -rw-r--r-- 1 user group 11441829 Jan 14 14:08 p-m_nodes-hf-1-Index.db
> -rw-r--r-- 1 user group     4340 Jan 14 14:08 p-m_nodes-hf-1-Statistics.db
> -rw-r--r-- 1 user group     3310 Jan 14 17:06 p-m_nodes-hf-2-CompressionInfo.db
> -rw-r--r-- 1 user group  8948738 Jan 14 17:06 p-m_nodes-hf-2-Data.db
> -rw-r--r-- 1 user group   138992 Jan 14 17:06 p-m_nodes-hf-2-Filter.db
> -rw-r--r-- 1 user group 11531149 Jan 14 17:06 p-m_nodes-hf-2-Index.db
> -rw-r--r-- 1 user group     4340 Jan 14 17:06 p-m_nodes-hf-2-Statistics.db
> -rw-r--r-- 1 user group     3270 Jan 14 21:19 p-m_nodes-hf-3-CompressionInfo.db
> -rw-r--r-- 1 user group  9246229 Jan 14 21:19 p-m_nodes-hf-3-Data.db
> -rw-r--r-- 1 user group   119776 Jan 14 21:19 p-m_nodes-hf-3-Filter.db
> -rw-r--r-- 1 user group 11633474 Jan 14 21:19 p-m_nodes-hf-3-Index.db
> -rw-r--r-- 1 user group     4340 Jan 14 21:19 p-m_nodes-hf-3-Statistics.db
> -rw-r--r-- 1 user group     3350 Jan 15 06:35 p-m_nodes-hf-4-CompressionInfo.db
> -rw-r--r-- 1 user group  8766723 Jan 15 06:35 p-m_nodes-hf-4-Data.db
> -rw-r--r-- 1 user group   158792 Jan 15 06:35 p-m_nodes-hf-4-Filter.db
> -rw-r--r-- 1 user group 11425708 Jan 15 06:35 p-m_nodes-hf-4-Index.db
> -rw-r--r-- 1 user group     4340 Jan 15 06:35 p-m_nodes-hf-4-Statistics.db
> -rw-r--r-- 1 user group     3318 Jan 15 10:27 p-m_nodes-hf-6-CompressionInfo.db
> -rw-r--r-- 1 user group  8748453 Jan 15 10:27 p-m_nodes-hf-6-Data.db
> -rw-r--r-- 1 user group   141904 Jan 15 10:27 p-m_nodes-hf-6-Filter.db
> -rw-r--r-- 1 user group 11518264 Jan 15 10:27 p-m_nodes-hf-6-Index.db
> -rw-r--r-- 1 user group     4340 Jan 15 10:27 p-m_nodes-hf-6-Statistics.db
> Commit log directory:
> $ ls -l commitlog/
> total 164012
> -rw-r--r-- 1 user group 33554432 Jan 14 12:44 CommitLog-1358164638228.log
> -rw-r--r-- 1 user group 33554432 Jan 15 10:50 CommitLog-1358164638237.log
> -rw-r--r-- 1 user group 33554432 Jan 15 11:59 CommitLog-1358164638238.log
> -rw-r--r-- 1 user group 33554432 Jan 15 10:27 CommitLog-1358164638239.log
> -rw-r--r-- 1 user group 33554432 Jan 15 11:48 CommitLog-1358164638240.log

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira