You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Terry Cumaranatunge (Commented) (JIRA)" <ji...@apache.org> on 2012/01/10 16:12:39 UTC

[jira] [Commented] (CASSANDRA-3468) SStable data corruption in 1.0.x

    [ https://issues.apache.org/jira/browse/CASSANDRA-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183318#comment-13183318 ] 

Terry Cumaranatunge commented on CASSANDRA-3468:
------------------------------------------------

This is an update on the ticket based on an investigation that was performed with Sylvain.

The summary is that this problem is reproducible under high load if you use JNA. If you don't use JNA, the problem does not occur. It is difficult to reproduce with JNA if the load is not high. The problem also happens more frequently as the number of nodes in the cluster increases. When you run with a 30-node cluster under high load, it can be reproduced in a few hours. It can take days or weeks when you have only 8 nodes. Sylvain's analysis confirms this is not a disk corruption issue.

------------------
Hello Terry,

Sorry for the late update.

Since the log message did appear with the patch, it means the corruption is not some disk bitrot, but rather something weird happening to the data between the time we compute the sha1 and the time the file is fully written. And it seems clear that it is related to JNA somehow from what you say. I have no clue what can be causing this so far, but it does help narrowing it down. The weird part is that I don't think anyone else has reported this issue as far as I know, while you seem to be able to reproduce fairly easily.

In any case, I have a lot on my plate at the moment and frankly not too many idea of what can be the problem but I'm not forgetting you.
It wouldn't hurt however if you were to update the JIRA ticket you've opened with all the new information we have, as it'll make sure more eyes are on the issue.

--
Sylvain


On Wed, Jan 4, 2012 at 1:16 PM, Terry Cumaranatunge <Te...@infinite.com> wrote:
> Sylvain,
>
> The data corruption occurred with the patch you provided and it did produce the error messages you were expecting. Below is some additional data for this problem.
>
> Based on this data, does it mean it wrote a corrupted file? It looks to be that way based on the code that was in the patch.
>
> BTW: When you turn off JNA, we have had no corruption after more than a week of use. When you have JNA, the corruption occurs pretty quickly.
>
> %md5sum /data/data/MSA/modseq-hb-1746-Data.db
> e4314b4cd53922202df644604a48e565  
> /data/data/MSA/modseq-hb-1746-Data.db
>
> %cat /data/data/MSA/modseq-hb-1746-Digest.sha1
> c658a73eda91879ed46f6532ff45f9903495a58d  modseq-hb-1746-Data.db
>
>
>> -----Original Message-----
>> From: Sylvain Lebresne [mailto:sylvain@datastax.com]
>> Sent: Tuesday, December 20, 2011 5:14 AM
>> To: Terry Cumaranatunge
>> Cc: Ramesh Natarajan
>> Subject: Re: Consultation session
>>
>> Hello Terry,
>>
>> As discussed yesterday, I'm joining you a patch that checks the sha1 
>> sum just after a sstable has been synced on disk and is closed. It 
>> should apply on the top of the current 1.0 branch of Cassandra but 
>> probably against slightly older version too. If you want to send you 
>> a jar of a specific version with this patch included instead, I can do that too.
>>
>> Once applied and once you run into the corruption situation again, 
>> look in the log for an ERROR message looking like:
>> "File checksum mismatch for <filename> before rename: expecting 
>> <someValue> but got <someOtherValue>"
>>
>> You should actually see 2 messages for the corrupted file, one with 
>> the string 'before rename' and one with 'after rename'. The reason is 
>> that when we write a sstable, we write it under a temporary name 
>> (something like Standard1-tmp-hb-12- Data.db), and once it is fully written, we rename it to Standard1-hb-12-Data.db.
>> Instead of choosing between place my sha1 check before the rename or 
>> after the rename, I've included both checks. In theory, the rename 
>> should not have change anything to the file content so you should either see both check fail or none of them.
>> But in case you see only one of those two message, that would be a 
>> pretty interesting information.
>>
>> For the other things we talked yesterday:
>> - You can change the log level through JMX using 
>> org.apache.cassandra.db.StorageService:setLog4jLevel. The method take 
>> two arguments, the first one being a logger specifier and the second 
>> the new level to set. To turn everything to debug you'll want to call setLog4jLevel("org", "DEBUG").
>> - I also confirm that the commit log use the trick of no polluting 
>> the OS cache (if JNA is included)
>> - For the JNA version, our Datastax Community and Server products 
>> uses the 3.3.0 version currently (the last one I believe).
>>
>>
>> For the hints CF exception you got, it looks like 
>> https://issues.apache.org/jira/browse/CASSANDRA-3579. If you don't 
>> mind, it would be useful if you could adds a comment to the ticket to 
>> say that you're runninginto this too (confirming this is a widespread
>> problem) and quickly describe the associated gossip problem (this 
>> doesn't really sound related a priori, but I'm not the expert in 
>> Gossip so someone else may think of  some link between the two).
>>
>> --
>> Sylvain
>>
>>
>> On Tue, Dec 20, 2011 at 4:59 AM, Terry Cumaranatunge 
>> <Te...@infinite.com> wrote:
>> > Sylvain,
>> >
>> > Remember that problem I described where the Gossip detects that the 
>> > node is up
>> and down constantly without any read or writes to Cassandra? It 
>> happened again today. I see these exceptions happening on several of the nodes at the same time.
>> It looks like the hints CF has some column size corruption. I know 
>> for sure that this is not a disk corruption because I have so much 
>> cache at the moment that after the change to remove JNA, no disk 
>> reads have occurred as confirmed by iostat -- everything is in cache. 
>> So, this seems to suggest some type of Cassandra hints CF corruption. 
>> I don't know which happened first; Cassandra being unresponsive or hints family error below.
>> >
>> > Restarting Cassandra fixed the problem. I've attached the portion 
>> > of the stack
>> trace for the hints CF exception.
>> >
>> >
>> > INFO [CompactionExecutor:648] 2011-12-19 20:41:17,399
>> CompactionController.java (line 133) Compacting large row
>> system/HintsColumnFamily:
>> > 77777777777777777777777777777770 (73427721 bytes) incrementally
>> >  INFO [FlushWriter:99] 2011-12-19 20:41:17,410 Memtable.java (line
>> > 275) Completed flushing
>> > /data/data/system/HintsColumnFamily-hb-141-Data
>> > .db (3022 bytes)
>> > ERROR [CompactionExecutor:648] 2011-12-19 20:41:19,445 
>> > AbstractCassandraDaemon.java (line 133) Fatal exception in thread 
>> > Thread[Compaction Executor:648,1,main]
>> > java.lang.AssertionError: originally calculated column size of
>> > 66102951 but now it is 66009419
>> >        at
>> > org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyC
>> > omp
>> > actedRow.java:124)
>> >        at
>> > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.
>> > jav
>> > a:160)
>> >        at
>> > org.apache.cassandra.db.compaction.CompactionTask.execute(Compactio
>> > nTa
>> > sk.java:158)
>> >        at
>> > org.apache.cassandra.db.compaction.CompactionManager$6.call(Compact
>> > ion
>> > Manager.java:275)
>> >        at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >        at
>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolEx
>> > ecu
>> > tor.java:886)
>> >        at
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
>> > java:908)
>> >        at java.lang.Thread.run(Thread.java:662)
>> > ERROR [HintedHandoff:1] 2011-12-19 20:41:19,445 
>> > AbstractCassandraDaemon.java (line 133) Fatal exception in thread
>> > Thread[HintedHandoff:1,1 ,main]
>> > java.lang.RuntimeException: java.lang.RuntimeException:
>> > java.util.concurrent.ExecutionException: java.lang.AssertionError:
>> > originally calc ulated column size of 66102951 but now it is 
>> > 66009419
>> >        at
>> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java
>> > :34
>> > )
>> >        at
>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolEx
>> > ecu
>> > tor.java:886)
>> >        at
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
>> > java:908)
>> >        at java.lang.Thread.run(Thread.java:662)
>> > Caused by: java.lang.RuntimeException:
>> > java.util.concurrent.ExecutionException: java.lang.AssertionError:
>> > originally calculated column siz e of 66102951 but now it is 
>> > 66009419
>> >        at
>> > org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint
>> > (Hi
>> > ntedHandOffManager.java:330)
>> >        at
>> > org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandO
>> > ffM
>> > anager.java:81)
>> >    at
>> > org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandO
>> > ffM
>> > anager.java:81)
>> >        at
>> > org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHa
>> > ndO
>> > ffManager.java:353)
>> >        at
>> > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java
>> > :30
>> > )
>> >        ... 3 more
>> > Caused by: java.util.concurrent.ExecutionException:
>> > java.lang.AssertionError: originally calculated column size of
>> > 66102951 but now it is66009419
>> >        at
>> > java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>> >        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>> >        at
>> > org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint
>> > (Hi
>> > ntedHandOffManager.java:326)
>> >        ... 6 more
>> > Caused by: java.lang.AssertionError: originally calculated column 
>> > size of 66102951 but now it is 66009419
>> >        at
>> > org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyC
>> > omp
>> > actedRow.java:124)
>> >        at
>> > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.
>> > jav
>> > a:160)
>> >        at
>> > org.apache.cassandra.db.compaction.CompactionTask.execute(Compactio
>> > nTa
>> > sk.java:158)
>> >        at
>> > org.apache.cassandra.db.compaction.CompactionManager$6.call(Compact
>> > ion
>> > Manager.java:275)
>> >        at
>> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> >        ... 3 more
>> >
>> >> -----Original Message-----
>> >> From: Terry Cumaranatunge
>> >> Sent: Monday, December 19, 2011 8:58 AM
>> >> To: 'Sylvain Lebresne'
>> >> Cc: Ramesh Natarajan
>> >> Subject: RE: Consultation session
>> >>
>> >> I have this GoToMeeting session in case we need to use it.
>> >>
>> >> 1.  Please join my meeting.
>> >> https://www1.gotomeeting.com/join/893565288
>> >>
>> >> Meeting ID: 893-565-288
>> >>
>> >> GoToMeeting(r)
>> >> Online Meetings Made Easy(tm)
>> >>
>> >> > -----Original Message-----
>> >> > From: Terry Cumaranatunge
>> >> > Sent: Sunday, December 18, 2011 9:15 PM
>> >> > To: 'Sylvain Lebresne'
>> >> > Cc: Ramesh Natarajan
>> >> > Subject: RE: Consultation session
>> >> >
>> >> > Sylvain,
>> >> >
>> >> > An update to the previous e-mail.
>> >> >
>> >> > We ran a test with changing disk_accees_mode to index_only, but 
>> >> > the problem
>> >> still
>> >> > occurred. We then started a test with removal of JNA, so it 
>> >> > removed use of direct I/O. That test that was started on 
>> >> > Saturday morning has been running over 36
>> >> hours
>> >> > without problems.
>> >> >
>> >> > Our previous tests with use of direct I/O, the longest we ran 
>> >> > was
>> >> > 26 hours before
>> >> a
>> >> > corruption occurred. This was with RHEL 5.6. If you RHEL 6.0 or 
>> >> > 6.1, our longest test was probably under 6 hours before 
>> >> > corruption was noticed. Current test
>> >> running
>> >> > over 36 hours without direct I/O is with RHEL 6.0.
>> >> >
>> >> > So, the OS version seems to influence the amount of time it 
>> >> > takes to hit the
>> >> problem
>> >> > when using direct I/O and not using direct I/O so far has not 
>> >> > produced the
>> >> problem.
>> >> > I have analyzed over 5 cases of corruptions carefully now and in 
>> >> > every case, the common pattern is that a compaction occurs, then 
>> >> > a read occurs to the same
>> >> table
>> >> > created in the compaction in 3-40 secs after the compaction and 
>> >> > corruption is
>> >> seen.
>> >> > None of the corruptions are related to an sstable created from 
>> >> > flushing a
>> >> memtable.
>> >> >
>> >> > The suspect here is that there is some interaction with direct 
>> >> > I/O and JNA based
>> >> on
>> >> > current results. We can talk more tomorrow.
>> >> >
>> >> > > -----Original Message-----
>> >> > > From: Terry Cumaranatunge
>> >> > > Sent: Friday, December 16, 2011 3:09 PM
>> >> > > To: 'Sylvain Lebresne'
>> >> > > Cc: Matt Kramer; Ramesh Natarajan
>> >> > > Subject: RE: Consultation session
>> >> > >
>> >> > > Hi Sylvain,
>> >> > >
>> >> > > Ok, let's talk on Monday, December 19, 9 a.m. CST.
>> >> > >
>> >> > > The questions around this call are related to data corruption 
>> >> > > issues we have
>> >> been
>> >> > > experiencing. We are running ESXi 5.0 with RHEL virtual machines.
>> >> > > We have
>> >> tried
>> >> > > RHEL 5.6, 6.0, 6.1 and have seen this problem. We have even 
>> >> > > tried going
>> >> directly
>> >> > to
>> >> > > the disk instead of using the virtual layer. The common 
>> >> > > symptom across all this
>> >> is
>> >> > > that it seems to happen when the I/O latency increases. There 
>> >> > > are no scsi error
>> >> or
>> >> > > any type of disk timeouts reported.  We are not seeing this 
>> >> > > problem on any
>> >> other
>> >> > > use of these systems. The problem has happened on several 
>> >> > > hosts and several RAIDs. We are suspecting that there may be 
>> >> > > something happening with the use
>> >> of
>> >> > > bypassing the cache and some times reading the file from cache 
>> >> > > and those are
>> >> > part
>> >> > > of the questions here.
>> >> > >
>> >> > > I'm attaching the associated files for the corruption of 
>> >> > > participants-hb-5255-
>> *.
>> >> This
>> >> > is
>> >> > > with Level compaction, but we do see the same problem with SizeTiered.
>> >> > >
>> >> > > Conference call details:
>> >> > >
>> >> > > Conference dial-in number: (712) 432-0075 Participant access
>> >> > > code:  338759
>> >> > >
>> >> > > > -----Original Message-----
>> >> > > > From: Sylvain Lebresne [mailto:sylvain@datastax.com]
>> >> > > > Sent: Thursday, December 15, 2011 10:28 AM
>> >> > > > To: Terry Cumaranatunge
>> >> > > > Cc: Matt Kramer
>> >> > > > Subject: Re: Consultation session
>> >> > > >
>> >> > > > Hello Terry,
>> >> > > >
>> >> > > > Would you mind if we push that to Monday? I am traveling on 
>> >> > > > Friday and thus won't be available for a call.
>> >> > > >
>> >> > > > --
>> >> > > > Sylvain
>> >> > > >
>> >> > > > On Thu, Dec 15, 2011 at 6:54 AM, Terry Cumaranatunge 
>> >> > > > <Te...@infinite.com> wrote:
>> >> > > > > Can we do a consultation tomorrow with Sylvain for Friday, 
>> >> > > > > December
>> >> > > > > 16 at 9 a.m. CST? I will send questions before the call as usual.
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> ___

                
> SStable data corruption in 1.0.x
> --------------------------------
>
>                 Key: CASSANDRA-3468
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3468
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>         Environment: RHEL 6 running Cassandra 1.0.x.
>            Reporter: Terry Cumaranatunge
>              Labels: patch
>
> We have noticed several instances of sstable corruptions in 1.0.x. This has occurred in 1.0.0-rcx and 1.0.0 and 1.0.1. It has happened on multiple nodes and multiple hosts with different disks, so this is the reason the software is suspected at this time. The file system used is XFS, but no resets or any type of failure scenarios have been run to create the problem. We were basically running under load and every so often, we see that the sstable gets corrupted and compaction stops on that node.
> I will attach the relevant sstable files if it lets me do that when I create this ticket.
> ERROR [CompactionExecutor:23] 2011-10-27 11:14:09,309 PrecompactedRow.java (line 119) Skipping row DecoratedKey(128013852116656632841539411062933532114, 37303730303138313533) in /var/lib/cassandra/data/MSA/participants-h-8688-Data.db
> java.io.EOFException
>         at java.io.RandomAccessFile.readFully(RandomAccessFile.java:399)
>         at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377)
>         at org.apache.cassandra.utils.BytesReadTracker.readFully(BytesReadTracker.java:95)
>         at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:388)
>         at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:350)
>         at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:96)
>         at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:36)
>         at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:143)
>         at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:231)
>         at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:115)
>         at org.apache.cassandra.db.compaction.PrecompactedRow.<init>(PrecompactedRow.java:102)
>         at org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:127)
>         at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:102)
>         at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:87)
>         at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:116)
>         at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
>         at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
>         at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
>         at com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
>         at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
>         at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
>         at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:179)
>         at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:47)
>         at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:131)
>         at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:114)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> This was Sylvain's analysis:
> I don't have much better news. Basically it seems the 2 last MB of the file are complete garbage (which also explain the mmap error btw). And given where the corruption actually starts, it suggests that it's either a very low level bug in our file writer code that start writting bad data at some point for some reason, or it's corruption not related to Cassandra. But given that, a Cassandra bug sounds fairly unlikely.
> You said that you saw that corruption more than once. Could you be more precise? In particular, did you get it on different hosts? Also, what file system are you using?
> If you do happen to have another instance of a corrupted sstable (ideally from some other host) that you can share, please don't hesitate. I could try to look if I find something common between the two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira