You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dan Hendry <da...@gmail.com> on 2011/01/29 20:45:39 UTC

Argh: Data Corruption (LOST DATA) (0.7.0)

I am once again having severe problems with my Cassandra cluster. This time,
I straight up cannot read sections of data (consistency level ONE). Client
side, I am seeing timeout exceptions. On the Cassandra node, I am seeing
errors as shown below. I don't understand what has happened or how to fix
it. I also don't understand how I am seeing errors on only one node, using
consistency level ONE with a rf=2 and yet clients are failing. I have tried
turning on debug logging but that been no help, the logs roll over (20 mb)
in < 10 seconds (the cluster is being used quite heavily). 

 

My cluster has been working fine for weeks the suddenly, I had a corrupt
SSTable which caused me all sorts of grief (outlined in pervious emails). I
was able to solve the problem by turning down the max compaction threshold
then figuring out which SSTable was corrupt by watching which minor
compactions failed. After that, I straight up deleted the on-disk data. Now
I am having problems on a different node (but adjacent in the ring) for what
I am almost certain is the same column family (presumably the same
row/column). At this point, the data is effectively lost as I know 1 of the
2 replicas was completely deleted.

 

Is there any advice going forward? My next course of action was going to be
exporting all of the sstables to JSON using the provided tool and trying to
look it over manually to see what the problem actually is (if exporting will
even work). I am not sure how useful this will be as there is nearly 80 GB
of data for this CF on a single node. What is more concerning is that I have
no idea how this problem initially popped up. I have performed hardware
tests and nothing seems to be malfunctioning. Furthermore, the fact that
these issues have 'jumped' nodes is a strong indication to me this is a
Cassandra problem.

 

There is a Cassandra bug here somewhere, if only in the way corrupt columns
are dealt with. 

 

 

db (85098417 bytes)

ERROR [ReadStage:221] 2011-01-29 12:42:39,153
DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor

java.lang.RuntimeException: java.io.IOException: Invalid localDeleteTime
read: -1516572672

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:124)

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:47)

        at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator
.java:136)

        at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131
)

        at
org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableS
liceIterator.java:108)

        at
org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIter
ator.java:283)

        at
org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIt
erator.java:326)

        at
org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIte
rator.java:230)

        at
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.jav
a:68)

        at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator
.java:136)

        at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131
)

        at
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQ
ueryFilter.java:118)

        at
org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(QueryFilte
r.java:142)

        at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilySto
re.java:1230)

        at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
java:1107)

        at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
java:1077)

        at org.apache.cassandra.db.Table.getRow(Table.java:384)

        at
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.jav
a:63)

        at
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)

        at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63
)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)

        at java.lang.Thread.run(Thread.java:662)

Caused by: java.io.IOException: Invalid localDeleteTime read: -1516572672

        at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3
56)

        at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3
13)

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetche
r.getNextBlock(IndexedSliceReader.java:180)

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:119)

        ... 22 more

ERROR [ReadStage:210] 2011-01-29 12:42:41,529
DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor

java.lang.RuntimeException:
org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: invalid
column name length 0

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:124)

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:47)

        at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator
.java:136)

        at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131
)

        at
org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableS
liceIterator.java:108)

        at
org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIter
ator.java:283)

        at
org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIt
erator.java:326)

        at
org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIte
rator.java:230)

        at
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.jav
a:68)

        at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator
.java:136)

        at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131
)

        at
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQ
ueryFilter.java:118)

        at
org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(QueryFilte
r.java:142)

        at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilySto
re.java:1230)

       at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
java:1107)

        at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.
java:1077)

        at org.apache.cassandra.db.Table.getRow(Table.java:384)

        at
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.jav
a:63)

        at
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)

        at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63
)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja
va:886)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)

        at java.lang.Thread.run(Thread.java:662)

Caused by: org.apache.cassandra.db.ColumnSerializer$CorruptColumnException:
invalid column name length 0

        at
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:6
8)

        at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3
64)

        at
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:3
13)

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetche
r.getNextBlock(IndexedSliceReader.java:180)

        at
org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(Indexe
dSliceReader.java:119)

        ... 22 more

 

Dan Hendry

(403) 660-2297

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

Posted by Terje Marthinussen <tm...@gmail.com>.

Hi,

Unfortunately, this patch is already included in the build I have.

Thanks for the suggestion though!
Terje

On Sat, Mar 5, 2011 at 7:47 PM, Sylvain Lebresne <sy...@datastax.com>wrote:

> Also, if you can, please be sure to try the new 0.7.3 release. We had a bug
> with the compaction of superColumns for instance that is fixed there (
> https://issues.apache.org/jira/browse/CASSANDRA-2104). It also ships with
> a new scrub command that tries to find if your sstables are corrupted and
> repair them in that event. I can be worth trying it too.
>
> --
> Sylvain
>
>
> On Fri, Mar 4, 2011 at 7:04 PM, Benjamin Coverston <
> ben.coverston@datastax.com> wrote:
>
>>  Hi Terje,
>>
>> Can you attach the portion of your logs that shows the exceptions
>> indicating corruption? Which version are you on right now?
>>
>> Ben
>>
>>
>> On 3/4/11 10:42 AM, Terje Marthinussen wrote:
>>
>> We are seeing various other messages as well related to deserialization,
>> so this seems to be some random corruption somewhere, but so far it may seem
>> to be limited to supercolumns.
>>
>>  Terje
>>
>>  On Sat, Mar 5, 2011 at 2:26 AM, Terje Marthinussen <
>> tmarthinussen@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>  Did you get anywhere on this problem?
>>>
>>>  I am seeing similar errors unfortunately :(
>>>
>>>  I tried to add some quick error checking to the serialization, and it
>>> seems like the data is ok there.
>>>
>>>  Some indication that this occurs in compaction and maybe in hinted
>>> handoff, but no indication that it occurs for sstables with newly written
>>> data.
>>>
>>>  Terje
>>>
>>
>>
>> --
>> Ben Coverston
>> DataStax -- The Apache Cassandra Companyhttp://www.datastax.com/
>>
>>
>

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

Posted by Sylvain Lebresne <sy...@datastax.com>.

Also, if you can, please be sure to try the new 0.7.3 release. We had a bug
with the compaction of superColumns for instance that is fixed there (
https://issues.apache.org/jira/browse/CASSANDRA-2104). It also ships with a
new scrub command that tries to find if your sstables are corrupted and
repair them in that event. I can be worth trying it too.

--
Sylvain

On Fri, Mar 4, 2011 at 7:04 PM, Benjamin Coverston <
ben.coverston@datastax.com> wrote:

>  Hi Terje,
>
> Can you attach the portion of your logs that shows the exceptions
> indicating corruption? Which version are you on right now?
>
> Ben
>
>
> On 3/4/11 10:42 AM, Terje Marthinussen wrote:
>
> We are seeing various other messages as well related to deserialization, so
> this seems to be some random corruption somewhere, but so far it may seem to
> be limited to supercolumns.
>
>  Terje
>
>  On Sat, Mar 5, 2011 at 2:26 AM, Terje Marthinussen <
> tmarthinussen@gmail.com> wrote:
>
>> Hi,
>>
>>  Did you get anywhere on this problem?
>>
>>  I am seeing similar errors unfortunately :(
>>
>>  I tried to add some quick error checking to the serialization, and it
>> seems like the data is ok there.
>>
>>  Some indication that this occurs in compaction and maybe in hinted
>> handoff, but no indication that it occurs for sstables with newly written
>> data.
>>
>>  Terje
>>
>
>
> --
> Ben Coverston
> DataStax -- The Apache Cassandra Companyhttp://www.datastax.com/
>
>

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

Posted by Benjamin Coverston <be...@datastax.com>.

Hi Terje,

Can you attach the portion of your logs that shows the exceptions 
indicating corruption? Which version are you on right now?

Ben

On 3/4/11 10:42 AM, Terje Marthinussen wrote:
> We are seeing various other messages as well related to 
> deserialization, so this seems to be some random corruption somewhere, 
> but so far it may seem to be limited to supercolumns.
>
> Terje
>
> On Sat, Mar 5, 2011 at 2:26 AM, Terje Marthinussen 
> <tmarthinussen@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     Did you get anywhere on this problem?
>
>     I am seeing similar errors unfortunately :(
>
>     I tried to add some quick error checking to the serialization, and
>     it seems like the data is ok there.
>
>     Some indication that this occurs in compaction and maybe in hinted
>     handoff, but no indication that it occurs for sstables with newly
>     written data.
>
>     Terje
>
>

-- 
Ben Coverston
DataStax -- The Apache Cassandra Company
http://www.datastax.com/

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

Posted by Terje Marthinussen <tm...@gmail.com>.

We are seeing various other messages as well related to deserialization, so
this seems to be some random corruption somewhere, but so far it may seem to
be limited to supercolumns.

Terje

On Sat, Mar 5, 2011 at 2:26 AM, Terje Marthinussen
<tm...@gmail.com>wrote:

> Hi,
>
> Did you get anywhere on this problem?
>
> I am seeing similar errors unfortunately :(
>
> I tried to add some quick error checking to the serialization, and it seems
> like the data is ok there.
>
> Some indication that this occurs in compaction and maybe in hinted handoff,
> but no indication that it occurs for sstables with newly written data.
>
> Terje
>

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

Posted by Terje Marthinussen <tm...@gmail.com>.

Hi,

Did you get anywhere on this problem?

I am seeing similar errors unfortunately :(

I tried to add some quick error checking to the serialization, and it seems
like the data is ok there.

Some indication that this occurs in compaction and maybe in hinted handoff,
but no indication that it occurs for sstables with newly written data.

Terje

Re: Argh: Data Corruption (LOST DATA) (0.7.0)

Posted by Ben Coverston <be...@datastax.com>.

Dan,

Do you have any more information on this issue? Have you been able to 
discover anything from exporing your SSTables to JSON?

Thanks,
Ben

On 1/29/11 12:45 PM, Dan Hendry wrote:
>
> I am once again having severe problems with my Cassandra cluster. This 
> time, I straight up cannot read sections of data (consistency level 
> ONE). Client side, I am seeing timeout exceptions. On the Cassandra 
> node, I am seeing errors as shown below. I don't understand what has 
> happened or how to fix it. I also don't understand how I am seeing 
> errors on only one node, using consistency level ONE with a rf=2 and 
> yet clients are failing. I have tried turning on debug logging but 
> that been no help, the logs roll over (20 mb) in < 10 seconds (the 
> cluster is being used quite heavily).
>
> My cluster has been working fine for weeks the suddenly, I had a 
> corrupt SSTable which caused me all sorts of grief (outlined in 
> pervious emails). I was able to solve the problem by turning down the 
> max compaction threshold then figuring out which SSTable was corrupt 
> by watching which minor compactions failed. After that, I straight up 
> deleted the on-disk data. Now I am having problems on a different node 
> (but adjacent in the ring) for what I am almost certain is the same 
> column family (presumably the same row/column). At this point, the 
> data is effectively lost as I know 1 of the 2 replicas was completely 
> deleted.
>
> Is there any advice going forward? My next course of action was going 
> to be exporting all of the sstables to JSON using the provided tool 
> and trying to look it over manually to see what the problem actually 
> is (if exporting will even work). I am not sure how useful this will 
> be as there is nearly 80 GB of data for this CF on a single node. What 
> is more concerning is that I have no idea how this problem initially 
> popped up. I have performed hardware tests and nothing seems to be 
> malfunctioning. Furthermore, the fact that these issues have 'jumped' 
> nodes is a strong indication to me this is a Cassandra problem.
>
> There is a Cassandra bug here somewhere, if only in the way corrupt 
> columns are dealt with.
>
> db (85098417 bytes)
>
> ERROR [ReadStage:221] 2011-01-29 12:42:39,153 
> DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
>
> java.lang.RuntimeException: java.io.IOException: Invalid 
> localDeleteTime read: -1516572672
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:124)
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:47)
>
>         at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>
>         at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>
>         at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:108)
>
>         at 
> org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:283)
>
>         at 
> org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326)
>
>         at 
> org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230)
>
>         at 
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:68)
>
>         at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>
>         at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>
>         at 
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:118)
>
>         at 
> org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(QueryFilter.java:142)
>
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1230)
>
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
>
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
>
>         at org.apache.cassandra.db.Table.getRow(Table.java:384)
>
>         at 
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:63)
>
>         at 
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
>
>         at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
>
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> Caused by: java.io.IOException: Invalid localDeleteTime read: -1516572672
>
>         at 
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:356)
>
>         at 
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:313)
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:180)
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:119)
>
>         ... 22 more
>
> ERROR [ReadStage:210] 2011-01-29 12:42:41,529 
> DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
>
> java.lang.RuntimeException: 
> org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: 
> invalid column name length 0
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:124)
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:47)
>
>         at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>
>         at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>
>         at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:108)
>
>         at 
> org.apache.commons.collections.iterators.CollatingIterator.set(CollatingIterator.java:283)
>
>         at 
> org.apache.commons.collections.iterators.CollatingIterator.least(CollatingIterator.java:326)
>
>         at 
> org.apache.commons.collections.iterators.CollatingIterator.next(CollatingIterator.java:230)
>
>         at 
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:68)
>
>         at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
>
>         at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
>
>         at 
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:118)
>
>         at 
> org.apache.cassandra.db.filter.QueryFilter.collectCollatedColumns(QueryFilter.java:142)
>
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1230)
>
>        at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107)
>
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077)
>
>         at org.apache.cassandra.db.Table.getRow(Table.java:384)
>
>         at 
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:63)
>
>         at 
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68)
>
>         at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
>
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> Caused by: 
> org.apache.cassandra.db.ColumnSerializer$CorruptColumnException: 
> invalid column name length 0
>
>         at 
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:68)
>
>         at 
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:364)
>
>         at 
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:313)
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:180)
>
>         at 
> org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:119)
>
>         ... 22 more
>
> Dan Hendry
>
> (403) 660-2297
>