You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Sylvain Lebresne <sy...@datastax.com> on 2012/06/14 10:09:35 UTC

Re: Urgent - IllegalArgumentException during compaction and memtable flush

On Thu, Jun 14, 2012 at 8:26 AM, Piavlo <lo...@gmail.com> wrote:
> I started looking for similar messages on other nodes saw a SINGLE IllegalArgumentException on
> ValidationExecutor on the same node and 2 other nodes (this is a 6 node cluster) which happened
> at almost the same time , in all nodes while flushing same UserCompletions CF memtable. This
> happened about 12hours before the IllegalArgumentException in  CompactionExecutor.

This actually does not happen during a flush but during a validation
compaction, which happens during a repair.
The exception is basically saying there is invalid composite column
name (you do use a composite comparator right?).
I guess that could result from some on-disk corruption. Are you using
sstable compression on UserCompletions? (I am asking because
compressed sstables have checksums)

> And even bigger problem now is that running repairs on other CFs against
> different nodes does not have any effect, for example running
> /usr/bin/nodetool -h dsc2b.internal -pr repair PRODUCTION UserDirectVendors
> does not trigger any repair activity and nothing in the logs to indicate a
> start of repair. And I have ~24hours left to repair some CFs before the gc
> period ends :(

Does that happen on every node?
What can happen is that some failed repair may block other from
starting. One thing you can try is to run the method called
forceTerminateAllRepairessions in JMX under
org.apache.cassandra.db->StorageService->Operations (I'm afraid there
is no nodetool hook so you will have to use jconsole). After that, try
starting a repair again. If that doesn't work, it's worth trying to
restart the node.

--
Sylvain

Re: Urgent - IllegalArgumentException during compaction and memtable flush

Posted by Dave Brosius <db...@mebigfatguy.com>.

One of the column names on the row with key 353339332d3134363533393931 
failed to validate with the validator for the column.

If you really are after what column is problematic, and are able to 
build and run cassandra, you can add debugging info to Column.java

     protected void validateName(CFMetaData metadata) throws 
MarshalException
     {
*try {*
         AbstractType<?> nameValidator = metadata.cfType == 
ColumnFamilyType.Super ? metadata.subcolumnComparator : metadata.comparator;

         nameValidator.validate(name());
*} catch (MarshalException me) {
     throw new MarshalException("Failed validating name: " + 
ByteBufferUtil.bytesToHex(name()), me);
}*
     }

btw, the 92668395684826132216160944211592988451 is just the key's token.



On 06/14/2012 01:56 PM, Piavlo wrote:
>
> I was able to figure out that 353339332d3134363533393931 is the row key
> while no idea what is 92668395684826132216160944211592988451 part?
>
> sstable2json also fails with validation error on this row key
>
> now since I have lost data for this row -  how do I find out that was 
> the root cause?
>
> thanks     protected void validateName(CFMetaData metadata) throws 
> MarshalException
>     {
>         AbstractType<?> nameValidator = metadata.cfType == 
> ColumnFamilyType.Super ? metadata.subcolumnComparator : 
> metadata.comparator;
>         nameValidator.validate(name());
>     }
>
> On 06/14/2012 06:17 PM, Piavlo wrote:
>> Ok i've run scrub on the 3 nodes and the problematic row
>> Error validating row 
>> DecoratedKey(92668395684826132216160944211592988451, 
>> 353339332d3134363533393931)
>>
>> The full message is
>>
>>  WARN [CompactionExecutor:2700] 2012-06-14 14:26:42,041 
>> CompactionManager.java (line 582) Non-fatal error reading row 
>> (stacktrace follows)
>> java.io.IOError: java.io.IOException: Error validating row 
>> DecoratedKey(92668395684826132216160944211592988451, 
>> 353339332d3134363533393931)
>>         at 
>> org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:114)
>>         at 
>> org.apache.cassandra.db.compaction.PrecompactedRow.<init>(PrecompactedRow.java:97)
>>         at 
>> org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:137)
>>         at 
>> org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:143)
>>         at 
>> org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:566)
>>         at 
>> org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:473)
>>         at 
>> org.apache.cassandra.db.compaction.CompactionManager.access$200(CompactionManager.java:64)
>>         at 
>> org.apache.cassandra.db.compaction.CompactionManager$3.perform(CompactionManager.java:213)
>>         at 
>> org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:183)
>>         at 
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>         at java.lang.Thread.run(Thread.java:662)
>> Caused by: java.io.IOException: Error validating row 
>> DecoratedKey(92668395684826132216160944211592988451, 
>> 353339332d3134363533393931)
>>         at 
>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:241)
>>         at 
>> org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:110)
>>         ... 13 more
>> Caused by: org.apache.cassandra.db.marshal.MarshalException: Not 
>> enough bytes to read value of component 1
>>         at 
>> org.apache.cassandra.db.marshal.AbstractCompositeType.validate(AbstractCompositeType.java:240)
>>         at org.apache.cassandra.db.Column.validateName(Column.java:273)
>>         at 
>> org.apache.cassandra.db.Column.validateFields(Column.java:278)
>>         at 
>> org.apache.cassandra.db.ColumnFamily.validateColumnFields(ColumnFamily.java:372)
>>         at 
>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:237)
>>         ... 14 more
>>  WARN [CompactionExecutor:2700] 2012-06-14 14:26:42,085 
>> CompactionManager.java (line 624) Row at 4047368880 is unreadable; 
>> skipping to next
>>
>>
>> This happened on several sstables on on each of the nodes - meaning 
>> it was mutated several times
>>
>> dsc2b:
>>            
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-450-Data.db  at 
>> 4244390041
>>            
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-452-Data.db  at 
>> 9366462649
>>
>> dsc2c:
>>            
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-413-Data.db  at 
>> 4047368880
>>            
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-481-Data.db  at 
>> 3598063925
>>
>> dsc1a:
>>           
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-883-Data.db  at 
>> 271195463
>>           
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-733-Data.db  at 
>> 1089815977
>>           
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-895-Data.db  at 
>> 312147765
>>
>> How do I use this info to find out what is actually wrong with this 
>> column, as I highly want to prevent it from being written again :)
>>
>> Thanks
>>
>> On 06/14/2012 05:02 PM, Sylvain Lebresne wrote:
>>>> Is there way to make cassandra throw away the offending column?
>>> Running scrub should allow to get read of the row containing the
>>> problematic column. Unfortunately it will discard the whole row, not
>>> just the column.
>>> However, since scrub takes a snapshot anyway (and should tell you
>>> which sstable had the offending column), you could do that and keep
>>> the bad sstable around to further investigate.
>>>
>>>> maybe sstable2json can somehow help?
>>> It's worth a try but there is a good change sstable2json will throw an
>>> error too. It may have written enough at that point for you to find
>>> which row is wrong though.
>>>
>>> -- 
>>> Sylvain
>>
>
>

Re: Urgent - IllegalArgumentException during compaction and memtable flush

Posted by Piavlo <lo...@gmail.com>.

I was able to figure out that 353339332d3134363533393931 is the row key
while no idea what is 92668395684826132216160944211592988451 part?

sstable2json also fails with validation error on this row key

now since I have lost data for this row -  how do I find out that was 
the root cause?

thanks

On 06/14/2012 06:17 PM, Piavlo wrote:
> Ok i've run scrub on the 3 nodes and the problematic row
> Error validating row 
> DecoratedKey(92668395684826132216160944211592988451, 
> 353339332d3134363533393931)
>
> The full message is
>
>  WARN [CompactionExecutor:2700] 2012-06-14 14:26:42,041 
> CompactionManager.java (line 582) Non-fatal error reading row 
> (stacktrace follows)
> java.io.IOError: java.io.IOException: Error validating row 
> DecoratedKey(92668395684826132216160944211592988451, 
> 353339332d3134363533393931)
>         at 
> org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:114)
>         at 
> org.apache.cassandra.db.compaction.PrecompactedRow.<init>(PrecompactedRow.java:97)
>         at 
> org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:137)
>         at 
> org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:143)
>         at 
> org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:566)
>         at 
> org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:473)
>         at 
> org.apache.cassandra.db.compaction.CompactionManager.access$200(CompactionManager.java:64)
>         at 
> org.apache.cassandra.db.compaction.CompactionManager$3.perform(CompactionManager.java:213)
>         at 
> org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:183)
>         at 
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: Error validating row 
> DecoratedKey(92668395684826132216160944211592988451, 
> 353339332d3134363533393931)
>         at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:241)
>         at 
> org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:110)
>         ... 13 more
> Caused by: org.apache.cassandra.db.marshal.MarshalException: Not 
> enough bytes to read value of component 1
>         at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.validate(AbstractCompositeType.java:240)
>         at org.apache.cassandra.db.Column.validateName(Column.java:273)
>         at org.apache.cassandra.db.Column.validateFields(Column.java:278)
>         at 
> org.apache.cassandra.db.ColumnFamily.validateColumnFields(ColumnFamily.java:372)
>         at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:237)
>         ... 14 more
>  WARN [CompactionExecutor:2700] 2012-06-14 14:26:42,085 
> CompactionManager.java (line 624) Row at 4047368880 is unreadable; 
> skipping to next
>
>
> This happened on several sstables on on each of the nodes - meaning it 
> was mutated several times
>
> dsc2b:
>            
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-450-Data.db  at 
> 4244390041
>            
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-452-Data.db  at 
> 9366462649
>
> dsc2c:
>            
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-413-Data.db  at 
> 4047368880
>            
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-481-Data.db  at 
> 3598063925
>
> dsc1a:
>           
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-883-Data.db  at 
> 271195463
>           
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-733-Data.db  at 
> 1089815977
>           
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-895-Data.db  at 
> 312147765
>
> How do I use this info to find out what is actually wrong with this 
> column, as I highly want to prevent it from being written again :)
>
> Thanks
>
> On 06/14/2012 05:02 PM, Sylvain Lebresne wrote:
>>> Is there way to make cassandra throw away the offending column?
>> Running scrub should allow to get read of the row containing the
>> problematic column. Unfortunately it will discard the whole row, not
>> just the column.
>> However, since scrub takes a snapshot anyway (and should tell you
>> which sstable had the offending column), you could do that and keep
>> the bad sstable around to further investigate.
>>
>>> maybe sstable2json can somehow help?
>> It's worth a try but there is a good change sstable2json will throw an
>> error too. It may have written enough at that point for you to find
>> which row is wrong though.
>>
>> -- 
>> Sylvain
>

Re: Urgent - IllegalArgumentException during compaction and memtable flush

Posted by Piavlo <lo...@gmail.com>.

Ok i've run scrub on the 3 nodes and the problematic row
Error validating row 
DecoratedKey(92668395684826132216160944211592988451, 
353339332d3134363533393931)

The full message is

  WARN [CompactionExecutor:2700] 2012-06-14 14:26:42,041 
CompactionManager.java (line 582) Non-fatal error reading row 
(stacktrace follows)
java.io.IOError: java.io.IOException: Error validating row 
DecoratedKey(92668395684826132216160944211592988451, 
353339332d3134363533393931)
         at 
org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:114)
         at 
org.apache.cassandra.db.compaction.PrecompactedRow.<init>(PrecompactedRow.java:97)
         at 
org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:137)
         at 
org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:143)
         at 
org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:566)
         at 
org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:473)
         at 
org.apache.cassandra.db.compaction.CompactionManager.access$200(CompactionManager.java:64)
         at 
org.apache.cassandra.db.compaction.CompactionManager$3.perform(CompactionManager.java:213)
         at 
org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:183)
         at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Error validating row 
DecoratedKey(92668395684826132216160944211592988451, 
353339332d3134363533393931)
         at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:241)
         at 
org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:110)
         ... 13 more
Caused by: org.apache.cassandra.db.marshal.MarshalException: Not enough 
bytes to read value of component 1
         at 
org.apache.cassandra.db.marshal.AbstractCompositeType.validate(AbstractCompositeType.java:240)
         at org.apache.cassandra.db.Column.validateName(Column.java:273)
         at org.apache.cassandra.db.Column.validateFields(Column.java:278)
         at 
org.apache.cassandra.db.ColumnFamily.validateColumnFields(ColumnFamily.java:372)
         at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:237)
         ... 14 more
  WARN [CompactionExecutor:2700] 2012-06-14 14:26:42,085 
CompactionManager.java (line 624) Row at 4047368880 is unreadable; 
skipping to next


This happened on several sstables on on each of the nodes - meaning it 
was mutated several times

dsc2b:
            
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-450-Data.db  at 
4244390041
            
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-452-Data.db  at 
9366462649

dsc2c:
            
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-413-Data.db  at 
4047368880
            
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-481-Data.db  at 
3598063925

dsc1a:
           
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-883-Data.db  at 
271195463
           
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-733-Data.db  at 
1089815977
           
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-895-Data.db  at 
312147765

How do I use this info to find out what is actually wrong with this 
column, as I highly want to prevent it from being written again :)

Thanks

On 06/14/2012 05:02 PM, Sylvain Lebresne wrote:
>> Is there way to make cassandra throw away the offending column?
> Running scrub should allow to get read of the row containing the
> problematic column. Unfortunately it will discard the whole row, not
> just the column.
> However, since scrub takes a snapshot anyway (and should tell you
> which sstable had the offending column), you could do that and keep
> the bad sstable around to further investigate.
>
>> maybe sstable2json can somehow help?
> It's worth a try but there is a good change sstable2json will throw an
> error too. It may have written enough at that point for you to find
> which row is wrong though.
>
> --
> Sylvain

Re: Urgent - IllegalArgumentException during compaction and memtable flush

Posted by Sylvain Lebresne <sy...@datastax.com>.

> Is there way to make cassandra throw away the offending column?

Running scrub should allow to get read of the row containing the
problematic column. Unfortunately it will discard the whole row, not
just the column.
However, since scrub takes a snapshot anyway (and should tell you
which sstable had the offending column), you could do that and keep
the bad sstable around to further investigate.

> maybe sstable2json can somehow help?

It's worth a try but there is a good change sstable2json will throw an
error too. It may have written enough at that point for you to find
which row is wrong though.

--
Sylvain

Re: Urgent - IllegalArgumentException during compaction and memtable flush

Posted by Piavlo <lo...@gmail.com>.

  Hi again,

  After some further investigation now I'm in a situation that there are 
3 nodes (of 6 nodes cluster) and all of them are falling with 
ValidationExecutor during compaction which is is trigger by
"repair -pr PRODUCTION UserCompletions" against any node in the cluster 
- repair get's stuck and i have to use the 
forceTerminateAllRepairessions in JMX method.
Other repairs on different CFs are working fine.

So afaiu this must be just some BAD composite column name that 
unfortunately passed mutation and now can't be read?
This is present on all replicas - which mean I can not do cassandra 
restart with deletion of relevant sstables an all 3 replicas - since 
I'll obviously loose data.
And I'm goind to hit the end of gc period really soon on this CF.

So any suggestions on what can I do?
Is there way to make cassandra throw away the offending column?
can I find which column is problematic? maybe sstable2json can somehow help?

Thanks
Alex


On 06/14/2012 04:04 PM, Sylvain Lebresne wrote:
> On Thu, Jun 14, 2012 at 12:00 PM, Piavlo<lo...@gmail.com>  wrote:
>> What's the procedure to check if the compressed sstable is corrupted or not?
> Since you use compression, in theory that can't be disk bitrot since
> in that case
> you would have got some checksum error instead. The fact that it happened
> on 3 nodes during a repair also does suggest that this was not a
> simple disk bitrot.
>
> Not sure why that happened though (you don't use the StorageProxy api by any
> chance?).
>
>> Is there way to delete all UserCompletions sstables on the problematic node
>> and then run repair on this CF only?
>> Like disable thrift, drain memtables so it does not read commit log on
>> startup and then delete the sstables and start the node again will it work?
> It should work and should at least fix the compaction problem. Though to win
> some time you may want to remove only those sstables that triggered the
> exception in the first place. Trusting the log above that would be the ones
> mentioned here:
> INFO [CompactionExecutor:3335] 2012-06-14 05:14:02,088
> CompactionTask.java (line 115) Compacting
> [SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-838-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1038-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-846-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-949-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-895-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-883-Data.db')]
>
>> can it mean that sstable is not corrupted but just some BAD column name was
>> inserted OK but can not be read later read by ValidationExecutor in any of
>> the replica nodes?
> Well, the exception really only says that you have a bad column name.
> What produced it is unclear.
>
> --
> Sylvain
>
>> Check the relevant cassandra logs below
>>
>> dsc2b.internal/10.234.71.33
>> -----------------------
>>   INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,464
>> AntiEntropyService.java (line 658) [repair
>> #7ec142c0-b588-11e1-0000-f423231d3fff] new session: will sync
>> dsc2b.internal/10.234.71.33, /10.49.127.4, /10.58.249.118 on range
>> (85070591730234615865843651857942052864,113427455640312821154458202477256070485]
>> for PRODUCTION.[UserCompletions]
>>   INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,465
>> AntiEntropyService.java (line 837) [repair
>> #7ec142c0-b588-11e1-0000-f423231d3fff] requests for merkle tree sent for
>> UserCompletions (to [/10.49.127.4, /10.58.249.118,
>> dsc2b.internal/10.234.71.33])
>>   INFO [ValidationExecutor:129] 2012-06-13 18:49:24,466
>> ColumnFamilyStore.java (line 705) Enqueuing flush of
>> Memtable-UserCompletions@843906517(9952311/21343163 serialized/live bytes,
>> 41801 ops)
>>   INFO [FlushWriter:2563] 2012-06-13 18:49:24,467 Memtable.java (line 246)
>> Writing Memtable-UserCompletions@843906517(9952311/21343163 serialized/live
>> bytes, 41801 ops)
>>   INFO [FlushWriter:2563] 2012-06-13 18:49:24,828 Memtable.java (line 283)
>> Completed flushing
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-515-Data.db (1671566
>> bytes)
>> ERROR [ValidationExecutor:129] 2012-06-13 18:55:32,236
>> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
>> Thread[ValidationExecutor:129,1,main]
>> java.lang.IllegalArgumentException
>>         at java.nio.Buffer.limit(Buffer.java:249)
>> ....
>> -----------------------
>>
>> dsc1a.internal/10.49.127.4
>> -----------------------
>>   INFO [ValidationExecutor:125] 2012-06-13 18:49:24,457
>> ColumnFamilyStore.java (line 705) Enqueuing flush of
>> Memtable-UserCompletions@266077104(9047552/76151840 serialized/live bytes,
>> 38000 ops)
>>   INFO [FlushWriter:2670] 2012-06-13 18:49:24,466 Memtable.java (line 246)
>> Writing Memtable-UserCompletions@266077104(9047552/76151840 serialized/live
>> bytes, 38000 ops)
>>   INFO [FlushWriter:2670] 2012-06-13 18:49:24,969 Memtable.java (line 283)
>> Completed flushing
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db (1508368
>> bytes)
>>   INFO [CompactionExecutor:3299] 2012-06-13 18:49:24,971 CompactionTask.java
>> (line 115) Compacting
>> [SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1027-Data.db'),
>> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db'),
>> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1028-Data.db'),
>> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1029-Data.db')]
>>   INFO [CompactionExecutor:3299] 2012-06-13 18:50:03,554 CompactionTask.java
>> (line 223) Compacted to
>> [/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1031-Data.db,].
>>   23,417,251 to 23,832,802 (~101% of original) bytes for 116,956 keys at
>> 0.589102MB/s.  Time: 38,582ms.
>> ERROR [ValidationExecutor:125] 2012-06-13 18:56:58,961
>> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
>> Thread[ValidationExecutor:125,1,main]
>> java.lang.IllegalArgumentException
>>         at java.nio.Buffer.limit(Buffer.java:249)
>> ...
>> -------------------------
>>
>> dsc2c.internal/10.58.249.118
>> -------------------------
>>
>>   INFO [ValidationExecutor:119] 2012-06-13 18:49:24,305
>> ColumnFamilyStore.java (line 705) Enqueuing flush of
>> Memtable-UserCompletions@1279460811(19014066/66201229 serialized/live bytes,
>> 79838 ops)
>>   INFO [FlushWriter:2001] 2012-06-13 18:49:24,326 Memtable.java (line 246)
>> Writing Memtable-UserCompletions@1279460811(19014066/66201229
>> serialized/live bytes, 79838 ops)
>>   INFO [FlushWriter:2001] 2012-06-13 18:49:24,848 Memtable.java (line 283)
>> Completed flushing
>> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-548-Data.db (3177074
>> bytes)
>> ERROR [ValidationExecutor:119] 2012-06-13 18:55:50,387
>> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
>> Thread[ValidationExecutor:119,1,main]
>> java.lang.IllegalArgumentException
>>         at java.nio.Buffer.limit(Buffer.java:249)
>> ...
>> -------------------------
>>
>> Thanks for your help.
>>
>>
>>
>> On 06/14/2012 11:09 AM, Sylvain Lebresne wrote:
>>> On Thu, Jun 14, 2012 at 8:26 AM, Piavlo<lo...@gmail.com>    wrote:
>>>> I started looking for similar messages on other nodes saw a SINGLE
>>>> IllegalArgumentException on
>>>> ValidationExecutor on the same node and 2 other nodes (this is a 6 node
>>>> cluster) which happened
>>>> at almost the same time , in all nodes while flushing same
>>>> UserCompletions CF memtable. This
>>>> happened about 12hours before the IllegalArgumentException in
>>>>   CompactionExecutor.
>>> This actually does not happen during a flush but during a validation
>>> compaction, which happens during a repair.
>>> The exception is basically saying there is invalid composite column
>>> name (you do use a composite comparator right?).
>>> I guess that could result from some on-disk corruption. Are you using
>>> sstable compression on UserCompletions? (I am asking because
>>> compressed sstables have checksums)
>>>
>>>> And even bigger problem now is that running repairs on other CFs against
>>>> different nodes does not have any effect, for example running
>>>> /usr/bin/nodetool -h dsc2b.internal -pr repair PRODUCTION
>>>> UserDirectVendors
>>>> does not trigger any repair activity and nothing in the logs to indicate
>>>> a
>>>> start of repair. And I have ~24hours left to repair some CFs before the
>>>> gc
>>>> period ends :(
>>> Does that happen on every node?
>>> What can happen is that some failed repair may block other from
>>> starting. One thing you can try is to run the method called
>>> forceTerminateAllRepairessions in JMX under
>>> org.apache.cassandra.db->StorageService->Operations (I'm afraid there
>>> is no nodetool hook so you will have to use jconsole). After that, try
>>> starting a repair again. If that doesn't work, it's worth trying to
>>> restart the node.
>>>
>>> --
>>> Sylvain
>>

Re: Urgent - IllegalArgumentException during compaction and memtable flush

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Thu, Jun 14, 2012 at 12:00 PM, Piavlo <lo...@gmail.com> wrote:
> What's the procedure to check if the compressed sstable is corrupted or not?

Since you use compression, in theory that can't be disk bitrot since
in that case
you would have got some checksum error instead. The fact that it happened
on 3 nodes during a repair also does suggest that this was not a
simple disk bitrot.

Not sure why that happened though (you don't use the StorageProxy api by any
chance?).

> Is there way to delete all UserCompletions sstables on the problematic node
> and then run repair on this CF only?
> Like disable thrift, drain memtables so it does not read commit log on
> startup and then delete the sstables and start the node again will it work?

It should work and should at least fix the compaction problem. Though to win
some time you may want to remove only those sstables that triggered the
exception in the first place. Trusting the log above that would be the ones
mentioned here:
INFO [CompactionExecutor:3335] 2012-06-14 05:14:02,088
CompactionTask.java (line 115) Compacting
[SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-838-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1038-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-846-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-949-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-895-Data.db'),
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-883-Data.db')]

> can it mean that sstable is not corrupted but just some BAD column name was
> inserted OK but can not be read later read by ValidationExecutor in any of
> the replica nodes?

Well, the exception really only says that you have a bad column name.
What produced it is unclear.

--
Sylvain

>
> Check the relevant cassandra logs below
>
> dsc2b.internal/10.234.71.33
> -----------------------
>  INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,464
> AntiEntropyService.java (line 658) [repair
> #7ec142c0-b588-11e1-0000-f423231d3fff] new session: will sync
> dsc2b.internal/10.234.71.33, /10.49.127.4, /10.58.249.118 on range
> (85070591730234615865843651857942052864,113427455640312821154458202477256070485]
> for PRODUCTION.[UserCompletions]
>  INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,465
> AntiEntropyService.java (line 837) [repair
> #7ec142c0-b588-11e1-0000-f423231d3fff] requests for merkle tree sent for
> UserCompletions (to [/10.49.127.4, /10.58.249.118,
> dsc2b.internal/10.234.71.33])
>  INFO [ValidationExecutor:129] 2012-06-13 18:49:24,466
> ColumnFamilyStore.java (line 705) Enqueuing flush of
> Memtable-UserCompletions@843906517(9952311/21343163 serialized/live bytes,
> 41801 ops)
>  INFO [FlushWriter:2563] 2012-06-13 18:49:24,467 Memtable.java (line 246)
> Writing Memtable-UserCompletions@843906517(9952311/21343163 serialized/live
> bytes, 41801 ops)
>  INFO [FlushWriter:2563] 2012-06-13 18:49:24,828 Memtable.java (line 283)
> Completed flushing
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-515-Data.db (1671566
> bytes)
> ERROR [ValidationExecutor:129] 2012-06-13 18:55:32,236
> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
> Thread[ValidationExecutor:129,1,main]
> java.lang.IllegalArgumentException
>        at java.nio.Buffer.limit(Buffer.java:249)
> ....
> -----------------------
>
> dsc1a.internal/10.49.127.4
> -----------------------
>  INFO [ValidationExecutor:125] 2012-06-13 18:49:24,457
> ColumnFamilyStore.java (line 705) Enqueuing flush of
> Memtable-UserCompletions@266077104(9047552/76151840 serialized/live bytes,
> 38000 ops)
>  INFO [FlushWriter:2670] 2012-06-13 18:49:24,466 Memtable.java (line 246)
> Writing Memtable-UserCompletions@266077104(9047552/76151840 serialized/live
> bytes, 38000 ops)
>  INFO [FlushWriter:2670] 2012-06-13 18:49:24,969 Memtable.java (line 283)
> Completed flushing
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db (1508368
> bytes)
>  INFO [CompactionExecutor:3299] 2012-06-13 18:49:24,971 CompactionTask.java
> (line 115) Compacting
> [SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1027-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1028-Data.db'),
> SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1029-Data.db')]
>  INFO [CompactionExecutor:3299] 2012-06-13 18:50:03,554 CompactionTask.java
> (line 223) Compacted to
> [/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1031-Data.db,].
>  23,417,251 to 23,832,802 (~101% of original) bytes for 116,956 keys at
> 0.589102MB/s.  Time: 38,582ms.
> ERROR [ValidationExecutor:125] 2012-06-13 18:56:58,961
> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
> Thread[ValidationExecutor:125,1,main]
> java.lang.IllegalArgumentException
>        at java.nio.Buffer.limit(Buffer.java:249)
> ...
> -------------------------
>
> dsc2c.internal/10.58.249.118
> -------------------------
>
>  INFO [ValidationExecutor:119] 2012-06-13 18:49:24,305
> ColumnFamilyStore.java (line 705) Enqueuing flush of
> Memtable-UserCompletions@1279460811(19014066/66201229 serialized/live bytes,
> 79838 ops)
>  INFO [FlushWriter:2001] 2012-06-13 18:49:24,326 Memtable.java (line 246)
> Writing Memtable-UserCompletions@1279460811(19014066/66201229
> serialized/live bytes, 79838 ops)
>  INFO [FlushWriter:2001] 2012-06-13 18:49:24,848 Memtable.java (line 283)
> Completed flushing
> /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-548-Data.db (3177074
> bytes)
> ERROR [ValidationExecutor:119] 2012-06-13 18:55:50,387
> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
> Thread[ValidationExecutor:119,1,main]
> java.lang.IllegalArgumentException
>        at java.nio.Buffer.limit(Buffer.java:249)
> ...
> -------------------------
>
> Thanks for your help.
>
>
>
> On 06/14/2012 11:09 AM, Sylvain Lebresne wrote:
>>
>> On Thu, Jun 14, 2012 at 8:26 AM, Piavlo<lo...@gmail.com>  wrote:
>>>
>>> I started looking for similar messages on other nodes saw a SINGLE
>>> IllegalArgumentException on
>>> ValidationExecutor on the same node and 2 other nodes (this is a 6 node
>>> cluster) which happened
>>> at almost the same time , in all nodes while flushing same
>>> UserCompletions CF memtable. This
>>> happened about 12hours before the IllegalArgumentException in
>>>  CompactionExecutor.
>>
>> This actually does not happen during a flush but during a validation
>> compaction, which happens during a repair.
>> The exception is basically saying there is invalid composite column
>> name (you do use a composite comparator right?).
>> I guess that could result from some on-disk corruption. Are you using
>> sstable compression on UserCompletions? (I am asking because
>> compressed sstables have checksums)
>>
>>> And even bigger problem now is that running repairs on other CFs against
>>> different nodes does not have any effect, for example running
>>> /usr/bin/nodetool -h dsc2b.internal -pr repair PRODUCTION
>>> UserDirectVendors
>>> does not trigger any repair activity and nothing in the logs to indicate
>>> a
>>> start of repair. And I have ~24hours left to repair some CFs before the
>>> gc
>>> period ends :(
>>
>> Does that happen on every node?
>> What can happen is that some failed repair may block other from
>> starting. One thing you can try is to run the method called
>> forceTerminateAllRepairessions in JMX under
>> org.apache.cassandra.db->StorageService->Operations (I'm afraid there
>> is no nodetool hook so you will have to use jconsole). After that, try
>> starting a repair again. If that doesn't work, it's worth trying to
>> restart the node.
>>
>> --
>> Sylvain
>
>

Re: Urgent - IllegalArgumentException during compaction and memtable flush

Posted by Piavlo <lo...@gmail.com>.

  Hi Sylvain,

Yes this UserCompletions CF uses composite comparator and I do use 
sstable compression.

What's the procedure to check if the compressed sstable is corrupted or not?
If it's corrupted what can I do to fix the issue with minimal cluster 
load impact?
Is there way to delete all UserCompletions sstables on the problematic 
node and then run repair on this CF only?
Like disable thrift, drain memtables so it does not read commit log on 
startup and then delete the sstables and start the node again will it work?


BUT since I saw this error in 3 nodes (and  RF=3 too) in 
ValidationExecutor at almost the same time (at 3 different times - 
Probably due to 3 attempts of reruning "repair -pr  UserCompletions 
dsc2b.internal" which never returned from the blocked nodeool command - 
an each time repair finished the new sstables trigger compations on all 
involed nodes)
can it mean that sstable is not corrupted but just some BAD column name 
was inserted OK but can not be read later read by ValidationExecutor in 
any of the replica nodes?

Check the relevant cassandra logs below

dsc2b.internal/10.234.71.33
-----------------------
  INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,464 
AntiEntropyService.java (line 658) [repair 
#7ec142c0-b588-11e1-0000-f423231d3fff] new session: will sync 
dsc2b.internal/10.234.71.33, /10.49.127.4, /10.58.249.118 on range 
(85070591730234615865843651857942052864,113427455640312821154458202477256070485] 
for PRODUCTION.[UserCompletions]
  INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,465 
AntiEntropyService.java (line 837) [repair 
#7ec142c0-b588-11e1-0000-f423231d3fff] requests for merkle tree sent for 
UserCompletions (to [/10.49.127.4, /10.58.249.118, 
dsc2b.internal/10.234.71.33])
  INFO [ValidationExecutor:129] 2012-06-13 18:49:24,466 
ColumnFamilyStore.java (line 705) Enqueuing flush of 
Memtable-UserCompletions@843906517(9952311/21343163 serialized/live 
bytes, 41801 ops)
  INFO [FlushWriter:2563] 2012-06-13 18:49:24,467 Memtable.java (line 
246) Writing Memtable-UserCompletions@843906517(9952311/21343163 
serialized/live bytes, 41801 ops)
  INFO [FlushWriter:2563] 2012-06-13 18:49:24,828 Memtable.java (line 
283) Completed flushing 
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-515-Data.db 
(1671566 bytes)
ERROR [ValidationExecutor:129] 2012-06-13 18:55:32,236 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[ValidationExecutor:129,1,main]
java.lang.IllegalArgumentException
         at java.nio.Buffer.limit(Buffer.java:249)
....
-----------------------

dsc1a.internal/10.49.127.4
-----------------------
  INFO [ValidationExecutor:125] 2012-06-13 18:49:24,457 
ColumnFamilyStore.java (line 705) Enqueuing flush of 
Memtable-UserCompletions@266077104(9047552/76151840 serialized/live 
bytes, 38000 ops)
  INFO [FlushWriter:2670] 2012-06-13 18:49:24,466 Memtable.java (line 
246) Writing Memtable-UserCompletions@266077104(9047552/76151840 
serialized/live bytes, 38000 ops)
  INFO [FlushWriter:2670] 2012-06-13 18:49:24,969 Memtable.java (line 
283) Completed flushing 
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db 
(1508368 bytes)
  INFO [CompactionExecutor:3299] 2012-06-13 18:49:24,971 
CompactionTask.java (line 115) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1027-Data.db'), 
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db'), 
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1028-Data.db'), 
SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1029-Data.db')]
  INFO [CompactionExecutor:3299] 2012-06-13 18:50:03,554 
CompactionTask.java (line 223) Compacted to 
[/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1031-Data.db,].  
23,417,251 to 23,832,802 (~101% of original) bytes for 116,956 keys at 
0.589102MB/s.  Time: 38,582ms.
ERROR [ValidationExecutor:125] 2012-06-13 18:56:58,961 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[ValidationExecutor:125,1,main]
java.lang.IllegalArgumentException
         at java.nio.Buffer.limit(Buffer.java:249)
...
-------------------------

dsc2c.internal/10.58.249.118
-------------------------
  INFO [ValidationExecutor:119] 2012-06-13 18:49:24,305 
ColumnFamilyStore.java (line 705) Enqueuing flush of 
Memtable-UserCompletions@1279460811(19014066/66201229 serialized/live 
bytes, 79838 ops)
  INFO [FlushWriter:2001] 2012-06-13 18:49:24,326 Memtable.java (line 
246) Writing Memtable-UserCompletions@1279460811(19014066/66201229 
serialized/live bytes, 79838 ops)
  INFO [FlushWriter:2001] 2012-06-13 18:49:24,848 Memtable.java (line 
283) Completed flushing 
/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-548-Data.db 
(3177074 bytes)
ERROR [ValidationExecutor:119] 2012-06-13 18:55:50,387 
AbstractCassandraDaemon.java (line 139) Fatal exception in thread 
Thread[ValidationExecutor:119,1,main]
java.lang.IllegalArgumentException
         at java.nio.Buffer.limit(Buffer.java:249)
...
-------------------------

Thanks for your help.


On 06/14/2012 11:09 AM, Sylvain Lebresne wrote:
> On Thu, Jun 14, 2012 at 8:26 AM, Piavlo<lo...@gmail.com>  wrote:
>> I started looking for similar messages on other nodes saw a SINGLE IllegalArgumentException on
>> ValidationExecutor on the same node and 2 other nodes (this is a 6 node cluster) which happened
>> at almost the same time , in all nodes while flushing same UserCompletions CF memtable. This
>> happened about 12hours before the IllegalArgumentException in  CompactionExecutor.
> This actually does not happen during a flush but during a validation
> compaction, which happens during a repair.
> The exception is basically saying there is invalid composite column
> name (you do use a composite comparator right?).
> I guess that could result from some on-disk corruption. Are you using
> sstable compression on UserCompletions? (I am asking because
> compressed sstables have checksums)
>
>> And even bigger problem now is that running repairs on other CFs against
>> different nodes does not have any effect, for example running
>> /usr/bin/nodetool -h dsc2b.internal -pr repair PRODUCTION UserDirectVendors
>> does not trigger any repair activity and nothing in the logs to indicate a
>> start of repair. And I have ~24hours left to repair some CFs before the gc
>> period ends :(
> Does that happen on every node?
> What can happen is that some failed repair may block other from
> starting. One thing you can try is to run the method called
> forceTerminateAllRepairessions in JMX under
> org.apache.cassandra.db->StorageService->Operations (I'm afraid there
> is no nodetool hook so you will have to use jconsole). After that, try
> starting a repair again. If that doesn't work, it's worth trying to
> restart the node.
>
> --
> Sylvain