You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by aaron morton <aa...@thelastpickle.com> on 2013/04/14 22:26:34 UTC

Re: Rename failed while cassandra is starting up

> From the log messages, it looked like the table/keyspace was
> opened before the scrubDataDirectories was executed. This created a race
> condition between two threads.
Seems odd. 
AFAIK that startup is single threaded and the scrub runs before the tables are opened. See AbstractCassandraDaemon.setup()

> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
> (line 184) Creating new index :
> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_mobilityZoneUUID'}
> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
> java.io.IOError: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db

Looks like a secondary index is being created at startup and there is an error renaming the file. 
OR
The node was shut down before the index was built and it's been rebuilt at startup.

Both of these are async operations and cause a race with scrubDirectories(). 

Probably not the log replaying because it looks like the sstables have not been opened. 

I *think* the way around this is to um…. 
* move all existing data and commit log out of the way 
* start with node with -Dcassandra.join_ring=false JVM option in cassandra-env.sh
* check that all indexes are built using nodetool cfstats
* shut it down
* put the commit log and data dirs back in place. 

All we want to do is get the system KS updated, but in 1.0 that's a serialised object and not easy to poke. 

Hope that helps. 
 
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 14/04/2013, at 3:50 PM, Boris Yen <yu...@gmail.com> wrote:

> Hi All,
> 
> Recently, we encountered an error on 1.0.12 that prevented cassandra from
> starting up. From the log messages, it looked like the table/keyspace was
> opened before the scrubDataDirectories was executed. This created a race
> condition between two threads. One was trying to rename files while the
> other was trying to remove tmp files. I was wondering if anyone could
> provide us some information or workaround for this.
> 
> INFO [MemoryMeter:1] 2013-04-09 02:49:39,868 Memtable.java (line 186)
> CFS(Keyspace='fmzd', ColumnFamily='alarm.fmzd_alarm_category') liveRatio is
> 3.7553409423470883 (just-counted was 3.1413828689370487).  calculation took
> 2ms for 265 columns
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,868 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-2 (83 bytes)
> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,868 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-1 (123 bytes)
> INFO [Creating index: alarm.fmzd_alarm_category] 2013-04-09 02:49:39,874
> ColumnFamilyStore.java (line 705) Enqueuing flush of
> Memtable-alarm.fmzd_alarm_category@413535513(14025/65835 serialized/live
> bytes, 275 ops)
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,877 SecondaryIndexManager.java
> (line 184) Creating new index : ColumnDefinition{name=6d65736853534944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_meshSSID'}
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,895 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-1 (122 bytes)
> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,896 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-2 (82 bytes)
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
> (line 184) Creating new index :
> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_mobilityZoneUUID'}
> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
> java.io.IOError: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
> at
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:375)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:302)
> at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:276)
> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:49)
> at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:299)
> at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
> at
> org.apache.cassandra.utils.FBUtilities.renameWithConfirm(FBUtilities.java:355)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:371)
> ... 9 more
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,917 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_mobilityZoneUUID-hd-1 (312 bytes)
> INFO [FlushWriter:2] 2013-04-09 02:49:39,916 Memtable.java (line 246)
> Writing Memtable-alarm.fmzd_alarm_alarmCode@402202831(2958/22542
> serialized/live bytes, 58 ops)
> ERROR [main] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java (line
> 373) Exception encountered during startup
> java.io.IOError: java.io.IOException: Failed to delete
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
> at
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:372)
> at
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:415)
> at
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:193)
> at
> org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:356)
> at
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
> Caused by: java.io.IOException: Failed to delete
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54)
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44)
> at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:141)
> at
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:368)
> ... 4 more
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,923 SecondaryIndexManager.java
> (line 184) Creating new index : ColumnDefinition{name=6d6f64656c,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_model'}
> 
> Thanks and Regards,
> Boris


Re: Rename failed while cassandra is starting up

Posted by aaron morton <aa...@thelastpickle.com>.
I forgot to add I created a ticket for it https://issues.apache.org/jira/browse/CASSANDRA-5469

See that ticket for recent changes to the MeteredFlusher. 

IMHO this is not related to the metered flusher. Index rebuilds force a flush.

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/04/2013, at 1:57 PM, Boris Yen <yu...@gmail.com> wrote:

> Hi Aaron,
> 
> "startup is single threaded and the scrub runs before the tables are opened".
> 
> This is what I was thinking too. However, after using the debugger to trace the code, I realized that MeteredFlusher (see the "countFlushBytes" method) might open the sstables before the scrub is completed. I suppose this is the cause of the exceptions I saw.
> 
> My plan is to add a boolean flag named "scrubCompleted" at AbstractCassandraDaemon or StorageService. By default, it is false, after the scrub is completed the AbstractCassandraDaemon needs to set it to true. The MeterdFluster needs to make sure the scrub is completed by checking this boolean value and starts to do all the calculation.
> 
> Is this a good plan? or it might have side effects?
> 
> Thanks and Regards,
> Boris
> 
> 
> On Mon, Apr 15, 2013 at 4:26 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> From the log messages, it looked like the table/keyspace was
>> opened before the scrubDataDirectories was executed. This created a race
>> condition between two threads.
> Seems odd. 
> AFAIK that startup is single threaded and the scrub runs before the tables are opened. See AbstractCassandraDaemon.setup()
> 
>> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
>> (line 184) Creating new index :
>> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
>> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
>> index_name='fmzd_ap_mobilityZoneUUID'}
>> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
>> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
>> java.io.IOError: java.io.IOException: rename failed of
>> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
> 
> Looks like a secondary index is being created at startup and there is an error renaming the file. 
> OR
> The node was shut down before the index was built and it's been rebuilt at startup.
> 
> Both of these are async operations and cause a race with scrubDirectories(). 
> 
> Probably not the log replaying because it looks like the sstables have not been opened. 
> 
> I *think* the way around this is to um…. 
> * move all existing data and commit log out of the way 
> * start with node with -Dcassandra.join_ring=false JVM option in cassandra-env.sh
> * check that all indexes are built using nodetool cfstats
> * shut it down
> * put the commit log and data dirs back in place. 
> 
> All we want to do is get the system KS updated, but in 1.0 that's a serialised object and not easy to poke. 
> 
> Hope that helps. 
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 14/04/2013, at 3:50 PM, Boris Yen <yu...@gmail.com> wrote:
> 
>> Hi All,
>> 
>> Recently, we encountered an error on 1.0.12 that prevented cassandra from
>> starting up. From the log messages, it looked like the table/keyspace was
>> opened before the scrubDataDirectories was executed. This created a race
>> condition between two threads. One was trying to rename files while the
>> other was trying to remove tmp files. I was wondering if anyone could
>> provide us some information or workaround for this.
>> 
>> INFO [MemoryMeter:1] 2013-04-09 02:49:39,868 Memtable.java (line 186)
>> CFS(Keyspace='fmzd', ColumnFamily='alarm.fmzd_alarm_category') liveRatio is
>> 3.7553409423470883 (just-counted was 3.1413828689370487).  calculation took
>> 2ms for 265 columns
>> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,868 SSTableReader.java (line
>> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-2 (83 bytes)
>> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,868 SSTableReader.java (line
>> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-1 (123 bytes)
>> INFO [Creating index: alarm.fmzd_alarm_category] 2013-04-09 02:49:39,874
>> ColumnFamilyStore.java (line 705) Enqueuing flush of
>> Memtable-alarm.fmzd_alarm_category@413535513(14025/65835 serialized/live
>> bytes, 275 ops)
>> INFO [OptionalTasks:1] 2013-04-09 02:49:39,877 SecondaryIndexManager.java
>> (line 184) Creating new index : ColumnDefinition{name=6d65736853534944,
>> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
>> index_name='fmzd_ap_meshSSID'}
>> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,895 SSTableReader.java (line
>> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-1 (122 bytes)
>> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,896 SSTableReader.java (line
>> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-2 (82 bytes)
>> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
>> (line 184) Creating new index :
>> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
>> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
>> index_name='fmzd_ap_mobilityZoneUUID'}
>> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
>> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
>> java.io.IOError: java.io.IOException: rename failed of
>> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
>> at
>> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:375)
>> at
>> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319)
>> at
>> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:302)
>> at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:276)
>> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:49)
>> at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:299)
>> at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>> Caused by: java.io.IOException: rename failed of
>> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
>> at
>> org.apache.cassandra.utils.FBUtilities.renameWithConfirm(FBUtilities.java:355)
>> at
>> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:371)
>> ... 9 more
>> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,917 SSTableReader.java (line
>> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_mobilityZoneUUID-hd-1 (312 bytes)
>> INFO [FlushWriter:2] 2013-04-09 02:49:39,916 Memtable.java (line 246)
>> Writing Memtable-alarm.fmzd_alarm_alarmCode@402202831(2958/22542
>> serialized/live bytes, 58 ops)
>> ERROR [main] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java (line
>> 373) Exception encountered during startup
>> java.io.IOError: java.io.IOException: Failed to delete
>> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
>> at
>> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:372)
>> at
>> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:415)
>> at
>> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:193)
>> at
>> org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:356)
>> at
>> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
>> Caused by: java.io.IOException: Failed to delete
>> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
>> at
>> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54)
>> at
>> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44)
>> at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:141)
>> at
>> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:368)
>> ... 4 more
>> INFO [OptionalTasks:1] 2013-04-09 02:49:39,923 SecondaryIndexManager.java
>> (line 184) Creating new index : ColumnDefinition{name=6d6f64656c,
>> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
>> index_name='fmzd_ap_model'}
>> 
>> Thanks and Regards,
>> Boris
> 
> 


Re: Rename failed while cassandra is starting up

Posted by Boris Yen <yu...@gmail.com>.
Hi Aaron,

"startup is single threaded and the scrub runs before the tables are opened
".

This is what I was thinking too. However, after using the debugger to trace
the code, I realized that MeteredFlusher (see the "countFlushBytes" method)
might open the sstables before the scrub is completed. I suppose this is
the cause of the exceptions I saw.

My plan is to add a boolean flag named "scrubCompleted" at
AbstractCassandraDaemon or StorageService. By default, it is false, after
the scrub is completed the AbstractCassandraDaemon needs to set it to true.
The MeterdFluster needs to make sure the scrub is completed by checking
this boolean value and starts to do all the calculation.

Is this a good plan? or it might have side effects?

Thanks and Regards,
Boris


On Mon, Apr 15, 2013 at 4:26 AM, aaron morton <aa...@thelastpickle.com>wrote:

> From the log messages, it looked like the table/keyspace was
> opened before the scrubDataDirectories was executed. This created a race
> condition between two threads.
>
> Seems odd.
> AFAIK that startup is single threaded and the scrub runs before the tables
> are opened. See AbstractCassandraDaemon.setup()
>
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
>
> (line 184) Creating new index :
> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_mobilityZoneUUID'}
> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
> java.io.IOError: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
>
>
> Looks like a secondary index is being created at startup and there is an
> error renaming the file.
> OR
> The node was shut down before the index was built and it's been rebuilt at
> startup.
>
> Both of these are async operations and cause a race with
> scrubDirectories().
>
> Probably not the log replaying because it looks like the sstables have not
> been opened.
>
> I *think* the way around this is to um….
> * move all existing data and commit log out of the way
> * start with node with -Dcassandra.join_ring=false JVM option in
> cassandra-env.sh
> * check that all indexes are built using nodetool cfstats
> * shut it down
> * put the commit log and data dirs back in place.
>
> All we want to do is get the system KS updated, but in 1.0 that's a
> serialised object and not easy to poke.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/04/2013, at 3:50 PM, Boris Yen <yu...@gmail.com> wrote:
>
> Hi All,
>
> Recently, we encountered an error on 1.0.12 that prevented cassandra from
> starting up. From the log messages, it looked like the table/keyspace was
> opened before the scrubDataDirectories was executed. This created a race
> condition between two threads. One was trying to rename files while the
> other was trying to remove tmp files. I was wondering if anyone could
> provide us some information or workaround for this.
>
> INFO [MemoryMeter:1] 2013-04-09 02:49:39,868 Memtable.java (line 186)
> CFS(Keyspace='fmzd', ColumnFamily='alarm.fmzd_alarm_category') liveRatio is
> 3.7553409423470883 (just-counted was 3.1413828689370487).  calculation took
> 2ms for 265 columns
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,868 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-2 (83 bytes)
> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,868 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-1 (123 bytes)
> INFO [Creating index: alarm.fmzd_alarm_category] 2013-04-09 02:49:39,874
> ColumnFamilyStore.java (line 705) Enqueuing flush of
> Memtable-alarm.fmzd_alarm_category@413535513(14025/65835 serialized/live
> bytes, 275 ops)
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,877 SecondaryIndexManager.java
> (line 184) Creating new index : ColumnDefinition{name=6d65736853534944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_meshSSID'}
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,895 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-1 (122 bytes)
> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,896 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-2 (82 bytes)
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
> (line 184) Creating new index :
> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_mobilityZoneUUID'}
> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
> java.io.IOError: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:375)
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319)
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:302)
> at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:276)
> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:49)
> at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:299)
> at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
> at
>
> org.apache.cassandra.utils.FBUtilities.renameWithConfirm(FBUtilities.java:355)
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:371)
> ... 9 more
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,917 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_mobilityZoneUUID-hd-1 (312
> bytes)
> INFO [FlushWriter:2] 2013-04-09 02:49:39,916 Memtable.java (line 246)
> Writing Memtable-alarm.fmzd_alarm_alarmCode@402202831(2958/22542
> serialized/live bytes, 58 ops)
> ERROR [main] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java (line
> 373) Exception encountered during startup
> java.io.IOError: java.io.IOException: Failed to delete
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
> at
>
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:372)
> at
>
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:415)
> at
>
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:193)
> at
>
> org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:356)
> at
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
> Caused by: java.io.IOException: Failed to delete
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54)
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44)
> at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:141)
> at
>
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:368)
> ... 4 more
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,923 SecondaryIndexManager.java
> (line 184) Creating new index : ColumnDefinition{name=6d6f64656c,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_model'}
>
> Thanks and Regards,
> Boris
>
>
>

Re: Rename failed while cassandra is starting up

Posted by Boris Yen <yu...@gmail.com>.
Hi Aaron,

"startup is single threaded and the scrub runs before the tables are opened
".

This is what I was thinking too. However, after using the debugger to trace
the code, I realized that MeteredFlusher (see the "countFlushBytes" method)
might open the sstables before the scrub is completed. I suppose this is
the cause of the exceptions I saw.

My plan is to add a boolean flag named "scrubCompleted" at
AbstractCassandraDaemon or StorageService. By default, it is false, after
the scrub is completed the AbstractCassandraDaemon needs to set it to true.
The MeterdFluster needs to make sure the scrub is completed by checking
this boolean value and starts to do all the calculation.

Is this a good plan? or it might have side effects?

Thanks and Regards,
Boris


On Mon, Apr 15, 2013 at 4:26 AM, aaron morton <aa...@thelastpickle.com>wrote:

> From the log messages, it looked like the table/keyspace was
> opened before the scrubDataDirectories was executed. This created a race
> condition between two threads.
>
> Seems odd.
> AFAIK that startup is single threaded and the scrub runs before the tables
> are opened. See AbstractCassandraDaemon.setup()
>
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
>
> (line 184) Creating new index :
> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_mobilityZoneUUID'}
> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
> java.io.IOError: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
>
>
> Looks like a secondary index is being created at startup and there is an
> error renaming the file.
> OR
> The node was shut down before the index was built and it's been rebuilt at
> startup.
>
> Both of these are async operations and cause a race with
> scrubDirectories().
>
> Probably not the log replaying because it looks like the sstables have not
> been opened.
>
> I *think* the way around this is to um….
> * move all existing data and commit log out of the way
> * start with node with -Dcassandra.join_ring=false JVM option in
> cassandra-env.sh
> * check that all indexes are built using nodetool cfstats
> * shut it down
> * put the commit log and data dirs back in place.
>
> All we want to do is get the system KS updated, but in 1.0 that's a
> serialised object and not easy to poke.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/04/2013, at 3:50 PM, Boris Yen <yu...@gmail.com> wrote:
>
> Hi All,
>
> Recently, we encountered an error on 1.0.12 that prevented cassandra from
> starting up. From the log messages, it looked like the table/keyspace was
> opened before the scrubDataDirectories was executed. This created a race
> condition between two threads. One was trying to rename files while the
> other was trying to remove tmp files. I was wondering if anyone could
> provide us some information or workaround for this.
>
> INFO [MemoryMeter:1] 2013-04-09 02:49:39,868 Memtable.java (line 186)
> CFS(Keyspace='fmzd', ColumnFamily='alarm.fmzd_alarm_category') liveRatio is
> 3.7553409423470883 (just-counted was 3.1413828689370487).  calculation took
> 2ms for 265 columns
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,868 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-2 (83 bytes)
> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,868 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshRole-hd-1 (123 bytes)
> INFO [Creating index: alarm.fmzd_alarm_category] 2013-04-09 02:49:39,874
> ColumnFamilyStore.java (line 705) Enqueuing flush of
> Memtable-alarm.fmzd_alarm_category@413535513(14025/65835 serialized/live
> bytes, 275 ops)
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,877 SecondaryIndexManager.java
> (line 184) Creating new index : ColumnDefinition{name=6d65736853534944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_meshSSID'}
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,895 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-1 (122 bytes)
> INFO [SSTableBatchOpen:2] 2013-04-09 02:49:39,896 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_meshSSID-hd-2 (82 bytes)
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,900 SecondaryIndexManager.java
> (line 184) Creating new index :
> ColumnDefinition{name=6d6f62696c6974795a6f6e6555554944,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_mobilityZoneUUID'}
> ERROR [FlushWriter:1] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java
> (line 139) Fatal exception in thread Thread[FlushWriter:1,5,main]
> java.io.IOError: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:375)
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319)
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:302)
> at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:276)
> at org.apache.cassandra.db.Memtable.access$400(Memtable.java:49)
> at org.apache.cassandra.db.Memtable$4.runMayThrow(Memtable.java:299)
> at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: rename failed of
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-hd-21-Data.db
> at
>
> org.apache.cassandra.utils.FBUtilities.renameWithConfirm(FBUtilities.java:355)
> at
>
> org.apache.cassandra.io.sstable.SSTableWriter.rename(SSTableWriter.java:371)
> ... 9 more
> INFO [SSTableBatchOpen:1] 2013-04-09 02:49:39,917 SSTableReader.java (line
> 153) Opening /test/db/data/fmzd/ap.fmzd_ap_mobilityZoneUUID-hd-1 (312
> bytes)
> INFO [FlushWriter:2] 2013-04-09 02:49:39,916 Memtable.java (line 246)
> Writing Memtable-alarm.fmzd_alarm_alarmCode@402202831(2958/22542
> serialized/live bytes, 58 ops)
> ERROR [main] 2013-04-09 02:49:39,916 AbstractCassandraDaemon.java (line
> 373) Exception encountered during startup
> java.io.IOError: java.io.IOException: Failed to delete
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
> at
>
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:372)
> at
>
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:415)
> at
>
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:193)
> at
>
> org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:356)
> at
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
> Caused by: java.io.IOException: Failed to delete
> /test/db/data/fmzd/alarm.fmzd_alarm_alarmCode-tmp-hd-21-Statistics.db
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54)
> at
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:44)
> at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:141)
> at
>
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:368)
> ... 4 more
> INFO [OptionalTasks:1] 2013-04-09 02:49:39,923 SecondaryIndexManager.java
> (line 184) Creating new index : ColumnDefinition{name=6d6f64656c,
> validator=org.apache.cassandra.db.marshal.UTF8Type, index_type=KEYS,
> index_name='fmzd_ap_model'}
>
> Thanks and Regards,
> Boris
>
>
>