You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jake Maizel <ja...@soundcloud.com> on 2011/08/27 12:40:05 UTC

Repairing lost data

Hello,

In a cluster running 0.6.6 one node lost part of a data file due to an
operator error.  An older file was moved in place to bring cassandra
up again.

Now we get lots of these in the log:

 2011-08-27_10:30:55.26219 'ERROR [ROW-READ-STAGE:4327] 10:30:55,258
CassandraDaemon.java:87 Uncaught exception in thread
Thread[ROW-READ-STAGE:4327,5,main]
2011-08-27_10:30:55.26219 'java.lang.ArrayIndexOutOfBoundsException
2011-08-27_10:30:55.26220 	at
org.apache.cassandra.io.util.BufferedRandomAccessFile.read(BufferedRandomAccessFile.java:326)
2011-08-27_10:30:55.26220 	at
java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
2011-08-27_10:30:55.26221 	at
java.io.DataInputStream.readUTF(DataInputStream.java:592)
2011-08-27_10:30:55.26221 	at
java.io.RandomAccessFile.readUTF(RandomAccessFile.java:887)
2011-08-27_10:30:55.26222 	at
org.apache.cassandra.db.filter.SSTableSliceIterator$ColumnGroupReader.<init>(SSTableSliceIterator.java:125)
2011-08-27_10:30:55.26222 	at
org.apache.cassandra.db.filter.SSTableSliceIterator.<init>(SSTableSliceIterator.java:59)
2011-08-27_10:30:55.26223 	at
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:63)
2011-08-27_10:30:55.26223 	at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:990)
2011-08-27_10:30:55.26224 	at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:901)
2011-08-27_10:30:55.26224 	at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:870)
2011-08-27_10:30:55.26224 	at
org.apache.cassandra.db.Table.getRow(Table.java:382)
2011-08-27_10:30:55.26225 	at
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:59)
2011-08-27_10:30:55.26225 	at
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:70)
2011-08-27_10:30:55.26226 	at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:49)
2011-08-27_10:30:55.26226 	at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
2011-08-27_10:30:55.26227 	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2011-08-27_10:30:55.26227 	at java.lang.Thread.run(Thread.java:662)

Is it possible to use nodetool repair to fix this with the current data set?

I issued a repair command and the other nodes seem to be doing the
correct things but I concerned  by this: "Uncaught exception in thread
Thread[ROW-READ-STAGE:4327,5,main]"

Will the affect node ever be able to do anything?

 Also, only Data file was affected, the index and Filter files are
still the originals.  Should I keep these or do anything else with
them?

My alternative is to delete all the data and run repair again which I
have done in the past and it works but takes a while with a large data
set.

I am open to ideas and any suggestions are welcome.

-- 
Jake Maizel
Head of Network Operations
Soundcloud

Mail & GTalk: jake@soundcloud.com
Skype: jakecloud

Rosenthaler strasse 13, 101 19, Berlin, DE

Re: Repairing lost data

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.
I'm pretty sure that was a bug fixed in a later 0.6.x release so you might be able to upgrade and the exceptions might go away.  We run 0.6.13 with a minor mod to support data expiration and will probably do so indefinitely since there no way to upgrade without shutting our site down :(

-Anthony

On Aug 27, 2011, at 3:40 AM, Jake Maizel <ja...@soundcloud.com> wrote:

> Hello,
> 
> In a cluster running 0.6.6 one node lost part of a data file due to an
> operator error.  An older file was moved in place to bring cassandra
> up again.
> 
> Now we get lots of these in the log:
> 
> 2011-08-27_10:30:55.26219 'ERROR [ROW-READ-STAGE:4327] 10:30:55,258
> CassandraDaemon.java:87 Uncaught exception in thread
> Thread[ROW-READ-STAGE:4327,5,main]
> 2011-08-27_10:30:55.26219 'java.lang.ArrayIndexOutOfBoundsException
> 2011-08-27_10:30:55.26220    at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.read(BufferedRandomAccessFile.java:326)
> 2011-08-27_10:30:55.26220    at
> java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
> 2011-08-27_10:30:55.26221    at
> java.io.DataInputStream.readUTF(DataInputStream.java:592)
> 2011-08-27_10:30:55.26221    at
> java.io.RandomAccessFile.readUTF(RandomAccessFile.java:887)
> 2011-08-27_10:30:55.26222    at
> org.apache.cassandra.db.filter.SSTableSliceIterator$ColumnGroupReader.<init>(SSTableSliceIterator.java:125)
> 2011-08-27_10:30:55.26222    at
> org.apache.cassandra.db.filter.SSTableSliceIterator.<init>(SSTableSliceIterator.java:59)
> 2011-08-27_10:30:55.26223    at
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:63)
> 2011-08-27_10:30:55.26223    at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:990)
> 2011-08-27_10:30:55.26224    at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:901)
> 2011-08-27_10:30:55.26224    at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:870)
> 2011-08-27_10:30:55.26224    at
> org.apache.cassandra.db.Table.getRow(Table.java:382)
> 2011-08-27_10:30:55.26225    at
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:59)
> 2011-08-27_10:30:55.26225    at
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:70)
> 2011-08-27_10:30:55.26226    at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:49)
> 2011-08-27_10:30:55.26226    at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 2011-08-27_10:30:55.26227    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 2011-08-27_10:30:55.26227    at java.lang.Thread.run(Thread.java:662)
> 
> Is it possible to use nodetool repair to fix this with the current data set?
> 
> I issued a repair command and the other nodes seem to be doing the
> correct things but I concerned  by this: "Uncaught exception in thread
> Thread[ROW-READ-STAGE:4327,5,main]"
> 
> Will the affect node ever be able to do anything?
> 
> Also, only Data file was affected, the index and Filter files are
> still the originals.  Should I keep these or do anything else with
> them?
> 
> My alternative is to delete all the data and run repair again which I
> have done in the past and it works but takes a while with a large data
> set.
> 
> I am open to ideas and any suggestions are welcome.
> 
> -- 
> Jake Maizel
> Head of Network Operations
> Soundcloud
> 
> Mail & GTalk: jake@soundcloud.com
> Skype: jakecloud
> 
> Rosenthaler strasse 13, 101 19, Berlin, DE

Re: Repairing lost data

Posted by Peter Schuller <pe...@infidyne.com>.
> Is it possible to use nodetool repair to fix this with the current data set?
>
> I issued a repair command and the other nodes seem to be doing the
> correct things but I concerned  by this: "Uncaught exception in thread
> Thread[ROW-READ-STAGE:4327,5,main]"
>
> Will the affect node ever be able to do anything?

Since it seems you're willing to keep the node up with the missing
data, I would remove (MOVE just to be safe) the bf+index files
corresponding to the over-written data. You definitely don't want a
bf/index files that does not match the data.

After that, a repair will propagate the missing data from other nodes.

(Implicit is that you do this with the node turned off; not just
"live" while the node is running.)

As to whether or not the exception you're seeing is expected when you
have a bf/index that is out of synch with the data file - I don't
know, and one would have to either know or look at the 0.6.6 codebase,
but it seems like a plausible error to trigger under such conditions.
But that's speaking solely based on the context and the stack trace,
not looking at the code.

But note: Removing data from a noder "under it's feet" *will* violate
consistency since the node will be missing data without "knowing" it's
missing data. So for example (but not limited to) a read at CL.ONE
that goes to that node will fail to return data, or maybe return old
data if the missing data files contained newer versions of data that
exists elsewhere in sstables on the node.

-- 
/ Peter Schuller (@scode on twitter)