You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jake Maizel <ja...@soundcloud.com> on 2011/08/27 12:40:05 UTC
Repairing lost data
Hello,
In a cluster running 0.6.6 one node lost part of a data file due to an
operator error. An older file was moved in place to bring cassandra
up again.
Now we get lots of these in the log:
2011-08-27_10:30:55.26219 'ERROR [ROW-READ-STAGE:4327] 10:30:55,258
CassandraDaemon.java:87 Uncaught exception in thread
Thread[ROW-READ-STAGE:4327,5,main]
2011-08-27_10:30:55.26219 'java.lang.ArrayIndexOutOfBoundsException
2011-08-27_10:30:55.26220 at
org.apache.cassandra.io.util.BufferedRandomAccessFile.read(BufferedRandomAccessFile.java:326)
2011-08-27_10:30:55.26220 at
java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
2011-08-27_10:30:55.26221 at
java.io.DataInputStream.readUTF(DataInputStream.java:592)
2011-08-27_10:30:55.26221 at
java.io.RandomAccessFile.readUTF(RandomAccessFile.java:887)
2011-08-27_10:30:55.26222 at
org.apache.cassandra.db.filter.SSTableSliceIterator$ColumnGroupReader.<init>(SSTableSliceIterator.java:125)
2011-08-27_10:30:55.26222 at
org.apache.cassandra.db.filter.SSTableSliceIterator.<init>(SSTableSliceIterator.java:59)
2011-08-27_10:30:55.26223 at
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:63)
2011-08-27_10:30:55.26223 at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:990)
2011-08-27_10:30:55.26224 at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:901)
2011-08-27_10:30:55.26224 at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:870)
2011-08-27_10:30:55.26224 at
org.apache.cassandra.db.Table.getRow(Table.java:382)
2011-08-27_10:30:55.26225 at
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:59)
2011-08-27_10:30:55.26225 at
org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:70)
2011-08-27_10:30:55.26226 at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:49)
2011-08-27_10:30:55.26226 at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
2011-08-27_10:30:55.26227 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
2011-08-27_10:30:55.26227 at java.lang.Thread.run(Thread.java:662)
Is it possible to use nodetool repair to fix this with the current data set?
I issued a repair command and the other nodes seem to be doing the
correct things but I concerned by this: "Uncaught exception in thread
Thread[ROW-READ-STAGE:4327,5,main]"
Will the affect node ever be able to do anything?
Also, only Data file was affected, the index and Filter files are
still the originals. Should I keep these or do anything else with
them?
My alternative is to delete all the data and run repair again which I
have done in the past and it works but takes a while with a large data
set.
I am open to ideas and any suggestions are welcome.
--
Jake Maizel
Head of Network Operations
Soundcloud
Mail & GTalk: jake@soundcloud.com
Skype: jakecloud
Rosenthaler strasse 13, 101 19, Berlin, DE
Re: Repairing lost data
Posted by Anthony Molinaro <an...@alumni.caltech.edu>.
I'm pretty sure that was a bug fixed in a later 0.6.x release so you might be able to upgrade and the exceptions might go away. We run 0.6.13 with a minor mod to support data expiration and will probably do so indefinitely since there no way to upgrade without shutting our site down :(
-Anthony
On Aug 27, 2011, at 3:40 AM, Jake Maizel <ja...@soundcloud.com> wrote:
> Hello,
>
> In a cluster running 0.6.6 one node lost part of a data file due to an
> operator error. An older file was moved in place to bring cassandra
> up again.
>
> Now we get lots of these in the log:
>
> 2011-08-27_10:30:55.26219 'ERROR [ROW-READ-STAGE:4327] 10:30:55,258
> CassandraDaemon.java:87 Uncaught exception in thread
> Thread[ROW-READ-STAGE:4327,5,main]
> 2011-08-27_10:30:55.26219 'java.lang.ArrayIndexOutOfBoundsException
> 2011-08-27_10:30:55.26220 at
> org.apache.cassandra.io.util.BufferedRandomAccessFile.read(BufferedRandomAccessFile.java:326)
> 2011-08-27_10:30:55.26220 at
> java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
> 2011-08-27_10:30:55.26221 at
> java.io.DataInputStream.readUTF(DataInputStream.java:592)
> 2011-08-27_10:30:55.26221 at
> java.io.RandomAccessFile.readUTF(RandomAccessFile.java:887)
> 2011-08-27_10:30:55.26222 at
> org.apache.cassandra.db.filter.SSTableSliceIterator$ColumnGroupReader.<init>(SSTableSliceIterator.java:125)
> 2011-08-27_10:30:55.26222 at
> org.apache.cassandra.db.filter.SSTableSliceIterator.<init>(SSTableSliceIterator.java:59)
> 2011-08-27_10:30:55.26223 at
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:63)
> 2011-08-27_10:30:55.26223 at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:990)
> 2011-08-27_10:30:55.26224 at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:901)
> 2011-08-27_10:30:55.26224 at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:870)
> 2011-08-27_10:30:55.26224 at
> org.apache.cassandra.db.Table.getRow(Table.java:382)
> 2011-08-27_10:30:55.26225 at
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:59)
> 2011-08-27_10:30:55.26225 at
> org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:70)
> 2011-08-27_10:30:55.26226 at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:49)
> 2011-08-27_10:30:55.26226 at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 2011-08-27_10:30:55.26227 at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 2011-08-27_10:30:55.26227 at java.lang.Thread.run(Thread.java:662)
>
> Is it possible to use nodetool repair to fix this with the current data set?
>
> I issued a repair command and the other nodes seem to be doing the
> correct things but I concerned by this: "Uncaught exception in thread
> Thread[ROW-READ-STAGE:4327,5,main]"
>
> Will the affect node ever be able to do anything?
>
> Also, only Data file was affected, the index and Filter files are
> still the originals. Should I keep these or do anything else with
> them?
>
> My alternative is to delete all the data and run repair again which I
> have done in the past and it works but takes a while with a large data
> set.
>
> I am open to ideas and any suggestions are welcome.
>
> --
> Jake Maizel
> Head of Network Operations
> Soundcloud
>
> Mail & GTalk: jake@soundcloud.com
> Skype: jakecloud
>
> Rosenthaler strasse 13, 101 19, Berlin, DE
Re: Repairing lost data
Posted by Peter Schuller <pe...@infidyne.com>.
> Is it possible to use nodetool repair to fix this with the current data set?
>
> I issued a repair command and the other nodes seem to be doing the
> correct things but I concerned by this: "Uncaught exception in thread
> Thread[ROW-READ-STAGE:4327,5,main]"
>
> Will the affect node ever be able to do anything?
Since it seems you're willing to keep the node up with the missing
data, I would remove (MOVE just to be safe) the bf+index files
corresponding to the over-written data. You definitely don't want a
bf/index files that does not match the data.
After that, a repair will propagate the missing data from other nodes.
(Implicit is that you do this with the node turned off; not just
"live" while the node is running.)
As to whether or not the exception you're seeing is expected when you
have a bf/index that is out of synch with the data file - I don't
know, and one would have to either know or look at the 0.6.6 codebase,
but it seems like a plausible error to trigger under such conditions.
But that's speaking solely based on the context and the stack trace,
not looking at the code.
But note: Removing data from a noder "under it's feet" *will* violate
consistency since the node will be missing data without "knowing" it's
missing data. So for example (but not limited to) a read at CL.ONE
that goes to that node will fail to return data, or maybe return old
data if the missing data files contained newer versions of data that
exists elsewhere in sstables on the node.
--
/ Peter Schuller (@scode on twitter)