You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Andrew Hulbert <ah...@ccri.com> on 2016/03/08 17:28:59 UTC
Recovery file versus directory
Hi folks,
We experienced a problem this morning with a recovery on 1.6.1 that went
something like this:
FileNotFoundException: File does not exist:
hdfs:///accumulo/recovery/<uuid>/failed/data
at Tablet.java:1410
at Tablet.java:1233
etc.
at TabletServer:2923
Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed was a 0
byte file, not a directory...and it was preventing tablets from getting
assigned (I am not sure what caused the original failure, but I believe
what happened is a tserver node was going down...the master indicated it
was trying to shutdown the a tserver which was so bad off someone just
rekicked the node).
I looked through the fixes for 1.6.2,3,4,5 but didn't see anything
related on the release notes pages but I haven't gone through all the
tickets yet. I haven't been able to get anyone to upgrade to 1.6.5 yet
and perhaps its already fixed.
Just wondering if that's something that has been seen before?
In order to fix it I just deleted the failed file and it proceeded
Thanks!
Andrew
Re: Recovery file versus directory
Posted by Andrew Hulbert <ah...@ccri.com>.
Looks like the only thing we have in the gc logs are:
DEBUG: deleted [hdfs://../accumulo/wal/<uuid> ...]
DEBUG: Removing sorted WAL hdfs://...<uuid>
I can't tell if they are before or after in time than when I deleted the
file
hdfs://accumulo/wal/<uuid>/failed
Here's the other issue we were looking at:
https://issues.apache.org/jira/browse/ACCUMULO-3727
FYI I originally increased the num WALs up to 8 to help batch write
ingest...Now I've modified it only to be for the tables that needed
ingest instead of the entire cluster, and reset the num WALs for the
cluster back to 3, and I haven't had any errors since (3 days). Not sure
why that would be a problem except for the few times that the metadata
table was involved.
Andrew
On 03/18/2016 09:43 AM, Andrew Hulbert wrote:
> I'll tar them up and see what I can find! Thanks.
>
> On 03/17/2016 08:18 PM, Michael Wall wrote:
>> Andrew,
>>
>> Sounds a lot like
>> https://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to see
>> if what you describe could also happen with this bug. If you still
>> have the gc logs, can you look for a message like "Removing WAL for
>> offline server" with the uuid?
>>
>> Mike
>>
>> On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <ahulbert@ccri.com
>> <ma...@ccri.com>> wrote:
>>
>> Hi folks,
>>
>> We experienced a problem this morning with a recovery on 1.6.1
>> that went something like this:
>>
>> FileNotFoundException: File does not exist:
>> hdfs:///accumulo/recovery/<uuid>/failed/data
>>
>> at Tablet.java:1410
>> at Tablet.java:1233
>> etc.
>> at TabletServer:2923
>>
>> Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
>> was a 0 byte file, not a directory...and it was preventing
>> tablets from getting assigned (I am not sure what caused the
>> original failure, but I believe what happened is a tserver node
>> was going down...the master indicated it was trying to shutdown
>> the a tserver which was so bad off someone just rekicked the node).
>>
>> I looked through the fixes for 1.6.2,3,4,5 but didn't see
>> anything related on the release notes pages but I haven't gone
>> through all the tickets yet. I haven't been able to get anyone to
>> upgrade to 1.6.5 yet and perhaps its already fixed.
>>
>> Just wondering if that's something that has been seen before?
>>
>> In order to fix it I just deleted the failed file and it proceeded
>>
>> Thanks!
>>
>> Andrew
>>
>>
>
Re: Recovery file versus directory
Posted by Andrew Hulbert <ah...@ccri.com>.
I'll tar them up and see what I can find! Thanks.
On 03/17/2016 08:18 PM, Michael Wall wrote:
> Andrew,
>
> Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157.
> I'll look to see if what you describe could also happen with this
> bug. If you still have the gc logs, can you look for a message like
> "Removing WAL for offline server" with the uuid?
>
> Mike
>
> On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <ahulbert@ccri.com
> <ma...@ccri.com>> wrote:
>
> Hi folks,
>
> We experienced a problem this morning with a recovery on 1.6.1
> that went something like this:
>
> FileNotFoundException: File does not exist:
> hdfs:///accumulo/recovery/<uuid>/failed/data
>
> at Tablet.java:1410
> at Tablet.java:1233
> etc.
> at TabletServer:2923
>
> Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
> was a 0 byte file, not a directory...and it was preventing tablets
> from getting assigned (I am not sure what caused the original
> failure, but I believe what happened is a tserver node was going
> down...the master indicated it was trying to shutdown the a
> tserver which was so bad off someone just rekicked the node).
>
> I looked through the fixes for 1.6.2,3,4,5 but didn't see anything
> related on the release notes pages but I haven't gone through all
> the tickets yet. I haven't been able to get anyone to upgrade to
> 1.6.5 yet and perhaps its already fixed.
>
> Just wondering if that's something that has been seen before?
>
> In order to fix it I just deleted the failed file and it proceeded
>
> Thanks!
>
> Andrew
>
>
Re: Recovery file versus directory
Posted by Michael Wall <mj...@gmail.com>.
Andrew,
Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157.
I'll look to see if what you describe could also happen with this bug. If
you still have the gc logs, can you look for a message like "Removing WAL
for offline server" with the uuid?
Mike
On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <ah...@ccri.com> wrote:
> Hi folks,
>
> We experienced a problem this morning with a recovery on 1.6.1 that went
> something like this:
>
> FileNotFoundException: File does not exist:
> hdfs:///accumulo/recovery/<uuid>/failed/data
>
> at Tablet.java:1410
> at Tablet.java:1233
> etc.
> at TabletServer:2923
>
> Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed was a 0
> byte file, not a directory...and it was preventing tablets from getting
> assigned (I am not sure what caused the original failure, but I believe
> what happened is a tserver node was going down...the master indicated it
> was trying to shutdown the a tserver which was so bad off someone just
> rekicked the node).
>
> I looked through the fixes for 1.6.2,3,4,5 but didn't see anything related
> on the release notes pages but I haven't gone through all the tickets yet.
> I haven't been able to get anyone to upgrade to 1.6.5 yet and perhaps its
> already fixed.
>
> Just wondering if that's something that has been seen before?
>
> In order to fix it I just deleted the failed file and it proceeded
>
> Thanks!
>
> Andrew
>