You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Andrew Hulbert <ah...@ccri.com> on 2016/03/08 17:28:59 UTC

Recovery file versus directory

Hi folks,

We experienced a problem this morning with a recovery on 1.6.1 that went 
something like this:

FileNotFoundException: File does not exist: 
hdfs:///accumulo/recovery/<uuid>/failed/data

at Tablet.java:1410
at Tablet.java:1233
etc.
at TabletServer:2923

Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed was a 0 
byte file, not a directory...and it was preventing tablets from getting 
assigned (I am not sure what caused the original failure, but I believe 
what happened is a tserver node was going down...the master indicated it 
was trying to shutdown the a tserver which was so bad off someone just 
rekicked the node).

I looked through the fixes for 1.6.2,3,4,5 but didn't see anything 
related on the release notes pages but I haven't gone through all the 
tickets yet. I haven't been able to get anyone to upgrade to 1.6.5 yet 
and perhaps its already fixed.

Just wondering if that's something that has been seen before?

In order to fix it I just deleted the failed file and it proceeded

Thanks!

Andrew

Re: Recovery file versus directory

Posted by Andrew Hulbert <ah...@ccri.com>.
Looks like the only thing we have in the gc logs are:

DEBUG: deleted [hdfs://../accumulo/wal/<uuid> ...]
DEBUG: Removing sorted WAL hdfs://...<uuid>

I can't tell if they are before or after in time than when I deleted the 
file

hdfs://accumulo/wal/<uuid>/failed

Here's the other issue we were looking at:

https://issues.apache.org/jira/browse/ACCUMULO-3727

FYI I originally increased the num WALs up to 8 to help batch write 
ingest...Now I've modified it only to be for the tables that needed 
ingest instead of the entire cluster, and reset the num WALs for the 
cluster back to 3, and I haven't had any errors since (3 days). Not sure 
why that would be a problem except for the few times that the metadata 
table was involved.

Andrew

On 03/18/2016 09:43 AM, Andrew Hulbert wrote:
> I'll tar them up and see what I can find! Thanks.
>
> On 03/17/2016 08:18 PM, Michael Wall wrote:
>> Andrew,
>>
>> Sounds a lot like 
>> https://issues.apache.org/jira/browse/ACCUMULO-4157. I'll look to see 
>> if what you describe could also happen with this bug.  If you still 
>> have the gc logs, can you look for a message like "Removing WAL for 
>> offline server" with the uuid?
>>
>> Mike
>>
>> On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <ahulbert@ccri.com 
>> <ma...@ccri.com>> wrote:
>>
>>     Hi folks,
>>
>>     We experienced a problem this morning with a recovery on 1.6.1
>>     that went something like this:
>>
>>     FileNotFoundException: File does not exist:
>>     hdfs:///accumulo/recovery/<uuid>/failed/data
>>
>>     at Tablet.java:1410
>>     at Tablet.java:1233
>>     etc.
>>     at TabletServer:2923
>>
>>     Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
>>     was a 0 byte file, not a directory...and it was preventing
>>     tablets from getting assigned (I am not sure what caused the
>>     original failure, but I believe what happened is a tserver node
>>     was going down...the master indicated it was trying to shutdown
>>     the a tserver which was so bad off someone just rekicked the node).
>>
>>     I looked through the fixes for 1.6.2,3,4,5 but didn't see
>>     anything related on the release notes pages but I haven't gone
>>     through all the tickets yet. I haven't been able to get anyone to
>>     upgrade to 1.6.5 yet and perhaps its already fixed.
>>
>>     Just wondering if that's something that has been seen before?
>>
>>     In order to fix it I just deleted the failed file and it proceeded
>>
>>     Thanks!
>>
>>     Andrew
>>
>>
>


Re: Recovery file versus directory

Posted by Andrew Hulbert <ah...@ccri.com>.
I'll tar them up and see what I can find! Thanks.

On 03/17/2016 08:18 PM, Michael Wall wrote:
> Andrew,
>
> Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157. 
> I'll look to see if what you describe could also happen with this 
> bug.  If you still have the gc logs, can you look for a message like 
> "Removing WAL for offline server" with the uuid?
>
> Mike
>
> On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <ahulbert@ccri.com 
> <ma...@ccri.com>> wrote:
>
>     Hi folks,
>
>     We experienced a problem this morning with a recovery on 1.6.1
>     that went something like this:
>
>     FileNotFoundException: File does not exist:
>     hdfs:///accumulo/recovery/<uuid>/failed/data
>
>     at Tablet.java:1410
>     at Tablet.java:1233
>     etc.
>     at TabletServer:2923
>
>     Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed
>     was a 0 byte file, not a directory...and it was preventing tablets
>     from getting assigned (I am not sure what caused the original
>     failure, but I believe what happened is a tserver node was going
>     down...the master indicated it was trying to shutdown the a
>     tserver which was so bad off someone just rekicked the node).
>
>     I looked through the fixes for 1.6.2,3,4,5 but didn't see anything
>     related on the release notes pages but I haven't gone through all
>     the tickets yet. I haven't been able to get anyone to upgrade to
>     1.6.5 yet and perhaps its already fixed.
>
>     Just wondering if that's something that has been seen before?
>
>     In order to fix it I just deleted the failed file and it proceeded
>
>     Thanks!
>
>     Andrew
>
>


Re: Recovery file versus directory

Posted by Michael Wall <mj...@gmail.com>.
Andrew,

Sounds a lot like https://issues.apache.org/jira/browse/ACCUMULO-4157.
I'll look to see if what you describe could also happen with this bug.  If
you still have the gc logs, can you look for a message like "Removing WAL
for offline server" with the uuid?

Mike

On Tue, Mar 8, 2016 at 11:28 AM, Andrew Hulbert <ah...@ccri.com> wrote:

> Hi folks,
>
> We experienced a problem this morning with a recovery on 1.6.1 that went
> something like this:
>
> FileNotFoundException: File does not exist:
> hdfs:///accumulo/recovery/<uuid>/failed/data
>
> at Tablet.java:1410
> at Tablet.java:1233
> etc.
> at TabletServer:2923
>
> Interestingly enough, at hdfs:///accumulo/recovery/<uuid>/failed was a 0
> byte file, not a directory...and it was preventing tablets from getting
> assigned (I am not sure what caused the original failure, but I believe
> what happened is a tserver node was going down...the master indicated it
> was trying to shutdown the a tserver which was so bad off someone just
> rekicked the node).
>
> I looked through the fixes for 1.6.2,3,4,5 but didn't see anything related
> on the release notes pages but I haven't gone through all the tickets yet.
> I haven't been able to get anyone to upgrade to 1.6.5 yet and perhaps its
> already fixed.
>
> Just wondering if that's something that has been seen before?
>
> In order to fix it I just deleted the failed file and it proceeded
>
> Thanks!
>
> Andrew
>