You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Tim Robertson (JIRA)" <ji...@apache.org> on 2018/09/12 13:45:00 UTC

[jira] [Commented] (HBASE-21183) loadIncrementalHFiles sometimes throws FileNotFoundException on retry

    [ https://issues.apache.org/jira/browse/HBASE-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612141#comment-16612141 ] 

Tim Robertson commented on HBASE-21183:
---------------------------------------

I was able to get access to some of the RS logs for the time this occurred (I don't have full access) and notably I see that several RS renewed their TGT during the 60 seconds covering the logs above, along with the following (presumably this means clients were reinitialized):

{code:bash}
WARN ... StandbyException ... Operation Category READ is not supported in state standby
WARN ... StandbyException ... Operation Category READ is not supported in state standby
WARN ... StandbyException ... Operation Category READ is not supported in state standby
INFO ... Trying to fail over immediately
{code}

The failover controller logs do not indicate any failover during this period, and the HDFS audit logs were already rolled over. 

> loadIncrementalHFiles sometimes throws FileNotFoundException on retry
> ---------------------------------------------------------------------
>
>                 Key: HBASE-21183
>                 URL: https://issues.apache.org/jira/browse/HBASE-21183
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: Tim Robertson
>            Priority: Major
>
> On a nightly batch job which prepares 100s of well balanced HFiles at around 2GB each, we see sporadic failures in a bulk load. 
> I'm unable to paste the logs here (different network) but they show e.g. the following on a failing day:
> {code:java}
> Trying to load hfile... /my/input/path/...
> Attempt to bulk load region containing ... failed. This is recoverable and will be retried
> Attempt to bulk load region containing ... failed. This is recoverable and will be retried
> Attempt to bulk load region containing ... failed. This is recoverable and will be retried
> Split occurred while grouping HFiles, retry attempt 1 with 3 files remaining to group or split
> Trying to load hfile...
> IOException during splitting
> java.io.FileNotFoundException: File does not exist: /my/input/path/...
> {code}
> The exception get's thrown from [this line|https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java#L685].
>   
>  I should note that this is a secure cluster (CDH 5.12.x).
> I've tried to go through the code, and don't spot an obvious race condition. I don't spot any changes related to this for the later 1.x versions so presume this exists in 1.5.
> I'm yet to get access to the NameNode audit logs when this occurs to trace through the rename() calls around these particular files.
> I don't see timeouts like HBASE-4030



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)