You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Wei-Chiu Chuang (JIRA)" <ji...@apache.org> on 2018/05/12 05:43:00 UTC
[jira] [Resolved] (HADOOP-8640) DU thread transient failures
propagate to callers
[ https://issues.apache.org/jira/browse/HADOOP-8640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang resolved HADOOP-8640.
-------------------------------------
Resolution: Won't Fix
Given that the refactor in HADOOP-12973 unintentionally eliminated this problem in 2.8.0 and above, I'll mark this as a won't fix.
> DU thread transient failures propagate to callers
> -------------------------------------------------
>
> Key: HADOOP-8640
> URL: https://issues.apache.org/jira/browse/HADOOP-8640
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs, io
> Affects Versions: 2.0.0-alpha, 1.2.1
> Reporter: Todd Lipcon
> Priority: Major
>
> When running some stress tests, I saw a failure where the DURefreshThread failed due to the filesystem changing underneath it:
> {code}
> org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/data/4/dfs/dn/current/BP-1928785663-172.20.90.20-1343880685858/current/rbw/blk_4637779214690837894': No such file or directory
> {code}
> (the block was probably finalized while the du process was running, which caused it to fail)
> The next block write, then, called {{getUsed()}}, and the exception got propagated causing the write to fail. Since it was a pseudo-distributed cluster, the client was unable to pick a different node to write to and failed.
> The current behavior of propagating the exception to the next (and only the next) caller doesn't seem well-thought-out.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org