You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Peter Bacsko (JIRA)" <ji...@apache.org> on 2017/11/27 15:07:00 UTC
[jira] [Created] (MAPREDUCE-7015) Possible race condition in JHS if
the job is not loaded
Peter Bacsko created MAPREDUCE-7015:
---------------------------------------
Summary: Possible race condition in JHS if the job is not loaded
Key: MAPREDUCE-7015
URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: jobhistoryserver
Reporter: Peter Bacsko
Assignee: Peter Bacsko
There could be a race condition inside JHS. In our build environment, {{TestMRJobClient.testJobClient()}} failed with this exception:
{noformat}
ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
{noformat}
Root cause:
1. MapReduce job completes
2. CLI calls {{cluster.getJob(jobid)}}
3. The job is finished and the client side gets redirected to JHS
4. The job data is missing from CachedHistoryStorage so JHS tries to find the job
5. First it scans the intermediate directory and finds the job
6. The call moveToDone() is scheduled for execution on a separate thread inside moveToDoneExecutor but does not get the chance to run immediately
7. RPC invocation returns with the path pointing to /tmp/hadoop-yarn/staging/history/done_intermediate
8. The call to moveToDone() completes which moves the contents of done_intermediate to done
9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer there
Usually step #6 is fast enough to complete before step #7, but sometimes it can get behind, causing this race condition.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org