You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Peter Bacsko (JIRA)" <ji...@apache.org> on 2017/11/27 15:07:00 UTC
[jira] [Created] (MAPREDUCE-7015) Possible race condition in JHS if the job is not loaded

Peter Bacsko created MAPREDUCE-7015:
---------------------------------------

             Summary: Possible race condition in JHS if the job is not loaded
                 Key: MAPREDUCE-7015
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7015
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobhistoryserver
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


There could be a race condition inside JHS. In our build environment, {{TestMRJobClient.testJobClient()}} failed with this exception:

{noformat}
ava.io.FileNotFoundException: File does not exist: hdfs://localhost:32836/tmp/hadoop-yarn/staging/history/done_intermediate/jenkins/job_1509975084722_0001_conf.xml
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2123)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2092)
	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2068)
	at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:460)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.mapreduce.TestMRJobClient.runTool(TestMRJobClient.java:94)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testConfig(TestMRJobClient.java:551)
	at org.apache.hadoop.mapreduce.TestMRJobClient.testJobClient(TestMRJobClient.java:167)
{noformat}

Root cause:
1. MapReduce job completes
2. CLI calls {{cluster.getJob(jobid)}}
3. The job is finished and the client side gets redirected to JHS
4. The job data is missing from CachedHistoryStorage so JHS tries to find the job
5. First it scans the intermediate directory and finds the job
6. The call moveToDone() is scheduled for execution on a separate thread inside moveToDoneExecutor but does not get the chance to run immediately
7. RPC invocation returns with the path pointing to /tmp/hadoop-yarn/staging/history/done_intermediate
8. The call to moveToDone() completes which moves the contents of done_intermediate to done
9. Hadoop CLI tries to download the config file from done_intermediate but it's no longer there

Usually step #6 is fast enough to complete before step #7, but sometimes it can get behind, causing this race condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org