You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "omkar vinit joshi (JIRA)" <ji...@apache.org> on 2013/03/06 02:35:12 UTC
[jira] [Commented] (YARN-99) Jobs fail during resource localization when directories in file cache reaches to unix directory limit

    [ https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13594210#comment-13594210 ] 

omkar vinit joshi commented on YARN-99:
---------------------------------------

The problem of large number of files (no. exceeding max file limit per directory) may occur at below (* marked) locations and it needs to be fixed at both the places. The directory structure shown is per local directory.
[For-any-local-dir]
   ---- [filecache *]
   ---- [usercache]
	   [userid]
                ---- [filecache *]
		---- [appcache]
			---- [appid]
				---- [filecache]
In application specific filecache it is highly improbable that we might hit that limit whereas in the other two places it is highly likely.

Proposed solution: - Adding an internal parameter (not making it externally configurable) to “localResourcesTrackerImpl” to control the hierarchical behavior for different type of resources. 
For hierarchical resources; managing hierarchy as follows
<orig-dir> = “path ending with –filecache- “
<orig-dir>
	---- (8192 / 8k localized files)
	---- 26 directories (a-z) ( Created only if the directory limit is reached.)
	----- [a]
		---- (8192 / 8k localized files)
		---- 26 directories (a-z)
	.
	.
	.
	.
Reason for creating directories with single character;
1) For windows we also have maxpath limit (~255 characters)
2) With this hierarchy we can accommodate >1M files with 3 levels; which is practically sufficient.

Let me know if I need to look at any specific scenario / corner case / need to modify the design.

                
> Jobs fail during resource localization when directories in file cache reaches to unix directory limit
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-99
>                 URL: https://issues.apache.org/jira/browse/YARN-99
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Devaraj K
>            Assignee: Devaraj K
>
> If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache. The jobs start failing with the below exception.
> {code:xml}
> java.io.IOException: mkdir of /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed
> 	at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
> 	at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
> 	at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
> 	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
> 	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
> 	at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
> 	at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
> 	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
> 	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {code}
> We should have a mechanism to clean the cache files if it crosses specified number of directories like cache size.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira