You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Omkar Vinit Joshi (JIRA)" <ji...@apache.org> on 2013/04/03 20:07:16 UTC
[jira] [Commented] (YARN-99) Jobs fail during resource localization when private distributed-cache hits unix directory limits

    [ https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621128#comment-13621128 ] 

Omkar Vinit Joshi commented on YARN-99:
---------------------------------------

Rebasing the patch as 467 is now committed.
This issue is related to 467 and the detailed information can be found here [underlying problem and proposed/implemented Solution | https://issues.apache.org/jira/browse/YARN-467?focusedCommentId=13615894&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13615894]

The only difference here is that the same problem is present in <local-dir>/usercache/<user-name>/filecache (Private user cache). We are using LocalCacheDirectoryManager for user-cache but not for app-cache as it is highly unlikely for application to have so many localized files.

Earlier implementation for private cache involved computing localized path inside ContainerLocalizer; i.e. in different processes. Now in order to centralize this we have moved it to ResourceLocalizationService.LocalizerRunner and this is communicated to all the ContainerLocalizer as a part of the heartbeat. Thereby we can now manage LocalCacheDirectory at one place.
                
> Jobs fail during resource localization when private distributed-cache hits unix directory limits
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-99
>                 URL: https://issues.apache.org/jira/browse/YARN-99
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Devaraj K
>            Assignee: Omkar Vinit Joshi
>         Attachments: yarn-99-20130324.patch
>
>
> If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache. The jobs start failing with the below exception.
> {code:xml}
> java.io.IOException: mkdir of /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed
> 	at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
> 	at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
> 	at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
> 	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
> 	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
> 	at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
> 	at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
> 	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
> 	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> {code}
> We should have a mechanism to clean the cache files if it crosses specified number of directories like cache size.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira