You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Omkar Vinit Joshi (JIRA)" <ji...@apache.org> on 2013/03/28 00:47:15 UTC

[jira] [Commented] (YARN-467) Jobs fail during resource localization when public distributed-cache hits unix directory limits

    [ https://issues.apache.org/jira/browse/YARN-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615894#comment-13615894 ] 

Omkar Vinit Joshi commented on YARN-467:
----------------------------------------

The Underlying problem here is that ResourceLocalization is trying to localize files more than the allowed file limit per directory for the underlying local file system.

Proposed Solution :- ( For Public resources - localized under :- <local-dirs>/filecache/ )

We are going to maintain hierarchical directory structure inside the local directories for filecache.
so the directory structure will look like this

.../filecache/<default-~8192-files>
.../filecache/<36 directories (0-9 & a-z)>/<default-~8192-files>
.../filecache/<36 directories (0-9 & a-z)>/<36 directories (0-9 & a-z)>
.....................

So in all every directory will have (8192-36) localized files and 36 sub directories named 0-9 and a-z. These sub directories are created only if they are required. They will not be created in advance. Likewise every sub directory will have similar structure.

Now to manage files and to limit the number of files per directory to HierarchicalDirectory#PER_DIR_FILE_LIMIT (in this case 8192) introducing below classes / implementation.

* LocalResourcesTrackerImpl :-
** maintainHierarchicalDir  :- a boolean flag. It should be set when you want to use this resource tracker to track resources with hierarchical directory structure.
** directoryMap :- Map of <Path, HierarchicalDirectory>. It makes sure that we have one HierarchicalDirectory for every localPath. ( For example if we have two local-dirs configured then it will have 2 entries.)
** inProgressRsrcMap :- Map of <LocalResourceRequest, Path>. This is used while local resource is getting localized. This map helps in two ways
*** If the resource localization fails for that resource then we can retrieve the path and remove the file reservation (file count)
*** If the LocalResourceRequest comes again for the same resourcerequest ( which is highly unlikely for today's implementation) it can return the same path back.
** getPathForLocalResource :- This method should be called to retrieve the Hierarchical directory path for the local-dir identified by the localDirPath. Internally it adds this request and returned path to inProgressRsrcMap and makes a reservation into the HierarchicalDirectory tracking this local-dir-path.
** decFileCountForHierarchicalPath :- It retrieves the localizedPath from either inProgressRsrcMap or from LocalizedResource and then reduces file count for the HierarchicalDirectory tracking it.
** localizationCompleted :- (Parameter - success) If true then it will only update inProgressRsrcMap; otherwise it will update inProgressRsrcMap and will also call decFileCountForHierarchicalPath.

* HierarchicalDirectory :- It just helps in managing hierarchical directories.
** PER_DIR_FILE_LIMIT :- It controls the files per directory /sub directories of it. Can be controlled but should not be set too low (YarnConfiguration.NM_LOCAL_CACHE_NUM_FILES_PER_DIRECTORY).
** DIRECTORIES_PER_LEVEL (constant 36) :- So every directory/sub-directory will have total 36 directories only if they are required. ( 0-9 and a-z). Reason behind using single character is the file length limit for windows.
** vacantSubDirectories :- Queue<HierarchicalSubDirectory> :- at the beginning this will have root of the HierarchicalDirectory as the only sub directory. if the queue becomes empty then new sub directory will be created starting with 0. Note:- It will only create internal tracking for this and doesn't create an actual directory on file system.
** knownSubDirectories :- Map of <String, HierarchicalSubDirectory> - Root directory is identified by an empty string "" and then other sub directories by their relative paths. like for directory 0:"0" for 0/a :"0/a"
** getHierarchicalPath :- (synchronized) This method returns the relative path for the sub directory which is empty (has not reached its directory file limit). If no empty sub directory is present then it will create one using totalSubDirectories.
** decFileCountForPath :- (synchronized) This method reduces the count for the HierarchicalSubDirectory representing the passed in relative path.

                
> Jobs fail during resource localization when public distributed-cache hits unix directory limits
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-467
>                 URL: https://issues.apache.org/jira/browse/YARN-467
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.0.0, 2.0.0-alpha
>            Reporter: Omkar Vinit Joshi
>            Assignee: Omkar Vinit Joshi
>         Attachments: yarn-467-20130322.1.patch, yarn-467-20130322.2.patch, yarn-467-20130322.3.patch, yarn-467-20130322.patch, yarn-467-20130325.1.patch, yarn-467-20130325.path
>
>
> If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache (PUBLIC). The jobs start failing with the below exception.
> java.io.IOException: mkdir of /tmp/nm-local-dir/filecache/3901886847734194975 failed
> 	at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
> 	at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
> 	at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
> 	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
> 	at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
> 	at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
> 	at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
> 	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
> 	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> we need to have a mechanism where in we can create directory hierarchy and limit number of files per directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira