You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Eric Badger (Jira)" <ji...@apache.org> on 2021/04/27 18:54:00 UTC

[jira] [Assigned] (MAPREDUCE-6759) JobSubmitter/JobResourceUploader should parallelize upload of -libjars, -files, -archives

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Badger reassigned MAPREDUCE-6759:
--------------------------------------

    Assignee: Christos Karampeazis-Papadakis

> JobSubmitter/JobResourceUploader should parallelize upload of -libjars, -files, -archives
> -----------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6759
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6759
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: job submission
>            Reporter: Dennis Huo
>            Assignee: Christos Karampeazis-Papadakis
>            Priority: Major
>
> During job submission, the {{JobResourceUploader}} currently iterates over for-loops of {{-libjars}}, {{-files}}, and {{-archives}} sequentially, which can significantly slow down job startup time when a large number of files need to be uploaded, especially if staging the files to a cloud object-store based FileSystem implementation like S3, GCS, WABS, etc., where round-trip latencies may be higher than HDFS despite having good throughput when parallelized:
> {code:title=JobResourceUploader.java}
>     if (files != null) {
>       FileSystem.mkdirs(jtFs, filesDir, mapredSysPerms);
>       String[] fileArr = files.split(",");
>       for (String tmpFile : fileArr) {
>         URI tmpURI = null;
>         try {
>           tmpURI = new URI(tmpFile);
>         } catch (URISyntaxException e) {
>           throw new IllegalArgumentException(e);
>         }
>         Path tmp = new Path(tmpURI);
>         Path newPath = copyRemoteFiles(filesDir, tmp, conf, replication);
>         try {
>           URI pathURI = getPathURI(newPath, tmpURI.getFragment());
>           DistributedCache.addCacheFile(pathURI, conf);
>         } catch (URISyntaxException ue) {
>           // should not throw a uri exception
>           throw new IOException("Failed to create uri for " + tmpFile, ue);
>         }
>       }
>     }
>     if (libjars != null) {
>       FileSystem.mkdirs(jtFs, libjarsDir, mapredSysPerms);
>       String[] libjarsArr = libjars.split(",");
>       for (String tmpjars : libjarsArr) {
>         Path tmp = new Path(tmpjars);
>         Path newPath = copyRemoteFiles(libjarsDir, tmp, conf, replication);
>         DistributedCache.addFileToClassPath(
>             new Path(newPath.toUri().getPath()), conf, jtFs);
>       }
>     }
>     if (archives != null) {
>       FileSystem.mkdirs(jtFs, archivesDir, mapredSysPerms);
>       String[] archivesArr = archives.split(",");
>       for (String tmpArchives : archivesArr) {
>         URI tmpURI;
>         try {
>           tmpURI = new URI(tmpArchives);
>         } catch (URISyntaxException e) {
>           throw new IllegalArgumentException(e);
>         }
>         Path tmp = new Path(tmpURI);
>         Path newPath = copyRemoteFiles(archivesDir, tmp, conf, replication);
>         try {
>           URI pathURI = getPathURI(newPath, tmpURI.getFragment());
>           DistributedCache.addCacheArchive(pathURI, conf);
>         } catch (URISyntaxException ue) {
>           // should not throw an uri excpetion
>           throw new IOException("Failed to create uri for " + tmpArchives, ue);
>         }
>       }
>     }
> {code}
> Parallelizing the upload of these files would improve job submission time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org