You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Eric Badger (Jira)" <ji...@apache.org> on 2021/04/27 18:54:00 UTC
[jira] [Assigned] (MAPREDUCE-6759) JobSubmitter/JobResourceUploader
should parallelize upload of -libjars, -files, -archives
[ https://issues.apache.org/jira/browse/MAPREDUCE-6759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Badger reassigned MAPREDUCE-6759:
--------------------------------------
Assignee: Christos Karampeazis-Papadakis
> JobSubmitter/JobResourceUploader should parallelize upload of -libjars, -files, -archives
> -----------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6759
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6759
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: job submission
> Reporter: Dennis Huo
> Assignee: Christos Karampeazis-Papadakis
> Priority: Major
>
> During job submission, the {{JobResourceUploader}} currently iterates over for-loops of {{-libjars}}, {{-files}}, and {{-archives}} sequentially, which can significantly slow down job startup time when a large number of files need to be uploaded, especially if staging the files to a cloud object-store based FileSystem implementation like S3, GCS, WABS, etc., where round-trip latencies may be higher than HDFS despite having good throughput when parallelized:
> {code:title=JobResourceUploader.java}
> if (files != null) {
> FileSystem.mkdirs(jtFs, filesDir, mapredSysPerms);
> String[] fileArr = files.split(",");
> for (String tmpFile : fileArr) {
> URI tmpURI = null;
> try {
> tmpURI = new URI(tmpFile);
> } catch (URISyntaxException e) {
> throw new IllegalArgumentException(e);
> }
> Path tmp = new Path(tmpURI);
> Path newPath = copyRemoteFiles(filesDir, tmp, conf, replication);
> try {
> URI pathURI = getPathURI(newPath, tmpURI.getFragment());
> DistributedCache.addCacheFile(pathURI, conf);
> } catch (URISyntaxException ue) {
> // should not throw a uri exception
> throw new IOException("Failed to create uri for " + tmpFile, ue);
> }
> }
> }
> if (libjars != null) {
> FileSystem.mkdirs(jtFs, libjarsDir, mapredSysPerms);
> String[] libjarsArr = libjars.split(",");
> for (String tmpjars : libjarsArr) {
> Path tmp = new Path(tmpjars);
> Path newPath = copyRemoteFiles(libjarsDir, tmp, conf, replication);
> DistributedCache.addFileToClassPath(
> new Path(newPath.toUri().getPath()), conf, jtFs);
> }
> }
> if (archives != null) {
> FileSystem.mkdirs(jtFs, archivesDir, mapredSysPerms);
> String[] archivesArr = archives.split(",");
> for (String tmpArchives : archivesArr) {
> URI tmpURI;
> try {
> tmpURI = new URI(tmpArchives);
> } catch (URISyntaxException e) {
> throw new IllegalArgumentException(e);
> }
> Path tmp = new Path(tmpURI);
> Path newPath = copyRemoteFiles(archivesDir, tmp, conf, replication);
> try {
> URI pathURI = getPathURI(newPath, tmpURI.getFragment());
> DistributedCache.addCacheArchive(pathURI, conf);
> } catch (URISyntaxException ue) {
> // should not throw an uri excpetion
> throw new IOException("Failed to create uri for " + tmpArchives, ue);
> }
> }
> }
> {code}
> Parallelizing the upload of these files would improve job submission time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org