You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2013/07/25 20:07:54 UTC
[jira] [Updated] (MAPREDUCE-4421) Remove dependency on deployed MR jars

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4421:
----------------------------------

    Attachment: MAPREDUCE-4421.patch

Submitting a patch to try to move this forward.  We're very interested in the ability to patch issues in the MapReduce framework without having to bring down the cluster and/or push a new version to all nodes.

This patch adds a new config, {{mapreduce.application.framework.path}}, which defaults to being unset.  If set, it specifies a path to an archive containing the MR framework to use with the job.  Normally this would point to a public location within HDFS, and the archive would contain all the MR jars and their dependencies, i.e.: MR jars, YARN client jars, HDFS client, common, and all their dependencies.

This allows ops to deposit a single archive into HDFS that contains the MR framework and configure mapred-site.xml to use it.  That framework is then lazily deployed to the nodes.  A new version can be uploaded to another path, the mapred-site.xml updated, and then all future jobs run with the new version while all currently running jobs proceed with the previous version.  Or ops can avoid pushing the mapred-site.xml change out to all gateway/launcher boxes by using a standard path symlink that always points to the current version to use.  New versions can be deployed, the symlink moved to them, and jobs implicitly pick up the new version without pushing a corresponding mapred-site.xml change.

I've tested this by taking the entire hadoop-3.0.0-SNAPSHOT.tar.gz file and placing it in HDFS under /mapred/.  Admittedly, this is not the most efficient deployment, but it does include everything necessary.  I then set mapreduce.application.framework.path to /mapred/hadoop-3.0.0-SNAPSHOT.tar.gz#mr-framework and mapreduce.application.classpath to:

{noformat}
$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop-3.0.0-SNAPSHOT/share/hadoop/hdfs/lib/*
{noformat}

The job then ran with my specified version of the MR framework instead of the one deployed to the nodes.  The application classpath is complicated because I used the standard distribution tarball.  I could have easily built a custom tarball with all the jars at the top directory and simply had a classpath of:

{noformat}
$PWD/mr-framework/*.jar
{noformat}

The framework is lazily deployed via the distributed cache, so nodes take a localization hit the first time they see a job with a specified framework path.  However subsequent jobs with the same framework run quickly, and I saw no performance difference between jobs using a custom framework and jobs using the cluster-installed framework on nodes that had already localized the specified framework.

Note that there is still a dependency on deployed MR jars with respect to the shuffle service running on all the nodes.  With this patch, new MR versions can only be used when the old shuffle service on all nodes is compatible with the new version.  Fixing this requires the ability to specify auxiliary services with YARN application submissions and have those lazily deploy to nodes that are allocated for the application.  (And ideally subsequently refcounted and retired once no longer necessary.)
                
> Remove dependency on deployed MR jars
> -------------------------------------
>
>                 Key: MAPREDUCE-4421
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4421
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: MAPREDUCE-4421.patch
>
>
> Currently MR AM depends on MR jars being deployed on all nodes via implicit dependency on YARN_APPLICATION_CLASSPATH. 
> We should stop adding mapreduce jars to YARN_APPLICATION_CLASSPATH and, probably, just rely on adding a shaded MR jar along with job.jar to the dist-cache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira