You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Robert Kanter (JIRA)" <ji...@apache.org> on 2017/03/11 00:41:04 UTC

[jira] [Commented] (OOZIE-2821) Using Hadoop Archives for Oozie ShareLib

    [ https://issues.apache.org/jira/browse/OOZIE-2821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905938#comment-15905938 ] 

Robert Kanter commented on OOZIE-2821:
--------------------------------------

Thanks for the interesting idea [~asasvari]!

The tricky thing here will be in creating the HAR files.  I dealt with a similar, though more problematic, issue with Yarn's Aggregated Log files in MAPREDUCE-6415, and some followups.  In a nutshell, the issue is that you end up with one small file per NM per Application, so on a large busy cluster, that adds up.  MAPREDUCE-6415 introduces a CLI tool that combines them into HAR files, so you get only one file per Application.  The tricky part is that it runs a Distributed Shell job which then runs MR jobs in local mode, one for each set of logs to process.  I don't think we want to run an MR job (local or otherwise) every time we install the sharelib.  AFAIK, the only way to create a HAR file is to run an MR job.  Maybe there's some other way?  Or perhaps it's enough to add a CLI argument to the sharelib upload script that will run an MR job to generate the HAR file after uploading the sharelib so users can decide if they want to run the MR job or not?

> Using Hadoop Archives for Oozie ShareLib
> ----------------------------------------
>
>                 Key: OOZIE-2821
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2821
>             Project: Oozie
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>
> Oozie ShareLib is a collection of lots of jar files that are required by Oozie actions. Right now, these jars are uploaded one by one with Oozie ShareLib installation. There can more hundreds of such jars, and many of them are pretty small, significantly smaller than a HDFS block size. Storing a large number of small files in HDFS is inefficient (for example due to the fact that there is an object maintained for each file in the NameNode's memory and blocks containing the small files might be much bigger then the actual files). When an action is executed, these jar files are copied to the distributed cache.
> It  would worth to investigate the possibility of using [Hadoop archives|http://hadoop.apache.org/docs/r2.6.5/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html] for handling  Oozie ShareLib files, because it could result in better utilisation of HDFS. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)