You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jeremy Hanna <je...@gmail.com> on 2011/07/16 14:48:28 UTC

Re: Hadoop Production Issue

One thing that we use is filecrush to merge small files below a threshold.  It works pretty well.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

On Jul 16, 2011, at 1:17 AM, jagaran das wrote:

> 
> 
> Hi,
> 
> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
> Like this we have 20 applications that would run in parallel
> 
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel, 
> 
> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins. 
> 
> What we do:
> 
> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
> 2. Copy to cluster
> 3. Execute PIG job
> 4. copy to local
> 5 Sql loader
> 
> Can we perform merge and copy to cluster from a different host other than the Namenode?
> We want an out of cluster machine running a java process that would
> 1. Run periodically
> 2. Merge Files
> 3. Copy to Cluster 
> 
> Secondly,
> If we can append to an existing file in cluster?
> 
> Please provide your thoughts as maintaing the SLA is becoming tough. 
> 
> Regards,
> Jagaran