You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jeremy Hanna <je...@gmail.com> on 2011/07/16 14:48:28 UTC
Re: Hadoop Production Issue
One thing that we use is filecrush to merge small files below a threshold. It works pretty well.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp
On Jul 16, 2011, at 1:17 AM, jagaran das wrote:
>
>
> Hi,
>
> Due to requirements in our current production CDH3 cluster we need to copy around 11520 small size files (Total Size 12 GB) to the cluster for one application.
> Like this we have 20 applications that would run in parallel
>
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel,
>
> We have a total SLA for the pipeline from copy to pig aggregation to copy to local and sql load is 15 mins.
>
> What we do:
>
> 1. Merge Files so that we get rid of small files. - Huge time hit process??? Do we have any other option???
> 2. Copy to cluster
> 3. Execute PIG job
> 4. copy to local
> 5 Sql loader
>
> Can we perform merge and copy to cluster from a different host other than the Namenode?
> We want an out of cluster machine running a java process that would
> 1. Run periodically
> 2. Merge Files
> 3. Copy to Cluster
>
> Secondly,
> If we can append to an existing file in cluster?
>
> Please provide your thoughts as maintaing the SLA is becoming tough.
>
> Regards,
> Jagaran