You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Time Less <ti...@gmail.com> on 2011/07/22 19:26:06 UTC

Merging Files in HDFS

Hello, List!

I have several files in HDFS in a single directory that I create throughout
the day. At the end of the day, I want to merge them together into one file.
How do you guys do this?

It seems this would do it:
hadoop fs -getmerge /hdfs/directory/allsource* > mergefile ; cat mergefile |
hadoop fs -put - ; rm mergefile ; hadoop fs -rm /hdfs/directory/allsource*

But I wonder if there's a command that can avoid writing to the local
filesystem then re-writing back into HDFS. I'm looking for an HDFS
equivalent to this Unix script:
cat /some/dir/allsource* > /some/dir/merged ; rm /some/dir/allsource*

-- 
Tim Ellis
Data Architect, Riot Games

Re: Merging Files in HDFS

Posted by Joey Echeverria <jo...@cloudera.com>.
You could do it with streaming and a single reducer:

bin/hadoop jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar
-Dmapred.num.reduce.tasks=1 -reducer cat -input
/hdfs/directory/allsource* -output
mergefile -verbose

-Joey

On Fri, Jul 22, 2011 at 1:26 PM, Time Less <ti...@gmail.com> wrote:

> Hello, List!
>
> I have several files in HDFS in a single directory that I create throughout
> the day. At the end of the day, I want to merge them together into one file.
> How do you guys do this?
>
> It seems this would do it:
> hadoop fs -getmerge /hdfs/directory/allsource* > mergefile ; cat mergefile
> | hadoop fs -put - ; rm mergefile ; hadoop fs -rm /hdfs/directory/allsource*
>
> But I wonder if there's a command that can avoid writing to the local
> filesystem then re-writing back into HDFS. I'm looking for an HDFS
> equivalent to this Unix script:
> cat /some/dir/allsource* > /some/dir/merged ; rm /some/dir/allsource*
>
> --
> Tim Ellis
> Data Architect, Riot Games
>
>


-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434