You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "W.P. McNeill" <bi...@gmail.com> on 2011/08/12 19:29:07 UTC
What is the most efficient way to copy a large number of .gz files
into HDFS?
I have a large number of gzipped web server logs on NFS that I need to pull
into HDFS for analysis by MapReduce. What is the most efficient way to do
this?
It seems like what I should do is:
hadoop fs -copyFromLocal *.gz /my/HDFS/directory
A couple of questions:
1. Is this single process, or will the files be copied up in parallel?
2. Gzip is not a desirable compression format because it's not
splittable. What's the best way to get these files into a better format?
Should I run zcat > bzip before calling copyFromLocal or write a Hadoop job?
Re: What is the most efficient way to copy a large number of .gz
files into HDFS?
Posted by "W.P. McNeill" <bi...@gmail.com>.
Am I better off using distcp instead of copyFromLocal, since the former will
be distributed?
Re: What is the most efficient way to copy a large number of .gz
files into HDFS?
Posted by sridhar basam <sr...@basam.org>.
On Fri, Aug 12, 2011 at 1:29 PM, W.P. McNeill <bi...@gmail.com> wrote:
> I have a large number of gzipped web server logs on NFS that I need to pull
> into HDFS for analysis by MapReduce. What is the most efficient way to do
> this?
>
> It seems like what I should do is:
>
> hadoop fs -copyFromLocal *.gz /my/HDFS/directory
>
> A couple of questions:
>
> 1. Is this single process, or will the files be copied up in parallel?
>
It will use a single process to do the copy. You could just have multiple
-copyFromLocal or moveFromLocal to improve speed.
> 2. Gzip is not a desirable compression format because it's not
> splittable. What's the best way to get these files into a better format?
> Should I run zcat > bzip before calling copyFromLocal or write a Hadoop
> job?
>
If you have lzo working, i would recommend it. Running mapreduce jobs using
lzo was measurably quicker in my setup. While bzip2 provides better
compression ratios, it is far too cpu intensive compared to lzo/gzip. If
you have multiple gzip files, you might still be able to increase
parallelizism by having multiple mapper run on the individual gzip files but
still be 1 per file. I don't specifically recall if gzip/bzip2 was better in
my case.
Sridhar