You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "W.P. McNeill" <bi...@gmail.com> on 2011/08/12 19:29:07 UTC

What is the most efficient way to copy a large number of .gz files into HDFS?

I have a large number of gzipped web server logs on NFS that I need to pull
into HDFS for analysis by MapReduce.  What is the most efficient way to do
this?

It seems like what I should do is:

hadoop fs -copyFromLocal *.gz /my/HDFS/directory

A couple of questions:

   1. Is this single process, or will the files be copied up in parallel?
   2. Gzip is not a desirable compression format because it's not
   splittable. What's the best way to get these files into a better format?
   Should I run zcat > bzip before calling copyFromLocal or write a Hadoop job?

Re: What is the most efficient way to copy a large number of .gz files into HDFS?

Posted by "W.P. McNeill" <bi...@gmail.com>.
Am I better off using distcp instead of copyFromLocal, since the former will
be distributed?

Re: What is the most efficient way to copy a large number of .gz files into HDFS?

Posted by sridhar basam <sr...@basam.org>.
On Fri, Aug 12, 2011 at 1:29 PM, W.P. McNeill <bi...@gmail.com> wrote:

> I have a large number of gzipped web server logs on NFS that I need to pull
> into HDFS for analysis by MapReduce.  What is the most efficient way to do
> this?
>
> It seems like what I should do is:
>
> hadoop fs -copyFromLocal *.gz /my/HDFS/directory
>
> A couple of questions:
>
>   1. Is this single process, or will the files be copied up in parallel?
>

It will use a single process to do the copy. You could just have multiple
-copyFromLocal or moveFromLocal to improve speed.


>   2. Gzip is not a desirable compression format because it's not
>   splittable. What's the best way to get these files into a better format?
>   Should I run zcat > bzip before calling copyFromLocal or write a Hadoop
> job?
>

If you have lzo working, i would recommend it. Running mapreduce jobs using
lzo was measurably quicker in my setup. While bzip2 provides better
compression ratios, it is far too cpu intensive compared to lzo/gzip.  If
you have multiple gzip files, you might still be able to increase
parallelizism by having multiple mapper run on the individual gzip files but
still be 1 per file. I don't specifically recall if gzip/bzip2 was better in
my case.

 Sridhar