You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Mapred Learn <ma...@gmail.com> on 2011/06/23 08:21:32 UTC

can we split a big gzipped file on HDFS ?

Hi,
If I have a big gzipped text file (~ 60 GB) in HDFS, can i split it into
smaller chunks (~ 1 GB) so that I can run a map-red job on those files
and finish faster than running job on 1 big file ?

Thanks,
-JJ

Re: can we split a big gzipped file on HDFS ?

Posted by Robert Evans <ev...@yahoo-inc.com>.
JJ,

If it is just gzipped then no.  gzip does not allow for splitting as you cannot seek to an arbitrary point in the file and then after, possibly moving to a sync point, start reading out the data.  If it is a sequence file with gzip compression then yes, because the sequence file format only compresses the file in chunks, not the entire file at once.

--Bobby Evans

On 6/23/11 1:21 AM, "Mapred Learn" <ma...@gmail.com> wrote:

Hi,
If I have a big gzipped text file (~ 60 GB) in HDFS, can i split it into smaller chunks (~ 1 GB) so that I can run a map-red job on those files and finish faster than running job on 1 big file ?

Thanks,
-JJ