You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Arv Mistry <ar...@kindsight.net> on 2010/06/01 16:28:01 UTC

Writing compressed data to HDFS

Hi,

I have a java process that writes compressed data to the HDFS. The way I
am doing that is wrapping the FSDataOutputSTream with GZIPOutputStream
and calling the write() method i.e. something like

FSDataOutputSTream  out = fs.create(file);
gzip = new GZIPOutputStream(out);		
gzip.write("sss".getBytes("UTF8");

The file seems to get written ok. 

However, when I get the file out of HDFS and try to unzip it, it
complains;

gunzip: cs_1_20100601_120000_1275396891183.cgz: unknown suffix --
ignored

When I do 'file' it is recognized as 'gzip compressed data, from FAT
filesystem (MS-DOS, OS/2, NT)'

Any ideas? Appreciate any help.

Cheers Arv

Re: Writing compressed data to HDFS

Posted by Eric Sammer <es...@cloudera.com>.
This isn't really a Hadoop issue, but gunzip will refuse to decompress
files that don't have a well known suffix. Rename the file to have the
file .gz and try again or use the -S option to specify an alternate
suffix.

On Tue, Jun 1, 2010 at 10:28 AM, Arv Mistry <ar...@kindsight.net> wrote:
> Hi,
>
> I have a java process that writes compressed data to the HDFS. The way I
> am doing that is wrapping the FSDataOutputSTream with GZIPOutputStream
> and calling the write() method i.e. something like
>
> FSDataOutputSTream  out = fs.create(file);
> gzip = new GZIPOutputStream(out);
> gzip.write("sss".getBytes("UTF8");
>
> The file seems to get written ok.
>
> However, when I get the file out of HDFS and try to unzip it, it
> complains;
>
> gunzip: cs_1_20100601_120000_1275396891183.cgz: unknown suffix --
> ignored
>
> When I do 'file' it is recognized as 'gzip compressed data, from FAT
> filesystem (MS-DOS, OS/2, NT)'
>
> Any ideas? Appreciate any help.
>
> Cheers Arv
>



-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com