You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by xeon <xe...@gmail.com> on 2013/10/16 10:02:08 UTC

How to execute wordcount with compression?

Hi,


I want execute the wordcount in yarn with compression enabled with a dir 
with several files, but for that I must compress the input.

dir1/file1.txt
dir1/file2.txt
dir1/file3.txt
dir1/file4.txt
dir1/file5.txt

1 - Should I compress the whole dir or each file in the dir?

2 - Should I use gzip or bzip2?

3 - Do I need to setup any yarn configuration file?

4 - when the job is running, the files are decompressed before running 
the mappers and compressed again after reducers executed?

-- 
Thanks,

Re: How to execute wordcount with compression?

Posted by Yanbo Liang <ya...@gmail.com>.

Compression is irrelevant with yarn.
If you want to store files with compression, you should compress the file
when they were load to HDFS.
The files on HDFS were compressed according to the parameter
"io.compression.codecs" which was set in core-site.xml.
If you want to specific a novel compression format, you need to set "STORED
AS INPUTFORMAT" to the corresponding class which act as the role of
compression such as "com.hadoop.mapred.DeprecatedLzoTextInputFormat".
1, you should compress each file in the dir rather than the whole dir.
2, consider the compression ratio, bzip2 > gzip > lzo, however, the
decompression speed is just the opposite order. So we need balance. gzip is
popular one as far as I know.
3, without need.
4, Yes, and the process is transparent to users.


2013/10/16 xeon <xe...@gmail.com>

> Hi,
>
>
> I want execute the wordcount in yarn with compression enabled with a dir
> with several files, but for that I must compress the input.
>
> dir1/file1.txt
> dir1/file2.txt
> dir1/file3.txt
> dir1/file4.txt
> dir1/file5.txt
>
> 1 - Should I compress the whole dir or each file in the dir?
>
> 2 - Should I use gzip or bzip2?
>
> 3 - Do I need to setup any yarn configuration file?
>
> 4 - when the job is running, the files are decompressed before running the
> mappers and compressed again after reducers executed?
>
> --
> Thanks,
>
>

Re: How to execute wordcount with compression?

Posted by Yanbo Liang <ya...@gmail.com>.

Compression is irrelevant with yarn.
If you want to store files with compression, you should compress the file
when they were load to HDFS.
The files on HDFS were compressed according to the parameter
"io.compression.codecs" which was set in core-site.xml.
If you want to specific a novel compression format, you need to set "STORED
AS INPUTFORMAT" to the corresponding class which act as the role of
compression such as "com.hadoop.mapred.DeprecatedLzoTextInputFormat".
1, you should compress each file in the dir rather than the whole dir.
2, consider the compression ratio, bzip2 > gzip > lzo, however, the
decompression speed is just the opposite order. So we need balance. gzip is
popular one as far as I know.
3, without need.
4, Yes, and the process is transparent to users.


2013/10/16 xeon <xe...@gmail.com>

> Hi,
>
>
> I want execute the wordcount in yarn with compression enabled with a dir
> with several files, but for that I must compress the input.
>
> dir1/file1.txt
> dir1/file2.txt
> dir1/file3.txt
> dir1/file4.txt
> dir1/file5.txt
>
> 1 - Should I compress the whole dir or each file in the dir?
>
> 2 - Should I use gzip or bzip2?
>
> 3 - Do I need to setup any yarn configuration file?
>
> 4 - when the job is running, the files are decompressed before running the
> mappers and compressed again after reducers executed?
>
> --
> Thanks,
>
>

Re: How to execute wordcount with compression?

Posted by Yanbo Liang <ya...@gmail.com>.

Compression is irrelevant with yarn.
If you want to store files with compression, you should compress the file
when they were load to HDFS.
The files on HDFS were compressed according to the parameter
"io.compression.codecs" which was set in core-site.xml.
If you want to specific a novel compression format, you need to set "STORED
AS INPUTFORMAT" to the corresponding class which act as the role of
compression such as "com.hadoop.mapred.DeprecatedLzoTextInputFormat".
1, you should compress each file in the dir rather than the whole dir.
2, consider the compression ratio, bzip2 > gzip > lzo, however, the
decompression speed is just the opposite order. So we need balance. gzip is
popular one as far as I know.
3, without need.
4, Yes, and the process is transparent to users.


2013/10/16 xeon <xe...@gmail.com>

> Hi,
>
>
> I want execute the wordcount in yarn with compression enabled with a dir
> with several files, but for that I must compress the input.
>
> dir1/file1.txt
> dir1/file2.txt
> dir1/file3.txt
> dir1/file4.txt
> dir1/file5.txt
>
> 1 - Should I compress the whole dir or each file in the dir?
>
> 2 - Should I use gzip or bzip2?
>
> 3 - Do I need to setup any yarn configuration file?
>
> 4 - when the job is running, the files are decompressed before running the
> mappers and compressed again after reducers executed?
>
> --
> Thanks,
>
>

Re: How to execute wordcount with compression?

Posted by Yanbo Liang <ya...@gmail.com>.

Compression is irrelevant with yarn.
If you want to store files with compression, you should compress the file
when they were load to HDFS.
The files on HDFS were compressed according to the parameter
"io.compression.codecs" which was set in core-site.xml.
If you want to specific a novel compression format, you need to set "STORED
AS INPUTFORMAT" to the corresponding class which act as the role of
compression such as "com.hadoop.mapred.DeprecatedLzoTextInputFormat".
1, you should compress each file in the dir rather than the whole dir.
2, consider the compression ratio, bzip2 > gzip > lzo, however, the
decompression speed is just the opposite order. So we need balance. gzip is
popular one as far as I know.
3, without need.
4, Yes, and the process is transparent to users.


2013/10/16 xeon <xe...@gmail.com>

> Hi,
>
>
> I want execute the wordcount in yarn with compression enabled with a dir
> with several files, but for that I must compress the input.
>
> dir1/file1.txt
> dir1/file2.txt
> dir1/file3.txt
> dir1/file4.txt
> dir1/file5.txt
>
> 1 - Should I compress the whole dir or each file in the dir?
>
> 2 - Should I use gzip or bzip2?
>
> 3 - Do I need to setup any yarn configuration file?
>
> 4 - when the job is running, the files are decompressed before running the
> mappers and compressed again after reducers executed?
>
> --
> Thanks,
>
>