You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Young-Geun Park <yo...@gmail.com> on 2012/09/07 01:25:25 UTC

Lzo vs SequenceFile for big file

Hi, All

I have tested which method is better between Lzo and SequenceFile for a BIG
file.

File size is 10GiB and WordCount MR is used.
Inputs of WordCount MR are  lzo which would be indexed by
LzoIndexTool(lzo),
sequence file which is compressed by block level snappy(seq)  , and
 uncompressed original file(none).

Map output  is compressed except of uncompressed file. mapreduce output is
not compressed for all cases.

The following are wordcount MR running time;
none       lzo         seq
248s      243s     1410s

-Test Environments

   - OS : CentOS 5.6 (x64) (kernel = 2.6.18)
   - # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
   - RAM : 18GB
   - Java version : 1.6.0_26
   - Hadoop version : CDH3U2
   - # of datanode(tasktracker) :  8

According to the result, The running time of SequnceFile is much less than
the others.
Before testing, I had expected that the results of  both SequenceFile and
Lzo are about the same.

I want to know why performance of the sequence file compressed by snappy is
so bad?

do I miss anything in tests?


Regards,
Park

Re: Lzo vs SequenceFile for big file

Posted by "박영근 (Alex)" <al...@nexr.com>.

Ruslan,
Thanks for your reply in advance.

Jobs' statistics are as follows;

case 1 : uncompressed data(none)
12/08/09 16:12:44 INFO mapred.JobClient: Job complete: job_201208021633_0049
12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23
12/08/09 16:12:44 INFO mapred.JobClient:   Job Counters
12/08/09 16:12:44 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3623053
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Rack-local map tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     Launched map tasks=166
12/08/09 16:12:44 INFO mapred.JobClient:     Data-local map tasks=165
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=220786
12/08/09 16:12:44 INFO mapred.JobClient:   FileSystemCounters
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_READ=1852424288
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_READ=10644581454
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1894096220
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 16:12:44 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Combine output records=69055428
12/08/09 16:12:44 INFO mapred.JobClient:     Map input records=158156100
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce shuffle bytes=33143186
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Spilled Records=122916251
12/08/09 16:12:44 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 16:12:44 INFO mapred.JobClient:     Combine input
records=1332132129
12/08/09 16:12:44 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 16:12:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=19716
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input records=2172099

case2 : lzo
12/08/09 15:58:11 INFO mapred.JobClient: Job complete: job_201208021633_0048
12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23
12/08/09 15:58:11 INFO mapred.JobClient:   Job Counters
12/08/09 15:58:11 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3361287
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Rack-local map tasks=4
12/08/09 15:58:11 INFO mapred.JobClient:     Launched map tasks=65
12/08/09 15:58:11 INFO mapred.JobClient:     Data-local map tasks=61
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=183529
12/08/09 15:58:11 INFO mapred.JobClient:   FileSystemCounters
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_READ=568178351
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_READ=3860287251
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=576095398
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 15:58:11 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Combine output records=66734193
12/08/09 15:58:11 INFO mapred.JobClient:     Map input records=158156100
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce shuffle bytes=4752406
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Spilled Records=132612729
12/08/09 15:58:11 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 15:58:11 INFO mapred.JobClient:     Combine input
records=1331190655
12/08/09 15:58:11 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 15:58:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=7366
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input records=792338

case3 : sequence file compressed block-level by snappy

12/09/05 18:33:00 INFO mapred.JobClient: Job complete: job_201209051652_0008

12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23

12/09/05 18:33:00 INFO mapred.JobClient:   Job Counters

12/09/05 18:33:00 INFO mapred.JobClient:     Launched reduce tasks=1

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5885897

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Rack-local map tasks=2

12/09/05 18:33:00 INFO mapred.JobClient:     Launched map tasks=68

12/09/05 18:33:00 INFO mapred.JobClient:     Data-local map tasks=66

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1320075

12/09/05 18:33:00 INFO mapred.JobClient:   FileSystemCounters

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_READ=3706936196

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_READ=4419150507

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4581439981

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440

12/09/05 18:33:00 INFO mapred.JobClient:   Map-Reduce Framework

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input groups=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Combine output records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map input records=158156100

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce shuffle bytes=857964933

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce output records=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Spilled Records=6232725043

12/09/05 18:33:00 INFO mapred.JobClient:     Map output bytes=15704921900

12/09/05 18:33:00 INFO mapred.JobClient:     Combine input records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map output records=1265248800

12/09/05 18:33:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=8382

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input records=1265248800
Regards,
Park

2012/9/7 Ruslan Al-Fakikh <ru...@jalent.ru>

> Hi,
>
> I would be interesting to see the jobs' statistics (counters).
>
> Thanks
>
> On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
> <yo...@gmail.com> wrote:
> > Hi, All
> >
> > I have tested which method is better between Lzo and SequenceFile for a
> BIG
> > file.
> >
> > File size is 10GiB and WordCount MR is used.
> > Inputs of WordCount MR are  lzo which would be indexed by
> LzoIndexTool(lzo),
> > sequence file which is compressed by block level snappy(seq)  , and
> > uncompressed original file(none).
> >
> > Map output  is compressed except of uncompressed file. mapreduce output
> is
> > not compressed for all cases.
> >
> > The following are wordcount MR running time;
> > none       lzo         seq
> > 248s      243s     1410s
> >
> > -Test Environments
> >
> > OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> > # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> > RAM : 18GB
> > Java version : 1.6.0_26
> > Hadoop version : CDH3U2
> > # of datanode(tasktracker) :  8
> >
> > According to the result, The running time of SequnceFile is much less
> than
> > the others.
> > Before testing, I had expected that the results of  both SequenceFile and
> > Lzo are about the same.
> >
> > I want to know why performance of the sequence file compressed by snappy
> is
> > so bad?
> >
> > do I miss anything in tests?
> >
> >
> > Regards,
> > Park
> >
> >
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh
>

Re: Lzo vs SequenceFile for big file

Posted by "박영근 (Alex)" <al...@nexr.com>.

Ruslan,
Thanks for your reply in advance.

Jobs' statistics are as follows;

case 1 : uncompressed data(none)
12/08/09 16:12:44 INFO mapred.JobClient: Job complete: job_201208021633_0049
12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23
12/08/09 16:12:44 INFO mapred.JobClient:   Job Counters
12/08/09 16:12:44 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3623053
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Rack-local map tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     Launched map tasks=166
12/08/09 16:12:44 INFO mapred.JobClient:     Data-local map tasks=165
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=220786
12/08/09 16:12:44 INFO mapred.JobClient:   FileSystemCounters
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_READ=1852424288
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_READ=10644581454
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1894096220
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 16:12:44 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Combine output records=69055428
12/08/09 16:12:44 INFO mapred.JobClient:     Map input records=158156100
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce shuffle bytes=33143186
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Spilled Records=122916251
12/08/09 16:12:44 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 16:12:44 INFO mapred.JobClient:     Combine input
records=1332132129
12/08/09 16:12:44 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 16:12:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=19716
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input records=2172099

case2 : lzo
12/08/09 15:58:11 INFO mapred.JobClient: Job complete: job_201208021633_0048
12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23
12/08/09 15:58:11 INFO mapred.JobClient:   Job Counters
12/08/09 15:58:11 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3361287
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Rack-local map tasks=4
12/08/09 15:58:11 INFO mapred.JobClient:     Launched map tasks=65
12/08/09 15:58:11 INFO mapred.JobClient:     Data-local map tasks=61
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=183529
12/08/09 15:58:11 INFO mapred.JobClient:   FileSystemCounters
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_READ=568178351
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_READ=3860287251
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=576095398
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 15:58:11 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Combine output records=66734193
12/08/09 15:58:11 INFO mapred.JobClient:     Map input records=158156100
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce shuffle bytes=4752406
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Spilled Records=132612729
12/08/09 15:58:11 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 15:58:11 INFO mapred.JobClient:     Combine input
records=1331190655
12/08/09 15:58:11 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 15:58:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=7366
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input records=792338

case3 : sequence file compressed block-level by snappy

12/09/05 18:33:00 INFO mapred.JobClient: Job complete: job_201209051652_0008

12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23

12/09/05 18:33:00 INFO mapred.JobClient:   Job Counters

12/09/05 18:33:00 INFO mapred.JobClient:     Launched reduce tasks=1

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5885897

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Rack-local map tasks=2

12/09/05 18:33:00 INFO mapred.JobClient:     Launched map tasks=68

12/09/05 18:33:00 INFO mapred.JobClient:     Data-local map tasks=66

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1320075

12/09/05 18:33:00 INFO mapred.JobClient:   FileSystemCounters

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_READ=3706936196

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_READ=4419150507

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4581439981

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440

12/09/05 18:33:00 INFO mapred.JobClient:   Map-Reduce Framework

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input groups=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Combine output records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map input records=158156100

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce shuffle bytes=857964933

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce output records=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Spilled Records=6232725043

12/09/05 18:33:00 INFO mapred.JobClient:     Map output bytes=15704921900

12/09/05 18:33:00 INFO mapred.JobClient:     Combine input records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map output records=1265248800

12/09/05 18:33:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=8382

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input records=1265248800
Regards,
Park

2012/9/7 Ruslan Al-Fakikh <ru...@jalent.ru>

> Hi,
>
> I would be interesting to see the jobs' statistics (counters).
>
> Thanks
>
> On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
> <yo...@gmail.com> wrote:
> > Hi, All
> >
> > I have tested which method is better between Lzo and SequenceFile for a
> BIG
> > file.
> >
> > File size is 10GiB and WordCount MR is used.
> > Inputs of WordCount MR are  lzo which would be indexed by
> LzoIndexTool(lzo),
> > sequence file which is compressed by block level snappy(seq)  , and
> > uncompressed original file(none).
> >
> > Map output  is compressed except of uncompressed file. mapreduce output
> is
> > not compressed for all cases.
> >
> > The following are wordcount MR running time;
> > none       lzo         seq
> > 248s      243s     1410s
> >
> > -Test Environments
> >
> > OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> > # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> > RAM : 18GB
> > Java version : 1.6.0_26
> > Hadoop version : CDH3U2
> > # of datanode(tasktracker) :  8
> >
> > According to the result, The running time of SequnceFile is much less
> than
> > the others.
> > Before testing, I had expected that the results of  both SequenceFile and
> > Lzo are about the same.
> >
> > I want to know why performance of the sequence file compressed by snappy
> is
> > so bad?
> >
> > do I miss anything in tests?
> >
> >
> > Regards,
> > Park
> >
> >
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh
>

Re: Lzo vs SequenceFile for big file

Posted by "박영근 (Alex)" <al...@nexr.com>.

Ruslan,
Thanks for your reply in advance.

Jobs' statistics are as follows;

case 1 : uncompressed data(none)
12/08/09 16:12:44 INFO mapred.JobClient: Job complete: job_201208021633_0049
12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23
12/08/09 16:12:44 INFO mapred.JobClient:   Job Counters
12/08/09 16:12:44 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3623053
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Rack-local map tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     Launched map tasks=166
12/08/09 16:12:44 INFO mapred.JobClient:     Data-local map tasks=165
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=220786
12/08/09 16:12:44 INFO mapred.JobClient:   FileSystemCounters
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_READ=1852424288
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_READ=10644581454
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1894096220
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 16:12:44 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Combine output records=69055428
12/08/09 16:12:44 INFO mapred.JobClient:     Map input records=158156100
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce shuffle bytes=33143186
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Spilled Records=122916251
12/08/09 16:12:44 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 16:12:44 INFO mapred.JobClient:     Combine input
records=1332132129
12/08/09 16:12:44 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 16:12:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=19716
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input records=2172099

case2 : lzo
12/08/09 15:58:11 INFO mapred.JobClient: Job complete: job_201208021633_0048
12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23
12/08/09 15:58:11 INFO mapred.JobClient:   Job Counters
12/08/09 15:58:11 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3361287
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Rack-local map tasks=4
12/08/09 15:58:11 INFO mapred.JobClient:     Launched map tasks=65
12/08/09 15:58:11 INFO mapred.JobClient:     Data-local map tasks=61
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=183529
12/08/09 15:58:11 INFO mapred.JobClient:   FileSystemCounters
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_READ=568178351
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_READ=3860287251
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=576095398
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 15:58:11 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Combine output records=66734193
12/08/09 15:58:11 INFO mapred.JobClient:     Map input records=158156100
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce shuffle bytes=4752406
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Spilled Records=132612729
12/08/09 15:58:11 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 15:58:11 INFO mapred.JobClient:     Combine input
records=1331190655
12/08/09 15:58:11 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 15:58:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=7366
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input records=792338

case3 : sequence file compressed block-level by snappy

12/09/05 18:33:00 INFO mapred.JobClient: Job complete: job_201209051652_0008

12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23

12/09/05 18:33:00 INFO mapred.JobClient:   Job Counters

12/09/05 18:33:00 INFO mapred.JobClient:     Launched reduce tasks=1

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5885897

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Rack-local map tasks=2

12/09/05 18:33:00 INFO mapred.JobClient:     Launched map tasks=68

12/09/05 18:33:00 INFO mapred.JobClient:     Data-local map tasks=66

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1320075

12/09/05 18:33:00 INFO mapred.JobClient:   FileSystemCounters

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_READ=3706936196

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_READ=4419150507

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4581439981

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440

12/09/05 18:33:00 INFO mapred.JobClient:   Map-Reduce Framework

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input groups=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Combine output records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map input records=158156100

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce shuffle bytes=857964933

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce output records=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Spilled Records=6232725043

12/09/05 18:33:00 INFO mapred.JobClient:     Map output bytes=15704921900

12/09/05 18:33:00 INFO mapred.JobClient:     Combine input records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map output records=1265248800

12/09/05 18:33:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=8382

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input records=1265248800
Regards,
Park

2012/9/7 Ruslan Al-Fakikh <ru...@jalent.ru>

> Hi,
>
> I would be interesting to see the jobs' statistics (counters).
>
> Thanks
>
> On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
> <yo...@gmail.com> wrote:
> > Hi, All
> >
> > I have tested which method is better between Lzo and SequenceFile for a
> BIG
> > file.
> >
> > File size is 10GiB and WordCount MR is used.
> > Inputs of WordCount MR are  lzo which would be indexed by
> LzoIndexTool(lzo),
> > sequence file which is compressed by block level snappy(seq)  , and
> > uncompressed original file(none).
> >
> > Map output  is compressed except of uncompressed file. mapreduce output
> is
> > not compressed for all cases.
> >
> > The following are wordcount MR running time;
> > none       lzo         seq
> > 248s      243s     1410s
> >
> > -Test Environments
> >
> > OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> > # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> > RAM : 18GB
> > Java version : 1.6.0_26
> > Hadoop version : CDH3U2
> > # of datanode(tasktracker) :  8
> >
> > According to the result, The running time of SequnceFile is much less
> than
> > the others.
> > Before testing, I had expected that the results of  both SequenceFile and
> > Lzo are about the same.
> >
> > I want to know why performance of the sequence file compressed by snappy
> is
> > so bad?
> >
> > do I miss anything in tests?
> >
> >
> > Regards,
> > Park
> >
> >
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh
>

Re: Lzo vs SequenceFile for big file

Posted by "박영근 (Alex)" <al...@nexr.com>.

Ruslan,
Thanks for your reply in advance.

Jobs' statistics are as follows;

case 1 : uncompressed data(none)
12/08/09 16:12:44 INFO mapred.JobClient: Job complete: job_201208021633_0049
12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23
12/08/09 16:12:44 INFO mapred.JobClient:   Job Counters
12/08/09 16:12:44 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3623053
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 16:12:44 INFO mapred.JobClient:     Rack-local map tasks=1
12/08/09 16:12:44 INFO mapred.JobClient:     Launched map tasks=166
12/08/09 16:12:44 INFO mapred.JobClient:     Data-local map tasks=165
12/08/09 16:12:44 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=220786
12/08/09 16:12:44 INFO mapred.JobClient:   FileSystemCounters
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_READ=1852424288
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_READ=10644581454
12/08/09 16:12:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1894096220
12/08/09 16:12:44 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 16:12:44 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Combine output records=69055428
12/08/09 16:12:44 INFO mapred.JobClient:     Map input records=158156100
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce shuffle bytes=33143186
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 16:12:44 INFO mapred.JobClient:     Spilled Records=122916251
12/08/09 16:12:44 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 16:12:44 INFO mapred.JobClient:     Combine input
records=1332132129
12/08/09 16:12:44 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 16:12:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=19716
12/08/09 16:12:44 INFO mapred.JobClient:     Reduce input records=2172099

case2 : lzo
12/08/09 15:58:11 INFO mapred.JobClient: Job complete: job_201208021633_0048
12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23
12/08/09 15:58:11 INFO mapred.JobClient:   Job Counters
12/08/09 15:58:11 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3361287
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/08/09 15:58:11 INFO mapred.JobClient:     Rack-local map tasks=4
12/08/09 15:58:11 INFO mapred.JobClient:     Launched map tasks=65
12/08/09 15:58:11 INFO mapred.JobClient:     Data-local map tasks=61
12/08/09 15:58:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=183529
12/08/09 15:58:11 INFO mapred.JobClient:   FileSystemCounters
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_READ=568178351
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_READ=3860287251
12/08/09 15:58:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=576095398
12/08/09 15:58:11 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440
12/08/09 15:58:11 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input groups=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Combine output records=66734193
12/08/09 15:58:11 INFO mapred.JobClient:     Map input records=158156100
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce shuffle bytes=4752406
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce output records=13661
12/08/09 15:58:11 INFO mapred.JobClient:     Spilled Records=132612729
12/08/09 15:58:11 INFO mapred.JobClient:     Map output bytes=15704921900
12/08/09 15:58:11 INFO mapred.JobClient:     Combine input
records=1331190655
12/08/09 15:58:11 INFO mapred.JobClient:     Map output records=1265248800
12/08/09 15:58:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=7366
12/08/09 15:58:11 INFO mapred.JobClient:     Reduce input records=792338

case3 : sequence file compressed block-level by snappy

12/09/05 18:33:00 INFO mapred.JobClient: Job complete: job_201209051652_0008

12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23

12/09/05 18:33:00 INFO mapred.JobClient:   Job Counters

12/09/05 18:33:00 INFO mapred.JobClient:     Launched reduce tasks=1

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5885897

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

12/09/05 18:33:00 INFO mapred.JobClient:     Rack-local map tasks=2

12/09/05 18:33:00 INFO mapred.JobClient:     Launched map tasks=68

12/09/05 18:33:00 INFO mapred.JobClient:     Data-local map tasks=66

12/09/05 18:33:00 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1320075

12/09/05 18:33:00 INFO mapred.JobClient:   FileSystemCounters

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_READ=3706936196

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_READ=4419150507

12/09/05 18:33:00 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=4581439981

12/09/05 18:33:00 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=211440

12/09/05 18:33:00 INFO mapred.JobClient:   Map-Reduce Framework

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input groups=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Combine output records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map input records=158156100

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce shuffle bytes=857964933

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce output records=13661

12/09/05 18:33:00 INFO mapred.JobClient:     Spilled Records=6232725043

12/09/05 18:33:00 INFO mapred.JobClient:     Map output bytes=15704921900

12/09/05 18:33:00 INFO mapred.JobClient:     Combine input records=0

12/09/05 18:33:00 INFO mapred.JobClient:     Map output records=1265248800

12/09/05 18:33:00 INFO mapred.JobClient:     SPLIT_RAW_BYTES=8382

12/09/05 18:33:00 INFO mapred.JobClient:     Reduce input records=1265248800
Regards,
Park

2012/9/7 Ruslan Al-Fakikh <ru...@jalent.ru>

> Hi,
>
> I would be interesting to see the jobs' statistics (counters).
>
> Thanks
>
> On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
> <yo...@gmail.com> wrote:
> > Hi, All
> >
> > I have tested which method is better between Lzo and SequenceFile for a
> BIG
> > file.
> >
> > File size is 10GiB and WordCount MR is used.
> > Inputs of WordCount MR are  lzo which would be indexed by
> LzoIndexTool(lzo),
> > sequence file which is compressed by block level snappy(seq)  , and
> > uncompressed original file(none).
> >
> > Map output  is compressed except of uncompressed file. mapreduce output
> is
> > not compressed for all cases.
> >
> > The following are wordcount MR running time;
> > none       lzo         seq
> > 248s      243s     1410s
> >
> > -Test Environments
> >
> > OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> > # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> > RAM : 18GB
> > Java version : 1.6.0_26
> > Hadoop version : CDH3U2
> > # of datanode(tasktracker) :  8
> >
> > According to the result, The running time of SequnceFile is much less
> than
> > the others.
> > Before testing, I had expected that the results of  both SequenceFile and
> > Lzo are about the same.
> >
> > I want to know why performance of the sequence file compressed by snappy
> is
> > so bad?
> >
> > do I miss anything in tests?
> >
> >
> > Regards,
> > Park
> >
> >
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh
>

Re: Lzo vs SequenceFile for big file

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.

Hi,

I would be interesting to see the jobs' statistics (counters).

Thanks

On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Lzo vs SequenceFile for big file

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.

Hi,

I would be interesting to see the jobs' statistics (counters).

Thanks

On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Lzo vs SequenceFile for big file

Posted by Harsh J <ha...@cloudera.com>.

A few things:

Storing simple, singular text records into sequence files isn't
optimal, as you're just adding overheads for every line of text stored
as Text type in it. If you have typed data and can benefit from
type-based serializations for each record, go for a container format
like SequenceFiles (With whatever serialization technique) or Avro
DataFiles (Has embedded schema support, among other niceties).

When comparing the result with Lzo, also factor in the indexing time
as thats part of the requirement in making it parallel (I think the
newer libs auto-index, but thats just what I heard was the plan, dunno
if its already available).

On Fri, Sep 7, 2012 at 4:55 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>

-- 
Harsh J

Re: Lzo vs SequenceFile for big file

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.

Hi,

I would be interesting to see the jobs' statistics (counters).

Thanks

On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Lzo vs SequenceFile for big file

Posted by Ruslan Al-Fakikh <ru...@jalent.ru>.

Hi,

I would be interesting to see the jobs' statistics (counters).

Thanks

On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>



-- 
Best Regards,
Ruslan Al-Fakikh

Re: Lzo vs SequenceFile for big file

Posted by Harsh J <ha...@cloudera.com>.

A few things:

Storing simple, singular text records into sequence files isn't
optimal, as you're just adding overheads for every line of text stored
as Text type in it. If you have typed data and can benefit from
type-based serializations for each record, go for a container format
like SequenceFiles (With whatever serialization technique) or Avro
DataFiles (Has embedded schema support, among other niceties).

When comparing the result with Lzo, also factor in the indexing time
as thats part of the requirement in making it parallel (I think the
newer libs auto-index, but thats just what I heard was the plan, dunno
if its already available).

On Fri, Sep 7, 2012 at 4:55 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>

-- 
Harsh J

Re: Lzo vs SequenceFile for big file

Posted by Harsh J <ha...@cloudera.com>.

A few things:

Storing simple, singular text records into sequence files isn't
optimal, as you're just adding overheads for every line of text stored
as Text type in it. If you have typed data and can benefit from
type-based serializations for each record, go for a container format
like SequenceFiles (With whatever serialization technique) or Avro
DataFiles (Has embedded schema support, among other niceties).

When comparing the result with Lzo, also factor in the indexing time
as thats part of the requirement in making it parallel (I think the
newer libs auto-index, but thats just what I heard was the plan, dunno
if its already available).

On Fri, Sep 7, 2012 at 4:55 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>

-- 
Harsh J

Re: Lzo vs SequenceFile for big file

Posted by Harsh J <ha...@cloudera.com>.

A few things:

Storing simple, singular text records into sequence files isn't
optimal, as you're just adding overheads for every line of text stored
as Text type in it. If you have typed data and can benefit from
type-based serializations for each record, go for a container format
like SequenceFiles (With whatever serialization technique) or Avro
DataFiles (Has embedded schema support, among other niceties).

When comparing the result with Lzo, also factor in the indexing time
as thats part of the requirement in making it parallel (I think the
newer libs auto-index, but thats just what I heard was the plan, dunno
if its already available).

On Fri, Sep 7, 2012 at 4:55 AM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
> I have tested which method is better between Lzo and SequenceFile for a BIG
> file.
>
> File size is 10GiB and WordCount MR is used.
> Inputs of WordCount MR are  lzo which would be indexed by LzoIndexTool(lzo),
> sequence file which is compressed by block level snappy(seq)  , and
> uncompressed original file(none).
>
> Map output  is compressed except of uncompressed file. mapreduce output is
> not compressed for all cases.
>
> The following are wordcount MR running time;
> none       lzo         seq
> 248s      243s     1410s
>
> -Test Environments
>
> OS : CentOS 5.6 (x64) (kernel = 2.6.18)
> # of Core  : 8 (cpu = Intel(R) Xeon(R) CPU E5504  @ 2.00GHz)
> RAM : 18GB
> Java version : 1.6.0_26
> Hadoop version : CDH3U2
> # of datanode(tasktracker) :  8
>
> According to the result, The running time of SequnceFile is much less than
> the others.
> Before testing, I had expected that the results of  both SequenceFile and
> Lzo are about the same.
>
> I want to know why performance of the sequence file compressed by snappy is
> so bad?
>
> do I miss anything in tests?
>
>
> Regards,
> Park
>
>

-- 
Harsh J