You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by 李钰 <ca...@gmail.com> on 2010/06/09 11:59:29 UTC

Problem found while using LZO compression in Hadoop 0.20.1

Hi,

While using LZO compression to try to improve performance of my cluster, I
found that compression didn't work. The job I run is
"org.apache.hadoop.examples.Sort", with the input data generated by
"org.apache.hadoop.examples.RandomWriter".
I've made sure that I configured lzo native library/jar files right and set
all compression related parameters (such as "mapred.compress.map.output",
"mapred.output.compression.type", "mapred.output.compression.codec",
"mapred.output.compress" and "map.output.compression.codec"), and the
tasktracker did compress the map/job output through infomation got from job
logs. But the output file is not compressed at all!
Then I searched the internet, and found from
http://wiki.apache.org/hadoop/SequenceFile that in *SequenceFile Common
Header*, there're two bytes decided whether compression and block
compression tuned on for the file. I checked the sequece file generated by
RandomWriter, and the result is as follows:

[hdpadmin@shihc008 rand-10mb]$ od -c part-00000 | head -n 15
0000000   S   E   Q 006   "   o   r   g   .   a   p   a   c   h   e   .
0000020   h   a   d   o   o   p   .   i   o   .   B   y   t   e   s   W
0000040   r   i   t   a   b   l   e   "   o   r   g   .   a   p   a   c
0000060   h   e   .   h   a   d   o   o   p   .   i   o   .   B   y   t
0000100   e   s   W   r   i   t   a   b   l   e  *\0  \0*  \0  \0  \0  \0
0000120 244   n   ! 177   L 316 030   q   g 035 351   L   ; 024 216 031
0000140  \0  \0  \t 234  \0  \0 001 305  \0  \0 001 301 207   v   5 255
0000160 220   ] 236   <  \b 367   &   9 241  \b   v 303   m 314 203 220
0000200 335  \0 241 325 232 035 037 267 303 360  \n 025   u   P 003 220
0000220   ^ 235 247 036   S 265 271 035   S 247   O   5 337   + 020   q
0000240 277   - 003 212   . 230 221   G 241   5   K   K 031 273 036 206
0000260   ( 317 303 367 351 214 364 262 340   S 211 230  \r 362   % 335
0000300   }   H   w   & 234   S   F 324 321 274   F 377   [ 344   [   h
0000320 204 001 265   ] 037   _   r   , 020 370 246 327 231 017 205 252
0000340 273 016 310   w 361 326 032 332 200   Y  \a   X 342  \r 016 364

I found the marked two bytes are set to zero, which meant tune off the
compression. And since the value of these two bytes are '\0', I guess this
may be a defect that we ignored to set these two bytes and this
makes sequece file generated by RandomWriter cannot be compressed.  And I
don't know whether this appears in other place.

Is my opinion right? If not, does anybody know what causes the compression
not working? Looking forward to your reply!

Thanks and Best Regards,
Carp

Re: Problem found while using LZO compression in Hadoop 0.20.1

Posted by 李钰 <ca...@gmail.com>.

Hi Todd,

Thanks for your reply. I got the LZO libraries exactly from the same link on
github, and build it successfully. So this is not the cause, I think.

Hi Guys,

Any other comments? Thanks.

Best Regards,
Carp
2010/6/9 Todd Lipcon <to...@cloudera.com>

> Hi,
>
> Where did you get the LZO libraries? The ones on Google Code are broken,
> please use the ones on github:
>
> http://github.com/toddlipcon/hadoop-lzo
>
> Thanks
> -Todd
>
>
> On Wed, Jun 9, 2010 at 2:59 AM, 李钰 <ca...@gmail.com> wrote:
>
> > Hi,
> >
> > While using LZO compression to try to improve performance of my cluster,
> I
> > found that compression didn't work. The job I run is
> > "org.apache.hadoop.examples.Sort", with the input data generated by
> > "org.apache.hadoop.examples.RandomWriter".
> > I've made sure that I configured lzo native library/jar files right and
> set
> > all compression related parameters (such as "mapred.compress.map.output",
> > "mapred.output.compression.type", "mapred.output.compression.codec",
> > "mapred.output.compress" and "map.output.compression.codec"), and the
> > tasktracker did compress the map/job output through infomation got from
> job
> > logs. But the output file is not compressed at all!
> > Then I searched the internet, and found from
> > http://wiki.apache.org/hadoop/SequenceFile that in *SequenceFile Common
> > Header*, there're two bytes decided whether compression and block
> > compression tuned on for the file. I checked the sequece file generated
> by
> > RandomWriter, and the result is as follows:
> >
> > [hdpadmin@shihc008 rand-10mb]$ od -c part-00000 | head -n 15
> > 0000000   S   E   Q 006   "   o   r   g   .   a   p   a   c   h   e   .
> > 0000020   h   a   d   o   o   p   .   i   o   .   B   y   t   e   s   W
> > 0000040   r   i   t   a   b   l   e   "   o   r   g   .   a   p   a   c
> > 0000060   h   e   .   h   a   d   o   o   p   .   i   o   .   B   y   t
> > 0000100   e   s   W   r   i   t   a   b   l   e  *\0  \0*  \0  \0  \0  \0
> > 0000120 244   n   ! 177   L 316 030   q   g 035 351   L   ; 024 216 031
> > 0000140  \0  \0  \t 234  \0  \0 001 305  \0  \0 001 301 207   v   5 255
> > 0000160 220   ] 236   <  \b 367   &   9 241  \b   v 303   m 314 203 220
> > 0000200 335  \0 241 325 232 035 037 267 303 360  \n 025   u   P 003 220
> > 0000220   ^ 235 247 036   S 265 271 035   S 247   O   5 337   + 020   q
> > 0000240 277   - 003 212   . 230 221   G 241   5   K   K 031 273 036 206
> > 0000260   ( 317 303 367 351 214 364 262 340   S 211 230  \r 362   % 335
> > 0000300   }   H   w   & 234   S   F 324 321 274   F 377   [ 344   [   h
> > 0000320 204 001 265   ] 037   _   r   , 020 370 246 327 231 017 205 252
> > 0000340 273 016 310   w 361 326 032 332 200   Y  \a   X 342  \r 016 364
> >
> > I found the marked two bytes are set to zero, which meant tune off the
> > compression. And since the value of these two bytes are '\0', I guess
> this
> > may be a defect that we ignored to set these two bytes and this
> > makes sequece file generated by RandomWriter cannot be compressed.  And I
> > don't know whether this appears in other place.
> >
> > Is my opinion right? If not, does anybody know what causes the
> compression
> > not working? Looking forward to your reply!
> >
> > Thanks and Best Regards,
> > Carp
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Problem found while using LZO compression in Hadoop 0.20.1

Posted by Todd Lipcon <to...@cloudera.com>.

Hi,

Where did you get the LZO libraries? The ones on Google Code are broken,
please use the ones on github:

http://github.com/toddlipcon/hadoop-lzo

Thanks
-Todd


On Wed, Jun 9, 2010 at 2:59 AM, 李钰 <ca...@gmail.com> wrote:

> Hi,
>
> While using LZO compression to try to improve performance of my cluster, I
> found that compression didn't work. The job I run is
> "org.apache.hadoop.examples.Sort", with the input data generated by
> "org.apache.hadoop.examples.RandomWriter".
> I've made sure that I configured lzo native library/jar files right and set
> all compression related parameters (such as "mapred.compress.map.output",
> "mapred.output.compression.type", "mapred.output.compression.codec",
> "mapred.output.compress" and "map.output.compression.codec"), and the
> tasktracker did compress the map/job output through infomation got from job
> logs. But the output file is not compressed at all!
> Then I searched the internet, and found from
> http://wiki.apache.org/hadoop/SequenceFile that in *SequenceFile Common
> Header*, there're two bytes decided whether compression and block
> compression tuned on for the file. I checked the sequece file generated by
> RandomWriter, and the result is as follows:
>
> [hdpadmin@shihc008 rand-10mb]$ od -c part-00000 | head -n 15
> 0000000   S   E   Q 006   "   o   r   g   .   a   p   a   c   h   e   .
> 0000020   h   a   d   o   o   p   .   i   o   .   B   y   t   e   s   W
> 0000040   r   i   t   a   b   l   e   "   o   r   g   .   a   p   a   c
> 0000060   h   e   .   h   a   d   o   o   p   .   i   o   .   B   y   t
> 0000100   e   s   W   r   i   t   a   b   l   e  *\0  \0*  \0  \0  \0  \0
> 0000120 244   n   ! 177   L 316 030   q   g 035 351   L   ; 024 216 031
> 0000140  \0  \0  \t 234  \0  \0 001 305  \0  \0 001 301 207   v   5 255
> 0000160 220   ] 236   <  \b 367   &   9 241  \b   v 303   m 314 203 220
> 0000200 335  \0 241 325 232 035 037 267 303 360  \n 025   u   P 003 220
> 0000220   ^ 235 247 036   S 265 271 035   S 247   O   5 337   + 020   q
> 0000240 277   - 003 212   . 230 221   G 241   5   K   K 031 273 036 206
> 0000260   ( 317 303 367 351 214 364 262 340   S 211 230  \r 362   % 335
> 0000300   }   H   w   & 234   S   F 324 321 274   F 377   [ 344   [   h
> 0000320 204 001 265   ] 037   _   r   , 020 370 246 327 231 017 205 252
> 0000340 273 016 310   w 361 326 032 332 200   Y  \a   X 342  \r 016 364
>
> I found the marked two bytes are set to zero, which meant tune off the
> compression. And since the value of these two bytes are '\0', I guess this
> may be a defect that we ignored to set these two bytes and this
> makes sequece file generated by RandomWriter cannot be compressed.  And I
> don't know whether this appears in other place.
>
> Is my opinion right? If not, does anybody know what causes the compression
> not working? Looking forward to your reply!
>
> Thanks and Best Regards,
> Carp
>



-- 
Todd Lipcon
Software Engineer, Cloudera