You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Pedro Costa <ps...@gmail.com> on 2011/02/15 11:35:48 UTC

compressed map intermediate files

Hi,

I run two examples of a MR execution with the same input files and
with 3 Reduce tasks defined. One example has the map-intermediate
files compressed, and the other examples has uncompressed data. Below,
I've put some debug lines that I put in the code.

1 - On the uncompressed data, the raw length is always smaller than
the partition length, but on the compressed data, is not. Why in
compressed data the raw length is bigger than the partition length?

2 - If we define the map-intermediate files as compressed, how the
map-intermediate files are distributed to all reduces? Since we can
split a compressed file, this means that each spill file is
compressed? For example, Compressed(Spill idx 0) goes to Reduce 0,
Compressed(Spill idx 1) goes to Reduce 1 and Compressed(Spill idx 2)
goes to Reduce 2,

Compressed data

Spill idx 0 - SegmentStart: 0 Part length: 10560 Raw length: 27567
Spill idx 1 - SegmentStart: 10560 Part length: 10029 Raw length: 26003
Spill idx 2 - SegmentStart: 20589 Part length: 10142 Raw length: 26459

Spill idx 0 - SegmentStart: 0 Part length: 10202 Raw length: 26785
Spill idx 1 - SegmentStart: 10202 Part length: 9932 Raw length: 26100
Spill idx 2 - SegmentStart: 20134 Part length: 9926 Raw length: 25821

Spill idx 0 - SegmentStart: 0 Part length: 9410 Raw length: 24503
Spill idx 1 - SegmentStart: 9410 Part length: 9849 Raw length: 25564
Spill idx 2 - SegmentStart: 19259 Part length: 9489 Raw length: 24716

Spill idx 0 - SegmentStart: 0 Part length: 1661 Raw length: 3440
Spill idx 1 - SegmentStart: 1661 Part length: 1527 Raw length: 3160
Spill idx 2 - SegmentStart: 3188 Part length: 1737 Raw length: 3750



Non-compressed data

Spill idx 0 - SegmentStart: 0 Part length: 27571 Raw length: 27567
Spill idx 1 - SegmentStart: 27571 Part length: 26007 Raw length: 26003
Spill idx 2 - SegmentStart: 53578 Part length: 26463 Raw length: 26459

Spill idx 0 - SegmentStart: 0 Part length: 26789 Raw length: 26785
Spill idx 1 - SegmentStart: 26789 Part length: 26104 Raw length: 26100
Spill idx 2 - SegmentStart: 52893 Part length: 25825 Raw length: 25821

Spill idx 0 - SegmentStart: 0 Part length: 24507 Raw length: 24503
Spill idx 1 - SegmentStart: 24507 Part length: 25568 Raw length: 25564
Spill idx 2 - SegmentStart: 50075 Part length: 24720 Raw length: 24716

Spill idx 0 - SegmentStart: 0 Part length: 3444 Raw length: 3440
Spill idx 1 - SegmentStart: 3444 Part length: 3164 Raw length: 3160
Spill idx 2 - SegmentStart: 6608 Part length: 3754 Raw length: 3750


Thanks,

-- 
Pedro

Re: compressed map intermediate files

Posted by Pedro Costa <ps...@gmail.com>.
As I understand from the log files that I put, in the example, since
we've 3 Reduces, all spill 0 files will be merged to go to Reduce 0,
all spill 1 files will be merged to go to Reduce 1 and all spill 2
files will be merged to go to Reduce 2.

This means that, if we set compression on, it's the merged files that
are compressed?

Thanks,





On Tue, Feb 15, 2011 at 10:35 AM, Pedro Costa <ps...@gmail.com> wrote:
> Hi,
>
> I run two examples of a MR execution with the same input files and
> with 3 Reduce tasks defined. One example has the map-intermediate
> files compressed, and the other examples has uncompressed data. Below,
> I've put some debug lines that I put in the code.
>
> 1 - On the uncompressed data, the raw length is always smaller than
> the partition length, but on the compressed data, is not. Why in
> compressed data the raw length is bigger than the partition length?
>
> 2 - If we define the map-intermediate files as compressed, how the
> map-intermediate files are distributed to all reduces? Since we can
> split a compressed file, this means that each spill file is
> compressed? For example, Compressed(Spill idx 0) goes to Reduce 0,
> Compressed(Spill idx 1) goes to Reduce 1 and Compressed(Spill idx 2)
> goes to Reduce 2,
>
> Compressed data
>
> Spill idx 0 - SegmentStart: 0 Part length: 10560 Raw length: 27567
> Spill idx 1 - SegmentStart: 10560 Part length: 10029 Raw length: 26003
> Spill idx 2 - SegmentStart: 20589 Part length: 10142 Raw length: 26459
>
> Spill idx 0 - SegmentStart: 0 Part length: 10202 Raw length: 26785
> Spill idx 1 - SegmentStart: 10202 Part length: 9932 Raw length: 26100
> Spill idx 2 - SegmentStart: 20134 Part length: 9926 Raw length: 25821
>
> Spill idx 0 - SegmentStart: 0 Part length: 9410 Raw length: 24503
> Spill idx 1 - SegmentStart: 9410 Part length: 9849 Raw length: 25564
> Spill idx 2 - SegmentStart: 19259 Part length: 9489 Raw length: 24716
>
> Spill idx 0 - SegmentStart: 0 Part length: 1661 Raw length: 3440
> Spill idx 1 - SegmentStart: 1661 Part length: 1527 Raw length: 3160
> Spill idx 2 - SegmentStart: 3188 Part length: 1737 Raw length: 3750
>
>
>
> Non-compressed data
>
> Spill idx 0 - SegmentStart: 0 Part length: 27571 Raw length: 27567
> Spill idx 1 - SegmentStart: 27571 Part length: 26007 Raw length: 26003
> Spill idx 2 - SegmentStart: 53578 Part length: 26463 Raw length: 26459
>
> Spill idx 0 - SegmentStart: 0 Part length: 26789 Raw length: 26785
> Spill idx 1 - SegmentStart: 26789 Part length: 26104 Raw length: 26100
> Spill idx 2 - SegmentStart: 52893 Part length: 25825 Raw length: 25821
>
> Spill idx 0 - SegmentStart: 0 Part length: 24507 Raw length: 24503
> Spill idx 1 - SegmentStart: 24507 Part length: 25568 Raw length: 25564
> Spill idx 2 - SegmentStart: 50075 Part length: 24720 Raw length: 24716
>
> Spill idx 0 - SegmentStart: 0 Part length: 3444 Raw length: 3440
> Spill idx 1 - SegmentStart: 3444 Part length: 3164 Raw length: 3160
> Spill idx 2 - SegmentStart: 6608 Part length: 3754 Raw length: 3750
>
>
> Thanks,
>
> --
> Pedro
>



-- 
Pedro