You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Billy Pearson <sa...@pearsonwholesale.com> on 2009/03/17 06:13:00 UTC

intermediate results not getting compressed

I am running a large streaming job that processes that about 3TB of data I 
am seeing large jumps in hard drive space usage in the reduce part of the 
jobs I tracked the problem down. The job is set to compress map outputs but 
looking at the intermediate files on the local drives the intermediate files 
are not getting compressed during/after merges. I am going from having say 
2Gb of mapfile.out files to having one intermediate.X file sizing 100-350% 
larger then the map files. I have looked at one of the files and confirmed 
that it is not getting compressed as I can read the data in it. if it was 
only one merge then it would not be a problem but when you are merging 
70-100 of these you use tons of GB's and my task are starting to die as they 
run out of hard drive space end the end kill the job.

I am running 0.19.1-dev, r744282. I have searched the issues but found 
nothing about the compression.
Should the intermediate results not be compressed also if the map output 
files are set to be compressed?
If not then why do we have the map compression option just to save network 
traffic?

Re: intermediate results not getting compressed

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

open issue
https://issues.apache.org/jira/browse/HADOOP-5539

Billy


"Billy Pearson" <bi...@sbcglobal.net> 
wrote in message news:CECF0598D9CA40A08E777568361DE773@BillyPC...
>
>
>>
>> How are you concluding that the intermediate output is compressed from 
>> the map, but not in the reduce? -C
>
> my hadoop-site.xml
>
> <property>
>  <name>mapred.compress.map.output</name>
>  <value>true</value>
>  <description>Should the job outputs be compressed?
>  </description>
> </property>
> <property>
>  <name>mapred.output.compression.type</name>
>  <value>BLOCK</value>
>  <description>If the job outputs are to compressed as SequenceFiles, how 
> should
>               they be compressed? Should be one of NONE, RECORD or BLOCK.
>  </description>
> </property>
>
>
> from the job.xml
>
> mapred.output.compress = false // final output
> mapred.compress.map.output = true // map output
>
> + I can head the files from comand line and read the key / value in the 
> reduce intermediate merges but not the map.out files.
>
>
>

Re: intermediate results not getting compressed

Posted by Billy Pearson <bi...@sbcglobal.net>.


>
> How are you concluding that the intermediate output is compressed from 
> the map, but not in the reduce? -C

my hadoop-site.xml

<property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
  <description>Should the job outputs be compressed?
  </description>
</property>
<property>
  <name>mapred.output.compression.type</name>
  <value>BLOCK</value>
  <description>If the job outputs are to compressed as SequenceFiles, how 
should
               they be compressed? Should be one of NONE, RECORD or BLOCK.
  </description>
</property>


from the job.xml

mapred.output.compress = false // final output
mapred.compress.map.output = true // map output

+ I can head the files from comand line and read the key / value in the 
reduce intermediate merges but not the map.out files.

Re: intermediate results not getting compressed

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

I opened a issue here
https://issues.apache.org/jira/browse/HADOOP-5539

If you would like to comment on it.

Billy

"Stefan Will" <st...@gmx.net> wrote in message 
news:C5E7DC6D.1840D%stefan.will@gmx.net...
>I noticed this too. I think the compression only applies to the final 
>mapper
> and reducer outputs, but not any intermediate files produced. The reducer
> will decompress the map output files after copying them, and then compress
> its own output only after it has finished.
>
> I wonder if this is by design, or just an oversight.
>
> -- Stefan
>
>
>> From: Billy Pearson 
>> <sa...@pearsonwholesale.com>
>> Reply-To: <co...@hadoop.apache.org>
>> Date: Wed, 18 Mar 2009 22:14:07 -0500
>> To: <co...@hadoop.apache.org>
>> Subject: Re: intermediate results not getting compressed
>>
>> I can run head on the map.out files and I get compressed garbish but I 
>> run
>> head on a intermediate file and I can read the data in the file clearly 
>> so
>> compression is not getting passed but I am setting the CompressMapOutput 
>> to
>> true by default in my hadoop-site.conf file.
>>
>> Billy
>>
>>
>> "Billy Pearson" <sa...@pearsonwholesale.com>
>> wrote in message news:gpscu3$66p$1@ger.gmane.org...
>>> the intermediate.X files are not getting compresses for some reason  not
>>> sure why
>>> I download and build the latest branch for 0.19
>>>
>>> o.a.h.mapred.Merger.class line 432
>>> new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);
>>>
>>> this seams to use the codec defined above but for some reasion its not
>>> working correctly the compression is not passing from the map output 
>>> files
>>> to the on disk merge of the intermediate.X files
>>>
>>> tail task report from one server:
>>>
>>> 2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
>>> Interleaved on-disk merge complete: 1730 files left.
>>> 2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
>>> In-memory merge complete: 3 files left.
>>> 2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> Keeping
>>> 3 segments, 39835369 bytes in memory for intermediate, on-disk merge
>>> 2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> Merging
>>> 1730 files, 70359998581 bytes from disk
>>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> Merging
>>> 0 segments, 0 bytes from memory into reduce
>>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 
>>> 1733
>>> sorted segments
>>> 2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
>>> intermediate segments out of a total of 1733
>>> 2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1712
>>> 2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1683
>>> 2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1654
>>> 2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1625
>>> 2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1596
>>> 2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1567
>>> 2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1538
>>> 2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1509
>>> 2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1480
>>> 2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1451
>>> 2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1422
>>> 2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1393
>>> 2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1364
>>> 2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1335
>>> 2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1306
>>>
>>> See the size of the files is about ~70GB (70359998581) these are
>>> compressed at this points its went from 1733 file to 1306 left to merge
>>> and the intermediate.X files are well over 200Gb at this point and we 
>>> are
>>> not even close to done. If compression is working we should not see task
>>> failing at this point in the task becuase lack of hard drvie space sense
>>> as we merge we delete the merged file from the output folder.
>>>
>>> I only see this happening when there are to many files left that did not
>>> get merged durring the shuffle stage and it starts on disk mergeing.
>>> the task that complete the merges and keep it below the io.sort size in 
>>> my
>>> case 30 skips the on disk merge and complete useing normal hard drive
>>> space.
>>>
>>> Anyone care to take a look?
>>> This job takes two or more days to get to this point so getting kind of 
>>> a
>>> pain in the butt to run and watch the reduces fail and the job keep
>>> failing no matter what.
>>>
>>> I can post the tail of this task long when it fails to show you how far 
>>> it
>>> gets before it runs out of space. before redcue on disk merge starts the
>>> disk are about 35-40% used on 500GB Drive and two taks runnning at the
>>> same time.
>>>
>>> Billy Pearson
>>>
>>>
>>
>
>
>

Re: intermediate results not getting compressed

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

If CompressMapOutput then it should carry all the way to the reduce 
including map.out files and intermediate
I added some logging to the Merger I have to wait until some more jobs 
finish before I can rebuild and restart to see the logging
but that will confirm weather or not the codec is null when it gets to line 
432 and the writer is created for the intermediate files.

if its null I will open a issue.

Billy


"Stefan Will" <st...@gmx.net> wrote in message 
news:C5E7DC6D.1840D%stefan.will@gmx.net...
>I noticed this too. I think the compression only applies to the final 
>mapper
> and reducer outputs, but not any intermediate files produced. The reducer
> will decompress the map output files after copying them, and then compress
> its own output only after it has finished.
>
> I wonder if this is by design, or just an oversight.
>
> -- Stefan
>
>
>> From: Billy Pearson 
>> <sa...@pearsonwholesale.com>
>> Reply-To: <co...@hadoop.apache.org>
>> Date: Wed, 18 Mar 2009 22:14:07 -0500
>> To: <co...@hadoop.apache.org>
>> Subject: Re: intermediate results not getting compressed
>>
>> I can run head on the map.out files and I get compressed garbish but I 
>> run
>> head on a intermediate file and I can read the data in the file clearly 
>> so
>> compression is not getting passed but I am setting the CompressMapOutput 
>> to
>> true by default in my hadoop-site.conf file.
>>
>> Billy
>>
>>
>> "Billy Pearson" <sa...@pearsonwholesale.com>
>> wrote in message news:gpscu3$66p$1@ger.gmane.org...
>>> the intermediate.X files are not getting compresses for some reason  not
>>> sure why
>>> I download and build the latest branch for 0.19
>>>
>>> o.a.h.mapred.Merger.class line 432
>>> new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);
>>>
>>> this seams to use the codec defined above but for some reasion its not
>>> working correctly the compression is not passing from the map output 
>>> files
>>> to the on disk merge of the intermediate.X files
>>>
>>> tail task report from one server:
>>>
>>> 2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
>>> Interleaved on-disk merge complete: 1730 files left.
>>> 2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
>>> In-memory merge complete: 3 files left.
>>> 2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> Keeping
>>> 3 segments, 39835369 bytes in memory for intermediate, on-disk merge
>>> 2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> Merging
>>> 1730 files, 70359998581 bytes from disk
>>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: 
>>> Merging
>>> 0 segments, 0 bytes from memory into reduce
>>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 
>>> 1733
>>> sorted segments
>>> 2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
>>> intermediate segments out of a total of 1733
>>> 2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1712
>>> 2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1683
>>> 2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1654
>>> 2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1625
>>> 2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1596
>>> 2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1567
>>> 2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1538
>>> 2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1509
>>> 2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1480
>>> 2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1451
>>> 2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1422
>>> 2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1393
>>> 2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1364
>>> 2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1335
>>> 2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
>>> intermediate segments out of a total of 1306
>>>
>>> See the size of the files is about ~70GB (70359998581) these are
>>> compressed at this points its went from 1733 file to 1306 left to merge
>>> and the intermediate.X files are well over 200Gb at this point and we 
>>> are
>>> not even close to done. If compression is working we should not see task
>>> failing at this point in the task becuase lack of hard drvie space sense
>>> as we merge we delete the merged file from the output folder.
>>>
>>> I only see this happening when there are to many files left that did not
>>> get merged durring the shuffle stage and it starts on disk mergeing.
>>> the task that complete the merges and keep it below the io.sort size in 
>>> my
>>> case 30 skips the on disk merge and complete useing normal hard drive
>>> space.
>>>
>>> Anyone care to take a look?
>>> This job takes two or more days to get to this point so getting kind of 
>>> a
>>> pain in the butt to run and watch the reduces fail and the job keep
>>> failing no matter what.
>>>
>>> I can post the tail of this task long when it fails to show you how far 
>>> it
>>> gets before it runs out of space. before redcue on disk merge starts the
>>> disk are about 35-40% used on 500GB Drive and two taks runnning at the
>>> same time.
>>>
>>> Billy Pearson
>>>
>>>
>>
>
>
>

Re: intermediate results not getting compressed

Posted by Stefan Will <st...@gmx.net>.

I noticed this too. I think the compression only applies to the final mapper
and reducer outputs, but not any intermediate files produced. The reducer
will decompress the map output files after copying them, and then compress
its own output only after it has finished.

I wonder if this is by design, or just an oversight.

-- Stefan


> From: Billy Pearson <sa...@pearsonwholesale.com>
> Reply-To: <co...@hadoop.apache.org>
> Date: Wed, 18 Mar 2009 22:14:07 -0500
> To: <co...@hadoop.apache.org>
> Subject: Re: intermediate results not getting compressed
> 
> I can run head on the map.out files and I get compressed garbish but I run
> head on a intermediate file and I can read the data in the file clearly so
> compression is not getting passed but I am setting the CompressMapOutput to
> true by default in my hadoop-site.conf file.
> 
> Billy
> 
> 
> "Billy Pearson" <sa...@pearsonwholesale.com>
> wrote in message news:gpscu3$66p$1@ger.gmane.org...
>> the intermediate.X files are not getting compresses for some reason  not
>> sure why
>> I download and build the latest branch for 0.19
>> 
>> o.a.h.mapred.Merger.class line 432
>> new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);
>> 
>> this seams to use the codec defined above but for some reasion its not
>> working correctly the compression is not passing from the map output files
>> to the on disk merge of the intermediate.X files
>> 
>> tail task report from one server:
>> 
>> 2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
>> Interleaved on-disk merge complete: 1730 files left.
>> 2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
>> In-memory merge complete: 3 files left.
>> 2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping
>> 3 segments, 39835369 bytes in memory for intermediate, on-disk merge
>> 2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging
>> 1730 files, 70359998581 bytes from disk
>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging
>> 0 segments, 0 bytes from memory into reduce
>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733
>> sorted segments
>> 2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
>> intermediate segments out of a total of 1733
>> 2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1712
>> 2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1683
>> 2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1654
>> 2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1625
>> 2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1596
>> 2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1567
>> 2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1538
>> 2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1509
>> 2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1480
>> 2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1451
>> 2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1422
>> 2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1393
>> 2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1364
>> 2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1335
>> 2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1306
>> 
>> See the size of the files is about ~70GB (70359998581) these are
>> compressed at this points its went from 1733 file to 1306 left to merge
>> and the intermediate.X files are well over 200Gb at this point and we are
>> not even close to done. If compression is working we should not see task
>> failing at this point in the task becuase lack of hard drvie space sense
>> as we merge we delete the merged file from the output folder.
>> 
>> I only see this happening when there are to many files left that did not
>> get merged durring the shuffle stage and it starts on disk mergeing.
>> the task that complete the merges and keep it below the io.sort size in my
>> case 30 skips the on disk merge and complete useing normal hard drive
>> space.
>> 
>> Anyone care to take a look?
>> This job takes two or more days to get to this point so getting kind of a
>> pain in the butt to run and watch the reduces fail and the job keep
>> failing no matter what.
>> 
>> I can post the tail of this task long when it fails to show you how far it
>> gets before it runs out of space. before redcue on disk merge starts the
>> disk are about 35-40% used on 500GB Drive and two taks runnning at the
>> same time.
>> 
>> Billy Pearson
>> 
>> 
>

Re: intermediate results not getting compressed

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

I can run head on the map.out files and I get compressed garbish but I run 
head on a intermediate file and I can read the data in the file clearly so 
compression is not getting passed but I am setting the CompressMapOutput to 
true by default in my hadoop-site.conf file.

Billy


"Billy Pearson" <sa...@pearsonwholesale.com> 
wrote in message news:gpscu3$66p$1@ger.gmane.org...
> the intermediate.X files are not getting compresses for some reason  not 
> sure why
> I download and build the latest branch for 0.19
>
> o.a.h.mapred.Merger.class line 432
> new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);
>
> this seams to use the codec defined above but for some reasion its not 
> working correctly the compression is not passing from the map output files 
> to the on disk merge of the intermediate.X files
>
> tail task report from one server:
>
> 2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask: 
> Interleaved on-disk merge complete: 1730 files left.
> 2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask: 
> In-memory merge complete: 3 files left.
> 2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping 
> 3 segments, 39835369 bytes in memory for intermediate, on-disk merge
> 2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging 
> 1730 files, 70359998581 bytes from disk
> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging 
> 0 segments, 0 bytes from memory into reduce
> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733 
> sorted segments
> 2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22 
> intermediate segments out of a total of 1733
> 2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1712
> 2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1683
> 2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1654
> 2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1625
> 2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1596
> 2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1567
> 2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1538
> 2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1509
> 2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1480
> 2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1451
> 2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1422
> 2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1393
> 2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1364
> 2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1335
> 2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30 
> intermediate segments out of a total of 1306
>
> See the size of the files is about ~70GB (70359998581) these are 
> compressed at this points its went from 1733 file to 1306 left to merge 
> and the intermediate.X files are well over 200Gb at this point and we are 
> not even close to done. If compression is working we should not see task 
> failing at this point in the task becuase lack of hard drvie space sense 
> as we merge we delete the merged file from the output folder.
>
> I only see this happening when there are to many files left that did not 
> get merged durring the shuffle stage and it starts on disk mergeing.
> the task that complete the merges and keep it below the io.sort size in my 
> case 30 skips the on disk merge and complete useing normal hard drive 
> space.
>
> Anyone care to take a look?
> This job takes two or more days to get to this point so getting kind of a 
> pain in the butt to run and watch the reduces fail and the job keep 
> failing no matter what.
>
> I can post the tail of this task long when it fails to show you how far it 
> gets before it runs out of space. before redcue on disk merge starts the 
> disk are about 35-40% used on 500GB Drive and two taks runnning at the 
> same time.
>
> Billy Pearson
>
>

Re: intermediate results not getting compressed

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

the intermediate.X files are not getting compresses for some reason  not 
sure why
I download and build the latest branch for 0.19

o.a.h.mapred.Merger.class line 432
new Writer<K, V>(conf, fs, outputFile, keyClass, valueClass, codec);

this seams to use the codec defined above but for some reasion its not 
working correctly the compression is not passing from the map output files 
to the on disk merge of the intermediate.X files

tail task report from one server:

2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask: 
Interleaved on-disk merge complete: 1730 files left.
2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask: In-memory 
merge complete: 3 files left.
2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping 3 
segments, 39835369 bytes in memory for intermediate, on-disk merge
2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging 
1730 files, 70359998581 bytes from disk
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 
segments, 0 bytes from memory into reduce
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733 
sorted segments
2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22 
intermediate segments out of a total of 1733
2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1712
2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1683
2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1654
2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1625
2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1596
2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1567
2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1538
2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1509
2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1480
2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1451
2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1422
2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1393
2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1364
2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1335
2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1306

See the size of the files is about ~70GB (70359998581) these are compressed 
at this points its went from 1733 file to 1306 left to merge and the 
intermediate.X files are well over 200Gb at this point and we are not even 
close to done. If compression is working we should not see task failing at 
this point in the task becuase lack of hard drvie space sense as we merge we 
delete the merged file from the output folder.

I only see this happening when there are to many files left that did not get 
merged durring the shuffle stage and it starts on disk mergeing.
the task that complete the merges and keep it below the io.sort size in my 
case 30 skips the on disk merge and complete useing normal hard drive space.

Anyone care to take a look?
This job takes two or more days to get to this point so getting kind of a 
pain in the butt to run and watch the reduces fail and the job keep failing 
no matter what.

I can post the tail of this task long when it fails to show you how far it 
gets before it runs out of space. before redcue on disk merge starts the 
disk are about 35-40% used on 500GB Drive and two taks runnning at the same 
time.

Billy Pearson

Re: intermediate results not getting compressed

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

Watching a second job with more reduce task running looks like the in-memory 
merges are working correctly with compression.

The task I was watching failed and was running again it Shuffle all the map 
output files then started the merged after all was copied so non was merged 
in memory it was closed before the merging started.
If it helps the name of the output files is intermediate.x and is stored in 
folder mapred/local/job-taskname/intermediate.x
while the in-memory merges are stored 
mapred/local/taskTracker/jobcache/job-name/taskname/

The non compressed ones are the intermediate.x file above.

Billy


"Chris Douglas" <ch...@yahoo-inc.com> wrote in 
message news:9BB78C3A-EFAB-45C3-8CC3-25AAB60DF914@yahoo-inc.com...
>> My problem is the output from merging the intermediate map output  files 
>> is not compresses so I lose all the benefit of compressing the  map file 
>> output to save disk space because the merged map output  files are no 
>> longer compressed.
>
> It should still be compressed, unless there's some bizarre regression. 
> More segments will be around simultaneously (since the segments not  yet 
> merged are still on disk), which clearly puts pressure on  intermediate 
> storage, but if the map outputs are compressed, then the  merged map 
> outputs at the reduce must also be compressed. There's no  place in the 
> intermediate format to store compression metadata, so  either all are or 
> none are. Intermediate merges should also follow the  compression spec of 
> the initiating merger, too (o.a.h.mapred.Merger: 447).
>
> How are you concluding that the intermediate output is compressed from 
> the map, but not in the reduce? -C
>
>>
>> ----- Original Message ----- From: "Chris Douglas" 
>> <chrisdo-ZXvpkYn067l8UrSeD/g0lQ@public.gmane.org
>> >
>> Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
>> To: 
>> <co...@public.gmane.org>
>> Sent: Tuesday, March 17, 2009 12:33 AM
>> Subject: Re: intermediate results not getting compressed
>>
>>
>>>> I am running 0.19.1-dev, r744282. I have searched the issues but 
>>>> found nothing about the compression.
>>>
>>> AFAIK, there are no open issues that prevent intermediate  compression 
>>> from working. The following might be useful:
>>>
>>> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
>>>
>>>> Should the intermediate results not be compressed also if the map 
>>>> output files are set to be compressed?
>>>
>>> These are controlled by separate options.
>>>
>>> FileOutputFormat::setCompressOutput enables/disables compression  on 
>>> the final output
>>> JobConf::setCompressMapOutput enables/disables compression of the 
>>> intermediate output
>>>
>>>> If not then why do we have the map compression option just to save 
>>>> network traffic?
>>>
>>> That's part of it. Also to save on disk bandwidth and intermediate 
>>> space. -C
>>
>>
>
>

Re: intermediate results not getting compressed

Posted by Chris Douglas <ch...@yahoo-inc.com>.

> My problem is the output from merging the intermediate map output  
> files is not compresses so I lose all the benefit of compressing the  
> map file output to save disk space because the merged map output  
> files are no longer compressed.

It should still be compressed, unless there's some bizarre regression.  
More segments will be around simultaneously (since the segments not  
yet merged are still on disk), which clearly puts pressure on  
intermediate storage, but if the map outputs are compressed, then the  
merged map outputs at the reduce must also be compressed. There's no  
place in the intermediate format to store compression metadata, so  
either all are or none are. Intermediate merges should also follow the  
compression spec of the initiating merger, too (o.a.h.mapred.Merger: 
447).

How are you concluding that the intermediate output is compressed from  
the map, but not in the reduce? -C

>
> ----- Original Message ----- From: "Chris Douglas" <chrisdo-ZXvpkYn067l8UrSeD/g0lQ@public.gmane.org 
> >
> Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
> To: <co...@public.gmane.org>
> Sent: Tuesday, March 17, 2009 12:33 AM
> Subject: Re: intermediate results not getting compressed
>
>
>>> I am running 0.19.1-dev, r744282. I have searched the issues but   
>>> found nothing about the compression.
>>
>> AFAIK, there are no open issues that prevent intermediate  
>> compression from working. The following might be useful:
>>
>> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
>>
>>> Should the intermediate results not be compressed also if the map   
>>> output files are set to be compressed?
>>
>> These are controlled by separate options.
>>
>> FileOutputFormat::setCompressOutput enables/disables compression  
>> on  the final output
>> JobConf::setCompressMapOutput enables/disables compression of the  
>> intermediate output
>>
>>> If not then why do we have the map compression option just to save  
>>> network traffic?
>>
>> That's part of it. Also to save on disk bandwidth and intermediate   
>> space. -C
>
>

Re: intermediate results not getting compressed

Posted by Billy Pearson <bi...@sbcglobal.net>.


I understand that I got CompressMapOutput set and it works the maps outputs 
are compressed but on the reduce end it downloads x files then merges the x 
file in to one intermediate file to keep the number of files to a minimal 
<= io.sort.factor.

My problem is the output from merging the intermediate map output files is 
not compresses so I lose all the benefit of compressing the map file output 
to save disk space because the merged map output files are no longer 
compressed.

Note there are two different type of intermediate files the map outputs then 
one the reduce merges the map outputs to meet the set io.sort.factor.

Billy



----- Original Message ----- 
From: "Chris Douglas" <ch...@public.gmane.org>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <co...@public.gmane.org>
Sent: Tuesday, March 17, 2009 12:33 AM
Subject: Re: intermediate results not getting compressed


>> I am running 0.19.1-dev, r744282. I have searched the issues but  found 
>> nothing about the compression.
>
> AFAIK, there are no open issues that prevent intermediate compression 
> from working. The following might be useful:
>
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression
>
>> Should the intermediate results not be compressed also if the map  output 
>> files are set to be compressed?
>
> These are controlled by separate options.
>
> FileOutputFormat::setCompressOutput enables/disables compression on  the 
> final output
> JobConf::setCompressMapOutput enables/disables compression of the 
> intermediate output
>
>> If not then why do we have the map compression option just to save 
>> network traffic?
>
> That's part of it. Also to save on disk bandwidth and intermediate 
>  space. -C
>

Re: intermediate results not getting compressed

Posted by Chris Douglas <ch...@yahoo-inc.com>.

> I am running 0.19.1-dev, r744282. I have searched the issues but  
> found nothing about the compression.

AFAIK, there are no open issues that prevent intermediate compression  
from working. The following might be useful:

http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression

> Should the intermediate results not be compressed also if the map  
> output files are set to be compressed?

These are controlled by separate options.

FileOutputFormat::setCompressOutput enables/disables compression on  
the final output
JobConf::setCompressMapOutput enables/disables compression of the  
intermediate output

> If not then why do we have the map compression option just to save  
> network traffic?

That's part of it. Also to save on disk bandwidth and intermediate  
space. -C