You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Ted Yu <yu...@gmail.com> on 2011/02/02 03:36:57 UTC

"Map input bytes" vs HDFS_BYTES_READ

In hadoop 0.20.2, what's the relationship between "Map input bytes" and
HDFS_BYTES_READ ?

<counter group="FileSystemCounters"
name="HDFS_BYTES_READ">203446204073</counter>
<counter group="FileSystemCounters"
name="HDFS_BYTES_WRITTEN">23413127561</counter>
<counter group="Map-Reduce Framework" name="Map input
records">163502600</counter>
<counter group="Map-Reduce Framework" name="Spilled Records">0</counter>
<counter group="Map-Reduce Framework" name="Map input
bytes">965922136488</counter>
<counter group="Map-Reduce Framework" name="Map output
records">296754600</counter>

Thanks

Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Ravi Gummadi <gr...@yahoo-inc.com>.
Ted Yu wrote:
> Ravi:
> Can you illustrate the situation where map output doesn't fit in io.sort.mb
> ?
>
> Thanks
>   
Basically, map output is spilled to local disk in chunks of size 
io.sort.mb( i.e. the value of this config property io.sort.mb set for 
that job). So if user's map() method in a map task outputs 500MB of 
data(say reading X MB of input split) and io.sort.mb=128, then
first 128MB out of this 500MB becomes 1st spill to disk,
2nd 128MB out of this 500MB becomes 1st spill to disk,
etc.

Anyway, all these spilled files are merged together to form the final 
single-map-output-file for that map task. This merging can happen in 
multiple merges --- for example, if io.sort.factor=2, then only 2 spills 
are merged at a time, thus resulting in multiple merges with 
intermediate writes to local disk.

So based on the number of spills and intermediate merges happen, the 
local-file-bytes-read counter can be a lot bigger value compared to the 
actual map input bytes.

Spilled records counter(number of records spilled to disk in the map 
task) may help for your calculation to some extent. But it is 
records-count and not bytes.

-Ravi
> On Thu, Feb 3, 2011 at 8:14 PM, Ravi Gummadi <gr...@yahoo-inc.com> wrote:
>
>   
>> Ted Yu wrote:
>>
>>     
>>> From my limited experiment, I think "Map input bytes" reflects the number
>>> of
>>> bytes of local data file(s) when LocalJobRunner is used.
>>>
>>> Correct me if I am wrong.
>>>
>>>
>>>       
>> This is correct only if there is a single spill (and not multiple spills)
>> i.e. all the map output fits in io.sort.mb.
>>
>> -Ravi
>>
>>  On Tue, Feb 1, 2011 at 7:52 PM, Harsh J <qw...@gmail.com> wrote:
>>     
>>>
>>>       
>>>> Each task counts independently of its attempt/other tasks, thereby
>>>> making the aggregates easier to control. Final counters are aggregated
>>>> only from successfully committed tasks. During the job's run, however,
>>>> counters are shown aggregated from the most successful attempts of a
>>>> task thus far.
>>>>
>>>> On Wed, Feb 2, 2011 at 9:09 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>
>>>>         
>>>>> If map task(s) were retried (mapred.map.max.attempts times), how would
>>>>>
>>>>>
>>>>>           
>>>> these
>>>>
>>>>
>>>>         
>>>>> two counters be affected ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Feb 1, 2011 at 7:31 PM, Harsh J <qw...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
>>>>>> with the FS read (lower level). Map input bytes is what the
>>>>>> RecordReader has processed in number of bytes for records being read
>>>>>> from the input stream.
>>>>>>
>>>>>> For plain text files, I believe both counters must report about the
>>>>>> same value, were entire records being read with no operation performed
>>>>>> on each line. But when you throw in a compressed file, you'll notice
>>>>>> that the HDFS_BYTES_READ would be far lesser than Map input bytes
>>>>>> since the disk read was low, but the total content stored in record
>>>>>> terms was still the same as it would be for an uncompressed file.
>>>>>>
>>>>>> Hope this clears it.
>>>>>>
>>>>>> On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> In hadoop 0.20.2, what's the relationship between "Map input bytes"
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> and
>>>>>>             
>>>>         
>>>>> HDFS_BYTES_READ ?
>>>>>           
>>>>>>> <counter group="FileSystemCounters"
>>>>>>> name="HDFS_BYTES_READ">203446204073</counter>
>>>>>>> <counter group="FileSystemCounters"
>>>>>>> name="HDFS_BYTES_WRITTEN">23413127561</counter>
>>>>>>> <counter group="Map-Reduce Framework" name="Map input
>>>>>>> records">163502600</counter>
>>>>>>> <counter group="Map-Reduce Framework" name="Spilled
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> Records">0</counter>
>>>>>>             
>>>>         
>>>>> <counter group="Map-Reduce Framework" name="Map input
>>>>>           
>>>>>>> bytes">965922136488</counter>
>>>>>>> <counter group="Map-Reduce Framework" name="Map output
>>>>>>> records">296754600</counter>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> --
>>>>>> Harsh J
>>>>>> www.harshj.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>> --
>>>> Harsh J
>>>> www.harshj.com
>>>>
>>>>
>>>>
>>>>         


Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Ted Yu <yu...@gmail.com>.
Ravi:
Can you illustrate the situation where map output doesn't fit in io.sort.mb
?

Thanks

On Thu, Feb 3, 2011 at 8:14 PM, Ravi Gummadi <gr...@yahoo-inc.com> wrote:

> Ted Yu wrote:
>
>> From my limited experiment, I think "Map input bytes" reflects the number
>> of
>> bytes of local data file(s) when LocalJobRunner is used.
>>
>> Correct me if I am wrong.
>>
>>
> This is correct only if there is a single spill (and not multiple spills)
> i.e. all the map output fits in io.sort.mb.
>
> -Ravi
>
>  On Tue, Feb 1, 2011 at 7:52 PM, Harsh J <qw...@gmail.com> wrote:
>>
>>
>>
>>> Each task counts independently of its attempt/other tasks, thereby
>>> making the aggregates easier to control. Final counters are aggregated
>>> only from successfully committed tasks. During the job's run, however,
>>> counters are shown aggregated from the most successful attempts of a
>>> task thus far.
>>>
>>> On Wed, Feb 2, 2011 at 9:09 AM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>
>>>> If map task(s) were retried (mapred.map.max.attempts times), how would
>>>>
>>>>
>>> these
>>>
>>>
>>>> two counters be affected ?
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Feb 1, 2011 at 7:31 PM, Harsh J <qw...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>> HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
>>>>> with the FS read (lower level). Map input bytes is what the
>>>>> RecordReader has processed in number of bytes for records being read
>>>>> from the input stream.
>>>>>
>>>>> For plain text files, I believe both counters must report about the
>>>>> same value, were entire records being read with no operation performed
>>>>> on each line. But when you throw in a compressed file, you'll notice
>>>>> that the HDFS_BYTES_READ would be far lesser than Map input bytes
>>>>> since the disk read was low, but the total content stored in record
>>>>> terms was still the same as it would be for an uncompressed file.
>>>>>
>>>>> Hope this clears it.
>>>>>
>>>>> On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>> In hadoop 0.20.2, what's the relationship between "Map input bytes"
>>>>>>
>>>>>>
>>>>> and
>>>
>>>
>>>> HDFS_BYTES_READ ?
>>>>>>
>>>>>> <counter group="FileSystemCounters"
>>>>>> name="HDFS_BYTES_READ">203446204073</counter>
>>>>>> <counter group="FileSystemCounters"
>>>>>> name="HDFS_BYTES_WRITTEN">23413127561</counter>
>>>>>> <counter group="Map-Reduce Framework" name="Map input
>>>>>> records">163502600</counter>
>>>>>> <counter group="Map-Reduce Framework" name="Spilled
>>>>>>
>>>>>>
>>>>> Records">0</counter>
>>>
>>>
>>>> <counter group="Map-Reduce Framework" name="Map input
>>>>>> bytes">965922136488</counter>
>>>>>> <counter group="Map-Reduce Framework" name="Map output
>>>>>> records">296754600</counter>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Harsh J
>>>>> www.harshj.com
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>>
>>>
>>>
>>
>

Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Ravi Gummadi <gr...@yahoo-inc.com>.
Ted Yu wrote:
> From my limited experiment, I think "Map input bytes" reflects the number of
> bytes of local data file(s) when LocalJobRunner is used.
>
> Correct me if I am wrong.
>   
This is correct only if there is a single spill (and not multiple 
spills) i.e. all the map output fits in io.sort.mb.

-Ravi
> On Tue, Feb 1, 2011 at 7:52 PM, Harsh J <qw...@gmail.com> wrote:
>
>   
>> Each task counts independently of its attempt/other tasks, thereby
>> making the aggregates easier to control. Final counters are aggregated
>> only from successfully committed tasks. During the job's run, however,
>> counters are shown aggregated from the most successful attempts of a
>> task thus far.
>>
>> On Wed, Feb 2, 2011 at 9:09 AM, Ted Yu <yu...@gmail.com> wrote:
>>     
>>> If map task(s) were retried (mapred.map.max.attempts times), how would
>>>       
>> these
>>     
>>> two counters be affected ?
>>>
>>> Thanks
>>>
>>> On Tue, Feb 1, 2011 at 7:31 PM, Harsh J <qw...@gmail.com> wrote:
>>>
>>>       
>>>> HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
>>>> with the FS read (lower level). Map input bytes is what the
>>>> RecordReader has processed in number of bytes for records being read
>>>> from the input stream.
>>>>
>>>> For plain text files, I believe both counters must report about the
>>>> same value, were entire records being read with no operation performed
>>>> on each line. But when you throw in a compressed file, you'll notice
>>>> that the HDFS_BYTES_READ would be far lesser than Map input bytes
>>>> since the disk read was low, but the total content stored in record
>>>> terms was still the same as it would be for an uncompressed file.
>>>>
>>>> Hope this clears it.
>>>>
>>>> On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
>>>>         
>>>>> In hadoop 0.20.2, what's the relationship between "Map input bytes"
>>>>>           
>> and
>>     
>>>>> HDFS_BYTES_READ ?
>>>>>
>>>>> <counter group="FileSystemCounters"
>>>>> name="HDFS_BYTES_READ">203446204073</counter>
>>>>> <counter group="FileSystemCounters"
>>>>> name="HDFS_BYTES_WRITTEN">23413127561</counter>
>>>>> <counter group="Map-Reduce Framework" name="Map input
>>>>> records">163502600</counter>
>>>>> <counter group="Map-Reduce Framework" name="Spilled
>>>>>           
>> Records">0</counter>
>>     
>>>>> <counter group="Map-Reduce Framework" name="Map input
>>>>> bytes">965922136488</counter>
>>>>> <counter group="Map-Reduce Framework" name="Map output
>>>>> records">296754600</counter>
>>>>>
>>>>> Thanks
>>>>>
>>>>>           
>>>>
>>>> --
>>>> Harsh J
>>>> www.harshj.com
>>>>
>>>>         
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>>     


Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Ted Yu <yu...@gmail.com>.
>From my limited experiment, I think "Map input bytes" reflects the number of
bytes of local data file(s) when LocalJobRunner is used.

Correct me if I am wrong.

On Tue, Feb 1, 2011 at 7:52 PM, Harsh J <qw...@gmail.com> wrote:

> Each task counts independently of its attempt/other tasks, thereby
> making the aggregates easier to control. Final counters are aggregated
> only from successfully committed tasks. During the job's run, however,
> counters are shown aggregated from the most successful attempts of a
> task thus far.
>
> On Wed, Feb 2, 2011 at 9:09 AM, Ted Yu <yu...@gmail.com> wrote:
> > If map task(s) were retried (mapred.map.max.attempts times), how would
> these
> > two counters be affected ?
> >
> > Thanks
> >
> > On Tue, Feb 1, 2011 at 7:31 PM, Harsh J <qw...@gmail.com> wrote:
> >
> >> HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
> >> with the FS read (lower level). Map input bytes is what the
> >> RecordReader has processed in number of bytes for records being read
> >> from the input stream.
> >>
> >> For plain text files, I believe both counters must report about the
> >> same value, were entire records being read with no operation performed
> >> on each line. But when you throw in a compressed file, you'll notice
> >> that the HDFS_BYTES_READ would be far lesser than Map input bytes
> >> since the disk read was low, but the total content stored in record
> >> terms was still the same as it would be for an uncompressed file.
> >>
> >> Hope this clears it.
> >>
> >> On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
> >> > In hadoop 0.20.2, what's the relationship between "Map input bytes"
> and
> >> > HDFS_BYTES_READ ?
> >> >
> >> > <counter group="FileSystemCounters"
> >> > name="HDFS_BYTES_READ">203446204073</counter>
> >> > <counter group="FileSystemCounters"
> >> > name="HDFS_BYTES_WRITTEN">23413127561</counter>
> >> > <counter group="Map-Reduce Framework" name="Map input
> >> > records">163502600</counter>
> >> > <counter group="Map-Reduce Framework" name="Spilled
> Records">0</counter>
> >> > <counter group="Map-Reduce Framework" name="Map input
> >> > bytes">965922136488</counter>
> >> > <counter group="Map-Reduce Framework" name="Map output
> >> > records">296754600</counter>
> >> >
> >> > Thanks
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.harshj.com
> >>
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Ted Yu <yu...@gmail.com>.
Harsh:
When LocalJobRunner is used, only "Map input bytes" is calculated.

Can you comment on this case ?

Thanks

On Tue, Feb 1, 2011 at 7:52 PM, Harsh J <qw...@gmail.com> wrote:

> Each task counts independently of its attempt/other tasks, thereby
> making the aggregates easier to control. Final counters are aggregated
> only from successfully committed tasks. During the job's run, however,
> counters are shown aggregated from the most successful attempts of a
> task thus far.
>
> On Wed, Feb 2, 2011 at 9:09 AM, Ted Yu <yu...@gmail.com> wrote:
> > If map task(s) were retried (mapred.map.max.attempts times), how would
> these
> > two counters be affected ?
> >
> > Thanks
> >
> > On Tue, Feb 1, 2011 at 7:31 PM, Harsh J <qw...@gmail.com> wrote:
> >
> >> HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
> >> with the FS read (lower level). Map input bytes is what the
> >> RecordReader has processed in number of bytes for records being read
> >> from the input stream.
> >>
> >> For plain text files, I believe both counters must report about the
> >> same value, were entire records being read with no operation performed
> >> on each line. But when you throw in a compressed file, you'll notice
> >> that the HDFS_BYTES_READ would be far lesser than Map input bytes
> >> since the disk read was low, but the total content stored in record
> >> terms was still the same as it would be for an uncompressed file.
> >>
> >> Hope this clears it.
> >>
> >> On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
> >> > In hadoop 0.20.2, what's the relationship between "Map input bytes"
> and
> >> > HDFS_BYTES_READ ?
> >> >
> >> > <counter group="FileSystemCounters"
> >> > name="HDFS_BYTES_READ">203446204073</counter>
> >> > <counter group="FileSystemCounters"
> >> > name="HDFS_BYTES_WRITTEN">23413127561</counter>
> >> > <counter group="Map-Reduce Framework" name="Map input
> >> > records">163502600</counter>
> >> > <counter group="Map-Reduce Framework" name="Spilled
> Records">0</counter>
> >> > <counter group="Map-Reduce Framework" name="Map input
> >> > bytes">965922136488</counter>
> >> > <counter group="Map-Reduce Framework" name="Map output
> >> > records">296754600</counter>
> >> >
> >> > Thanks
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.harshj.com
> >>
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Harsh J <qw...@gmail.com>.
Each task counts independently of its attempt/other tasks, thereby
making the aggregates easier to control. Final counters are aggregated
only from successfully committed tasks. During the job's run, however,
counters are shown aggregated from the most successful attempts of a
task thus far.

On Wed, Feb 2, 2011 at 9:09 AM, Ted Yu <yu...@gmail.com> wrote:
> If map task(s) were retried (mapred.map.max.attempts times), how would these
> two counters be affected ?
>
> Thanks
>
> On Tue, Feb 1, 2011 at 7:31 PM, Harsh J <qw...@gmail.com> wrote:
>
>> HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
>> with the FS read (lower level). Map input bytes is what the
>> RecordReader has processed in number of bytes for records being read
>> from the input stream.
>>
>> For plain text files, I believe both counters must report about the
>> same value, were entire records being read with no operation performed
>> on each line. But when you throw in a compressed file, you'll notice
>> that the HDFS_BYTES_READ would be far lesser than Map input bytes
>> since the disk read was low, but the total content stored in record
>> terms was still the same as it would be for an uncompressed file.
>>
>> Hope this clears it.
>>
>> On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
>> > In hadoop 0.20.2, what's the relationship between "Map input bytes" and
>> > HDFS_BYTES_READ ?
>> >
>> > <counter group="FileSystemCounters"
>> > name="HDFS_BYTES_READ">203446204073</counter>
>> > <counter group="FileSystemCounters"
>> > name="HDFS_BYTES_WRITTEN">23413127561</counter>
>> > <counter group="Map-Reduce Framework" name="Map input
>> > records">163502600</counter>
>> > <counter group="Map-Reduce Framework" name="Spilled Records">0</counter>
>> > <counter group="Map-Reduce Framework" name="Map input
>> > bytes">965922136488</counter>
>> > <counter group="Map-Reduce Framework" name="Map output
>> > records">296754600</counter>
>> >
>> > Thanks
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>>
>



-- 
Harsh J
www.harshj.com

Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Ted Yu <yu...@gmail.com>.
If map task(s) were retried (mapred.map.max.attempts times), how would these
two counters be affected ?

Thanks

On Tue, Feb 1, 2011 at 7:31 PM, Harsh J <qw...@gmail.com> wrote:

> HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
> with the FS read (lower level). Map input bytes is what the
> RecordReader has processed in number of bytes for records being read
> from the input stream.
>
> For plain text files, I believe both counters must report about the
> same value, were entire records being read with no operation performed
> on each line. But when you throw in a compressed file, you'll notice
> that the HDFS_BYTES_READ would be far lesser than Map input bytes
> since the disk read was low, but the total content stored in record
> terms was still the same as it would be for an uncompressed file.
>
> Hope this clears it.
>
> On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
> > In hadoop 0.20.2, what's the relationship between "Map input bytes" and
> > HDFS_BYTES_READ ?
> >
> > <counter group="FileSystemCounters"
> > name="HDFS_BYTES_READ">203446204073</counter>
> > <counter group="FileSystemCounters"
> > name="HDFS_BYTES_WRITTEN">23413127561</counter>
> > <counter group="Map-Reduce Framework" name="Map input
> > records">163502600</counter>
> > <counter group="Map-Reduce Framework" name="Spilled Records">0</counter>
> > <counter group="Map-Reduce Framework" name="Map input
> > bytes">965922136488</counter>
> > <counter group="Map-Reduce Framework" name="Map output
> > records">296754600</counter>
> >
> > Thanks
> >
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: "Map input bytes" vs HDFS_BYTES_READ

Posted by Harsh J <qw...@gmail.com>.
HDFS_BYTES_READ is a FileSystem interface counter. It directly deals
with the FS read (lower level). Map input bytes is what the
RecordReader has processed in number of bytes for records being read
from the input stream.

For plain text files, I believe both counters must report about the
same value, were entire records being read with no operation performed
on each line. But when you throw in a compressed file, you'll notice
that the HDFS_BYTES_READ would be far lesser than Map input bytes
since the disk read was low, but the total content stored in record
terms was still the same as it would be for an uncompressed file.

Hope this clears it.

On Wed, Feb 2, 2011 at 8:06 AM, Ted Yu <yu...@gmail.com> wrote:
> In hadoop 0.20.2, what's the relationship between "Map input bytes" and
> HDFS_BYTES_READ ?
>
> <counter group="FileSystemCounters"
> name="HDFS_BYTES_READ">203446204073</counter>
> <counter group="FileSystemCounters"
> name="HDFS_BYTES_WRITTEN">23413127561</counter>
> <counter group="Map-Reduce Framework" name="Map input
> records">163502600</counter>
> <counter group="Map-Reduce Framework" name="Spilled Records">0</counter>
> <counter group="Map-Reduce Framework" name="Map input
> bytes">965922136488</counter>
> <counter group="Map-Reduce Framework" name="Map output
> records">296754600</counter>
>
> Thanks
>



-- 
Harsh J
www.harshj.com