You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Virajith Jalaparti <vi...@gmail.com> on 2011/06/29 11:29:18 UTC

Intermediate data size of Sort example

Hi,

I was running the Sort example in Hadoop 0.20.2 (hadoop-0.20.2-examples.jar)
over an input data size of 100GB (generated using randomwriter) with
800mappers (I was using 128MB of HDFS block size) and 4 reducers over a 3
machine cluster with 2 slave nodes. While the input and output were 100GB, I
found that the intermediate data sent to each reducer was around 78GB,
making the total intermediate data around 310GB. I dont really understand
why there is an increase in data size given that the sort example just uses
the identity mapper and identity reducer.
Could someone please help me out with this?

Thanks!!

Re: Intermediate data size of Sort example

Posted by Virajith Jalaparti <vi...@gmail.com>.

Great, that makes a lot of sense now! Thanks a lot Harsh!

A related question: what does REDUCE_SHUFFLE_BYTES represent? Is it the size
of the sorted output of the shuffle phase?

Thanks,
Virajith

On Wed, Jun 29, 2011 at 2:10 PM, Harsh J <ha...@cloudera.com> wrote:

> Virajith,
>
> The FILE_BYTES_READ also counts all the reads of spilled records done
> during sorting of the various outputs between the MR phases.
>
> On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti
> <vi...@gmail.com> wrote:
> > I would like to clarify my earlier question: I found that each reducer
> > reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and
> > REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ  78GB and
> not
> > just 25GB?
> >
> > Thanks,
> > Virajith
> >
> > On Wed, Jun 29, 2011 at 10:29 AM, Virajith Jalaparti <
> virajith.j@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> I was running the Sort example in Hadoop 0.20.2
> >> (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated
> >> using randomwriter) with 800mappers (I was using 128MB of HDFS block
> size)
> >> and 4 reducers over a 3 machine cluster with 2 slave nodes. While the
> input
> >> and output were 100GB, I found that the intermediate data sent to each
> >> reducer was around 78GB, making the total intermediate data around
> 310GB. I
> >> dont really understand why there is an increase in data size given that
> the
> >> sort example just uses the identity mapper and identity reducer.
> >> Could someone please help me out with this?
> >>
> >> Thanks!!
> >
> >
>
>
>
> --
> Harsh J
>

Re: Intermediate data size of Sort example

Posted by Harsh J <ha...@cloudera.com>.

Virajith,

The FILE_BYTES_READ also counts all the reads of spilled records done
during sorting of the various outputs between the MR phases.

On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti
<vi...@gmail.com> wrote:
> I would like to clarify my earlier question: I found that each reducer
> reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and
> REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ  78GB and not
> just 25GB?
>
> Thanks,
> Virajith
>
> On Wed, Jun 29, 2011 at 10:29 AM, Virajith Jalaparti <vi...@gmail.com>
> wrote:
>>
>> Hi,
>>
>> I was running the Sort example in Hadoop 0.20.2
>> (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated
>> using randomwriter) with 800mappers (I was using 128MB of HDFS block size)
>> and 4 reducers over a 3 machine cluster with 2 slave nodes. While the input
>> and output were 100GB, I found that the intermediate data sent to each
>> reducer was around 78GB, making the total intermediate data around 310GB. I
>> dont really understand why there is an increase in data size given that the
>> sort example just uses the identity mapper and identity reducer.
>> Could someone please help me out with this?
>>
>> Thanks!!
>
>



-- 
Harsh J

Re: Intermediate data size of Sort example

Posted by Virajith Jalaparti <vi...@gmail.com>.

I would like to clarify my earlier question: I found that each reducer
reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and
REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ  78GB and not
just 25GB?

Thanks,
Virajith

On Wed, Jun 29, 2011 at 10:29 AM, Virajith Jalaparti
<vi...@gmail.com>wrote:

> Hi,
>
> I was running the Sort example in Hadoop 0.20.2
> (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated
> using randomwriter) with 800mappers (I was using 128MB of HDFS block size)
> and 4 reducers over a 3 machine cluster with 2 slave nodes. While the input
> and output were 100GB, I found that the intermediate data sent to each
> reducer was around 78GB, making the total intermediate data around 310GB. I
> dont really understand why there is an increase in data size given that the
> sort example just uses the identity mapper and identity reducer.
> Could someone please help me out with this?
>
> Thanks!!
>