You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Grandl Robert <rg...@yahoo.com> on 2012/07/08 03:37:00 UTC

Basic question on how reducer works

Hi,

I have some questions related to basic functionality in Hadoop. 

1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each  partition for each reducer ?

2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ?

Could somebody help me with these questions together with pointing me out where I can find the java code doing that ? I am running Hadoop 1.0.3. 


Thanks,
Robert

Re: Basic question on how reducer works

Posted by Subir S <su...@gmail.com>.

Just for reference of others who might see this thread. Jira
corresponding to parameter on reduce input limit is MAPREDUCE-2324

On 7/14/12, Harsh J <ha...@cloudera.com> wrote:
> Subir,
>
> On Sat, Jul 14, 2012 at 5:30 PM, Subir S <su...@gmail.com> wrote:
>> Harsh, Thanks I think this is what I was looking for. I have 3 related
>> questions.
>>
>> 1.) Will this work in 0.20.2-cdh3u3
>
> Yes, will work. (Btw, best to ask CDH-specific questions on the
> cdh-user@cloudera.org lists)
>
>> 2.) What is the hard limit that you mean?
>
> If a reducer gets more data than this value, due to the map's outputs
> growing large (for any partition), the job will begin to fail.
>
>> 3.)Can this be applied for streaming?
>
> Yes, streaming is still MR and this property is for MR (applied during
> scheduling, so not streaming/java specific).
>
> --
> Harsh J
>

Re: Basic question on how reducer works

Posted by Harsh J <ha...@cloudera.com>.

Subir,

On Sat, Jul 14, 2012 at 5:30 PM, Subir S <su...@gmail.com> wrote:
> Harsh, Thanks I think this is what I was looking for. I have 3 related
> questions.
>
> 1.) Will this work in 0.20.2-cdh3u3

Yes, will work. (Btw, best to ask CDH-specific questions on the
cdh-user@cloudera.org lists)

> 2.) What is the hard limit that you mean?

If a reducer gets more data than this value, due to the map's outputs
growing large (for any partition), the job will begin to fail.

> 3.)Can this be applied for streaming?

Yes, streaming is still MR and this property is for MR (applied during
scheduling, so not streaming/java specific).

-- 
Harsh J

Re: Basic question on how reducer works

Posted by Subir S <su...@gmail.com>.

Harsh, Thanks I think this is what I was looking for. I have 3 related
questions.

1.) Will this work in 0.20.2-cdh3u3

2.) What is the hard limit that you mean?

3.)Can this be applied for streaming?

Thanks, Subir

On 7/14/12, Harsh J <ha...@cloudera.com> wrote:
> If you wish to impose a limit on the max reducer input to be allowed
> in a job, you may set "mapreduce.reduce.input.limit" on your job, as
> total bytes allowed per reducer.
>
> But this is more of a hard limit, which I suspect your question wasn't
> about. Your question is indeed better off on the pig's user lists.
>
> On Tue, Jul 10, 2012 at 8:59 PM, Subir S <su...@gmail.com> wrote:
>> Is there any property to convey the maximum amount of data each
>> reducer/partition may take for processing. Like the bytes_per_reducer
>> of pig, so that the count of reducers can be controlled based on size
>> of intermediate map output data size?
>>
>> On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote:
>>> The partitioner is configurable. The default partitioner, from what I
>>> remember, computes the partition as the hashcode modulo number of
>>> reducers/partitions. For random input, it is balanced, but some cases
>>> can
>>> have very skewed key distribution. Also, as you have pointed out, the
>>> number of values per key can also vary. Together, both of them determine
>>> "weight" of each partition as you call it.
>>>
>>> Karthik
>>>
>>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rg...@yahoo.com> wrote:
>>>
>>>> Thanks Arun.
>>>>
>>>> So just for my clarification. The map will create partitions according
>>>> to
>>>> the number of reducers s.t. each reducer to get almost same number of
>>>> keys
>>>> in its partition. However, each key can have different number of values
>>>> so
>>>> the "weight" of each partition will depend on that. Also when a new
>>>> <key,
>>>> value> is added into a partition a hash on the partition ID will be
>>>> computed to find the corresponding partition ?
>>>>
>>>> Robert
>>>>
>>>>   ------------------------------
>>>> *From:* Arun C Murthy <ac...@hortonworks.com>
>>>> *To:* mapreduce-user@hadoop.apache.org
>>>> *Sent:* Monday, July 9, 2012 4:33 PM
>>>>
>>>> *Subject:* Re: Basic question on how reducer works
>>>>
>>>>
>>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>>>
>>>> Thanks a lot guys for answers.
>>>>
>>>> Still I am not able to find exactly the code for the following things:
>>>>
>>>> 1. reducer to read from a Map output only its partition. I looked into
>>>> ReduceTask#getMapOutput which do the actual read in
>>>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>>>> partition to read(reduceID).
>>>>
>>>>
>>>> Look at TaskTracker.MapOutputServlet.
>>>>
>>>> 2. still don't understand very well in which part of the
>>>> code(MapTask.java) the intermediate data is written do which partition.
>>>> So
>>>> MapOutputBuffer is the one who actually writes the data to buffer and
>>>> spill
>>>> after buffer is full. Could you please elaborate a bit on how the data
>>>> is
>>>> written to which partition ?
>>>>
>>>>
>>>> Essentially you can think of the partition-id as the 'primary key' and
>>>> the
>>>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>>>
>>>> hth,
>>>> Arun
>>>>
>>>> Thanks,
>>>> Robert
>>>>
>>>>   ------------------------------
>>>> *From:* Arun C Murthy <ac...@hortonworks.com>
>>>> *To:* mapreduce-user@hadoop.apache.org
>>>> *Sent:* Monday, July 9, 2012 9:24 AM
>>>> *Subject:* Re: Basic question on how reducer works
>>>>
>>>> Robert,
>>>>
>>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have some questions related to basic functionality in Hadoop.
>>>>
>>>> 1. When a Mapper process the intermediate output data, how it knows how
>>>> many partitions to do(how many reducers will be) and how much data to
>>>> go
>>>> in
>>>> each  partition for each reducer ?
>>>>
>>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>>>> the
>>>> locations of intermediate output data where it should retrieve it right
>>>> ?
>>>> But how a reducer will know from each remote location with intermediate
>>>> output what portion it has to retrieve only ?
>>>>
>>>>
>>>> To add to Harsh's comment. Essentially the TT *knows* where the output
>>>> of
>>>> a given map-id/reduce-id pair is present via an output-file/index-file
>>>> combination.
>>>>
>>>> Arun
>>>>
>>>> --
>>>> Arun C. Murthy
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Arun C. Murthy
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
>
>
> --
> Harsh J
>

Re: Basic question on how reducer works

Posted by Harsh J <ha...@cloudera.com>.

If you wish to impose a limit on the max reducer input to be allowed
in a job, you may set "mapreduce.reduce.input.limit" on your job, as
total bytes allowed per reducer.

But this is more of a hard limit, which I suspect your question wasn't
about. Your question is indeed better off on the pig's user lists.

On Tue, Jul 10, 2012 at 8:59 PM, Subir S <su...@gmail.com> wrote:
> Is there any property to convey the maximum amount of data each
> reducer/partition may take for processing. Like the bytes_per_reducer
> of pig, so that the count of reducers can be controlled based on size
> of intermediate map output data size?
>
> On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote:
>> The partitioner is configurable. The default partitioner, from what I
>> remember, computes the partition as the hashcode modulo number of
>> reducers/partitions. For random input, it is balanced, but some cases can
>> have very skewed key distribution. Also, as you have pointed out, the
>> number of values per key can also vary. Together, both of them determine
>> "weight" of each partition as you call it.
>>
>> Karthik
>>
>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rg...@yahoo.com> wrote:
>>
>>> Thanks Arun.
>>>
>>> So just for my clarification. The map will create partitions according to
>>> the number of reducers s.t. each reducer to get almost same number of
>>> keys
>>> in its partition. However, each key can have different number of values
>>> so
>>> the "weight" of each partition will depend on that. Also when a new <key,
>>> value> is added into a partition a hash on the partition ID will be
>>> computed to find the corresponding partition ?
>>>
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <ac...@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 4:33 PM
>>>
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>>
>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>>
>>> Thanks a lot guys for answers.
>>>
>>> Still I am not able to find exactly the code for the following things:
>>>
>>> 1. reducer to read from a Map output only its partition. I looked into
>>> ReduceTask#getMapOutput which do the actual read in
>>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>>> partition to read(reduceID).
>>>
>>>
>>> Look at TaskTracker.MapOutputServlet.
>>>
>>> 2. still don't understand very well in which part of the
>>> code(MapTask.java) the intermediate data is written do which partition.
>>> So
>>> MapOutputBuffer is the one who actually writes the data to buffer and
>>> spill
>>> after buffer is full. Could you please elaborate a bit on how the data is
>>> written to which partition ?
>>>
>>>
>>> Essentially you can think of the partition-id as the 'primary key' and
>>> the
>>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>>
>>> hth,
>>> Arun
>>>
>>> Thanks,
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <ac...@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 9:24 AM
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>> Robert,
>>>
>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>>
>>> Hi,
>>>
>>> I have some questions related to basic functionality in Hadoop.
>>>
>>> 1. When a Mapper process the intermediate output data, how it knows how
>>> many partitions to do(how many reducers will be) and how much data to go
>>> in
>>> each  partition for each reducer ?
>>>
>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>>> the
>>> locations of intermediate output data where it should retrieve it right ?
>>> But how a reducer will know from each remote location with intermediate
>>> output what portion it has to retrieve only ?
>>>
>>>
>>> To add to Harsh's comment. Essentially the TT *knows* where the output of
>>> a given map-id/reduce-id pair is present via an output-file/index-file
>>> combination.
>>>
>>> Arun
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>



-- 
Harsh J

Re: Basic question on how reducer works

Posted by Subir S <su...@gmail.com>.

Probably a wrong question in a wrong thread and wrong mailing list :)

On 7/10/12, Subir S <su...@gmail.com> wrote:
> Is there any property to convey the maximum amount of data each
> reducer/partition may take for processing. Like the bytes_per_reducer
> of pig, so that the count of reducers can be controlled based on size
> of intermediate map output data size?
>
> On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote:
>> The partitioner is configurable. The default partitioner, from what I
>> remember, computes the partition as the hashcode modulo number of
>> reducers/partitions. For random input, it is balanced, but some cases can
>> have very skewed key distribution. Also, as you have pointed out, the
>> number of values per key can also vary. Together, both of them determine
>> "weight" of each partition as you call it.
>>
>> Karthik
>>
>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rg...@yahoo.com> wrote:
>>
>>> Thanks Arun.
>>>
>>> So just for my clarification. The map will create partitions according
>>> to
>>> the number of reducers s.t. each reducer to get almost same number of
>>> keys
>>> in its partition. However, each key can have different number of values
>>> so
>>> the "weight" of each partition will depend on that. Also when a new
>>> <key,
>>> value> is added into a partition a hash on the partition ID will be
>>> computed to find the corresponding partition ?
>>>
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <ac...@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 4:33 PM
>>>
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>>
>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>>
>>> Thanks a lot guys for answers.
>>>
>>> Still I am not able to find exactly the code for the following things:
>>>
>>> 1. reducer to read from a Map output only its partition. I looked into
>>> ReduceTask#getMapOutput which do the actual read in
>>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>>> partition to read(reduceID).
>>>
>>>
>>> Look at TaskTracker.MapOutputServlet.
>>>
>>> 2. still don't understand very well in which part of the
>>> code(MapTask.java) the intermediate data is written do which partition.
>>> So
>>> MapOutputBuffer is the one who actually writes the data to buffer and
>>> spill
>>> after buffer is full. Could you please elaborate a bit on how the data
>>> is
>>> written to which partition ?
>>>
>>>
>>> Essentially you can think of the partition-id as the 'primary key' and
>>> the
>>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>>
>>> hth,
>>> Arun
>>>
>>> Thanks,
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <ac...@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 9:24 AM
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>> Robert,
>>>
>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>>
>>> Hi,
>>>
>>> I have some questions related to basic functionality in Hadoop.
>>>
>>> 1. When a Mapper process the intermediate output data, how it knows how
>>> many partitions to do(how many reducers will be) and how much data to go
>>> in
>>> each  partition for each reducer ?
>>>
>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>>> the
>>> locations of intermediate output data where it should retrieve it right
>>> ?
>>> But how a reducer will know from each remote location with intermediate
>>> output what portion it has to retrieve only ?
>>>
>>>
>>> To add to Harsh's comment. Essentially the TT *knows* where the output
>>> of
>>> a given map-id/reduce-id pair is present via an output-file/index-file
>>> combination.
>>>
>>> Arun
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Basic question on how reducer works

Posted by Subir S <su...@gmail.com>.

Is there any property to convey the maximum amount of data each
reducer/partition may take for processing. Like the bytes_per_reducer
of pig, so that the count of reducers can be controlled based on size
of intermediate map output data size?

On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote:
> The partitioner is configurable. The default partitioner, from what I
> remember, computes the partition as the hashcode modulo number of
> reducers/partitions. For random input, it is balanced, but some cases can
> have very skewed key distribution. Also, as you have pointed out, the
> number of values per key can also vary. Together, both of them determine
> "weight" of each partition as you call it.
>
> Karthik
>
> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rg...@yahoo.com> wrote:
>
>> Thanks Arun.
>>
>> So just for my clarification. The map will create partitions according to
>> the number of reducers s.t. each reducer to get almost same number of
>> keys
>> in its partition. However, each key can have different number of values
>> so
>> the "weight" of each partition will depend on that. Also when a new <key,
>> value> is added into a partition a hash on the partition ID will be
>> computed to find the corresponding partition ?
>>
>> Robert
>>
>>   ------------------------------
>> *From:* Arun C Murthy <ac...@hortonworks.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Monday, July 9, 2012 4:33 PM
>>
>> *Subject:* Re: Basic question on how reducer works
>>
>>
>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>
>> Thanks a lot guys for answers.
>>
>> Still I am not able to find exactly the code for the following things:
>>
>> 1. reducer to read from a Map output only its partition. I looked into
>> ReduceTask#getMapOutput which do the actual read in
>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>> partition to read(reduceID).
>>
>>
>> Look at TaskTracker.MapOutputServlet.
>>
>> 2. still don't understand very well in which part of the
>> code(MapTask.java) the intermediate data is written do which partition.
>> So
>> MapOutputBuffer is the one who actually writes the data to buffer and
>> spill
>> after buffer is full. Could you please elaborate a bit on how the data is
>> written to which partition ?
>>
>>
>> Essentially you can think of the partition-id as the 'primary key' and
>> the
>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>
>> hth,
>> Arun
>>
>> Thanks,
>> Robert
>>
>>   ------------------------------
>> *From:* Arun C Murthy <ac...@hortonworks.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Monday, July 9, 2012 9:24 AM
>> *Subject:* Re: Basic question on how reducer works
>>
>> Robert,
>>
>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>
>> Hi,
>>
>> I have some questions related to basic functionality in Hadoop.
>>
>> 1. When a Mapper process the intermediate output data, how it knows how
>> many partitions to do(how many reducers will be) and how much data to go
>> in
>> each  partition for each reducer ?
>>
>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>> the
>> locations of intermediate output data where it should retrieve it right ?
>> But how a reducer will know from each remote location with intermediate
>> output what portion it has to retrieve only ?
>>
>>
>> To add to Harsh's comment. Essentially the TT *knows* where the output of
>> a given map-id/reduce-id pair is present via an output-file/index-file
>> combination.
>>
>> Arun
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>>
>>
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>>
>>
>>
>

Re: Basic question on how reducer works

Posted by Karthik Kambatla <ka...@cloudera.com>.

The partitioner is configurable. The default partitioner, from what I
remember, computes the partition as the hashcode modulo number of
reducers/partitions. For random input, it is balanced, but some cases can
have very skewed key distribution. Also, as you have pointed out, the
number of values per key can also vary. Together, both of them determine
"weight" of each partition as you call it.

Karthik

On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rg...@yahoo.com> wrote:

> Thanks Arun.
>
> So just for my clarification. The map will create partitions according to
> the number of reducers s.t. each reducer to get almost same number of keys
> in its partition. However, each key can have different number of values so
> the "weight" of each partition will depend on that. Also when a new <key,
> value> is added into a partition a hash on the partition ID will be
> computed to find the corresponding partition ?
>
> Robert
>
>   ------------------------------
> *From:* Arun C Murthy <ac...@hortonworks.com>
> *To:* mapreduce-user@hadoop.apache.org
> *Sent:* Monday, July 9, 2012 4:33 PM
>
> *Subject:* Re: Basic question on how reducer works
>
>
> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>
> Thanks a lot guys for answers.
>
> Still I am not able to find exactly the code for the following things:
>
> 1. reducer to read from a Map output only its partition. I looked into
> ReduceTask#getMapOutput which do the actual read in
> ReduceTask#shuffleInMemory, but I don't see where it specify which
> partition to read(reduceID).
>
>
> Look at TaskTracker.MapOutputServlet.
>
> 2. still don't understand very well in which part of the
> code(MapTask.java) the intermediate data is written do which partition. So
> MapOutputBuffer is the one who actually writes the data to buffer and spill
> after buffer is full. Could you please elaborate a bit on how the data is
> written to which partition ?
>
>
> Essentially you can think of the partition-id as the 'primary key' and the
> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>
> hth,
> Arun
>
> Thanks,
> Robert
>
>   ------------------------------
> *From:* Arun C Murthy <ac...@hortonworks.com>
> *To:* mapreduce-user@hadoop.apache.org
> *Sent:* Monday, July 9, 2012 9:24 AM
> *Subject:* Re: Basic question on how reducer works
>
> Robert,
>
> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>
> Hi,
>
> I have some questions related to basic functionality in Hadoop.
>
> 1. When a Mapper process the intermediate output data, how it knows how
> many partitions to do(how many reducers will be) and how much data to go in
> each  partition for each reducer ?
>
> 2. A JobTracker when assigns a task to a reducer, it will also specify the
> locations of intermediate output data where it should retrieve it right ?
> But how a reducer will know from each remote location with intermediate
> output what portion it has to retrieve only ?
>
>
> To add to Harsh's comment. Essentially the TT *knows* where the output of
> a given map-id/reduce-id pair is present via an output-file/index-file
> combination.
>
> Arun
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>
>
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>
>
>

Re: Basic question on how reducer works

Posted by Grandl Robert <rg...@yahoo.com>.

Thanks Arun.

So just for my clarification. The map will create partitions according to the number of reducers s.t. each reducer to get almost same number of keys in its partition. However, each key can have different number of values so the "weight" of each partition will depend on that. Also when a new <key, value> is added into a partition a hash on the partition ID will be computed to find the corresponding partition ?

Robert



________________________________
 From: Arun C Murthy <ac...@hortonworks.com>
To: mapreduce-user@hadoop.apache.org 
Sent: Monday, July 9, 2012 4:33 PM
Subject: Re: Basic question on how reducer works
 



On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:

Thanks a lot guys for answers. 
>
>
>
>Still I am not able to find exactly the code for the following things:
>
>
>1. reducer to read from a Map output only its partition. I looked into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuffleInMemory, but I don't see where it specify which partition to read(reduceID).
>
>
Look at TaskTracker.MapOutputServlet.


2. still don't understand very well in which part of the code(MapTask.java) the intermediate data is written do which partition. So MapOutputBuffer is the one who actually writes the data to buffer and spill after buffer is full. Could you please elaborate a bit on how the data is written to which partition ?
>
>
Essentially you can think of the partition-id as the 'primary key' and the actual 'key' in the map-output of <key, value> as the 'secondary key'.

hth,
Arun


Thanks,
>Robert
>
>
>
>________________________________
> From: Arun C Murthy <ac...@hortonworks.com>
>To: mapreduce-user@hadoop.apache.org 
>Sent: Monday, July 9, 2012 9:24 AM
>Subject: Re: Basic question on how reducer works
> 
>
>Robert,
>
>
>On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>
>Hi,
>>
>>
>>I have some questions related to basic functionality in Hadoop. 
>>
>>
>>1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each  partition for each reducer ?
>>
>>
>>2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ?
>
>To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination.
>
>
>Arun
>
>
>--
>Arun C. Murthy
>Hortonworks Inc.
>http://hortonworks.com/
>
> 
>
>
>

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Basic question on how reducer works

Posted by Arun C Murthy <ac...@hortonworks.com>.

On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:

> Thanks a lot guys for answers. 
> 
> Still I am not able to find exactly the code for the following things:
> 
> 1. reducer to read from a Map output only its partition. I looked into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuffleInMemory, but I don't see where it specify which partition to read(reduceID).
> 

Look at TaskTracker.MapOutputServlet.

> 2. still don't understand very well in which part of the code(MapTask.java) the intermediate data is written do which partition. So MapOutputBuffer is the one who actually writes the data to buffer and spill after buffer is full. Could you please elaborate a bit on how the data is written to which partition ?
> 

Essentially you can think of the partition-id as the 'primary key' and the actual 'key' in the map-output of <key, value> as the 'secondary key'.

hth,
Arun

> Thanks,
> Robert
> 
> From: Arun C Murthy <ac...@hortonworks.com>
> To: mapreduce-user@hadoop.apache.org 
> Sent: Monday, July 9, 2012 9:24 AM
> Subject: Re: Basic question on how reducer works
> 
> Robert,
> 
> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
> 
>> Hi,
>> 
>> I have some questions related to basic functionality in Hadoop. 
>> 
>> 1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each  partition for each reducer ?
>> 
>> 2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ?
> 
> To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination.
> 
> Arun
> 
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
> 
> 
> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Basic question on how reducer works

Posted by Grandl Robert <rg...@yahoo.com>.

Thanks a lot guys for answers. 

Still I am not able to find exactly the code for the following things:

1. reducer to read from a Map output only its partition. I looked into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuffleInMemory, but I don't see where it specify which partition to read(reduceID).

2. still don't understand very well in which part of the code(MapTask.java) the intermediate data is written do which partition. So MapOutputBuffer is the one who actually writes the data to buffer and spill after buffer is full. Could you please elaborate a bit on how the data is written to which partition ?

Thanks,
Robert

________________________________
 From: Arun C Murthy <ac...@hortonworks.com>
To: mapreduce-user@hadoop.apache.org 
Sent: Monday, July 9, 2012 9:24 AM
Subject: Re: Basic question on how reducer works

Robert,

On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:

Hi,
>
>
>I have some questions related to basic functionality in Hadoop. 
>
>
>1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each  partition for each reducer ?
>
>
>2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ?
To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination.

Arun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Basic question on how reducer works

Posted by Karthik Kambatla <ka...@cloudera.com>.

Hi Manoj,

As Harsh said, we would almost always need multiple reducers. As each
reduce is potentially executed on a different core (same machine or a
different one), in most cases, we would want at least as many reduces as
the number of cores for maximum parallelism/performance.

Karthik

On Mon, Jul 9, 2012 at 11:07 AM, Manoj Babu <ma...@gmail.com> wrote:

> Hi Harsh,
>
> Thanks for clarifying. I was in thought earlier that Partitioner is
> picking the reducer.
>
> My cluster setup provides options for multiple reducers so i want to know
> when and in which scenario we have go for multiple reducers?
>
> Cheers!
> Manoj.
>
>
>
> On Mon, Jul 9, 2012 at 11:27 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Manoj,
>>
>> Think of it this way, and you shouldn't be confused: A reducer == a
>> partition.
>>
>> For (1) - Partitioners do not 'call' a reduce, just write the data
>> with a proper partition ID. The reducer thats same as the partition
>> ID, picks it up for itself later. This we have already explained
>> earlier.
>>
>> For (2) - For what scenario do you _not_ want multiple reducers
>> handling each partition uniquely, when it is possible to scale that
>> way?
>>
>> On Mon, Jul 9, 2012 at 11:22 PM, Manoj Babu <ma...@gmail.com> wrote:
>> > Hi,
>> >
>> > It would be more helpful, If you could more details for the below
>> doubts.
>> >
>> > 1, How the partitioner knows which reducer needs to be called?
>> > 2, When we are using more than one reducers, the output gets separated.
>> > Actually for what scenario we have to go for multiple reducers?
>> >
>> > Cheers!
>> > Manoj.
>> >
>> >
>> >
>> > On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <ac...@hortonworks.com>
>> wrote:
>> >>
>> >> Robert,
>> >>
>> >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have some questions related to basic functionality in Hadoop.
>> >>
>> >> 1. When a Mapper process the intermediate output data, how it knows how
>> >> many partitions to do(how many reducers will be) and how much data to
>> go in
>> >> each  partition for each reducer ?
>> >>
>> >> 2. A JobTracker when assigns a task to a reducer, it will also specify
>> the
>> >> locations of intermediate output data where it should retrieve it
>> right ?
>> >> But how a reducer will know from each remote location with intermediate
>> >> output what portion it has to retrieve only ?
>> >>
>> >>
>> >> To add to Harsh's comment. Essentially the TT *knows* where the output
>> of
>> >> a given map-id/reduce-id pair is present via an output-file/index-file
>> >> combination.
>> >>
>> >> Arun
>> >>
>> >> --
>> >> Arun C. Murthy
>> >> Hortonworks Inc.
>> >> http://hortonworks.com/
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Basic question on how reducer works

Posted by Manoj Babu <ma...@gmail.com>.

Hi Harsh,

Thanks for clarifying. I was in thought earlier that Partitioner is picking
the reducer.

My cluster setup provides options for multiple reducers so i want to know
when and in which scenario we have go for multiple reducers?

Cheers!
Manoj.



On Mon, Jul 9, 2012 at 11:27 PM, Harsh J <ha...@cloudera.com> wrote:

> Manoj,
>
> Think of it this way, and you shouldn't be confused: A reducer == a
> partition.
>
> For (1) - Partitioners do not 'call' a reduce, just write the data
> with a proper partition ID. The reducer thats same as the partition
> ID, picks it up for itself later. This we have already explained
> earlier.
>
> For (2) - For what scenario do you _not_ want multiple reducers
> handling each partition uniquely, when it is possible to scale that
> way?
>
> On Mon, Jul 9, 2012 at 11:22 PM, Manoj Babu <ma...@gmail.com> wrote:
> > Hi,
> >
> > It would be more helpful, If you could more details for the below doubts.
> >
> > 1, How the partitioner knows which reducer needs to be called?
> > 2, When we are using more than one reducers, the output gets separated.
> > Actually for what scenario we have to go for multiple reducers?
> >
> > Cheers!
> > Manoj.
> >
> >
> >
> > On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <ac...@hortonworks.com>
> wrote:
> >>
> >> Robert,
> >>
> >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
> >>
> >> Hi,
> >>
> >> I have some questions related to basic functionality in Hadoop.
> >>
> >> 1. When a Mapper process the intermediate output data, how it knows how
> >> many partitions to do(how many reducers will be) and how much data to
> go in
> >> each  partition for each reducer ?
> >>
> >> 2. A JobTracker when assigns a task to a reducer, it will also specify
> the
> >> locations of intermediate output data where it should retrieve it right
> ?
> >> But how a reducer will know from each remote location with intermediate
> >> output what portion it has to retrieve only ?
> >>
> >>
> >> To add to Harsh's comment. Essentially the TT *knows* where the output
> of
> >> a given map-id/reduce-id pair is present via an output-file/index-file
> >> combination.
> >>
> >> Arun
> >>
> >> --
> >> Arun C. Murthy
> >> Hortonworks Inc.
> >> http://hortonworks.com/
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Basic question on how reducer works

Posted by Harsh J <ha...@cloudera.com>.

Manoj,

Think of it this way, and you shouldn't be confused: A reducer == a partition.

For (1) - Partitioners do not 'call' a reduce, just write the data
with a proper partition ID. The reducer thats same as the partition
ID, picks it up for itself later. This we have already explained
earlier.

For (2) - For what scenario do you _not_ want multiple reducers
handling each partition uniquely, when it is possible to scale that
way?

On Mon, Jul 9, 2012 at 11:22 PM, Manoj Babu <ma...@gmail.com> wrote:
> Hi,
>
> It would be more helpful, If you could more details for the below doubts.
>
> 1, How the partitioner knows which reducer needs to be called?
> 2, When we are using more than one reducers, the output gets separated.
> Actually for what scenario we have to go for multiple reducers?
>
> Cheers!
> Manoj.
>
>
>
> On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>>
>> Robert,
>>
>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>
>> Hi,
>>
>> I have some questions related to basic functionality in Hadoop.
>>
>> 1. When a Mapper process the intermediate output data, how it knows how
>> many partitions to do(how many reducers will be) and how much data to go in
>> each  partition for each reducer ?
>>
>> 2. A JobTracker when assigns a task to a reducer, it will also specify the
>> locations of intermediate output data where it should retrieve it right ?
>> But how a reducer will know from each remote location with intermediate
>> output what portion it has to retrieve only ?
>>
>>
>> To add to Harsh's comment. Essentially the TT *knows* where the output of
>> a given map-id/reduce-id pair is present via an output-file/index-file
>> combination.
>>
>> Arun
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>



-- 
Harsh J

Re: Basic question on how reducer works

Posted by Manoj Babu <ma...@gmail.com>.

Hi,

It would be more helpful, If you could more details for the below doubts.

1, How the partitioner knows which reducer needs to be called?
2, When we are using more than one reducers, the output gets separated.
Actually for what scenario we have to go for multiple reducers?

Cheers!
Manoj.



On Mon, Jul 9, 2012 at 6:54 PM, Arun C Murthy <ac...@hortonworks.com> wrote:

> Robert,
>
> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>
> Hi,
>
> I have some questions related to basic functionality in Hadoop.
>
> 1. When a Mapper process the intermediate output data, how it knows how
> many partitions to do(how many reducers will be) and how much data to go in
> each  partition for each reducer ?
>
> 2. A JobTracker when assigns a task to a reducer, it will also specify the
> locations of intermediate output data where it should retrieve it right ?
> But how a reducer will know from each remote location with intermediate
> output what portion it has to retrieve only ?
>
>
> To add to Harsh's comment. Essentially the TT *knows* where the output of
> a given map-id/reduce-id pair is present via an output-file/index-file
> combination.
>
> Arun
>
> --
>  Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>

Re: Basic question on how reducer works

Posted by Arun C Murthy <ac...@hortonworks.com>.

Robert,

On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:

> Hi,
> 
> I have some questions related to basic functionality in Hadoop. 
> 
> 1. When a Mapper process the intermediate output data, how it knows how many partitions to do(how many reducers will be) and how much data to go in each  partition for each reducer ?
> 
> 2. A JobTracker when assigns a task to a reducer, it will also specify the locations of intermediate output data where it should retrieve it right ? But how a reducer will know from each remote location with intermediate output what portion it has to retrieve only ?

To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/reduce-id pair is present via an output-file/index-file combination.

Arun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Basic question on how reducer works

Posted by Pavan Kulkarni <pa...@gmail.com>.

Oh.Thanks a lot Harsh .

On Sun, Jul 8, 2012 at 11:38 PM, Harsh J <ha...@cloudera.com> wrote:

> Pavan,
>
> This is covered in the MR tutorial doc:
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Task+Logs
>
> On Mon, Jul 9, 2012 at 8:26 AM, Pavan Kulkarni <pa...@gmail.com>
> wrote:
> > I too had similar problems.
> > I guess we should also set the debug mode for
> > that specific class in the log4j.properties file .Isn't it?
> >
> > And I didn't quite get what you mean by task's userlogs?
> > where are these logs located ? In the logs directory I only see
> > logs for all the daemons.Thanks
> >
> >
> > On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <rg...@yahoo.com> wrote:
> >>
> >> I see. I was looking into tasktracker log :).
> >>
> >> Thanks a lot,
> >> Robert
> >>
> >> ________________________________
> >> From: Harsh J <ha...@cloudera.com>
> >> To: Grandl Robert <rg...@yahoo.com>; mapreduce-user
> >> <ma...@hadoop.apache.org>
> >> Sent: Sunday, July 8, 2012 9:16 PM
> >>
> >> Subject: Re: Basic question on how reducer works
> >>
> >> The changes should appear in your Task's userlogs (not the TaskTracker
> >> logs). Have you deployed your changed code properly (i.e. do you
> >> generate a new tarball, or perhaps use the MRMiniCluster to do this)?
> >>
> >> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rg...@yahoo.com>
> wrote:
> >> > Hi Harsh,
> >> >
> >> > Your comments were extremely helpful.
> >> >
> >> > Still I am wondering why if I add LOG.info entries into MapTask.java
> or
> >> > ReduceTask.java in most of the functions(including
> >> > Old/NewOutputCollector),
> >> > the logs are not shown. In this way it's hard for me to track which
> >> > functions are called and which not. Even more in ReduceTask.java.
> >> >
> >> > Do you have any ideas ?
> >> >
> >> > Thanks a lot for your answer,
> >> > Robert
> >> >
> >> > ________________________________
> >> > From: Harsh J <ha...@cloudera.com>
> >> > To: mapreduce-user@hadoop.apache.org; Grandl Robert <
> rgrandl@yahoo.com>
> >> > Sent: Sunday, July 8, 2012 1:34 AM
> >> >
> >> > Subject: Re: Basic question on how reducer works
> >> >
> >> > Hi Robert,
> >> >
> >> > Inline. (Answer is specific to Hadoop 1.x since you asked for that
> >> > alone, but certain things may vary for Hadoop 2.x).
> >> >
> >> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rg...@yahoo.com>
> wrote:
> >> >> Hi,
> >> >>
> >> >> I have some questions related to basic functionality in Hadoop.
> >> >>
> >> >> 1. When a Mapper process the intermediate output data, how it knows
> how
> >> >> many
> >> >> partitions to do(how many reducers will be) and how much data to go
> in
> >> >> each
> >> >> partition for each reducer ?
> >> >
> >> > The number of reducers is non-dynamic and is user-specified, and is
> >> > set in the job configuration. Hence the Partitioner knows about the
> >> > value it needs to use for its numPartitions (== numReduces for the
> >> > job).
> >> >
> >> > For this one in 1.x code, look at MapTask.java, in the constructors of
> >> > internal classes OldOutputCollector (Stable API) and
> >> > NewOutputCollector (New API).
> >> >
> >> > The data estimated to be going into a partition, for limit/scheduling
> >> > checks, is currently a naive computation, done by summing upon the
> >> > estimate output sizes of each map. See
> >> > ResourceEstimator#getEstimatedReduceInputSize for the overall
> >> > estimation across maps, and see Task#calculateOutputSize for the
> >> > per-map estimation code.
> >> >
> >> >> 2. A JobTracker when assigns a task to a reducer, it will also
> specify
> >> >> the
> >> >> locations of intermediate output data where it should retrieve it
> right
> >> >> ?
> >> >> But how a reducer will know from each remote location with
> intermediate
> >> >> output what portion it has to retrieve only ?
> >> >
> >> > The JT does not send in the information of locations when a reduce is
> >> > scheduled. When the reducers begin their shuffle phase, they query the
> >> > TaskTracker to get the map completion events, via
> >> > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
> >> > itself calls the JobTracker#getTaskCompletionEvents protocol call to
> >> > get this info underneath. The returned structure carries the host that
> >> > has completed the map successfully, which the Reduce's copier relies
> >> > on to fetch the data from the right host's TT.
> >> >
> >> > The reduce merely asks the data assigned for it for the specific
> >> > completed maps at each TT. Note that a reduce task ID is also its
> >> > partition ID, so it merely has to ask the data for its own task ID #
> >> > and the TT serves, over HTTP, the right parts of the intermediate data
> >> > to it.
> >> >
> >> > Feel free to ping back if you need some more clarification! :)
> >> >
> >> > --
> >> > Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >>
> >
> >
> >
> > --
> >
> > --With Regards
> > Pavan Kulkarni
> >
>
>
>
> --
> Harsh J
>



-- 

--With Regards
Pavan Kulkarni

Re: Basic question on how reducer works

Posted by Harsh J <ha...@cloudera.com>.

Pavan,

This is covered in the MR tutorial doc:
http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#Task+Logs

On Mon, Jul 9, 2012 at 8:26 AM, Pavan Kulkarni <pa...@gmail.com> wrote:
> I too had similar problems.
> I guess we should also set the debug mode for
> that specific class in the log4j.properties file .Isn't it?
>
> And I didn't quite get what you mean by task's userlogs?
> where are these logs located ? In the logs directory I only see
> logs for all the daemons.Thanks
>
>
> On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <rg...@yahoo.com> wrote:
>>
>> I see. I was looking into tasktracker log :).
>>
>> Thanks a lot,
>> Robert
>>
>> ________________________________
>> From: Harsh J <ha...@cloudera.com>
>> To: Grandl Robert <rg...@yahoo.com>; mapreduce-user
>> <ma...@hadoop.apache.org>
>> Sent: Sunday, July 8, 2012 9:16 PM
>>
>> Subject: Re: Basic question on how reducer works
>>
>> The changes should appear in your Task's userlogs (not the TaskTracker
>> logs). Have you deployed your changed code properly (i.e. do you
>> generate a new tarball, or perhaps use the MRMiniCluster to do this)?
>>
>> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rg...@yahoo.com> wrote:
>> > Hi Harsh,
>> >
>> > Your comments were extremely helpful.
>> >
>> > Still I am wondering why if I add LOG.info entries into MapTask.java or
>> > ReduceTask.java in most of the functions(including
>> > Old/NewOutputCollector),
>> > the logs are not shown. In this way it's hard for me to track which
>> > functions are called and which not. Even more in ReduceTask.java.
>> >
>> > Do you have any ideas ?
>> >
>> > Thanks a lot for your answer,
>> > Robert
>> >
>> > ________________________________
>> > From: Harsh J <ha...@cloudera.com>
>> > To: mapreduce-user@hadoop.apache.org; Grandl Robert <rg...@yahoo.com>
>> > Sent: Sunday, July 8, 2012 1:34 AM
>> >
>> > Subject: Re: Basic question on how reducer works
>> >
>> > Hi Robert,
>> >
>> > Inline. (Answer is specific to Hadoop 1.x since you asked for that
>> > alone, but certain things may vary for Hadoop 2.x).
>> >
>> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rg...@yahoo.com> wrote:
>> >> Hi,
>> >>
>> >> I have some questions related to basic functionality in Hadoop.
>> >>
>> >> 1. When a Mapper process the intermediate output data, how it knows how
>> >> many
>> >> partitions to do(how many reducers will be) and how much data to go in
>> >> each
>> >> partition for each reducer ?
>> >
>> > The number of reducers is non-dynamic and is user-specified, and is
>> > set in the job configuration. Hence the Partitioner knows about the
>> > value it needs to use for its numPartitions (== numReduces for the
>> > job).
>> >
>> > For this one in 1.x code, look at MapTask.java, in the constructors of
>> > internal classes OldOutputCollector (Stable API) and
>> > NewOutputCollector (New API).
>> >
>> > The data estimated to be going into a partition, for limit/scheduling
>> > checks, is currently a naive computation, done by summing upon the
>> > estimate output sizes of each map. See
>> > ResourceEstimator#getEstimatedReduceInputSize for the overall
>> > estimation across maps, and see Task#calculateOutputSize for the
>> > per-map estimation code.
>> >
>> >> 2. A JobTracker when assigns a task to a reducer, it will also specify
>> >> the
>> >> locations of intermediate output data where it should retrieve it right
>> >> ?
>> >> But how a reducer will know from each remote location with intermediate
>> >> output what portion it has to retrieve only ?
>> >
>> > The JT does not send in the information of locations when a reduce is
>> > scheduled. When the reducers begin their shuffle phase, they query the
>> > TaskTracker to get the map completion events, via
>> > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
>> > itself calls the JobTracker#getTaskCompletionEvents protocol call to
>> > get this info underneath. The returned structure carries the host that
>> > has completed the map successfully, which the Reduce's copier relies
>> > on to fetch the data from the right host's TT.
>> >
>> > The reduce merely asks the data assigned for it for the specific
>> > completed maps at each TT. Note that a reduce task ID is also its
>> > partition ID, so it merely has to ask the data for its own task ID #
>> > and the TT serves, over HTTP, the right parts of the intermediate data
>> > to it.
>> >
>> > Feel free to ping back if you need some more clarification! :)
>> >
>> > --
>> > Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>>
>
>
>
> --
>
> --With Regards
> Pavan Kulkarni
>



-- 
Harsh J

Re: Basic question on how reducer works

Posted by Pavan Kulkarni <pa...@gmail.com>.

I too had similar problems.
I guess we should also set the debug mode for
that specific class in the log4j.properties file .Isn't it?

And I didn't quite get what you mean by task's userlogs?
where are these logs located ? In the logs directory I only see
logs for all the daemons.Thanks


On Sun, Jul 8, 2012 at 6:27 PM, Grandl Robert <rg...@yahoo.com> wrote:

> I see. I was looking into tasktracker log :).
>
> Thanks a lot,
> Robert
>
>   ------------------------------
> *From:* Harsh J <ha...@cloudera.com>
> *To:* Grandl Robert <rg...@yahoo.com>; mapreduce-user <
> mapreduce-user@hadoop.apache.org>
> *Sent:* Sunday, July 8, 2012 9:16 PM
>
> *Subject:* Re: Basic question on how reducer works
>
> The changes should appear in your Task's userlogs (not the TaskTracker
> logs). Have you deployed your changed code properly (i.e. do you
> generate a new tarball, or perhaps use the MRMiniCluster to do this)?
>
> On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rg...@yahoo.com> wrote:
> > Hi Harsh,
> >
> > Your comments were extremely helpful.
> >
> > Still I am wondering why if I add LOG.info <http://log.info/> entries
> into MapTask.java or
> > ReduceTask.java in most of the functions(including
> Old/NewOutputCollector),
> > the logs are not shown. In this way it's hard for me to track which
> > functions are called and which not. Even more in ReduceTask.java.
> >
> > Do you have any ideas ?
> >
> > Thanks a lot for your answer,
> > Robert
> >
> > ________________________________
> > From: Harsh J <ha...@cloudera.com>
> > To: mapreduce-user@hadoop.apache.org; Grandl Robert <rg...@yahoo.com>
> > Sent: Sunday, July 8, 2012 1:34 AM
> >
> > Subject: Re: Basic question on how reducer works
> >
> > Hi Robert,
> >
> > Inline. (Answer is specific to Hadoop 1.x since you asked for that
> > alone, but certain things may vary for Hadoop 2.x).
> >
> > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rg...@yahoo.com> wrote:
> >> Hi,
> >>
> >> I have some questions related to basic functionality in Hadoop.
> >>
> >> 1. When a Mapper process the intermediate output data, how it knows how
> >> many
> >> partitions to do(how many reducers will be) and how much data to go in
> >> each
> >> partition for each reducer ?
> >
> > The number of reducers is non-dynamic and is user-specified, and is
> > set in the job configuration. Hence the Partitioner knows about the
> > value it needs to use for its numPartitions (== numReduces for the
> > job).
> >
> > For this one in 1.x code, look at MapTask.java, in the constructors of
> > internal classes OldOutputCollector (Stable API) and
> > NewOutputCollector (New API).
> >
> > The data estimated to be going into a partition, for limit/scheduling
> > checks, is currently a naive computation, done by summing upon the
> > estimate output sizes of each map. See
> > ResourceEstimator#getEstimatedReduceInputSize for the overall
> > estimation across maps, and see Task#calculateOutputSize for the
> > per-map estimation code.
> >
> >> 2. A JobTracker when assigns a task to a reducer, it will also specify
> the
> >> locations of intermediate output data where it should retrieve it right
> ?
> >> But how a reducer will know from each remote location with intermediate
> >> output what portion it has to retrieve only ?
> >
> > The JT does not send in the information of locations when a reduce is
> > scheduled. When the reducers begin their shuffle phase, they query the
> > TaskTracker to get the map completion events, via
> > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
> > itself calls the JobTracker#getTaskCompletionEvents protocol call to
> > get this info underneath. The returned structure carries the host that
> > has completed the map successfully, which the Reduce's copier relies
> > on to fetch the data from the right host's TT.
> >
> > The reduce merely asks the data assigned for it for the specific
> > completed maps at each TT. Note that a reduce task ID is also its
> > partition ID, so it merely has to ask the data for its own task ID #
> > and the TT serves, over HTTP, the right parts of the intermediate data
> > to it.
> >
> > Feel free to ping back if you need some more clarification! :)
> >
> > --
> > Harsh J
> >
> >
>
>
>
> --
> Harsh J
>
>
>


-- 

--With Regards
Pavan Kulkarni

Re: Basic question on how reducer works

Posted by Grandl Robert <rg...@yahoo.com>.

I see. I was looking into tasktracker log :).

Thanks a lot,
Robert



________________________________
 From: Harsh J <ha...@cloudera.com>
To: Grandl Robert <rg...@yahoo.com>; mapreduce-user <ma...@hadoop.apache.org> 
Sent: Sunday, July 8, 2012 9:16 PM
Subject: Re: Basic question on how reducer works
 
The changes should appear in your Task's userlogs (not the TaskTracker
logs). Have you deployed your changed code properly (i.e. do you
generate a new tarball, or perhaps use the MRMiniCluster to do this)?

On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rg...@yahoo.com> wrote:
> Hi Harsh,
>
> Your comments were extremely helpful.
>
> Still I am wondering why if I add LOG.info entries into MapTask.java or
> ReduceTask.java in most of the functions(including Old/NewOutputCollector),
> the logs are not shown. In this way it's hard for me to track which
> functions are called and which not. Even more in ReduceTask.java.
>
> Do you have any ideas ?
>
> Thanks a lot for your answer,
> Robert
>
> ________________________________
> From: Harsh J <ha...@cloudera.com>
> To: mapreduce-user@hadoop.apache.org; Grandl Robert <rg...@yahoo.com>
> Sent: Sunday, July 8, 2012 1:34 AM
>
> Subject: Re: Basic question on how reducer works
>
> Hi Robert,
>
> Inline. (Answer is specific to Hadoop 1.x since you asked for that
> alone, but certain things may vary for Hadoop 2.x).
>
> On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rg...@yahoo.com> wrote:
>> Hi,
>>
>> I have some questions related to basic functionality in Hadoop.
>>
>> 1. When a Mapper process the intermediate output data, how it knows how
>> many
>> partitions to do(how many reducers will be) and how much data to go in
>> each
>> partition for each reducer ?
>
> The number of reducers is non-dynamic and is user-specified, and is
> set in the job configuration. Hence the Partitioner knows about the
> value it needs to use for its numPartitions (== numReduces for the
> job).
>
> For this one in 1.x code, look at MapTask.java, in the constructors of
> internal classes OldOutputCollector (Stable API) and
> NewOutputCollector (New API).
>
> The data estimated to be going into a partition, for limit/scheduling
> checks, is currently a naive computation, done by summing upon the
> estimate output sizes of each map. See
> ResourceEstimator#getEstimatedReduceInputSize for the overall
> estimation across maps, and see Task#calculateOutputSize for the
> per-map estimation code.
>
>> 2. A JobTracker when assigns a task to a reducer, it will also specify the
>> locations of intermediate output data where it should retrieve it right ?
>> But how a reducer will know from each remote location with intermediate
>> output what portion it has to retrieve only ?
>
> The JT does not send in the information of locations when a reduce is
> scheduled. When the reducers begin their shuffle phase, they query the
> TaskTracker to get the map completion events, via
> TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
> itself calls the JobTracker#getTaskCompletionEvents protocol call to
> get this info underneath. The returned structure carries the host that
> has completed the map successfully, which the Reduce's copier relies
> on to fetch the data from the right host's TT.
>
> The reduce merely asks the data assigned for it for the specific
> completed maps at each TT. Note that a reduce task ID is also its
> partition ID, so it merely has to ask the data for its own task ID #
> and the TT serves, over HTTP, the right parts of the intermediate data
> to it.
>
> Feel free to ping back if you need some more clarification! :)
>
> --
> Harsh J
>
>



-- 
Harsh J

Re: Basic question on how reducer works

Posted by Harsh J <ha...@cloudera.com>.

The changes should appear in your Task's userlogs (not the TaskTracker
logs). Have you deployed your changed code properly (i.e. do you
generate a new tarball, or perhaps use the MRMiniCluster to do this)?

On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rg...@yahoo.com> wrote:
> Hi Harsh,
>
> Your comments were extremely helpful.
>
> Still I am wondering why if I add LOG.info entries into MapTask.java or
> ReduceTask.java in most of the functions(including Old/NewOutputCollector),
> the logs are not shown. In this way it's hard for me to track which
> functions are called and which not. Even more in ReduceTask.java.
>
> Do you have any ideas ?
>
> Thanks a lot for your answer,
> Robert
>
> ________________________________
> From: Harsh J <ha...@cloudera.com>
> To: mapreduce-user@hadoop.apache.org; Grandl Robert <rg...@yahoo.com>
> Sent: Sunday, July 8, 2012 1:34 AM
>
> Subject: Re: Basic question on how reducer works
>
> Hi Robert,
>
> Inline. (Answer is specific to Hadoop 1.x since you asked for that
> alone, but certain things may vary for Hadoop 2.x).
>
> On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rg...@yahoo.com> wrote:
>> Hi,
>>
>> I have some questions related to basic functionality in Hadoop.
>>
>> 1. When a Mapper process the intermediate output data, how it knows how
>> many
>> partitions to do(how many reducers will be) and how much data to go in
>> each
>> partition for each reducer ?
>
> The number of reducers is non-dynamic and is user-specified, and is
> set in the job configuration. Hence the Partitioner knows about the
> value it needs to use for its numPartitions (== numReduces for the
> job).
>
> For this one in 1.x code, look at MapTask.java, in the constructors of
> internal classes OldOutputCollector (Stable API) and
> NewOutputCollector (New API).
>
> The data estimated to be going into a partition, for limit/scheduling
> checks, is currently a naive computation, done by summing upon the
> estimate output sizes of each map. See
> ResourceEstimator#getEstimatedReduceInputSize for the overall
> estimation across maps, and see Task#calculateOutputSize for the
> per-map estimation code.
>
>> 2. A JobTracker when assigns a task to a reducer, it will also specify the
>> locations of intermediate output data where it should retrieve it right ?
>> But how a reducer will know from each remote location with intermediate
>> output what portion it has to retrieve only ?
>
> The JT does not send in the information of locations when a reduce is
> scheduled. When the reducers begin their shuffle phase, they query the
> TaskTracker to get the map completion events, via
> TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
> itself calls the JobTracker#getTaskCompletionEvents protocol call to
> get this info underneath. The returned structure carries the host that
> has completed the map successfully, which the Reduce's copier relies
> on to fetch the data from the right host's TT.
>
> The reduce merely asks the data assigned for it for the specific
> completed maps at each TT. Note that a reduce task ID is also its
> partition ID, so it merely has to ask the data for its own task ID #
> and the TT serves, over HTTP, the right parts of the intermediate data
> to it.
>
> Feel free to ping back if you need some more clarification! :)
>
> --
> Harsh J
>
>



-- 
Harsh J

Re: Basic question on how reducer works

Posted by Harsh J <ha...@cloudera.com>.

Hi Robert,

Inline. (Answer is specific to Hadoop 1.x since you asked for that
alone, but certain things may vary for Hadoop 2.x).

On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rg...@yahoo.com> wrote:
> Hi,
>
> I have some questions related to basic functionality in Hadoop.
>
> 1. When a Mapper process the intermediate output data, how it knows how many
> partitions to do(how many reducers will be) and how much data to go in each
> partition for each reducer ?

The number of reducers is non-dynamic and is user-specified, and is
set in the job configuration. Hence the Partitioner knows about the
value it needs to use for its numPartitions (== numReduces for the
job).

For this one in 1.x code, look at MapTask.java, in the constructors of
internal classes OldOutputCollector (Stable API) and
NewOutputCollector (New API).

The data estimated to be going into a partition, for limit/scheduling
checks, is currently a naive computation, done by summing upon the
estimate output sizes of each map. See
ResourceEstimator#getEstimatedReduceInputSize for the overall
estimation across maps, and see Task#calculateOutputSize for the
per-map estimation code.

> 2. A JobTracker when assigns a task to a reducer, it will also specify the
> locations of intermediate output data where it should retrieve it right ?
> But how a reducer will know from each remote location with intermediate
> output what portion it has to retrieve only ?

The JT does not send in the information of locations when a reduce is
scheduled. When the reducers begin their shuffle phase, they query the
TaskTracker to get the map completion events, via
TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
itself calls the JobTracker#getTaskCompletionEvents protocol call to
get this info underneath. The returned structure carries the host that
has completed the map successfully, which the Reduce's copier relies
on to fetch the data from the right host's TT.

The reduce merely asks the data assigned for it for the specific
completed maps at each TT. Note that a reduce task ID is also its
partition ID, so it merely has to ask the data for its own task ID #
and the TT serves, over HTTP, the right parts of the intermediate data
to it.

Feel free to ping back if you need some more clarification! :)

-- 
Harsh J