You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Alberto Cordioli <co...@gmail.com> on 2013/03/28 09:47:52 UTC

Find reducer for a key

Hi everyone,

how can i know the keys that are associated to a particular reducer in
the setup method?
Let's assume in the setup method to read from a file where each line
is a string that will become a key emitted from mappers.
For each of these lines I would like to know if the string will be a
key associated with the current reducer or not.

I read something about mapred.task.partition and mapred.task.id, but I
didn't understand the usage.


Thanks,
Alberto


--
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
You understood correctly the scenario.
I see your rationale and thanks for your suggestions.
To better explain the problem and my point of view let me make an example.

I want to read two files. In the first one the rows are composed as
the following:
Airport_Id, User_Id, Time
and indicates user positions in airports at specific time. This file
is very large.

The second file contains tuples of this form:
Flight_Id,Airport_From,Airport_To,Time
and summarize all the flights timetable with the respective airports.

Now, I want a job that takes the first file as input and computes all
the possible flights a user may have taken.

My solution, according to what I wrote in the previous mails, would be
to emits tuples from the first file partitioned by Airport_Id.
Then, we know that all the tuples with the same Airport_ID go the same
reducer and we can perform an in-memory load of the part of the second
file related to those airports this reducers is receiving keys.
I think this is much faster than perform a MR join, right?


Thanks,
Alberto



On 29 March 2013 04:47, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> The way I understand your requirement - you have a file that contains a set
> of keys. You want to read this file on every reducer and take only those
> entries of the set, whose keys correspond to the current reducer.
>
> If the above summary is correct, can I assume that you are potentially
> reading the entire intermediate output key space on every reducer. Would
> that even work (considering memory constraints, etc).
>
> It seemed to me that your solution is implementing what the framework can
> already do for you. That was the rationale behind my suggestion. Maybe you
> should try and implement both approaches to see which one works better for
> you.
>
> Thanks
> hemanth
>
>
> On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Yes, that is a possible solution.
>> But since the MR job has another scope, the mappers already read other
>> files (very large) and output tuples.
>> You cannot control the number of mappers and hence the risk is that a
>> lot of mappers will be created, and each of them read also the other
>> file instead of a small number of reducers.
>>
>> Do you think that the solution I proposed is not so elegant or efficient?
>>
>> Alberto
>>
>> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hmm. That feels like a join. Can't you read the input file on the map
>> > side
>> > and output those keys along with the original map output keys.. That way
>> > the
>> > reducer would automatically get both together ?
>> >
>> >
>> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi Hemanth,
>> >>
>> >> thanks for your reply.
>> >> Yes, this partially answered to my question. I know how hash
>> >> partitioner works and I guessed something similar.
>> >> The piece that I missed was that mapred.task.partition returns the
>> >> partition number of the reducer.
>> >> So, putting al the pieces together I undersand that: for each key in
>> >> the file I have to call the HashPartitioner.
>> >> Then I have to compare the returned index with the one retrieved by
>> >> Configuration.getInt("mapred.task.partition").
>> >> If it is equal then such a key will be served by that reducer. Is this
>> >> correct?
>> >>
>> >>
>> >> To answer to your question:
>> >> In a reduce side of a MR job, I want to load from file some data in a
>> >> in-memory structure. Actually, I don't need to store the whole file
>> >> for each reducer, but only the lines that are related to such keys a
>> >> particular reducers will receive.
>> >> So, my intention is to know the keys in the setup method to store only
>> >> the needed lines.
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Not sure if I am answering your question, but this is the background.
>> >> > Every
>> >> > MapReduce job has a partitioner associated to it. The default
>> >> > partitioner is
>> >> > a HashPartitioner. You can as a user write your own partitioner as
>> >> > well
>> >> > and
>> >> > plug it into the job. The partitioner is responsible for splitting
>> >> > the
>> >> > map
>> >> > outputs key space among the reducers.
>> >> >
>> >> > So, to know which reducer a key will go to, it is basically the value
>> >> > returned by the partitioner's getPartition method. For e.g this is
>> >> > the
>> >> > code
>> >> > in the HashPartitioner:
>> >> >
>> >> >   public int getPartition(K2 key, V2 value,
>> >> >                           int numReduceTasks) {
>> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >> >   }
>> >> >
>> >> > mapred.task.partition is the key that defines the partition number of
>> >> > this
>> >> > reducer.
>> >> >
>> >> > I guess you can piece together these bits into what you'd want..
>> >> > However, I
>> >> > am interested in understanding why you want to know this ? Can you
>> >> > share
>> >> > some info ?
>> >> >
>> >> > Thanks
>> >> > Hemanth
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> >> > <co...@gmail.com> wrote:
>> >> >>
>> >> >> Hi everyone,
>> >> >>
>> >> >> how can i know the keys that are associated to a particular reducer
>> >> >> in
>> >> >> the setup method?
>> >> >> Let's assume in the setup method to read from a file where each line
>> >> >> is a string that will become a key emitted from mappers.
>> >> >> For each of these lines I would like to know if the string will be a
>> >> >> key associated with the current reducer or not.
>> >> >>
>> >> >> I read something about mapred.task.partition and mapred.task.id, but
>> >> >> I
>> >> >> didn't understand the usage.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> Alberto
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Alberto Cordioli
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
You understood correctly the scenario.
I see your rationale and thanks for your suggestions.
To better explain the problem and my point of view let me make an example.

I want to read two files. In the first one the rows are composed as
the following:
Airport_Id, User_Id, Time
and indicates user positions in airports at specific time. This file
is very large.

The second file contains tuples of this form:
Flight_Id,Airport_From,Airport_To,Time
and summarize all the flights timetable with the respective airports.

Now, I want a job that takes the first file as input and computes all
the possible flights a user may have taken.

My solution, according to what I wrote in the previous mails, would be
to emits tuples from the first file partitioned by Airport_Id.
Then, we know that all the tuples with the same Airport_ID go the same
reducer and we can perform an in-memory load of the part of the second
file related to those airports this reducers is receiving keys.
I think this is much faster than perform a MR join, right?


Thanks,
Alberto



On 29 March 2013 04:47, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> The way I understand your requirement - you have a file that contains a set
> of keys. You want to read this file on every reducer and take only those
> entries of the set, whose keys correspond to the current reducer.
>
> If the above summary is correct, can I assume that you are potentially
> reading the entire intermediate output key space on every reducer. Would
> that even work (considering memory constraints, etc).
>
> It seemed to me that your solution is implementing what the framework can
> already do for you. That was the rationale behind my suggestion. Maybe you
> should try and implement both approaches to see which one works better for
> you.
>
> Thanks
> hemanth
>
>
> On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Yes, that is a possible solution.
>> But since the MR job has another scope, the mappers already read other
>> files (very large) and output tuples.
>> You cannot control the number of mappers and hence the risk is that a
>> lot of mappers will be created, and each of them read also the other
>> file instead of a small number of reducers.
>>
>> Do you think that the solution I proposed is not so elegant or efficient?
>>
>> Alberto
>>
>> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hmm. That feels like a join. Can't you read the input file on the map
>> > side
>> > and output those keys along with the original map output keys.. That way
>> > the
>> > reducer would automatically get both together ?
>> >
>> >
>> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi Hemanth,
>> >>
>> >> thanks for your reply.
>> >> Yes, this partially answered to my question. I know how hash
>> >> partitioner works and I guessed something similar.
>> >> The piece that I missed was that mapred.task.partition returns the
>> >> partition number of the reducer.
>> >> So, putting al the pieces together I undersand that: for each key in
>> >> the file I have to call the HashPartitioner.
>> >> Then I have to compare the returned index with the one retrieved by
>> >> Configuration.getInt("mapred.task.partition").
>> >> If it is equal then such a key will be served by that reducer. Is this
>> >> correct?
>> >>
>> >>
>> >> To answer to your question:
>> >> In a reduce side of a MR job, I want to load from file some data in a
>> >> in-memory structure. Actually, I don't need to store the whole file
>> >> for each reducer, but only the lines that are related to such keys a
>> >> particular reducers will receive.
>> >> So, my intention is to know the keys in the setup method to store only
>> >> the needed lines.
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Not sure if I am answering your question, but this is the background.
>> >> > Every
>> >> > MapReduce job has a partitioner associated to it. The default
>> >> > partitioner is
>> >> > a HashPartitioner. You can as a user write your own partitioner as
>> >> > well
>> >> > and
>> >> > plug it into the job. The partitioner is responsible for splitting
>> >> > the
>> >> > map
>> >> > outputs key space among the reducers.
>> >> >
>> >> > So, to know which reducer a key will go to, it is basically the value
>> >> > returned by the partitioner's getPartition method. For e.g this is
>> >> > the
>> >> > code
>> >> > in the HashPartitioner:
>> >> >
>> >> >   public int getPartition(K2 key, V2 value,
>> >> >                           int numReduceTasks) {
>> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >> >   }
>> >> >
>> >> > mapred.task.partition is the key that defines the partition number of
>> >> > this
>> >> > reducer.
>> >> >
>> >> > I guess you can piece together these bits into what you'd want..
>> >> > However, I
>> >> > am interested in understanding why you want to know this ? Can you
>> >> > share
>> >> > some info ?
>> >> >
>> >> > Thanks
>> >> > Hemanth
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> >> > <co...@gmail.com> wrote:
>> >> >>
>> >> >> Hi everyone,
>> >> >>
>> >> >> how can i know the keys that are associated to a particular reducer
>> >> >> in
>> >> >> the setup method?
>> >> >> Let's assume in the setup method to read from a file where each line
>> >> >> is a string that will become a key emitted from mappers.
>> >> >> For each of these lines I would like to know if the string will be a
>> >> >> key associated with the current reducer or not.
>> >> >>
>> >> >> I read something about mapred.task.partition and mapred.task.id, but
>> >> >> I
>> >> >> didn't understand the usage.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> Alberto
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Alberto Cordioli
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
You understood correctly the scenario.
I see your rationale and thanks for your suggestions.
To better explain the problem and my point of view let me make an example.

I want to read two files. In the first one the rows are composed as
the following:
Airport_Id, User_Id, Time
and indicates user positions in airports at specific time. This file
is very large.

The second file contains tuples of this form:
Flight_Id,Airport_From,Airport_To,Time
and summarize all the flights timetable with the respective airports.

Now, I want a job that takes the first file as input and computes all
the possible flights a user may have taken.

My solution, according to what I wrote in the previous mails, would be
to emits tuples from the first file partitioned by Airport_Id.
Then, we know that all the tuples with the same Airport_ID go the same
reducer and we can perform an in-memory load of the part of the second
file related to those airports this reducers is receiving keys.
I think this is much faster than perform a MR join, right?


Thanks,
Alberto



On 29 March 2013 04:47, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> The way I understand your requirement - you have a file that contains a set
> of keys. You want to read this file on every reducer and take only those
> entries of the set, whose keys correspond to the current reducer.
>
> If the above summary is correct, can I assume that you are potentially
> reading the entire intermediate output key space on every reducer. Would
> that even work (considering memory constraints, etc).
>
> It seemed to me that your solution is implementing what the framework can
> already do for you. That was the rationale behind my suggestion. Maybe you
> should try and implement both approaches to see which one works better for
> you.
>
> Thanks
> hemanth
>
>
> On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Yes, that is a possible solution.
>> But since the MR job has another scope, the mappers already read other
>> files (very large) and output tuples.
>> You cannot control the number of mappers and hence the risk is that a
>> lot of mappers will be created, and each of them read also the other
>> file instead of a small number of reducers.
>>
>> Do you think that the solution I proposed is not so elegant or efficient?
>>
>> Alberto
>>
>> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hmm. That feels like a join. Can't you read the input file on the map
>> > side
>> > and output those keys along with the original map output keys.. That way
>> > the
>> > reducer would automatically get both together ?
>> >
>> >
>> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi Hemanth,
>> >>
>> >> thanks for your reply.
>> >> Yes, this partially answered to my question. I know how hash
>> >> partitioner works and I guessed something similar.
>> >> The piece that I missed was that mapred.task.partition returns the
>> >> partition number of the reducer.
>> >> So, putting al the pieces together I undersand that: for each key in
>> >> the file I have to call the HashPartitioner.
>> >> Then I have to compare the returned index with the one retrieved by
>> >> Configuration.getInt("mapred.task.partition").
>> >> If it is equal then such a key will be served by that reducer. Is this
>> >> correct?
>> >>
>> >>
>> >> To answer to your question:
>> >> In a reduce side of a MR job, I want to load from file some data in a
>> >> in-memory structure. Actually, I don't need to store the whole file
>> >> for each reducer, but only the lines that are related to such keys a
>> >> particular reducers will receive.
>> >> So, my intention is to know the keys in the setup method to store only
>> >> the needed lines.
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Not sure if I am answering your question, but this is the background.
>> >> > Every
>> >> > MapReduce job has a partitioner associated to it. The default
>> >> > partitioner is
>> >> > a HashPartitioner. You can as a user write your own partitioner as
>> >> > well
>> >> > and
>> >> > plug it into the job. The partitioner is responsible for splitting
>> >> > the
>> >> > map
>> >> > outputs key space among the reducers.
>> >> >
>> >> > So, to know which reducer a key will go to, it is basically the value
>> >> > returned by the partitioner's getPartition method. For e.g this is
>> >> > the
>> >> > code
>> >> > in the HashPartitioner:
>> >> >
>> >> >   public int getPartition(K2 key, V2 value,
>> >> >                           int numReduceTasks) {
>> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >> >   }
>> >> >
>> >> > mapred.task.partition is the key that defines the partition number of
>> >> > this
>> >> > reducer.
>> >> >
>> >> > I guess you can piece together these bits into what you'd want..
>> >> > However, I
>> >> > am interested in understanding why you want to know this ? Can you
>> >> > share
>> >> > some info ?
>> >> >
>> >> > Thanks
>> >> > Hemanth
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> >> > <co...@gmail.com> wrote:
>> >> >>
>> >> >> Hi everyone,
>> >> >>
>> >> >> how can i know the keys that are associated to a particular reducer
>> >> >> in
>> >> >> the setup method?
>> >> >> Let's assume in the setup method to read from a file where each line
>> >> >> is a string that will become a key emitted from mappers.
>> >> >> For each of these lines I would like to know if the string will be a
>> >> >> key associated with the current reducer or not.
>> >> >>
>> >> >> I read something about mapred.task.partition and mapred.task.id, but
>> >> >> I
>> >> >> didn't understand the usage.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> Alberto
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Alberto Cordioli
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
You understood correctly the scenario.
I see your rationale and thanks for your suggestions.
To better explain the problem and my point of view let me make an example.

I want to read two files. In the first one the rows are composed as
the following:
Airport_Id, User_Id, Time
and indicates user positions in airports at specific time. This file
is very large.

The second file contains tuples of this form:
Flight_Id,Airport_From,Airport_To,Time
and summarize all the flights timetable with the respective airports.

Now, I want a job that takes the first file as input and computes all
the possible flights a user may have taken.

My solution, according to what I wrote in the previous mails, would be
to emits tuples from the first file partitioned by Airport_Id.
Then, we know that all the tuples with the same Airport_ID go the same
reducer and we can perform an in-memory load of the part of the second
file related to those airports this reducers is receiving keys.
I think this is much faster than perform a MR join, right?


Thanks,
Alberto



On 29 March 2013 04:47, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> The way I understand your requirement - you have a file that contains a set
> of keys. You want to read this file on every reducer and take only those
> entries of the set, whose keys correspond to the current reducer.
>
> If the above summary is correct, can I assume that you are potentially
> reading the entire intermediate output key space on every reducer. Would
> that even work (considering memory constraints, etc).
>
> It seemed to me that your solution is implementing what the framework can
> already do for you. That was the rationale behind my suggestion. Maybe you
> should try and implement both approaches to see which one works better for
> you.
>
> Thanks
> hemanth
>
>
> On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Yes, that is a possible solution.
>> But since the MR job has another scope, the mappers already read other
>> files (very large) and output tuples.
>> You cannot control the number of mappers and hence the risk is that a
>> lot of mappers will be created, and each of them read also the other
>> file instead of a small number of reducers.
>>
>> Do you think that the solution I proposed is not so elegant or efficient?
>>
>> Alberto
>>
>> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hmm. That feels like a join. Can't you read the input file on the map
>> > side
>> > and output those keys along with the original map output keys.. That way
>> > the
>> > reducer would automatically get both together ?
>> >
>> >
>> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi Hemanth,
>> >>
>> >> thanks for your reply.
>> >> Yes, this partially answered to my question. I know how hash
>> >> partitioner works and I guessed something similar.
>> >> The piece that I missed was that mapred.task.partition returns the
>> >> partition number of the reducer.
>> >> So, putting al the pieces together I undersand that: for each key in
>> >> the file I have to call the HashPartitioner.
>> >> Then I have to compare the returned index with the one retrieved by
>> >> Configuration.getInt("mapred.task.partition").
>> >> If it is equal then such a key will be served by that reducer. Is this
>> >> correct?
>> >>
>> >>
>> >> To answer to your question:
>> >> In a reduce side of a MR job, I want to load from file some data in a
>> >> in-memory structure. Actually, I don't need to store the whole file
>> >> for each reducer, but only the lines that are related to such keys a
>> >> particular reducers will receive.
>> >> So, my intention is to know the keys in the setup method to store only
>> >> the needed lines.
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Not sure if I am answering your question, but this is the background.
>> >> > Every
>> >> > MapReduce job has a partitioner associated to it. The default
>> >> > partitioner is
>> >> > a HashPartitioner. You can as a user write your own partitioner as
>> >> > well
>> >> > and
>> >> > plug it into the job. The partitioner is responsible for splitting
>> >> > the
>> >> > map
>> >> > outputs key space among the reducers.
>> >> >
>> >> > So, to know which reducer a key will go to, it is basically the value
>> >> > returned by the partitioner's getPartition method. For e.g this is
>> >> > the
>> >> > code
>> >> > in the HashPartitioner:
>> >> >
>> >> >   public int getPartition(K2 key, V2 value,
>> >> >                           int numReduceTasks) {
>> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >> >   }
>> >> >
>> >> > mapred.task.partition is the key that defines the partition number of
>> >> > this
>> >> > reducer.
>> >> >
>> >> > I guess you can piece together these bits into what you'd want..
>> >> > However, I
>> >> > am interested in understanding why you want to know this ? Can you
>> >> > share
>> >> > some info ?
>> >> >
>> >> > Thanks
>> >> > Hemanth
>> >> >
>> >> >
>> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> >> > <co...@gmail.com> wrote:
>> >> >>
>> >> >> Hi everyone,
>> >> >>
>> >> >> how can i know the keys that are associated to a particular reducer
>> >> >> in
>> >> >> the setup method?
>> >> >> Let's assume in the setup method to read from a file where each line
>> >> >> is a string that will become a key emitted from mappers.
>> >> >> For each of these lines I would like to know if the string will be a
>> >> >> key associated with the current reducer or not.
>> >> >>
>> >> >> I read something about mapred.task.partition and mapred.task.id, but
>> >> >> I
>> >> >> didn't understand the usage.
>> >> >>
>> >> >>
>> >> >> Thanks,
>> >> >> Alberto
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Alberto Cordioli
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

The way I understand your requirement - you have a file that contains a set
of keys. You want to read this file on every reducer and take only those
entries of the set, whose keys correspond to the current reducer.

If the above summary is correct, can I assume that you are potentially
reading the entire intermediate output key space on every reducer. Would
that even work (considering memory constraints, etc).

It seemed to me that your solution is implementing what the framework can
already do for you. That was the rationale behind my suggestion. Maybe you
should try and implement both approaches to see which one works better for
you.

Thanks
hemanth


On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Yes, that is a possible solution.
> But since the MR job has another scope, the mappers already read other
> files (very large) and output tuples.
> You cannot control the number of mappers and hence the risk is that a
> lot of mappers will be created, and each of them read also the other
> file instead of a small number of reducers.
>
> Do you think that the solution I proposed is not so elegant or efficient?
>
> Alberto
>
> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hmm. That feels like a join. Can't you read the input file on the map
> side
> > and output those keys along with the original map output keys.. That way
> the
> > reducer would automatically get both together ?
> >
> >
> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi Hemanth,
> >>
> >> thanks for your reply.
> >> Yes, this partially answered to my question. I know how hash
> >> partitioner works and I guessed something similar.
> >> The piece that I missed was that mapred.task.partition returns the
> >> partition number of the reducer.
> >> So, putting al the pieces together I undersand that: for each key in
> >> the file I have to call the HashPartitioner.
> >> Then I have to compare the returned index with the one retrieved by
> >> Configuration.getInt("mapred.task.partition").
> >> If it is equal then such a key will be served by that reducer. Is this
> >> correct?
> >>
> >>
> >> To answer to your question:
> >> In a reduce side of a MR job, I want to load from file some data in a
> >> in-memory structure. Actually, I don't need to store the whole file
> >> for each reducer, but only the lines that are related to such keys a
> >> particular reducers will receive.
> >> So, my intention is to know the keys in the setup method to store only
> >> the needed lines.
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > Not sure if I am answering your question, but this is the background.
> >> > Every
> >> > MapReduce job has a partitioner associated to it. The default
> >> > partitioner is
> >> > a HashPartitioner. You can as a user write your own partitioner as
> well
> >> > and
> >> > plug it into the job. The partitioner is responsible for splitting the
> >> > map
> >> > outputs key space among the reducers.
> >> >
> >> > So, to know which reducer a key will go to, it is basically the value
> >> > returned by the partitioner's getPartition method. For e.g this is the
> >> > code
> >> > in the HashPartitioner:
> >> >
> >> >   public int getPartition(K2 key, V2 value,
> >> >                           int numReduceTasks) {
> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >> >   }
> >> >
> >> > mapred.task.partition is the key that defines the partition number of
> >> > this
> >> > reducer.
> >> >
> >> > I guess you can piece together these bits into what you'd want..
> >> > However, I
> >> > am interested in understanding why you want to know this ? Can you
> share
> >> > some info ?
> >> >
> >> > Thanks
> >> > Hemanth
> >> >
> >> >
> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> >> > <co...@gmail.com> wrote:
> >> >>
> >> >> Hi everyone,
> >> >>
> >> >> how can i know the keys that are associated to a particular reducer
> in
> >> >> the setup method?
> >> >> Let's assume in the setup method to read from a file where each line
> >> >> is a string that will become a key emitted from mappers.
> >> >> For each of these lines I would like to know if the string will be a
> >> >> key associated with the current reducer or not.
> >> >>
> >> >> I read something about mapred.task.partition and mapred.task.id,
> but I
> >> >> didn't understand the usage.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Alberto
> >> >>
> >> >>
> >> >> --
> >> >> Alberto Cordioli
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

The way I understand your requirement - you have a file that contains a set
of keys. You want to read this file on every reducer and take only those
entries of the set, whose keys correspond to the current reducer.

If the above summary is correct, can I assume that you are potentially
reading the entire intermediate output key space on every reducer. Would
that even work (considering memory constraints, etc).

It seemed to me that your solution is implementing what the framework can
already do for you. That was the rationale behind my suggestion. Maybe you
should try and implement both approaches to see which one works better for
you.

Thanks
hemanth


On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Yes, that is a possible solution.
> But since the MR job has another scope, the mappers already read other
> files (very large) and output tuples.
> You cannot control the number of mappers and hence the risk is that a
> lot of mappers will be created, and each of them read also the other
> file instead of a small number of reducers.
>
> Do you think that the solution I proposed is not so elegant or efficient?
>
> Alberto
>
> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hmm. That feels like a join. Can't you read the input file on the map
> side
> > and output those keys along with the original map output keys.. That way
> the
> > reducer would automatically get both together ?
> >
> >
> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi Hemanth,
> >>
> >> thanks for your reply.
> >> Yes, this partially answered to my question. I know how hash
> >> partitioner works and I guessed something similar.
> >> The piece that I missed was that mapred.task.partition returns the
> >> partition number of the reducer.
> >> So, putting al the pieces together I undersand that: for each key in
> >> the file I have to call the HashPartitioner.
> >> Then I have to compare the returned index with the one retrieved by
> >> Configuration.getInt("mapred.task.partition").
> >> If it is equal then such a key will be served by that reducer. Is this
> >> correct?
> >>
> >>
> >> To answer to your question:
> >> In a reduce side of a MR job, I want to load from file some data in a
> >> in-memory structure. Actually, I don't need to store the whole file
> >> for each reducer, but only the lines that are related to such keys a
> >> particular reducers will receive.
> >> So, my intention is to know the keys in the setup method to store only
> >> the needed lines.
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > Not sure if I am answering your question, but this is the background.
> >> > Every
> >> > MapReduce job has a partitioner associated to it. The default
> >> > partitioner is
> >> > a HashPartitioner. You can as a user write your own partitioner as
> well
> >> > and
> >> > plug it into the job. The partitioner is responsible for splitting the
> >> > map
> >> > outputs key space among the reducers.
> >> >
> >> > So, to know which reducer a key will go to, it is basically the value
> >> > returned by the partitioner's getPartition method. For e.g this is the
> >> > code
> >> > in the HashPartitioner:
> >> >
> >> >   public int getPartition(K2 key, V2 value,
> >> >                           int numReduceTasks) {
> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >> >   }
> >> >
> >> > mapred.task.partition is the key that defines the partition number of
> >> > this
> >> > reducer.
> >> >
> >> > I guess you can piece together these bits into what you'd want..
> >> > However, I
> >> > am interested in understanding why you want to know this ? Can you
> share
> >> > some info ?
> >> >
> >> > Thanks
> >> > Hemanth
> >> >
> >> >
> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> >> > <co...@gmail.com> wrote:
> >> >>
> >> >> Hi everyone,
> >> >>
> >> >> how can i know the keys that are associated to a particular reducer
> in
> >> >> the setup method?
> >> >> Let's assume in the setup method to read from a file where each line
> >> >> is a string that will become a key emitted from mappers.
> >> >> For each of these lines I would like to know if the string will be a
> >> >> key associated with the current reducer or not.
> >> >>
> >> >> I read something about mapred.task.partition and mapred.task.id,
> but I
> >> >> didn't understand the usage.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Alberto
> >> >>
> >> >>
> >> >> --
> >> >> Alberto Cordioli
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

The way I understand your requirement - you have a file that contains a set
of keys. You want to read this file on every reducer and take only those
entries of the set, whose keys correspond to the current reducer.

If the above summary is correct, can I assume that you are potentially
reading the entire intermediate output key space on every reducer. Would
that even work (considering memory constraints, etc).

It seemed to me that your solution is implementing what the framework can
already do for you. That was the rationale behind my suggestion. Maybe you
should try and implement both approaches to see which one works better for
you.

Thanks
hemanth


On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Yes, that is a possible solution.
> But since the MR job has another scope, the mappers already read other
> files (very large) and output tuples.
> You cannot control the number of mappers and hence the risk is that a
> lot of mappers will be created, and each of them read also the other
> file instead of a small number of reducers.
>
> Do you think that the solution I proposed is not so elegant or efficient?
>
> Alberto
>
> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hmm. That feels like a join. Can't you read the input file on the map
> side
> > and output those keys along with the original map output keys.. That way
> the
> > reducer would automatically get both together ?
> >
> >
> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi Hemanth,
> >>
> >> thanks for your reply.
> >> Yes, this partially answered to my question. I know how hash
> >> partitioner works and I guessed something similar.
> >> The piece that I missed was that mapred.task.partition returns the
> >> partition number of the reducer.
> >> So, putting al the pieces together I undersand that: for each key in
> >> the file I have to call the HashPartitioner.
> >> Then I have to compare the returned index with the one retrieved by
> >> Configuration.getInt("mapred.task.partition").
> >> If it is equal then such a key will be served by that reducer. Is this
> >> correct?
> >>
> >>
> >> To answer to your question:
> >> In a reduce side of a MR job, I want to load from file some data in a
> >> in-memory structure. Actually, I don't need to store the whole file
> >> for each reducer, but only the lines that are related to such keys a
> >> particular reducers will receive.
> >> So, my intention is to know the keys in the setup method to store only
> >> the needed lines.
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > Not sure if I am answering your question, but this is the background.
> >> > Every
> >> > MapReduce job has a partitioner associated to it. The default
> >> > partitioner is
> >> > a HashPartitioner. You can as a user write your own partitioner as
> well
> >> > and
> >> > plug it into the job. The partitioner is responsible for splitting the
> >> > map
> >> > outputs key space among the reducers.
> >> >
> >> > So, to know which reducer a key will go to, it is basically the value
> >> > returned by the partitioner's getPartition method. For e.g this is the
> >> > code
> >> > in the HashPartitioner:
> >> >
> >> >   public int getPartition(K2 key, V2 value,
> >> >                           int numReduceTasks) {
> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >> >   }
> >> >
> >> > mapred.task.partition is the key that defines the partition number of
> >> > this
> >> > reducer.
> >> >
> >> > I guess you can piece together these bits into what you'd want..
> >> > However, I
> >> > am interested in understanding why you want to know this ? Can you
> share
> >> > some info ?
> >> >
> >> > Thanks
> >> > Hemanth
> >> >
> >> >
> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> >> > <co...@gmail.com> wrote:
> >> >>
> >> >> Hi everyone,
> >> >>
> >> >> how can i know the keys that are associated to a particular reducer
> in
> >> >> the setup method?
> >> >> Let's assume in the setup method to read from a file where each line
> >> >> is a string that will become a key emitted from mappers.
> >> >> For each of these lines I would like to know if the string will be a
> >> >> key associated with the current reducer or not.
> >> >>
> >> >> I read something about mapred.task.partition and mapred.task.id,
> but I
> >> >> didn't understand the usage.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Alberto
> >> >>
> >> >>
> >> >> --
> >> >> Alberto Cordioli
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

The way I understand your requirement - you have a file that contains a set
of keys. You want to read this file on every reducer and take only those
entries of the set, whose keys correspond to the current reducer.

If the above summary is correct, can I assume that you are potentially
reading the entire intermediate output key space on every reducer. Would
that even work (considering memory constraints, etc).

It seemed to me that your solution is implementing what the framework can
already do for you. That was the rationale behind my suggestion. Maybe you
should try and implement both approaches to see which one works better for
you.

Thanks
hemanth


On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Yes, that is a possible solution.
> But since the MR job has another scope, the mappers already read other
> files (very large) and output tuples.
> You cannot control the number of mappers and hence the risk is that a
> lot of mappers will be created, and each of them read also the other
> file instead of a small number of reducers.
>
> Do you think that the solution I proposed is not so elegant or efficient?
>
> Alberto
>
> On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hmm. That feels like a join. Can't you read the input file on the map
> side
> > and output those keys along with the original map output keys.. That way
> the
> > reducer would automatically get both together ?
> >
> >
> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi Hemanth,
> >>
> >> thanks for your reply.
> >> Yes, this partially answered to my question. I know how hash
> >> partitioner works and I guessed something similar.
> >> The piece that I missed was that mapred.task.partition returns the
> >> partition number of the reducer.
> >> So, putting al the pieces together I undersand that: for each key in
> >> the file I have to call the HashPartitioner.
> >> Then I have to compare the returned index with the one retrieved by
> >> Configuration.getInt("mapred.task.partition").
> >> If it is equal then such a key will be served by that reducer. Is this
> >> correct?
> >>
> >>
> >> To answer to your question:
> >> In a reduce side of a MR job, I want to load from file some data in a
> >> in-memory structure. Actually, I don't need to store the whole file
> >> for each reducer, but only the lines that are related to such keys a
> >> particular reducers will receive.
> >> So, my intention is to know the keys in the setup method to store only
> >> the needed lines.
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > Not sure if I am answering your question, but this is the background.
> >> > Every
> >> > MapReduce job has a partitioner associated to it. The default
> >> > partitioner is
> >> > a HashPartitioner. You can as a user write your own partitioner as
> well
> >> > and
> >> > plug it into the job. The partitioner is responsible for splitting the
> >> > map
> >> > outputs key space among the reducers.
> >> >
> >> > So, to know which reducer a key will go to, it is basically the value
> >> > returned by the partitioner's getPartition method. For e.g this is the
> >> > code
> >> > in the HashPartitioner:
> >> >
> >> >   public int getPartition(K2 key, V2 value,
> >> >                           int numReduceTasks) {
> >> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >> >   }
> >> >
> >> > mapred.task.partition is the key that defines the partition number of
> >> > this
> >> > reducer.
> >> >
> >> > I guess you can piece together these bits into what you'd want..
> >> > However, I
> >> > am interested in understanding why you want to know this ? Can you
> share
> >> > some info ?
> >> >
> >> > Thanks
> >> > Hemanth
> >> >
> >> >
> >> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> >> > <co...@gmail.com> wrote:
> >> >>
> >> >> Hi everyone,
> >> >>
> >> >> how can i know the keys that are associated to a particular reducer
> in
> >> >> the setup method?
> >> >> Let's assume in the setup method to read from a file where each line
> >> >> is a string that will become a key emitted from mappers.
> >> >> For each of these lines I would like to know if the string will be a
> >> >> key associated with the current reducer or not.
> >> >>
> >> >> I read something about mapred.task.partition and mapred.task.id,
> but I
> >> >> didn't understand the usage.
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Alberto
> >> >>
> >> >>
> >> >> --
> >> >> Alberto Cordioli
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Yes, that is a possible solution.
But since the MR job has another scope, the mappers already read other
files (very large) and output tuples.
You cannot control the number of mappers and hence the risk is that a
lot of mappers will be created, and each of them read also the other
file instead of a small number of reducers.

Do you think that the solution I proposed is not so elegant or efficient?

Alberto

On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hmm. That feels like a join. Can't you read the input file on the map side
> and output those keys along with the original map output keys.. That way the
> reducer would automatically get both together ?
>
>
> On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi Hemanth,
>>
>> thanks for your reply.
>> Yes, this partially answered to my question. I know how hash
>> partitioner works and I guessed something similar.
>> The piece that I missed was that mapred.task.partition returns the
>> partition number of the reducer.
>> So, putting al the pieces together I undersand that: for each key in
>> the file I have to call the HashPartitioner.
>> Then I have to compare the returned index with the one retrieved by
>> Configuration.getInt("mapred.task.partition").
>> If it is equal then such a key will be served by that reducer. Is this
>> correct?
>>
>>
>> To answer to your question:
>> In a reduce side of a MR job, I want to load from file some data in a
>> in-memory structure. Actually, I don't need to store the whole file
>> for each reducer, but only the lines that are related to such keys a
>> particular reducers will receive.
>> So, my intention is to know the keys in the setup method to store only
>> the needed lines.
>>
>> Thanks,
>> Alberto
>>
>>
>> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hi,
>> >
>> > Not sure if I am answering your question, but this is the background.
>> > Every
>> > MapReduce job has a partitioner associated to it. The default
>> > partitioner is
>> > a HashPartitioner. You can as a user write your own partitioner as well
>> > and
>> > plug it into the job. The partitioner is responsible for splitting the
>> > map
>> > outputs key space among the reducers.
>> >
>> > So, to know which reducer a key will go to, it is basically the value
>> > returned by the partitioner's getPartition method. For e.g this is the
>> > code
>> > in the HashPartitioner:
>> >
>> >   public int getPartition(K2 key, V2 value,
>> >                           int numReduceTasks) {
>> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >   }
>> >
>> > mapred.task.partition is the key that defines the partition number of
>> > this
>> > reducer.
>> >
>> > I guess you can piece together these bits into what you'd want..
>> > However, I
>> > am interested in understanding why you want to know this ? Can you share
>> > some info ?
>> >
>> > Thanks
>> > Hemanth
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> how can i know the keys that are associated to a particular reducer in
>> >> the setup method?
>> >> Let's assume in the setup method to read from a file where each line
>> >> is a string that will become a key emitted from mappers.
>> >> For each of these lines I would like to know if the string will be a
>> >> key associated with the current reducer or not.
>> >>
>> >> I read something about mapred.task.partition and mapred.task.id, but I
>> >> didn't understand the usage.
>> >>
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Yes, that is a possible solution.
But since the MR job has another scope, the mappers already read other
files (very large) and output tuples.
You cannot control the number of mappers and hence the risk is that a
lot of mappers will be created, and each of them read also the other
file instead of a small number of reducers.

Do you think that the solution I proposed is not so elegant or efficient?

Alberto

On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hmm. That feels like a join. Can't you read the input file on the map side
> and output those keys along with the original map output keys.. That way the
> reducer would automatically get both together ?
>
>
> On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi Hemanth,
>>
>> thanks for your reply.
>> Yes, this partially answered to my question. I know how hash
>> partitioner works and I guessed something similar.
>> The piece that I missed was that mapred.task.partition returns the
>> partition number of the reducer.
>> So, putting al the pieces together I undersand that: for each key in
>> the file I have to call the HashPartitioner.
>> Then I have to compare the returned index with the one retrieved by
>> Configuration.getInt("mapred.task.partition").
>> If it is equal then such a key will be served by that reducer. Is this
>> correct?
>>
>>
>> To answer to your question:
>> In a reduce side of a MR job, I want to load from file some data in a
>> in-memory structure. Actually, I don't need to store the whole file
>> for each reducer, but only the lines that are related to such keys a
>> particular reducers will receive.
>> So, my intention is to know the keys in the setup method to store only
>> the needed lines.
>>
>> Thanks,
>> Alberto
>>
>>
>> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hi,
>> >
>> > Not sure if I am answering your question, but this is the background.
>> > Every
>> > MapReduce job has a partitioner associated to it. The default
>> > partitioner is
>> > a HashPartitioner. You can as a user write your own partitioner as well
>> > and
>> > plug it into the job. The partitioner is responsible for splitting the
>> > map
>> > outputs key space among the reducers.
>> >
>> > So, to know which reducer a key will go to, it is basically the value
>> > returned by the partitioner's getPartition method. For e.g this is the
>> > code
>> > in the HashPartitioner:
>> >
>> >   public int getPartition(K2 key, V2 value,
>> >                           int numReduceTasks) {
>> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >   }
>> >
>> > mapred.task.partition is the key that defines the partition number of
>> > this
>> > reducer.
>> >
>> > I guess you can piece together these bits into what you'd want..
>> > However, I
>> > am interested in understanding why you want to know this ? Can you share
>> > some info ?
>> >
>> > Thanks
>> > Hemanth
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> how can i know the keys that are associated to a particular reducer in
>> >> the setup method?
>> >> Let's assume in the setup method to read from a file where each line
>> >> is a string that will become a key emitted from mappers.
>> >> For each of these lines I would like to know if the string will be a
>> >> key associated with the current reducer or not.
>> >>
>> >> I read something about mapred.task.partition and mapred.task.id, but I
>> >> didn't understand the usage.
>> >>
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Yes, that is a possible solution.
But since the MR job has another scope, the mappers already read other
files (very large) and output tuples.
You cannot control the number of mappers and hence the risk is that a
lot of mappers will be created, and each of them read also the other
file instead of a small number of reducers.

Do you think that the solution I proposed is not so elegant or efficient?

Alberto

On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hmm. That feels like a join. Can't you read the input file on the map side
> and output those keys along with the original map output keys.. That way the
> reducer would automatically get both together ?
>
>
> On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi Hemanth,
>>
>> thanks for your reply.
>> Yes, this partially answered to my question. I know how hash
>> partitioner works and I guessed something similar.
>> The piece that I missed was that mapred.task.partition returns the
>> partition number of the reducer.
>> So, putting al the pieces together I undersand that: for each key in
>> the file I have to call the HashPartitioner.
>> Then I have to compare the returned index with the one retrieved by
>> Configuration.getInt("mapred.task.partition").
>> If it is equal then such a key will be served by that reducer. Is this
>> correct?
>>
>>
>> To answer to your question:
>> In a reduce side of a MR job, I want to load from file some data in a
>> in-memory structure. Actually, I don't need to store the whole file
>> for each reducer, but only the lines that are related to such keys a
>> particular reducers will receive.
>> So, my intention is to know the keys in the setup method to store only
>> the needed lines.
>>
>> Thanks,
>> Alberto
>>
>>
>> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hi,
>> >
>> > Not sure if I am answering your question, but this is the background.
>> > Every
>> > MapReduce job has a partitioner associated to it. The default
>> > partitioner is
>> > a HashPartitioner. You can as a user write your own partitioner as well
>> > and
>> > plug it into the job. The partitioner is responsible for splitting the
>> > map
>> > outputs key space among the reducers.
>> >
>> > So, to know which reducer a key will go to, it is basically the value
>> > returned by the partitioner's getPartition method. For e.g this is the
>> > code
>> > in the HashPartitioner:
>> >
>> >   public int getPartition(K2 key, V2 value,
>> >                           int numReduceTasks) {
>> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >   }
>> >
>> > mapred.task.partition is the key that defines the partition number of
>> > this
>> > reducer.
>> >
>> > I guess you can piece together these bits into what you'd want..
>> > However, I
>> > am interested in understanding why you want to know this ? Can you share
>> > some info ?
>> >
>> > Thanks
>> > Hemanth
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> how can i know the keys that are associated to a particular reducer in
>> >> the setup method?
>> >> Let's assume in the setup method to read from a file where each line
>> >> is a string that will become a key emitted from mappers.
>> >> For each of these lines I would like to know if the string will be a
>> >> key associated with the current reducer or not.
>> >>
>> >> I read something about mapred.task.partition and mapred.task.id, but I
>> >> didn't understand the usage.
>> >>
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Yes, that is a possible solution.
But since the MR job has another scope, the mappers already read other
files (very large) and output tuples.
You cannot control the number of mappers and hence the risk is that a
lot of mappers will be created, and each of them read also the other
file instead of a small number of reducers.

Do you think that the solution I proposed is not so elegant or efficient?

Alberto

On 28 March 2013 13:12, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hmm. That feels like a join. Can't you read the input file on the map side
> and output those keys along with the original map output keys.. That way the
> reducer would automatically get both together ?
>
>
> On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi Hemanth,
>>
>> thanks for your reply.
>> Yes, this partially answered to my question. I know how hash
>> partitioner works and I guessed something similar.
>> The piece that I missed was that mapred.task.partition returns the
>> partition number of the reducer.
>> So, putting al the pieces together I undersand that: for each key in
>> the file I have to call the HashPartitioner.
>> Then I have to compare the returned index with the one retrieved by
>> Configuration.getInt("mapred.task.partition").
>> If it is equal then such a key will be served by that reducer. Is this
>> correct?
>>
>>
>> To answer to your question:
>> In a reduce side of a MR job, I want to load from file some data in a
>> in-memory structure. Actually, I don't need to store the whole file
>> for each reducer, but only the lines that are related to such keys a
>> particular reducers will receive.
>> So, my intention is to know the keys in the setup method to store only
>> the needed lines.
>>
>> Thanks,
>> Alberto
>>
>>
>> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
>> wrote:
>> > Hi,
>> >
>> > Not sure if I am answering your question, but this is the background.
>> > Every
>> > MapReduce job has a partitioner associated to it. The default
>> > partitioner is
>> > a HashPartitioner. You can as a user write your own partitioner as well
>> > and
>> > plug it into the job. The partitioner is responsible for splitting the
>> > map
>> > outputs key space among the reducers.
>> >
>> > So, to know which reducer a key will go to, it is basically the value
>> > returned by the partitioner's getPartition method. For e.g this is the
>> > code
>> > in the HashPartitioner:
>> >
>> >   public int getPartition(K2 key, V2 value,
>> >                           int numReduceTasks) {
>> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>> >   }
>> >
>> > mapred.task.partition is the key that defines the partition number of
>> > this
>> > reducer.
>> >
>> > I guess you can piece together these bits into what you'd want..
>> > However, I
>> > am interested in understanding why you want to know this ? Can you share
>> > some info ?
>> >
>> > Thanks
>> > Hemanth
>> >
>> >
>> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
>> > <co...@gmail.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> how can i know the keys that are associated to a particular reducer in
>> >> the setup method?
>> >> Let's assume in the setup method to read from a file where each line
>> >> is a string that will become a key emitted from mappers.
>> >> For each of these lines I would like to know if the string will be a
>> >> key associated with the current reducer or not.
>> >>
>> >> I read something about mapred.task.partition and mapred.task.id, but I
>> >> didn't understand the usage.
>> >>
>> >>
>> >> Thanks,
>> >> Alberto
>> >>
>> >>
>> >> --
>> >> Alberto Cordioli
>> >
>> >
>>
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hmm. That feels like a join. Can't you read the input file on the map side
and output those keys along with the original map output keys.. That way
the reducer would automatically get both together ?


On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi Hemanth,
>
> thanks for your reply.
> Yes, this partially answered to my question. I know how hash
> partitioner works and I guessed something similar.
> The piece that I missed was that mapred.task.partition returns the
> partition number of the reducer.
> So, putting al the pieces together I undersand that: for each key in
> the file I have to call the HashPartitioner.
> Then I have to compare the returned index with the one retrieved by
> Configuration.getInt("mapred.task.partition").
> If it is equal then such a key will be served by that reducer. Is this
> correct?
>
>
> To answer to your question:
> In a reduce side of a MR job, I want to load from file some data in a
> in-memory structure. Actually, I don't need to store the whole file
> for each reducer, but only the lines that are related to such keys a
> particular reducers will receive.
> So, my intention is to know the keys in the setup method to store only
> the needed lines.
>
> Thanks,
> Alberto
>
>
> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hi,
> >
> > Not sure if I am answering your question, but this is the background.
> Every
> > MapReduce job has a partitioner associated to it. The default
> partitioner is
> > a HashPartitioner. You can as a user write your own partitioner as well
> and
> > plug it into the job. The partitioner is responsible for splitting the
> map
> > outputs key space among the reducers.
> >
> > So, to know which reducer a key will go to, it is basically the value
> > returned by the partitioner's getPartition method. For e.g this is the
> code
> > in the HashPartitioner:
> >
> >   public int getPartition(K2 key, V2 value,
> >                           int numReduceTasks) {
> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >   }
> >
> > mapred.task.partition is the key that defines the partition number of
> this
> > reducer.
> >
> > I guess you can piece together these bits into what you'd want..
> However, I
> > am interested in understanding why you want to know this ? Can you share
> > some info ?
> >
> > Thanks
> > Hemanth
> >
> >
> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> how can i know the keys that are associated to a particular reducer in
> >> the setup method?
> >> Let's assume in the setup method to read from a file where each line
> >> is a string that will become a key emitted from mappers.
> >> For each of these lines I would like to know if the string will be a
> >> key associated with the current reducer or not.
> >>
> >> I read something about mapred.task.partition and mapred.task.id, but I
> >> didn't understand the usage.
> >>
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hmm. That feels like a join. Can't you read the input file on the map side
and output those keys along with the original map output keys.. That way
the reducer would automatically get both together ?


On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi Hemanth,
>
> thanks for your reply.
> Yes, this partially answered to my question. I know how hash
> partitioner works and I guessed something similar.
> The piece that I missed was that mapred.task.partition returns the
> partition number of the reducer.
> So, putting al the pieces together I undersand that: for each key in
> the file I have to call the HashPartitioner.
> Then I have to compare the returned index with the one retrieved by
> Configuration.getInt("mapred.task.partition").
> If it is equal then such a key will be served by that reducer. Is this
> correct?
>
>
> To answer to your question:
> In a reduce side of a MR job, I want to load from file some data in a
> in-memory structure. Actually, I don't need to store the whole file
> for each reducer, but only the lines that are related to such keys a
> particular reducers will receive.
> So, my intention is to know the keys in the setup method to store only
> the needed lines.
>
> Thanks,
> Alberto
>
>
> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hi,
> >
> > Not sure if I am answering your question, but this is the background.
> Every
> > MapReduce job has a partitioner associated to it. The default
> partitioner is
> > a HashPartitioner. You can as a user write your own partitioner as well
> and
> > plug it into the job. The partitioner is responsible for splitting the
> map
> > outputs key space among the reducers.
> >
> > So, to know which reducer a key will go to, it is basically the value
> > returned by the partitioner's getPartition method. For e.g this is the
> code
> > in the HashPartitioner:
> >
> >   public int getPartition(K2 key, V2 value,
> >                           int numReduceTasks) {
> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >   }
> >
> > mapred.task.partition is the key that defines the partition number of
> this
> > reducer.
> >
> > I guess you can piece together these bits into what you'd want..
> However, I
> > am interested in understanding why you want to know this ? Can you share
> > some info ?
> >
> > Thanks
> > Hemanth
> >
> >
> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> how can i know the keys that are associated to a particular reducer in
> >> the setup method?
> >> Let's assume in the setup method to read from a file where each line
> >> is a string that will become a key emitted from mappers.
> >> For each of these lines I would like to know if the string will be a
> >> key associated with the current reducer or not.
> >>
> >> I read something about mapred.task.partition and mapred.task.id, but I
> >> didn't understand the usage.
> >>
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hmm. That feels like a join. Can't you read the input file on the map side
and output those keys along with the original map output keys.. That way
the reducer would automatically get both together ?


On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi Hemanth,
>
> thanks for your reply.
> Yes, this partially answered to my question. I know how hash
> partitioner works and I guessed something similar.
> The piece that I missed was that mapred.task.partition returns the
> partition number of the reducer.
> So, putting al the pieces together I undersand that: for each key in
> the file I have to call the HashPartitioner.
> Then I have to compare the returned index with the one retrieved by
> Configuration.getInt("mapred.task.partition").
> If it is equal then such a key will be served by that reducer. Is this
> correct?
>
>
> To answer to your question:
> In a reduce side of a MR job, I want to load from file some data in a
> in-memory structure. Actually, I don't need to store the whole file
> for each reducer, but only the lines that are related to such keys a
> particular reducers will receive.
> So, my intention is to know the keys in the setup method to store only
> the needed lines.
>
> Thanks,
> Alberto
>
>
> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hi,
> >
> > Not sure if I am answering your question, but this is the background.
> Every
> > MapReduce job has a partitioner associated to it. The default
> partitioner is
> > a HashPartitioner. You can as a user write your own partitioner as well
> and
> > plug it into the job. The partitioner is responsible for splitting the
> map
> > outputs key space among the reducers.
> >
> > So, to know which reducer a key will go to, it is basically the value
> > returned by the partitioner's getPartition method. For e.g this is the
> code
> > in the HashPartitioner:
> >
> >   public int getPartition(K2 key, V2 value,
> >                           int numReduceTasks) {
> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >   }
> >
> > mapred.task.partition is the key that defines the partition number of
> this
> > reducer.
> >
> > I guess you can piece together these bits into what you'd want..
> However, I
> > am interested in understanding why you want to know this ? Can you share
> > some info ?
> >
> > Thanks
> > Hemanth
> >
> >
> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> how can i know the keys that are associated to a particular reducer in
> >> the setup method?
> >> Let's assume in the setup method to read from a file where each line
> >> is a string that will become a key emitted from mappers.
> >> For each of these lines I would like to know if the string will be a
> >> key associated with the current reducer or not.
> >>
> >> I read something about mapred.task.partition and mapred.task.id, but I
> >> didn't understand the usage.
> >>
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hmm. That feels like a join. Can't you read the input file on the map side
and output those keys along with the original map output keys.. That way
the reducer would automatically get both together ?


On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi Hemanth,
>
> thanks for your reply.
> Yes, this partially answered to my question. I know how hash
> partitioner works and I guessed something similar.
> The piece that I missed was that mapred.task.partition returns the
> partition number of the reducer.
> So, putting al the pieces together I undersand that: for each key in
> the file I have to call the HashPartitioner.
> Then I have to compare the returned index with the one retrieved by
> Configuration.getInt("mapred.task.partition").
> If it is equal then such a key will be served by that reducer. Is this
> correct?
>
>
> To answer to your question:
> In a reduce side of a MR job, I want to load from file some data in a
> in-memory structure. Actually, I don't need to store the whole file
> for each reducer, but only the lines that are related to such keys a
> particular reducers will receive.
> So, my intention is to know the keys in the setup method to store only
> the needed lines.
>
> Thanks,
> Alberto
>
>
> On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com>
> wrote:
> > Hi,
> >
> > Not sure if I am answering your question, but this is the background.
> Every
> > MapReduce job has a partitioner associated to it. The default
> partitioner is
> > a HashPartitioner. You can as a user write your own partitioner as well
> and
> > plug it into the job. The partitioner is responsible for splitting the
> map
> > outputs key space among the reducers.
> >
> > So, to know which reducer a key will go to, it is basically the value
> > returned by the partitioner's getPartition method. For e.g this is the
> code
> > in the HashPartitioner:
> >
> >   public int getPartition(K2 key, V2 value,
> >                           int numReduceTasks) {
> >     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
> >   }
> >
> > mapred.task.partition is the key that defines the partition number of
> this
> > reducer.
> >
> > I guess you can piece together these bits into what you'd want..
> However, I
> > am interested in understanding why you want to know this ? Can you share
> > some info ?
> >
> > Thanks
> > Hemanth
> >
> >
> > On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> > <co...@gmail.com> wrote:
> >>
> >> Hi everyone,
> >>
> >> how can i know the keys that are associated to a particular reducer in
> >> the setup method?
> >> Let's assume in the setup method to read from a file where each line
> >> is a string that will become a key emitted from mappers.
> >> For each of these lines I would like to know if the string will be a
> >> key associated with the current reducer or not.
> >>
> >> I read something about mapred.task.partition and mapred.task.id, but I
> >> didn't understand the usage.
> >>
> >>
> >> Thanks,
> >> Alberto
> >>
> >>
> >> --
> >> Alberto Cordioli
> >
> >
>
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Hi Hemanth,

thanks for your reply.
Yes, this partially answered to my question. I know how hash
partitioner works and I guessed something similar.
The piece that I missed was that mapred.task.partition returns the
partition number of the reducer.
So, putting al the pieces together I undersand that: for each key in
the file I have to call the HashPartitioner.
Then I have to compare the returned index with the one retrieved by
Configuration.getInt("mapred.task.partition").
If it is equal then such a key will be served by that reducer. Is this correct?


To answer to your question:
In a reduce side of a MR job, I want to load from file some data in a
in-memory structure. Actually, I don't need to store the whole file
for each reducer, but only the lines that are related to such keys a
particular reducers will receive.
So, my intention is to know the keys in the setup method to store only
the needed lines.

Thanks,
Alberto


On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> Not sure if I am answering your question, but this is the background. Every
> MapReduce job has a partitioner associated to it. The default partitioner is
> a HashPartitioner. You can as a user write your own partitioner as well and
> plug it into the job. The partitioner is responsible for splitting the map
> outputs key space among the reducers.
>
> So, to know which reducer a key will go to, it is basically the value
> returned by the partitioner's getPartition method. For e.g this is the code
> in the HashPartitioner:
>
>   public int getPartition(K2 key, V2 value,
>                           int numReduceTasks) {
>     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>   }
>
> mapred.task.partition is the key that defines the partition number of this
> reducer.
>
> I guess you can piece together these bits into what you'd want.. However, I
> am interested in understanding why you want to know this ? Can you share
> some info ?
>
> Thanks
> Hemanth
>
>
> On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> how can i know the keys that are associated to a particular reducer in
>> the setup method?
>> Let's assume in the setup method to read from a file where each line
>> is a string that will become a key emitted from mappers.
>> For each of these lines I would like to know if the string will be a
>> key associated with the current reducer or not.
>>
>> I read something about mapred.task.partition and mapred.task.id, but I
>> didn't understand the usage.
>>
>>
>> Thanks,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Hi Hemanth,

thanks for your reply.
Yes, this partially answered to my question. I know how hash
partitioner works and I guessed something similar.
The piece that I missed was that mapred.task.partition returns the
partition number of the reducer.
So, putting al the pieces together I undersand that: for each key in
the file I have to call the HashPartitioner.
Then I have to compare the returned index with the one retrieved by
Configuration.getInt("mapred.task.partition").
If it is equal then such a key will be served by that reducer. Is this correct?


To answer to your question:
In a reduce side of a MR job, I want to load from file some data in a
in-memory structure. Actually, I don't need to store the whole file
for each reducer, but only the lines that are related to such keys a
particular reducers will receive.
So, my intention is to know the keys in the setup method to store only
the needed lines.

Thanks,
Alberto


On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> Not sure if I am answering your question, but this is the background. Every
> MapReduce job has a partitioner associated to it. The default partitioner is
> a HashPartitioner. You can as a user write your own partitioner as well and
> plug it into the job. The partitioner is responsible for splitting the map
> outputs key space among the reducers.
>
> So, to know which reducer a key will go to, it is basically the value
> returned by the partitioner's getPartition method. For e.g this is the code
> in the HashPartitioner:
>
>   public int getPartition(K2 key, V2 value,
>                           int numReduceTasks) {
>     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>   }
>
> mapred.task.partition is the key that defines the partition number of this
> reducer.
>
> I guess you can piece together these bits into what you'd want.. However, I
> am interested in understanding why you want to know this ? Can you share
> some info ?
>
> Thanks
> Hemanth
>
>
> On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> how can i know the keys that are associated to a particular reducer in
>> the setup method?
>> Let's assume in the setup method to read from a file where each line
>> is a string that will become a key emitted from mappers.
>> For each of these lines I would like to know if the string will be a
>> key associated with the current reducer or not.
>>
>> I read something about mapred.task.partition and mapred.task.id, but I
>> didn't understand the usage.
>>
>>
>> Thanks,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Hi Hemanth,

thanks for your reply.
Yes, this partially answered to my question. I know how hash
partitioner works and I guessed something similar.
The piece that I missed was that mapred.task.partition returns the
partition number of the reducer.
So, putting al the pieces together I undersand that: for each key in
the file I have to call the HashPartitioner.
Then I have to compare the returned index with the one retrieved by
Configuration.getInt("mapred.task.partition").
If it is equal then such a key will be served by that reducer. Is this correct?


To answer to your question:
In a reduce side of a MR job, I want to load from file some data in a
in-memory structure. Actually, I don't need to store the whole file
for each reducer, but only the lines that are related to such keys a
particular reducers will receive.
So, my intention is to know the keys in the setup method to store only
the needed lines.

Thanks,
Alberto


On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> Not sure if I am answering your question, but this is the background. Every
> MapReduce job has a partitioner associated to it. The default partitioner is
> a HashPartitioner. You can as a user write your own partitioner as well and
> plug it into the job. The partitioner is responsible for splitting the map
> outputs key space among the reducers.
>
> So, to know which reducer a key will go to, it is basically the value
> returned by the partitioner's getPartition method. For e.g this is the code
> in the HashPartitioner:
>
>   public int getPartition(K2 key, V2 value,
>                           int numReduceTasks) {
>     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>   }
>
> mapred.task.partition is the key that defines the partition number of this
> reducer.
>
> I guess you can piece together these bits into what you'd want.. However, I
> am interested in understanding why you want to know this ? Can you share
> some info ?
>
> Thanks
> Hemanth
>
>
> On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> how can i know the keys that are associated to a particular reducer in
>> the setup method?
>> Let's assume in the setup method to read from a file where each line
>> is a string that will become a key emitted from mappers.
>> For each of these lines I would like to know if the string will be a
>> key associated with the current reducer or not.
>>
>> I read something about mapred.task.partition and mapred.task.id, but I
>> didn't understand the usage.
>>
>>
>> Thanks,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Alberto Cordioli <co...@gmail.com>.
Hi Hemanth,

thanks for your reply.
Yes, this partially answered to my question. I know how hash
partitioner works and I guessed something similar.
The piece that I missed was that mapred.task.partition returns the
partition number of the reducer.
So, putting al the pieces together I undersand that: for each key in
the file I have to call the HashPartitioner.
Then I have to compare the returned index with the one retrieved by
Configuration.getInt("mapred.task.partition").
If it is equal then such a key will be served by that reducer. Is this correct?


To answer to your question:
In a reduce side of a MR job, I want to load from file some data in a
in-memory structure. Actually, I don't need to store the whole file
for each reducer, but only the lines that are related to such keys a
particular reducers will receive.
So, my intention is to know the keys in the setup method to store only
the needed lines.

Thanks,
Alberto


On 28 March 2013 11:01, Hemanth Yamijala <yh...@thoughtworks.com> wrote:
> Hi,
>
> Not sure if I am answering your question, but this is the background. Every
> MapReduce job has a partitioner associated to it. The default partitioner is
> a HashPartitioner. You can as a user write your own partitioner as well and
> plug it into the job. The partitioner is responsible for splitting the map
> outputs key space among the reducers.
>
> So, to know which reducer a key will go to, it is basically the value
> returned by the partitioner's getPartition method. For e.g this is the code
> in the HashPartitioner:
>
>   public int getPartition(K2 key, V2 value,
>                           int numReduceTasks) {
>     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
>   }
>
> mapred.task.partition is the key that defines the partition number of this
> reducer.
>
> I guess you can piece together these bits into what you'd want.. However, I
> am interested in understanding why you want to know this ? Can you share
> some info ?
>
> Thanks
> Hemanth
>
>
> On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli
> <co...@gmail.com> wrote:
>>
>> Hi everyone,
>>
>> how can i know the keys that are associated to a particular reducer in
>> the setup method?
>> Let's assume in the setup method to read from a file where each line
>> is a string that will become a key emitted from mappers.
>> For each of these lines I would like to know if the string will be a
>> key associated with the current reducer or not.
>>
>> I read something about mapred.task.partition and mapred.task.id, but I
>> didn't understand the usage.
>>
>>
>> Thanks,
>> Alberto
>>
>>
>> --
>> Alberto Cordioli
>
>



-- 
Alberto Cordioli

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Not sure if I am answering your question, but this is the background. Every
MapReduce job has a partitioner associated to it. The default partitioner
is a HashPartitioner. You can as a user write your own partitioner as well
and plug it into the job. The partitioner is responsible for splitting the
map outputs key space among the reducers.

So, to know which reducer a key will go to, it is basically the value
returned by the partitioner's getPartition method. For e.g this is the code
in the HashPartitioner:

  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

mapred.task.partition is the key that defines the partition number of this
reducer.

I guess you can piece together these bits into what you'd want.. However, I
am interested in understanding why you want to know this ? Can you share
some info ?

Thanks
Hemanth


On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi everyone,
>
> how can i know the keys that are associated to a particular reducer in
> the setup method?
> Let's assume in the setup method to read from a file where each line
> is a string that will become a key emitted from mappers.
> For each of these lines I would like to know if the string will be a
> key associated with the current reducer or not.
>
> I read something about mapred.task.partition and mapred.task.id, but I
> didn't understand the usage.
>
>
> Thanks,
> Alberto
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Not sure if I am answering your question, but this is the background. Every
MapReduce job has a partitioner associated to it. The default partitioner
is a HashPartitioner. You can as a user write your own partitioner as well
and plug it into the job. The partitioner is responsible for splitting the
map outputs key space among the reducers.

So, to know which reducer a key will go to, it is basically the value
returned by the partitioner's getPartition method. For e.g this is the code
in the HashPartitioner:

  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

mapred.task.partition is the key that defines the partition number of this
reducer.

I guess you can piece together these bits into what you'd want.. However, I
am interested in understanding why you want to know this ? Can you share
some info ?

Thanks
Hemanth


On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi everyone,
>
> how can i know the keys that are associated to a particular reducer in
> the setup method?
> Let's assume in the setup method to read from a file where each line
> is a string that will become a key emitted from mappers.
> For each of these lines I would like to know if the string will be a
> key associated with the current reducer or not.
>
> I read something about mapred.task.partition and mapred.task.id, but I
> didn't understand the usage.
>
>
> Thanks,
> Alberto
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Not sure if I am answering your question, but this is the background. Every
MapReduce job has a partitioner associated to it. The default partitioner
is a HashPartitioner. You can as a user write your own partitioner as well
and plug it into the job. The partitioner is responsible for splitting the
map outputs key space among the reducers.

So, to know which reducer a key will go to, it is basically the value
returned by the partitioner's getPartition method. For e.g this is the code
in the HashPartitioner:

  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

mapred.task.partition is the key that defines the partition number of this
reducer.

I guess you can piece together these bits into what you'd want.. However, I
am interested in understanding why you want to know this ? Can you share
some info ?

Thanks
Hemanth


On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi everyone,
>
> how can i know the keys that are associated to a particular reducer in
> the setup method?
> Let's assume in the setup method to read from a file where each line
> is a string that will become a key emitted from mappers.
> For each of these lines I would like to know if the string will be a
> key associated with the current reducer or not.
>
> I read something about mapred.task.partition and mapred.task.id, but I
> didn't understand the usage.
>
>
> Thanks,
> Alberto
>
>
> --
> Alberto Cordioli
>

Re: Find reducer for a key

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.
Hi,

Not sure if I am answering your question, but this is the background. Every
MapReduce job has a partitioner associated to it. The default partitioner
is a HashPartitioner. You can as a user write your own partitioner as well
and plug it into the job. The partitioner is responsible for splitting the
map outputs key space among the reducers.

So, to know which reducer a key will go to, it is basically the value
returned by the partitioner's getPartition method. For e.g this is the code
in the HashPartitioner:

  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

mapred.task.partition is the key that defines the partition number of this
reducer.

I guess you can piece together these bits into what you'd want.. However, I
am interested in understanding why you want to know this ? Can you share
some info ?

Thanks
Hemanth


On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli <
cordioli.alberto@gmail.com> wrote:

> Hi everyone,
>
> how can i know the keys that are associated to a particular reducer in
> the setup method?
> Let's assume in the setup method to read from a file where each line
> is a string that will become a key emitted from mappers.
> For each of these lines I would like to know if the string will be a
> key associated with the current reducer or not.
>
> I read something about mapred.task.partition and mapred.task.id, but I
> didn't understand the usage.
>
>
> Thanks,
> Alberto
>
>
> --
> Alberto Cordioli
>