You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Joey Krabacher <jk...@gmail.com> on 2012/12/05 00:37:10 UTC

Question on Key Grouping

Is there a way to group Keys a second time before sending results to the
Reducer in the same job? I thought maybe a combiner would do this for me,
but it just acts like a reducer, so I need an intermediate step that acts
like another mapper instead.

To try to visualize this, how I want it to work:

Map output:

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

Combiner Output:

<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>

Reduce Output:

<1, "John","Doe">


How it currently works:

Map output:

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

Combiner Output:

<1, {1, "John",""}>
<1, {1, "",""}>
<1, {1, "", "Doe"}>

Reduce Output:

<1, "John","Doe">
<1, "John","Doe">
<1, "John","Doe">


So, basically the issue is that even though the 2 in the first map record
should really be a one, I still need to extract the value of "John" and
have it included in the output for key 1.

Hope this makes sense.

Thanks in advance,
/* Joey */

RE: Question on Key Grouping

Posted by David Parks <da...@yahoo.com>.

First rule to be wary of is your use of the combiner. The combiner *might*
be run, it *might not* be run, and it *might be run multiple times*. The
combiner is only for reducing the amount of data going to the reducer, and
it will only be run *if and when* it's deemed likely to be useful by Hadoop.
Don't use it for logic.

 

Although I didn't quite follow your example (it's not clear what your keys
and values are), I think what you need to do is just run 2 map/reduce phases
here. The first map/reduce phase groups the first set of keys you need, then
reduce, write it to disk (hdfs probably), and run a 2nd map/reduce phase
that reads that input and does the mapping you need. Most even modestly
complex applications are going through multiple map/reduce phases to
accomplish their task. If you need 2 map phases, then the first reduce phase
might just be the identity reducer (org.apache.hadoop.mapreduce.Reducer),
which just writes the results of the first map phase straight out.

 

David

 

 

From: Joey Krabacher [mailto:jkrabacher@gmail.com] 
Sent: Wednesday, December 05, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Question on Key Grouping

 

Is there a way to group Keys a second time before sending results to the
Reducer in the same job? I thought maybe a combiner would do this for me,
but it just acts like a reducer, so I need an intermediate step that acts
like another mapper instead.

 

To try to visualize this, how I want it to work:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Reduce Output:

 

<1, "John","Doe">

 

 

How it currently works:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, {1, "John",""}>

<1, {1, "",""}>

<1, {1, "", "Doe"}>

 

Reduce Output:

 

<1, "John","Doe">

<1, "John","Doe">

<1, "John","Doe">

 

 

So, basically the issue is that even though the 2 in the first map record
should really be a one, I still need to extract the value of "John" and have
it included in the output for key 1.

 

Hope this makes sense.

 

Thanks in advance,

/* Joey */

RE: Question on Key Grouping

Posted by David Parks <da...@yahoo.com>.

First rule to be wary of is your use of the combiner. The combiner *might*
be run, it *might not* be run, and it *might be run multiple times*. The
combiner is only for reducing the amount of data going to the reducer, and
it will only be run *if and when* it's deemed likely to be useful by Hadoop.
Don't use it for logic.

 

Although I didn't quite follow your example (it's not clear what your keys
and values are), I think what you need to do is just run 2 map/reduce phases
here. The first map/reduce phase groups the first set of keys you need, then
reduce, write it to disk (hdfs probably), and run a 2nd map/reduce phase
that reads that input and does the mapping you need. Most even modestly
complex applications are going through multiple map/reduce phases to
accomplish their task. If you need 2 map phases, then the first reduce phase
might just be the identity reducer (org.apache.hadoop.mapreduce.Reducer),
which just writes the results of the first map phase straight out.

 

David

 

 

From: Joey Krabacher [mailto:jkrabacher@gmail.com] 
Sent: Wednesday, December 05, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Question on Key Grouping

 

Is there a way to group Keys a second time before sending results to the
Reducer in the same job? I thought maybe a combiner would do this for me,
but it just acts like a reducer, so I need an intermediate step that acts
like another mapper instead.

 

To try to visualize this, how I want it to work:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Reduce Output:

 

<1, "John","Doe">

 

 

How it currently works:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, {1, "John",""}>

<1, {1, "",""}>

<1, {1, "", "Doe"}>

 

Reduce Output:

 

<1, "John","Doe">

<1, "John","Doe">

<1, "John","Doe">

 

 

So, basically the issue is that even though the 2 in the first map record
should really be a one, I still need to extract the value of "John" and have
it included in the output for key 1.

 

Hope this makes sense.

 

Thanks in advance,

/* Joey */

RE: Question on Key Grouping

Posted by David Parks <da...@yahoo.com>.

First rule to be wary of is your use of the combiner. The combiner *might*
be run, it *might not* be run, and it *might be run multiple times*. The
combiner is only for reducing the amount of data going to the reducer, and
it will only be run *if and when* it's deemed likely to be useful by Hadoop.
Don't use it for logic.

 

Although I didn't quite follow your example (it's not clear what your keys
and values are), I think what you need to do is just run 2 map/reduce phases
here. The first map/reduce phase groups the first set of keys you need, then
reduce, write it to disk (hdfs probably), and run a 2nd map/reduce phase
that reads that input and does the mapping you need. Most even modestly
complex applications are going through multiple map/reduce phases to
accomplish their task. If you need 2 map phases, then the first reduce phase
might just be the identity reducer (org.apache.hadoop.mapreduce.Reducer),
which just writes the results of the first map phase straight out.

 

David

 

 

From: Joey Krabacher [mailto:jkrabacher@gmail.com] 
Sent: Wednesday, December 05, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Question on Key Grouping

 

Is there a way to group Keys a second time before sending results to the
Reducer in the same job? I thought maybe a combiner would do this for me,
but it just acts like a reducer, so I need an intermediate step that acts
like another mapper instead.

 

To try to visualize this, how I want it to work:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Reduce Output:

 

<1, "John","Doe">

 

 

How it currently works:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, {1, "John",""}>

<1, {1, "",""}>

<1, {1, "", "Doe"}>

 

Reduce Output:

 

<1, "John","Doe">

<1, "John","Doe">

<1, "John","Doe">

 

 

So, basically the issue is that even though the 2 in the first map record
should really be a one, I still need to extract the value of "John" and have
it included in the output for key 1.

 

Hope this makes sense.

 

Thanks in advance,

/* Joey */

RE: Question on Key Grouping

Posted by David Parks <da...@yahoo.com>.

First rule to be wary of is your use of the combiner. The combiner *might*
be run, it *might not* be run, and it *might be run multiple times*. The
combiner is only for reducing the amount of data going to the reducer, and
it will only be run *if and when* it's deemed likely to be useful by Hadoop.
Don't use it for logic.

 

Although I didn't quite follow your example (it's not clear what your keys
and values are), I think what you need to do is just run 2 map/reduce phases
here. The first map/reduce phase groups the first set of keys you need, then
reduce, write it to disk (hdfs probably), and run a 2nd map/reduce phase
that reads that input and does the mapping you need. Most even modestly
complex applications are going through multiple map/reduce phases to
accomplish their task. If you need 2 map phases, then the first reduce phase
might just be the identity reducer (org.apache.hadoop.mapreduce.Reducer),
which just writes the results of the first map phase straight out.

 

David

 

 

From: Joey Krabacher [mailto:jkrabacher@gmail.com] 
Sent: Wednesday, December 05, 2012 6:37 AM
To: user@hadoop.apache.org
Subject: Question on Key Grouping

 

Is there a way to group Keys a second time before sending results to the
Reducer in the same job? I thought maybe a combiner would do this for me,
but it just acts like a reducer, so I need an intermediate step that acts
like another mapper instead.

 

To try to visualize this, how I want it to work:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Reduce Output:

 

<1, "John","Doe">

 

 

How it currently works:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, {1, "John",""}>

<1, {1, "",""}>

<1, {1, "", "Doe"}>

 

Reduce Output:

 

<1, "John","Doe">

<1, "John","Doe">

<1, "John","Doe">

 

 

So, basically the issue is that even though the 2 in the first map record
should really be a one, I still need to extract the value of "John" and have
it included in the output for key 1.

 

Hope this makes sense.

 

Thanks in advance,

/* Joey */