You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jianxin Wang <wa...@gmail.com> on 2011/08/03 05:50:17 UTC

how to get all different values for each key

HI,
    I hava many <key,value> pairs now, and want to get all different values
for each key, which way is efficient for this work.

   such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
   output: <1,2/3/4> <2,1/2>

   Thanks!

walter

Re: how to get all different values for each key

Posted by Harsh J <ha...@cloudera.com>.
Secondary sort is the way to go. Easier to dedup a sorted input set.
Although you can also try to filter in map and combine phases to a
safe extent possible (sets, etc.), to speed up the process and reduce
data transfers.

On Wed, Aug 3, 2011 at 4:07 PM, Jianxin Wang <wa...@gmail.com> wrote:
> thanks! Matthew :
> *
> *
> *    how about using SecondarySory to get <key,values>, the values are
> sorted for every key.*
> *then traverse the sorted values to get all unique values.*
> *    *
> *   I am not sure which way is more efficient. I doubt HashSet is a
> complicated data structure.
> *
> 2011/8/3 Matthew John <tm...@gmail.com>
>
>> Hey,
>>
>> I feel HashSet is a good method to dedup. To increase the overall
>> efficiency
>> you could also look into Combiner running the same Reducer code. That would
>> ensure less data in the sort-shuffle phase.
>>
>> Regards,
>> Matthew
>>
>> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wa...@gmail.com> wrote:
>>
>> > hi,harsh
>> >     After map, I can get all values for one key, but I want dedup these
>> > values, only get all unique values. now I just do it like the image.
>> >
>> >     I think the following code is not efficient.(using a HashSet to
>> dedup)
>> > Thanks:)
>> >
>> > private static class MyReducer extends
>> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
>> > {
>> > HashSet<Long> uids=new HashSet<Long>();
>> >  LongsWritable unique_uids=new LongsWritable();
>> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context
>> > context)throws IOException,InterruptedException
>> >  {
>> > uids.clear();
>> > for(LongWritable v:values)
>> >  {
>> > uids.add(v.get());
>> > }
>> >  int size=uids.size();
>> > long[] l=new long[size];
>> > int i=0;
>> >  for(long uid:uids)
>> > {
>> > l[i]=uid;
>> >  i++;
>> > }
>> > unique_uids.Set(l);
>> >  context.write(key,unique_uids);
>> > }
>> > }
>> >
>> >
>> > 2011/8/3 Harsh J <ha...@cloudera.com>
>> >
>> >> Use MapReduce :)
>> >>
>> >> If map output: (key, value)
>> >> Then reduce input becomes: (key, [iterator of values across all maps
>> >> with (key, value)])
>> >>
>> >> I believe this is very similar to the wordcount example, but minus the
>> >> summing. For a given key, you get all the values that carry that key
>> >> in the reducer. Have you tried to run a simple program to achieve this
>> >> before asking? Or is something specifically not working?
>> >>
>> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com>
>> wrote:
>> >> > HI,
>> >> >    I hava many <key,value> pairs now, and want to get all different
>> >> values
>> >> > for each key, which way is efficient for this work.
>> >> >
>> >> >   such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
>> >> >   output: <1,2/3/4> <2,1/2>
>> >> >
>> >> >   Thanks!
>> >> >
>> >> > walter
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >>
>> >
>> >
>>
>



-- 
Harsh J

Re: how to get all different values for each key

Posted by Jianxin Wang <wa...@gmail.com>.
thanks! Matthew :
*
*
*    how about using SecondarySory to get <key,values>, the values are
sorted for every key.*
*then traverse the sorted values to get all unique values.*
*    *
*   I am not sure which way is more efficient. I doubt HashSet is a
complicated data structure.
*
2011/8/3 Matthew John <tm...@gmail.com>

> Hey,
>
> I feel HashSet is a good method to dedup. To increase the overall
> efficiency
> you could also look into Combiner running the same Reducer code. That would
> ensure less data in the sort-shuffle phase.
>
> Regards,
> Matthew
>
> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wa...@gmail.com> wrote:
>
> > hi,harsh
> >     After map, I can get all values for one key, but I want dedup these
> > values, only get all unique values. now I just do it like the image.
> >
> >     I think the following code is not efficient.(using a HashSet to
> dedup)
> > Thanks:)
> >
> > private static class MyReducer extends
> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
> > {
> > HashSet<Long> uids=new HashSet<Long>();
> >  LongsWritable unique_uids=new LongsWritable();
> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context
> > context)throws IOException,InterruptedException
> >  {
> > uids.clear();
> > for(LongWritable v:values)
> >  {
> > uids.add(v.get());
> > }
> >  int size=uids.size();
> > long[] l=new long[size];
> > int i=0;
> >  for(long uid:uids)
> > {
> > l[i]=uid;
> >  i++;
> > }
> > unique_uids.Set(l);
> >  context.write(key,unique_uids);
> > }
> > }
> >
> >
> > 2011/8/3 Harsh J <ha...@cloudera.com>
> >
> >> Use MapReduce :)
> >>
> >> If map output: (key, value)
> >> Then reduce input becomes: (key, [iterator of values across all maps
> >> with (key, value)])
> >>
> >> I believe this is very similar to the wordcount example, but minus the
> >> summing. For a given key, you get all the values that carry that key
> >> in the reducer. Have you tried to run a simple program to achieve this
> >> before asking? Or is something specifically not working?
> >>
> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com>
> wrote:
> >> > HI,
> >> >    I hava many <key,value> pairs now, and want to get all different
> >> values
> >> > for each key, which way is efficient for this work.
> >> >
> >> >   such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
> >> >   output: <1,2/3/4> <2,1/2>
> >> >
> >> >   Thanks!
> >> >
> >> > walter
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
>

Re: how to get all different values for each key

Posted by Matthew John <tm...@gmail.com>.
Hey,

I feel HashSet is a good method to dedup. To increase the overall efficiency
you could also look into Combiner running the same Reducer code. That would
ensure less data in the sort-shuffle phase.

Regards,
Matthew

On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wa...@gmail.com> wrote:

> hi,harsh
>     After map, I can get all values for one key, but I want dedup these
> values, only get all unique values. now I just do it like the image.
>
>     I think the following code is not efficient.(using a HashSet to dedup)
> Thanks:)
>
> private static class MyReducer extends
> Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
> {
> HashSet<Long> uids=new HashSet<Long>();
>  LongsWritable unique_uids=new LongsWritable();
> public void reduce(LongWritable key,Iterable<LongWritable> values,Context
> context)throws IOException,InterruptedException
>  {
> uids.clear();
> for(LongWritable v:values)
>  {
> uids.add(v.get());
> }
>  int size=uids.size();
> long[] l=new long[size];
> int i=0;
>  for(long uid:uids)
> {
> l[i]=uid;
>  i++;
> }
> unique_uids.Set(l);
>  context.write(key,unique_uids);
> }
> }
>
>
> 2011/8/3 Harsh J <ha...@cloudera.com>
>
>> Use MapReduce :)
>>
>> If map output: (key, value)
>> Then reduce input becomes: (key, [iterator of values across all maps
>> with (key, value)])
>>
>> I believe this is very similar to the wordcount example, but minus the
>> summing. For a given key, you get all the values that carry that key
>> in the reducer. Have you tried to run a simple program to achieve this
>> before asking? Or is something specifically not working?
>>
>> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com> wrote:
>> > HI,
>> >    I hava many <key,value> pairs now, and want to get all different
>> values
>> > for each key, which way is efficient for this work.
>> >
>> >   such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
>> >   output: <1,2/3/4> <2,1/2>
>> >
>> >   Thanks!
>> >
>> > walter
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: how to get all different values for each key

Posted by Jianxin Wang <wa...@gmail.com>.
hi,harsh
    After map, I can get all values for one key, but I want dedup these
values, only get all unique values. now I just do it like the image.

    I think the following code is not efficient.(using a HashSet to dedup)
Thanks:)

private static class MyReducer extends
Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
{
HashSet<Long> uids=new HashSet<Long>();
LongsWritable unique_uids=new LongsWritable();
public void reduce(LongWritable key,Iterable<LongWritable> values,Context
context)throws IOException,InterruptedException
{
uids.clear();
for(LongWritable v:values)
{
uids.add(v.get());
}
int size=uids.size();
long[] l=new long[size];
int i=0;
for(long uid:uids)
{
l[i]=uid;
i++;
}
unique_uids.Set(l);
context.write(key,unique_uids);
}
}


2011/8/3 Harsh J <ha...@cloudera.com>

> Use MapReduce :)
>
> If map output: (key, value)
> Then reduce input becomes: (key, [iterator of values across all maps
> with (key, value)])
>
> I believe this is very similar to the wordcount example, but minus the
> summing. For a given key, you get all the values that carry that key
> in the reducer. Have you tried to run a simple program to achieve this
> before asking? Or is something specifically not working?
>
> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com> wrote:
> > HI,
> >    I hava many <key,value> pairs now, and want to get all different
> values
> > for each key, which way is efficient for this work.
> >
> >   such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
> >   output: <1,2/3/4> <2,1/2>
> >
> >   Thanks!
> >
> > walter
> >
>
>
>
> --
> Harsh J
>

Re: how to get all different values for each key

Posted by Harsh J <ha...@cloudera.com>.
Use MapReduce :)

If map output: (key, value)
Then reduce input becomes: (key, [iterator of values across all maps
with (key, value)])

I believe this is very similar to the wordcount example, but minus the
summing. For a given key, you get all the values that carry that key
in the reducer. Have you tried to run a simple program to achieve this
before asking? Or is something specifically not working?

On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com> wrote:
> HI,
>    I hava many <key,value> pairs now, and want to get all different values
> for each key, which way is efficient for this work.
>
>   such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
>   output: <1,2/3/4> <2,1/2>
>
>   Thanks!
>
> walter
>



-- 
Harsh J