You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jianxin Wang <wa...@gmail.com> on 2011/08/03 05:50:17 UTC
how to get all different values for each key
HI,
I hava many <key,value> pairs now, and want to get all different values
for each key, which way is efficient for this work.
such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
output: <1,2/3/4> <2,1/2>
Thanks!
walter
Re: how to get all different values for each key
Posted by Harsh J <ha...@cloudera.com>.
Secondary sort is the way to go. Easier to dedup a sorted input set.
Although you can also try to filter in map and combine phases to a
safe extent possible (sets, etc.), to speed up the process and reduce
data transfers.
On Wed, Aug 3, 2011 at 4:07 PM, Jianxin Wang <wa...@gmail.com> wrote:
> thanks! Matthew :
> *
> *
> * how about using SecondarySory to get <key,values>, the values are
> sorted for every key.*
> *then traverse the sorted values to get all unique values.*
> * *
> * I am not sure which way is more efficient. I doubt HashSet is a
> complicated data structure.
> *
> 2011/8/3 Matthew John <tm...@gmail.com>
>
>> Hey,
>>
>> I feel HashSet is a good method to dedup. To increase the overall
>> efficiency
>> you could also look into Combiner running the same Reducer code. That would
>> ensure less data in the sort-shuffle phase.
>>
>> Regards,
>> Matthew
>>
>> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wa...@gmail.com> wrote:
>>
>> > hi,harsh
>> > After map, I can get all values for one key, but I want dedup these
>> > values, only get all unique values. now I just do it like the image.
>> >
>> > I think the following code is not efficient.(using a HashSet to
>> dedup)
>> > Thanks:)
>> >
>> > private static class MyReducer extends
>> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
>> > {
>> > HashSet<Long> uids=new HashSet<Long>();
>> > LongsWritable unique_uids=new LongsWritable();
>> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context
>> > context)throws IOException,InterruptedException
>> > {
>> > uids.clear();
>> > for(LongWritable v:values)
>> > {
>> > uids.add(v.get());
>> > }
>> > int size=uids.size();
>> > long[] l=new long[size];
>> > int i=0;
>> > for(long uid:uids)
>> > {
>> > l[i]=uid;
>> > i++;
>> > }
>> > unique_uids.Set(l);
>> > context.write(key,unique_uids);
>> > }
>> > }
>> >
>> >
>> > 2011/8/3 Harsh J <ha...@cloudera.com>
>> >
>> >> Use MapReduce :)
>> >>
>> >> If map output: (key, value)
>> >> Then reduce input becomes: (key, [iterator of values across all maps
>> >> with (key, value)])
>> >>
>> >> I believe this is very similar to the wordcount example, but minus the
>> >> summing. For a given key, you get all the values that carry that key
>> >> in the reducer. Have you tried to run a simple program to achieve this
>> >> before asking? Or is something specifically not working?
>> >>
>> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com>
>> wrote:
>> >> > HI,
>> >> > I hava many <key,value> pairs now, and want to get all different
>> >> values
>> >> > for each key, which way is efficient for this work.
>> >> >
>> >> > such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
>> >> > output: <1,2/3/4> <2,1/2>
>> >> >
>> >> > Thanks!
>> >> >
>> >> > walter
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >>
>> >
>> >
>>
>
--
Harsh J
Re: how to get all different values for each key
Posted by Jianxin Wang <wa...@gmail.com>.
thanks! Matthew :
*
*
* how about using SecondarySory to get <key,values>, the values are
sorted for every key.*
*then traverse the sorted values to get all unique values.*
* *
* I am not sure which way is more efficient. I doubt HashSet is a
complicated data structure.
*
2011/8/3 Matthew John <tm...@gmail.com>
> Hey,
>
> I feel HashSet is a good method to dedup. To increase the overall
> efficiency
> you could also look into Combiner running the same Reducer code. That would
> ensure less data in the sort-shuffle phase.
>
> Regards,
> Matthew
>
> On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wa...@gmail.com> wrote:
>
> > hi,harsh
> > After map, I can get all values for one key, but I want dedup these
> > values, only get all unique values. now I just do it like the image.
> >
> > I think the following code is not efficient.(using a HashSet to
> dedup)
> > Thanks:)
> >
> > private static class MyReducer extends
> > Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
> > {
> > HashSet<Long> uids=new HashSet<Long>();
> > LongsWritable unique_uids=new LongsWritable();
> > public void reduce(LongWritable key,Iterable<LongWritable> values,Context
> > context)throws IOException,InterruptedException
> > {
> > uids.clear();
> > for(LongWritable v:values)
> > {
> > uids.add(v.get());
> > }
> > int size=uids.size();
> > long[] l=new long[size];
> > int i=0;
> > for(long uid:uids)
> > {
> > l[i]=uid;
> > i++;
> > }
> > unique_uids.Set(l);
> > context.write(key,unique_uids);
> > }
> > }
> >
> >
> > 2011/8/3 Harsh J <ha...@cloudera.com>
> >
> >> Use MapReduce :)
> >>
> >> If map output: (key, value)
> >> Then reduce input becomes: (key, [iterator of values across all maps
> >> with (key, value)])
> >>
> >> I believe this is very similar to the wordcount example, but minus the
> >> summing. For a given key, you get all the values that carry that key
> >> in the reducer. Have you tried to run a simple program to achieve this
> >> before asking? Or is something specifically not working?
> >>
> >> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com>
> wrote:
> >> > HI,
> >> > I hava many <key,value> pairs now, and want to get all different
> >> values
> >> > for each key, which way is efficient for this work.
> >> >
> >> > such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
> >> > output: <1,2/3/4> <2,1/2>
> >> >
> >> > Thanks!
> >> >
> >> > walter
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
>
Re: how to get all different values for each key
Posted by Matthew John <tm...@gmail.com>.
Hey,
I feel HashSet is a good method to dedup. To increase the overall efficiency
you could also look into Combiner running the same Reducer code. That would
ensure less data in the sort-shuffle phase.
Regards,
Matthew
On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang <wa...@gmail.com> wrote:
> hi,harsh
> After map, I can get all values for one key, but I want dedup these
> values, only get all unique values. now I just do it like the image.
>
> I think the following code is not efficient.(using a HashSet to dedup)
> Thanks:)
>
> private static class MyReducer extends
> Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
> {
> HashSet<Long> uids=new HashSet<Long>();
> LongsWritable unique_uids=new LongsWritable();
> public void reduce(LongWritable key,Iterable<LongWritable> values,Context
> context)throws IOException,InterruptedException
> {
> uids.clear();
> for(LongWritable v:values)
> {
> uids.add(v.get());
> }
> int size=uids.size();
> long[] l=new long[size];
> int i=0;
> for(long uid:uids)
> {
> l[i]=uid;
> i++;
> }
> unique_uids.Set(l);
> context.write(key,unique_uids);
> }
> }
>
>
> 2011/8/3 Harsh J <ha...@cloudera.com>
>
>> Use MapReduce :)
>>
>> If map output: (key, value)
>> Then reduce input becomes: (key, [iterator of values across all maps
>> with (key, value)])
>>
>> I believe this is very similar to the wordcount example, but minus the
>> summing. For a given key, you get all the values that carry that key
>> in the reducer. Have you tried to run a simple program to achieve this
>> before asking? Or is something specifically not working?
>>
>> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com> wrote:
>> > HI,
>> > I hava many <key,value> pairs now, and want to get all different
>> values
>> > for each key, which way is efficient for this work.
>> >
>> > such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
>> > output: <1,2/3/4> <2,1/2>
>> >
>> > Thanks!
>> >
>> > walter
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>
Re: how to get all different values for each key
Posted by Jianxin Wang <wa...@gmail.com>.
hi,harsh
After map, I can get all values for one key, but I want dedup these
values, only get all unique values. now I just do it like the image.
I think the following code is not efficient.(using a HashSet to dedup)
Thanks:)
private static class MyReducer extends
Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
{
HashSet<Long> uids=new HashSet<Long>();
LongsWritable unique_uids=new LongsWritable();
public void reduce(LongWritable key,Iterable<LongWritable> values,Context
context)throws IOException,InterruptedException
{
uids.clear();
for(LongWritable v:values)
{
uids.add(v.get());
}
int size=uids.size();
long[] l=new long[size];
int i=0;
for(long uid:uids)
{
l[i]=uid;
i++;
}
unique_uids.Set(l);
context.write(key,unique_uids);
}
}
2011/8/3 Harsh J <ha...@cloudera.com>
> Use MapReduce :)
>
> If map output: (key, value)
> Then reduce input becomes: (key, [iterator of values across all maps
> with (key, value)])
>
> I believe this is very similar to the wordcount example, but minus the
> summing. For a given key, you get all the values that carry that key
> in the reducer. Have you tried to run a simple program to achieve this
> before asking? Or is something specifically not working?
>
> On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com> wrote:
> > HI,
> > I hava many <key,value> pairs now, and want to get all different
> values
> > for each key, which way is efficient for this work.
> >
> > such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
> > output: <1,2/3/4> <2,1/2>
> >
> > Thanks!
> >
> > walter
> >
>
>
>
> --
> Harsh J
>
Re: how to get all different values for each key
Posted by Harsh J <ha...@cloudera.com>.
Use MapReduce :)
If map output: (key, value)
Then reduce input becomes: (key, [iterator of values across all maps
with (key, value)])
I believe this is very similar to the wordcount example, but minus the
summing. For a given key, you get all the values that carry that key
in the reducer. Have you tried to run a simple program to achieve this
before asking? Or is something specifically not working?
On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wa...@gmail.com> wrote:
> HI,
> I hava many <key,value> pairs now, and want to get all different values
> for each key, which way is efficient for this work.
>
> such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
> output: <1,2/3/4> <2,1/2>
>
> Thanks!
>
> walter
>
--
Harsh J