You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mihran Shahinian <sl...@gmail.com> on 2015/03/26 21:47:08 UTC

Fuzzy GroupBy

I would like to group records, but instead of grouping on exact key I want
to be able to compute the similarity of keys on my own. Is there a
recommended way of doing this?

here is my starting point

final JavaRDD< pojo > records = spark.parallelize(getListofPojos()).cache();
class pojo {
 String prop1
 String prop2
}

during groupBy I would like to compute similarity between prop1 for each
pojo.

Much appreciated,
Mihran

Re: Fuzzy GroupBy

Posted by Sean Owen <so...@cloudera.com>.

The grouping is determined by the POJO's equals() method. You can also
call groupBy() to group by some function of the POJOs. For example if
you're grouping Doubles into nearly-equal bunches, you could group by
their .intValue()

On Thu, Mar 26, 2015 at 8:47 PM, Mihran Shahinian <sl...@gmail.com> wrote:
> I would like to group records, but instead of grouping on exact key I want
> to be able to compute the similarity of keys on my own. Is there a
> recommended way of doing this?
>
> here is my starting point
>
> final JavaRDD< pojo > records = spark.parallelize(getListofPojos()).cache();
>
> class pojo {
>  String prop1
>  String prop2
> }
>
> during groupBy I would like to compute similarity between prop1 for each
> pojo.
>
> Much appreciated,
> Mihran

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org