You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Simone Robutti <si...@radicalbit.io> on 2016/05/25 08:22:18 UTC

Merging sets with common elements

Hello,

I'm implementing MinHash for reccomendation on Flink. I'm almost done but I
need an efficient way to merge sets of similar keys together (and later
join these sets of keys with more data).

The actual data structure is of the form DataSet[(Int,Set[Int])] where the
left element of the tuple is an ID for the right element, that is a set of
keys. I want to merge these sets together only if they share at least one
element.

I'm rather sure to have studied the efficient solution to this problem in a
local environment but I don't really know how to treat it in a distributed
environment. Any suggestion?

Thanks,

Simone

Re: Merging sets with common elements

Posted by Simone Robutti <si...@radicalbit.io>.

@Till:

A more meaningful example would be the following: from
{{1{1,2}},{2,{2,3}},{3,{4,5},{4{1,27}}}} the result should be
{1,2,3,27},{4,5} because the set #1,#2 and #4 have at least one element in
common.

If you see this as a graph where the elements of the sets are nodes and a
set express a full connection of all the elements of said set , you can see
the result as the connected components of the graph.


2016-05-25 11:42 GMT+02:00 Till Rohrmann <tr...@apache.org>:

> Hi Simone,
>
> could you elaborate a little bit on the actual operation you want to
> perform. Given a data set {(1, {1,2}), (2, {2,3})} what's the result of
> your operation? Is the result { ({1,2}, {1,2,3}) } because the 2 is
> contained in both sets?
>
> Cheers,
> Till
>
> On Wed, May 25, 2016 at 10:22 AM, Simone Robutti <
> simone.robutti@radicalbit.io> wrote:
>
>> Hello,
>>
>> I'm implementing MinHash for reccomendation on Flink. I'm almost done but
>> I need an efficient way to merge sets of similar keys together (and later
>> join these sets of keys with more data).
>>
>> The actual data structure is of the form DataSet[(Int,Set[Int])] where
>> the left element of the tuple is an ID for the right element, that is a set
>> of keys. I want to merge these sets together only if they share at least
>> one element.
>>
>> I'm rather sure to have studied the efficient solution to this problem in
>> a local environment but I don't really know how to treat it in a
>> distributed environment. Any suggestion?
>>
>> Thanks,
>>
>> Simone
>>
>
>

Re: Merging sets with common elements

Posted by Till Rohrmann <tr...@apache.org>.

Hi Simone,

could you elaborate a little bit on the actual operation you want to
perform. Given a data set {(1, {1,2}), (2, {2,3})} what's the result of
your operation? Is the result { ({1,2}, {1,2,3}) } because the 2 is
contained in both sets?

Cheers,
Till

On Wed, May 25, 2016 at 10:22 AM, Simone Robutti <
simone.robutti@radicalbit.io> wrote:

> Hello,
>
> I'm implementing MinHash for reccomendation on Flink. I'm almost done but
> I need an efficient way to merge sets of similar keys together (and later
> join these sets of keys with more data).
>
> The actual data structure is of the form DataSet[(Int,Set[Int])] where the
> left element of the tuple is an ID for the right element, that is a set of
> keys. I want to merge these sets together only if they share at least one
> element.
>
> I'm rather sure to have studied the efficient solution to this problem in
> a local environment but I don't really know how to treat it in a
> distributed environment. Any suggestion?
>
> Thanks,
>
> Simone
>

Re: Merging sets with common elements

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
if I understood it correctly the "key" in that case would be a
fuzzy/probabilistic key. I'm not sure this can be computed using either the
sort-based or hash-based joinging/grouping strategies of Flink. Maybe we
can find something if you elaborate.

Cheers,
Aljoscha

On Wed, 25 May 2016 at 10:22 Simone Robutti <si...@radicalbit.io>
wrote:

> Hello,
>
> I'm implementing MinHash for reccomendation on Flink. I'm almost done but
> I need an efficient way to merge sets of similar keys together (and later
> join these sets of keys with more data).
>
> The actual data structure is of the form DataSet[(Int,Set[Int])] where the
> left element of the tuple is an ID for the right element, that is a set of
> keys. I want to merge these sets together only if they share at least one
> element.
>
> I'm rather sure to have studied the efficient solution to this problem in
> a local environment but I don't really know how to treat it in a
> distributed environment. Any suggestion?
>
> Thanks,
>
> Simone
>