You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Jan Nehring <ja...@dfki.de> on 2017/02/06 12:39:37 UTC

stream clustering in flink

Hi,

we want to cluster a stream of Tweets using Flink. Every incoming tweet 
is compared to the last 100 tweets. After this comparison, a cluster ID 
is assigned to the tweet. We try to find out the best approach how to 
solve this:

1. Using a stream window of the last tweets seems to be difficult 
because we would need to cross join this window with every incoming 
tweet. According to my research the Flink API does not support cross 
joins on stream windows.
2. We could also store the last 100 tweets in one operator with 
parallelism=1. This would work but it introduces a bottleneck.
3. We could share the last 100 tweets as a "shared state" among the 
operator that assigns the cluster. But every tweet changes the state so 
there would be a lot of synchronization effort between the operators.

Are you aware of other possible solutions? Currently solution #2 seems 
the most promising to me but I do not like the bottleneck.

Best regards Jan

Re: stream clustering in flink

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,
I think distributed stream clustering is still a somewhat open field. I'm
not aware of popular open source systems that have implementations for that
(except maybe Apache SAMOA). Maybe you will have some luck if you try to
search for "distributed stream clustering" papers.

Cheers,
Aljoscha

On Mon, 6 Feb 2017 at 13:39 Jan Nehring <ja...@dfki.de> wrote:

> Hi,
>
> we want to cluster a stream of Tweets using Flink. Every incoming tweet
> is compared to the last 100 tweets. After this comparison, a cluster ID
> is assigned to the tweet. We try to find out the best approach how to
> solve this:
>
> 1. Using a stream window of the last tweets seems to be difficult
> because we would need to cross join this window with every incoming
> tweet. According to my research the Flink API does not support cross
> joins on stream windows.
> 2. We could also store the last 100 tweets in one operator with
> parallelism=1. This would work but it introduces a bottleneck.
> 3. We could share the last 100 tweets as a "shared state" among the
> operator that assigns the cluster. But every tweet changes the state so
> there would be a lot of synchronization effort between the operators.
>
> Are you aware of other possible solutions? Currently solution #2 seems
> the most promising to me but I do not like the bottleneck.
>
> Best regards Jan
>