You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by iñaki williams <ju...@gmail.com> on 2016/06/06 07:31:21 UTC

Custom keyBy(), look for similaties

Hi guys,

I am using Flink on my project and I have a question. (I am using Java)

Is it possible to modify the keyby method in order to key by similarities
and not by the exact name?

Example: I recieve 2 DataStreams, in the first one , the name of the field
that I want to KeyBy is "John Locke", while in the Datastream the field
value is "John L". Can I use some java library to find for similarities
between strings and if the similitude is high, then key those elements
together.

Re: Custom keyBy(), look for similaties

Posted by Chesnay Schepler <ch...@apache.org>.
the idea behind key-selectors is to extract a property on which you can 
to equality comparisons.

let's get one question out of the way first:
is your scoring algorithm transitive? as in if A==B and B==C, is it a 
given that A==C? because if not, there's
just no way to group(=partition) the data, since B would belong to 2 
distinct groups.

Even if it did work, one thing you have to realize is that this wouldn't 
scale at all. For every element that
comes in you would have to compare it to all other groups you have 
created so far.

What i would propose is the following: create a key-selector that allows 
a /rough/ grouping of your data.
something like "John L" => "J L". On that group (that is hopefully 
relatively small) you can then fire up your
algorithm between all possible pairs to do whatever you wanna do.

On 07.06.2016 10:48, i�aki williams wrote:
> Thanks for your answer Ufuk.
>
> However, I have been reading about KeySelector and I don't understand 
> completely how it works with my idea.
>
> I am using an algorithm that gives me an score between some different 
> strings. My idea is: if the score is higher than 0'80 for example, 
> then those two strings will be consider the same and when I apply the 
> keyby("name") those similar string will be keyed as they have the 
> exact same name.
>
> El lunes, 6 de junio de 2016, Ufuk Celebi <uce@apache.org 
> <ma...@apache.org>> escribi�:
>
>     Hey I�aki,
>
>     you can use the KeySelector as described here:
>     https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys
>
>     But you only a local view for the current element, e.g. the library
>     you use to determine the similarity has to know the similarities
>     upfront.
>
>     \u2013 Ufuk
>
>
>     On Mon, Jun 6, 2016 at 9:31 AM, i�aki williams
>     <juanramallo80@gmail.com <javascript:;>> wrote:
>     > Hi guys,
>     >
>     > I am using Flink on my project and I have a question. (I am
>     using Java)
>     >
>     > Is it possible to modify the keyby method in order to key by
>     similarities
>     > and not by the exact name?
>     >
>     > Example: I recieve 2 DataStreams, in the first one , the name of
>     the field
>     > that I want to KeyBy is "John Locke", while in the Datastream
>     the field
>     > value is "John L". Can I use some java library to find for
>     similarities
>     > between strings and if the similitude is high, then key those
>     elements
>     > together.
>


Re: Custom keyBy(), look for similaties

Posted by iñaki williams <ju...@gmail.com>.
Thanks for your answer Ufuk.

However, I have been reading about KeySelector and I don't understand
completely how it works with my idea.

I am using an algorithm that gives me an score between some different
strings. My idea is: if the score is higher than 0'80 for example, then
those two strings will be consider the same and when I apply the
keyby("name") those similar string will be keyed as they have the exact
same name.

El lunes, 6 de junio de 2016, Ufuk Celebi <uc...@apache.org> escribió:

> Hey Iñaki,
>
> you can use the KeySelector as described here:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys
>
> But you only a local view for the current element, e.g. the library
> you use to determine the similarity has to know the similarities
> upfront.
>
> – Ufuk
>
>
> On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams <juanramallo80@gmail.com
> <javascript:;>> wrote:
> > Hi guys,
> >
> > I am using Flink on my project and I have a question. (I am using Java)
> >
> > Is it possible to modify the keyby method in order to key by similarities
> > and not by the exact name?
> >
> > Example: I recieve 2 DataStreams, in the first one , the name of the
> field
> > that I want to KeyBy is "John Locke", while in the Datastream the field
> > value is "John L". Can I use some java library to find for similarities
> > between strings and if the similitude is high, then key those elements
> > together.
>

Re: Custom keyBy(), look for similaties

Posted by Ufuk Celebi <uc...@apache.org>.
Hey Iñaki,

you can use the KeySelector as described here:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#specifying-keys

But you only a local view for the current element, e.g. the library
you use to determine the similarity has to know the similarities
upfront.

– Ufuk


On Mon, Jun 6, 2016 at 9:31 AM, iñaki williams <ju...@gmail.com> wrote:
> Hi guys,
>
> I am using Flink on my project and I have a question. (I am using Java)
>
> Is it possible to modify the keyby method in order to key by similarities
> and not by the exact name?
>
> Example: I recieve 2 DataStreams, in the first one , the name of the field
> that I want to KeyBy is "John Locke", while in the Datastream the field
> value is "John L". Can I use some java library to find for similarities
> between strings and if the similitude is high, then key those elements
> together.