You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Martin Neumann <mn...@spotify.com> on 2014/07/27 12:56:47 UTC
how to split data-sets efficiently?
Hej,
I have a dataset of StringID's and I want to map them to Longs by using a
hash function. I will use the LongID's in a series of Iterative
computations and then map back to StringID's.
Currently I have a map operation that creates tuples with the string and
the long. I have an other mapper cleaning out the String's.
Is there a way to do a operation that allows for more the one output set
(basically split a set into 2 sets)? This would reduce the complexity of
the code a lot.
Also how does the optimizer deal with this case? Does it join both map
operation's together and actually run it as if it would be a split?
cheers Martin
Re: how to split data-sets efficiently?
Posted by Stephan Ewen <se...@apache.org>.
Hey!
A similar issue has arisen in different context. We should solve both
problems homogeneously.
Can you participate in the discussion here:
https://issues.apache.org/jira/browse/FLINK-87
Greetings,
Stephan
On Mon, Jul 28, 2014 at 3:42 PM, Stephan Ewen <se...@apache.org> wrote:
> Hi!
>
> "Splitting", in the sense that one function returns two different data
> sets, is currently not supported.
>
> I guess you have to go with Ufuk's suggestion. IN your case, I guess it
> would look somewhat like this:
>
>
> DataSet<Tuple2<Long, String>> mapped = ogiginalStrings.map(HashIdMapper());
>
> DataSet<Long> ids = mapped.map(new ProjectTo2());
>
> DataSet<Long> result = ids.runTheGraphAlgorithm(...)
>
> result.join(mapped).where(...).equalTo(...).with(new MapBackToStrings());
>
>
> Greetings,
> Stephan
>
Re: how to split data-sets efficiently?
Posted by Stephan Ewen <se...@apache.org>.
Hi!
"Splitting", in the sense that one function returns two different data
sets, is currently not supported.
I guess you have to go with Ufuk's suggestion. IN your case, I guess it
would look somewhat like this:
DataSet<Tuple2<Long, String>> mapped = ogiginalStrings.map(HashIdMapper());
DataSet<Long> ids = mapped.map(new ProjectTo2());
DataSet<Long> result = ids.runTheGraphAlgorithm(...)
result.join(mapped).where(...).equalTo(...).with(new MapBackToStrings());
Greetings,
Stephan
Re: how to split data-sets efficiently?
Posted by Chesnay Schepler <ch...@fu-berlin.de>.
i think this is what martin is currently doing:
StringIDs --map-> (StringIDs,LongIDs) --map-> LongIDs
and he wants to use both the second and third set. he asks for a way to
replace the second map operation. (since it seems unnecessary to create
an extra map for that)
i believe the appropriate way would be to use projections instead of a
map operation. something like:
mapped = stringIDs.map(...)
longids = mapped.project(1).types(Long)
you would end up with a Tuple1 set though.
On 27.7.2014 13:21, Ufuk Celebi wrote:
> Hey Martin,
>
> On 27 Jul 2014, at 12:56, Martin Neumann <mn...@spotify.com> wrote:
>
>> Is there a way to do a operation that allows for more the one output set
>> (basically split a set into 2 sets)? This would reduce the complexity of
>> the code a lot.
> What exactly do you mean with split?
>
> I am not sure if this is what you want, but you can just apply two transformations on the same input data set.
>
> DataSet<String> input = ...;
>
> DataSet<String> firstSet = input.map(...)
>
> DataSet<String> secondSet = input.map(...)
>
> Does this help?
Re: how to split data-sets efficiently?
Posted by Ufuk Celebi <u....@fu-berlin.de>.
Hey Martin,
On 27 Jul 2014, at 12:56, Martin Neumann <mn...@spotify.com> wrote:
> Is there a way to do a operation that allows for more the one output set
> (basically split a set into 2 sets)? This would reduce the complexity of
> the code a lot.
What exactly do you mean with split?
I am not sure if this is what you want, but you can just apply two transformations on the same input data set.
DataSet<String> input = ...;
DataSet<String> firstSet = input.map(...)
DataSet<String> secondSet = input.map(...)
Does this help?