You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Magnus Vojbacke <ma...@gmail.com> on 2017/10/17 08:42:12 UTC

Split a dataset

I'm looking for something like DataStream.split(), but for DataSets. I'd like to split my streaming data so messages go to different parts of an execution graph, based on arbitrary logic.

DataStream.split() seems to be perfect, except that my source is a CSV file, and I have only found built in functions for reading CSV files into a DataSet.

I've evaluated using DataSet.filter(), but as far as I can tell, that only allows me to emulate a yes/no split. This is not ideal because it's too coarse, and I would prefer a more fine grained split than that.


Do you have any suggestions on how I can achieve my arbitrary splitting logic for a) DataSets in general, or b) CSV files?


Re: Split a dataset

Posted by Fabian Hueske <fh...@gmail.com>.
Unfortunately, it's not possible to bridge the gap between the DataSet and
DataStream APIs.

However, you can also use a CsvInputFormat in the DataStream API. Since
there's no built-in API to configure the CSV input, you would have to
create (and configure) the CsvInputFormat yourself.
Once you have the CsvInputFormat, you can create a DataStream using
StreamExecutionEnvironment.readFile(csvIF).

Hope this helps,
Fabian

2017-10-17 11:05 GMT+02:00 Magnus Vojbacke <ma...@gmail.com>:

> Thank you, Fabian! If batch semantics are not important to my use case, is
> there any way to "downgrade" or convert a DataSet to a DataStream?
>
> BR
> /Magnus
>
> On 17 Oct 2017, at 10:54, Fabian Hueske <fh...@gmail.com> wrote:
>
> Hi Magnus,
>
> there is no Split operator on the DataSet API.
>
> As you said, this can be done using a FilterFunction. This also allows for
> non-binary splits:
>
> DataSet<X> setToSplit = ...
> DataSet<X> firstSplit = setToSplit.filter(new SplitCondition1());
> DataSet<X> secondSplit = setToSplit.filter(new SplitCondition2());
> DataSet<X> thirdSplit = setToSplit.filter(new SplitCondition3());
>
> where SplitCondition1, SplitCondition2, and SplitCondition3 are
> FilterFunction that filter out all records that don't belong to the split.
>
> Best, Fabian
>
> 2017-10-17 10:42 GMT+02:00 Magnus Vojbacke <ma...@gmail.com>:
>
>> I'm looking for something like DataStream.split(), but for DataSets. I'd
>> like to split my streaming data so messages go to different parts of an
>> execution graph, based on arbitrary logic.
>>
>> DataStream.split() seems to be perfect, except that my source is a CSV
>> file, and I have only found built in functions for reading CSV files into a
>> DataSet.
>>
>> I've evaluated using DataSet.filter(), but as far as I can tell, that
>> only allows me to emulate a yes/no split. This is not ideal because it's
>> too coarse, and I would prefer a more fine grained split than that.
>>
>>
>> Do you have any suggestions on how I can achieve my arbitrary splitting
>> logic for a) DataSets in general, or b) CSV files?
>>
>>
>
>

Re: Split a dataset

Posted by Magnus Vojbacke <ma...@gmail.com>.
Thank you, Fabian! If batch semantics are not important to my use case, is there any way to "downgrade" or convert a DataSet to a DataStream?

BR
/Magnus

> On 17 Oct 2017, at 10:54, Fabian Hueske <fh...@gmail.com> wrote:
> 
> Hi Magnus,
> 
> there is no Split operator on the DataSet API.
> 
> As you said, this can be done using a FilterFunction. This also allows for non-binary splits:
> 
> DataSet<X> setToSplit = ...
> DataSet<X> firstSplit = setToSplit.filter(new SplitCondition1());
> DataSet<X> secondSplit = setToSplit.filter(new SplitCondition2());
> DataSet<X> thirdSplit = setToSplit.filter(new SplitCondition3());
> 
> where SplitCondition1, SplitCondition2, and SplitCondition3 are FilterFunction that filter out all records that don't belong to the split.
> 
> Best, Fabian
> 
> 2017-10-17 10:42 GMT+02:00 Magnus Vojbacke <magnus.vojbacke@gmail.com <ma...@gmail.com>>:
> I'm looking for something like DataStream.split(), but for DataSets. I'd like to split my streaming data so messages go to different parts of an execution graph, based on arbitrary logic.
> 
> DataStream.split() seems to be perfect, except that my source is a CSV file, and I have only found built in functions for reading CSV files into a DataSet.
> 
> I've evaluated using DataSet.filter(), but as far as I can tell, that only allows me to emulate a yes/no split. This is not ideal because it's too coarse, and I would prefer a more fine grained split than that.
> 
> 
> Do you have any suggestions on how I can achieve my arbitrary splitting logic for a) DataSets in general, or b) CSV files?
> 
> 


Re: Split a dataset

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Magnus,

there is no Split operator on the DataSet API.

As you said, this can be done using a FilterFunction. This also allows for
non-binary splits:

DataSet<X> setToSplit = ...
DataSet<X> firstSplit = setToSplit.filter(new SplitCondition1());
DataSet<X> secondSplit = setToSplit.filter(new SplitCondition2());
DataSet<X> thirdSplit = setToSplit.filter(new SplitCondition3());

where SplitCondition1, SplitCondition2, and SplitCondition3 are
FilterFunction that filter out all records that don't belong to the split.

Best, Fabian

2017-10-17 10:42 GMT+02:00 Magnus Vojbacke <ma...@gmail.com>:

> I'm looking for something like DataStream.split(), but for DataSets. I'd
> like to split my streaming data so messages go to different parts of an
> execution graph, based on arbitrary logic.
>
> DataStream.split() seems to be perfect, except that my source is a CSV
> file, and I have only found built in functions for reading CSV files into a
> DataSet.
>
> I've evaluated using DataSet.filter(), but as far as I can tell, that only
> allows me to emulate a yes/no split. This is not ideal because it's too
> coarse, and I would prefer a more fine grained split than that.
>
>
> Do you have any suggestions on how I can achieve my arbitrary splitting
> logic for a) DataSets in general, or b) CSV files?
>
>