You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by la...@posteo.de on 2016/12/08 15:39:08 UTC

conditional dataset output

Hi,

let's assume I have a dataset and depending on the input data and 
different filter operations this dataset can be empty. Now I want to 
output the dataset to HD, but I want that files are only created if the 
dataset is not empty. If the dataset is empty I don't want any files. 
The default way: dataset.write(...) will always create as many files as 
the parallelism of this operator is configured - in case of an empty 
dataset all files would be empty as well. I thought about doing 
something like:

if (dataset.count() > 0) {
    dataset.write(...)
}

but I don't think thats the way to go, because dataset.count() triggers 
a execution of the (sub)program.

Is there a simple way how to avoid creating empty files for empty 
datasets?

Regards,

Lars

Re: conditional dataset output

Posted by la...@posteo.de.

Hi Chesnay,

I actually thought about the same but like you said it seems a bit hacky 
;-). Anyway thank you!

Regards,

Lars

Am 08.12.2016 16:47 schrieb Chesnay Schepler:
> Hello Lars,
> 
> The only other way i can think of how this could be done is by wrapping 
> the used
> outputformat in a custom format, which calls open on the wrapped 
> outputformat
> when you receive the first record.
> 
> This should work but is quite hacky though as it interferes with the
> format life-cycle.
> 
> Regards,
> Chesnay
> 
> On 08.12.2016 16:39, lars.bachmann@posteo.de wrote:
>> Hi,
>> 
>> let's assume I have a dataset and depending on the input data and 
>> different filter operations this dataset can be empty. Now I want to 
>> output the dataset to HD, but I want that files are only created if 
>> the dataset is not empty. If the dataset is empty I don't want any 
>> files. The default way: dataset.write(...) will always create as many 
>> files as the parallelism of this operator is configured - in case of 
>> an empty dataset all files would be empty as well. I thought about 
>> doing something like:
>> 
>> if (dataset.count() > 0) {
>>    dataset.write(...)
>> }
>> 
>> but I don't think thats the way to go, because dataset.count() 
>> triggers a execution of the (sub)program.
>> 
>> Is there a simple way how to avoid creating empty files for empty 
>> datasets?
>> 
>> Regards,
>> 
>> Lars
>>

Re: conditional dataset output

Posted by Chesnay Schepler <ch...@apache.org>.

Hello Lars,

The only other way i can think of how this could be done is by wrapping 
the used
outputformat in a custom format, which calls open on the wrapped 
outputformat
when you receive the first record.

This should work but is quite hacky though as it interferes with the 
format life-cycle.

Regards,
Chesnay

On 08.12.2016 16:39, lars.bachmann@posteo.de wrote:
> Hi,
>
> let's assume I have a dataset and depending on the input data and 
> different filter operations this dataset can be empty. Now I want to 
> output the dataset to HD, but I want that files are only created if 
> the dataset is not empty. If the dataset is empty I don't want any 
> files. The default way: dataset.write(...) will always create as many 
> files as the parallelism of this operator is configured - in case of 
> an empty dataset all files would be empty as well. I thought about 
> doing something like:
>
> if (dataset.count() > 0) {
>    dataset.write(...)
> }
>
> but I don't think thats the way to go, because dataset.count() 
> triggers a execution of the (sub)program.
>
> Is there a simple way how to avoid creating empty files for empty 
> datasets?
>
> Regards,
>
> Lars
>