You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by bastien dine <ba...@gmail.com> on 2018/11/07 13:02:14 UTC

Counting DataSet in DataFlow

Hello,

I would like to a way to count a dataset to check if it is empty or not..
But .count() throw an execution and I do not want to do separe job
execution plan, as hthis will trigger multiple reading..
I would like to have something like..

Source -> map -> count -> if 0 -> do someting
                                           if not -> do something


More concrete i would like to check if one of my dataset is empty before
doing a cross operation..

Thanks,
Bastien

Re: Counting DataSet in DataFlow

Posted by bastien dine <ba...@gmail.com>.

Hi Fabian,
Thanks for the response, I am going to use the second solution !
Regards,
Bastien

------------------

Bastien DINE
Data Architect / Software Engineer / Sysadmin
bastiendine.io


Le mer. 7 nov. 2018 à 14:16, Fabian Hueske <fh...@gmail.com> a écrit :

> Another option for certain tasks is to work with broadcast variables [1].
> The value could be use to configure two filters.
>
> DataSet<Data> input = ....
> DataSet<Long> count = input.map(-> 1L).sum()
> DataSet<OutCntZero> input.filter(if cnt == 0).withBroadcastSet("cnt",
> count).doSomething
> DataSet<OutCntNonZero> input.filter(if cnt != 0).withBroadcastSet("cnt",
> count).doSomethingElse
>
> Fabian
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#broadcast-variables
>
>
> Am Mi., 7. Nov. 2018 um 14:10 Uhr schrieb Fabian Hueske <fhueske@gmail.com
> >:
>
>> Hi,
>>
>> Counting always requires a job to be executed.
>> Not sure if this is what you want to do, but if you want to prevent to
>> get an empty result due to an empty cross input, you can use a
>> mapPartition() with parallelism 1 to emit a special record, in case the
>> MapPartitionFunction didn't see any data.
>>
>> Best, Fabian
>>
>> Am Mi., 7. Nov. 2018 um 14:02 Uhr schrieb bastien dine <
>> bastien.dine@gmail.com>:
>>
>>> Hello,
>>>
>>> I would like to a way to count a dataset to check if it is empty or
>>> not.. But .count() throw an execution and I do not want to do separe job
>>> execution plan, as hthis will trigger multiple reading..
>>> I would like to have something like..
>>>
>>> Source -> map -> count -> if 0 -> do someting
>>>                                            if not -> do something
>>>
>>>
>>> More concrete i would like to check if one of my dataset is empty before
>>> doing a cross operation..
>>>
>>> Thanks,
>>> Bastien
>>>
>>>
>>>

Re: Counting DataSet in DataFlow

Posted by Fabian Hueske <fh...@gmail.com>.

Another option for certain tasks is to work with broadcast variables [1].
The value could be use to configure two filters.

DataSet<Data> input = ....
DataSet<Long> count = input.map(-> 1L).sum()
DataSet<OutCntZero> input.filter(if cnt == 0).withBroadcastSet("cnt",
count).doSomething
DataSet<OutCntNonZero> input.filter(if cnt != 0).withBroadcastSet("cnt",
count).doSomethingElse

Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#broadcast-variables


Am Mi., 7. Nov. 2018 um 14:10 Uhr schrieb Fabian Hueske <fh...@gmail.com>:

> Hi,
>
> Counting always requires a job to be executed.
> Not sure if this is what you want to do, but if you want to prevent to get
> an empty result due to an empty cross input, you can use a mapPartition()
> with parallelism 1 to emit a special record, in case the
> MapPartitionFunction didn't see any data.
>
> Best, Fabian
>
> Am Mi., 7. Nov. 2018 um 14:02 Uhr schrieb bastien dine <
> bastien.dine@gmail.com>:
>
>> Hello,
>>
>> I would like to a way to count a dataset to check if it is empty or not..
>> But .count() throw an execution and I do not want to do separe job
>> execution plan, as hthis will trigger multiple reading..
>> I would like to have something like..
>>
>> Source -> map -> count -> if 0 -> do someting
>>                                            if not -> do something
>>
>>
>> More concrete i would like to check if one of my dataset is empty before
>> doing a cross operation..
>>
>> Thanks,
>> Bastien
>>
>>
>>

Re: Counting DataSet in DataFlow

Posted by Fabian Hueske <fh...@gmail.com>.

Hi,

Counting always requires a job to be executed.
Not sure if this is what you want to do, but if you want to prevent to get
an empty result due to an empty cross input, you can use a mapPartition()
with parallelism 1 to emit a special record, in case the
MapPartitionFunction didn't see any data.

Best, Fabian

Am Mi., 7. Nov. 2018 um 14:02 Uhr schrieb bastien dine <
bastien.dine@gmail.com>:

> Hello,
>
> I would like to a way to count a dataset to check if it is empty or not..
> But .count() throw an execution and I do not want to do separe job
> execution plan, as hthis will trigger multiple reading..
> I would like to have something like..
>
> Source -> map -> count -> if 0 -> do someting
>                                            if not -> do something
>
>
> More concrete i would like to check if one of my dataset is empty before
> doing a cross operation..
>
> Thanks,
> Bastien
>
>
>