You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2015/05/04 14:43:54 UTC

filter().project() vs flatMap()

Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a
flatMap to filter tuples and do some projection after the filter.
Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio

Re: filter().project() vs flatMap()

Posted by Fabian Hueske <fh...@gmail.com>.
That might help with cardinality estimation for cost-based optimization.
For example when deciding about join strategies (broadcast vs. repartition,
build-side of a hash join).
However, as Stephan said, there are many cases where it does not make a
difference, e.g. if the input cardinality of the filter (or the size of the
other join input) is unknown.

I think, chances are low that it makes a difference.


2015-05-04 14:53 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:

> Thanks Sebastian and Fabian for the feedback, just one last question:
> what does change from the system point of view to know that the  output
> tuples is <= the number of input tuples?
> Is there any optimization that Flink can apply to the pipeline?
>
> On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <fh...@gmail.com> wrote:
>
>> It should not make a difference. I think its just personal taste.
>>
>> If your filter condition is simple enough, I'd go with Flink's Table API
>> because it does not require to define a Filter or FlatMapFunction.
>>
>>
>> 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:
>>
>>> Hi Flinkers,
>>>
>>> I'd like to know whether it's better to perform a filter.project or a
>>> flatMap to filter tuples and do some projection after the filter.
>>> Functionally they are equivalent but maybe I'm ignoring something..
>>>
>>> Thanks in advance,
>>> Flavio
>>>
>>
>>
>
>

Re: filter().project() vs flatMap()

Posted by Sebastian <ss...@apache.org>.
If the system has to decide data shipping strategies for a join (e.g., 
broadcasting one side) it helps to have good estimates of the input sizes.

On 04.05.2015 14:53, Flavio Pompermaier wrote:
> Thanks Sebastian and Fabian for the feedback, just one last question:
> what does change from the system point of view to know that the output
> tuples is <= the number of input tuples?
> Is there any optimization that Flink can apply to the pipeline?
>
> On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <fhueske@gmail.com
> <ma...@gmail.com>> wrote:
>
>     It should not make a difference. I think its just personal taste.
>
>     If your filter condition is simple enough, I'd go with Flink's Table
>     API because it does not require to define a Filter or FlatMapFunction.
>
>
>     2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it
>     <ma...@okkam.it>>:
>
>         Hi Flinkers,
>
>         I'd like to know whether it's better to perform a filter.project
>         or a flatMap to filter tuples and do some projection after the
>         filter. Functionally they are equivalent but maybe I'm ignoring
>         something..
>
>         Thanks in advance,
>         Flavio
>
>
>
>

Re: filter().project() vs flatMap()

Posted by Flavio Pompermaier <po...@okkam.it>.
Thanks Sebastian and Fabian for the feedback, just one last question:
what does change from the system point of view to know that the  output
tuples is <= the number of input tuples?
Is there any optimization that Flink can apply to the pipeline?

On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <fh...@gmail.com> wrote:

> It should not make a difference. I think its just personal taste.
>
> If your filter condition is simple enough, I'd go with Flink's Table API
> because it does not require to define a Filter or FlatMapFunction.
>
>
> 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:
>
>> Hi Flinkers,
>>
>> I'd like to know whether it's better to perform a filter.project or a
>> flatMap to filter tuples and do some projection after the filter.
>> Functionally they are equivalent but maybe I'm ignoring something..
>>
>> Thanks in advance,
>> Flavio
>>
>
>

Re: filter().project() vs flatMap()

Posted by Stephan Ewen <se...@apache.org>.
Hah, interesting to see how opinions differ ;-)

Sebastian has a point, that filter + project is more transparent to the
system. In some situations, this knowledge can help the optimizer, but
often, it will not matter.

Greetings,
Stephan


On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <fh...@gmail.com> wrote:

> It should not make a difference. I think its just personal taste.
>
> If your filter condition is simple enough, I'd go with Flink's Table API
> because it does not require to define a Filter or FlatMapFunction.
>
>
> 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:
>
>> Hi Flinkers,
>>
>> I'd like to know whether it's better to perform a filter.project or a
>> flatMap to filter tuples and do some projection after the filter.
>> Functionally they are equivalent but maybe I'm ignoring something..
>>
>> Thanks in advance,
>> Flavio
>>
>
>

Re: filter().project() vs flatMap()

Posted by Fabian Hueske <fh...@gmail.com>.
It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API
because it does not require to define a Filter or FlatMapFunction.


2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <po...@okkam.it>:

> Hi Flinkers,
>
> I'd like to know whether it's better to perform a filter.project or a
> flatMap to filter tuples and do some projection after the filter.
> Functionally they are equivalent but maybe I'm ignoring something..
>
> Thanks in advance,
> Flavio
>

Re: filter().project() vs flatMap()

Posted by Sebastian <ss...@apache.org>.
filter + project is easier to understand for the system, as the number 
of output tuples is guaranteed to be <= the number of input tuples. With 
flatMap, the system cannot know an upper bound.

--sebastian

On 04.05.2015 14:43, Flavio Pompermaier wrote:
> Hi Flinkers,
>
> I'd like to know whether it's better to perform a filter.project or a
> flatMap to filter tuples and do some projection after the filter.
> Functionally they are equivalent but maybe I'm ignoring something..
>
> Thanks in advance,
> Flavio