You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Etienne Chauchot <ec...@apache.org> on 2021/09/29 12:38:54 UTC

[Flink blogs]

Hi all,

I'm a the author of a bloghttps://echauchot.blogspot.com/ 
<https://echauchot.blogspot.com/>

I have been thinking about subjects that I dealt with lately in my Flink 
PRs and I need to find one that fits for a blog post, I'd like to have 
your opinion on the subjects:

  *


  * - metrics (https://github.com/apache/flink/pull/14510): it was
    dealing with delimiters. I think it is a bit low level for a blog post ?
  *

  * - migration of pipelines from DataSet API to DataStream API: it is
    already discussed in the flink website
  *

  * - accumulators (https://github.com/apache/flink/pull/14558): it was
    about an asynchronous get, once again a bit too low level for a blog
    post ?
  *

  * - FileInputFormat mainly parquet improvements and fixes
    (https://github.com/apache/flink/pull/15725,
    https://github.com/apache/flink/pull/15172,
    https://github.com/apache/flink/pull/15156): interesting but as this
    API is being decommissioned, it might not be a good subject ?
  *

  * - doing a manual join in DataStream API in batch mode with
    /KeyedCoProcessFunction///(https://issues.apache.org/jira/browse/FLINK-22587).
    As the target is more Flink table/SQL for these kind of things, the
    same deprecation comment as above applies.
  *

=> maybe a blog post on back pressure in checkpointing 
(https://github.com/apache/flink/pull/13040). WDYT ?


Best

Etienne Chauchot

Re: [Flink blogs]

Posted by Etienne Chauchot <ec...@apache.org>.

Hi all,

Thanks a lot for your feedback guys ! Special thanks to Fabian, Till and 
Arvid (in a private discussion) !

The consensus seems to go toward the blog post on migrating a batch 
pipeline from DataSet API to DataStream API. For the record it is linked 
to a work I did lately (unfortunately not public, let's see if I can 
make it public in the future) of testing TPCDS performance framework on 
Flink. I know there is an impl already in the repo using flink-SQL but I 
wanted to implement it lower level using DataSet API and later 
DataStream API. It uses parquet (so old format). The query I implemented 
is TPCDS Query 3. That is for the use case of this future blob post. 
Indeed, as Fabian and Till said, it can easily become a serie.


Second blog to receive lower consensus: manual join with 
KeyedCoProcessFunction in DataStream (thanks Till !). I will off course 
add a pointer to the new target for users :Table/SQL API as reminded by 
Fabian.


Another blog post could be related to performances: during this bench, I 
have observed the cost of SQL translation compared to lower level, the 
improvement of perfs in DataStream or the improvement of perfs entailed 
by Blink planner.  That also could be a good blog post. Also I tend not 
to compare perfs with other apache big data projects such as Spark 
because they all have their strengths, their tricky parameters and in 
the end we often end up comparing not 100% comparable things.


Regarding the other topics, as I wrote, I was doubting that they could 
have interest mainly because of deprecation of formats, steering users 
toward Table/SQL API or because of too low level topics. Thanks for 
confirming my doubts !


Best

Etienne

On 30/09/2021 15:51, Till Rohrmann wrote:
> Hi Etienne,
>
> Great to see that you want to write about one of the topics you have worked
> on! Spreading the word about changes/improvements/new features is always
> super important.
>
> As a general recommendation I think you should write about the topic you
> are most excited about. This usually results in an interesting blog post
> for others. If you have multiple favourites, then I would think about what
> topic could be most interesting for users. In my experience, blog posts
> that deal with a user problem (e.g. how to solve xyz) get more attention
> than technical blog posts. Having said this, I think the following topics
> would be good candidates:
>
> - migration of pipelines from DataSet API to DataStream API
> As Fabian said, this could easily become a series of blog posts. Maybe this
> could also become part of the documentation.
>
> - doing a manual join in DataStream API in batch mode with
> /KeyedCoProcessFunction
> I could see that this is a nice blog post about a concrete recipe on how to
> solve a certain set of problems with the DataStream API. Fabian is right
> that in the future we will try to steer people towards the Table API but
> maybe the join condition cannot be easily expressed with SQL so that people
> would naturally switch to the DataStream API for it.
>
> - back pressure in checkpointing
> Improving the understanding of Flink operations is always a good and
> worthwhile idea imo.
>
> Cheers,
> Till
>
> On Thu, Sep 30, 2021 at 10:19 AM Fabian Paul <fa...@ververica.com>
> wrote:
>
>> Hi Etienne,
>>
>> Thanks for reaching out I think your list already looks very appealing.
>>
>>> * - metrics (https://github.com/apache/flink/pull/14510): it was
>>>    dealing with delimiters. I think it is a bit low level for a blog post
>> ?
>>> *
>> I am also unsure whether this a good fit to present. I can only imagine
>> showing what kind of use-case it supports.
>>
>>
>>> * - migration of pipelines from DataSet API to DataStream API: it is
>>>    already discussed in the flink website
>>> *
>> This is definitely something I’d like to see in my opinion it can also
>> become a series because the topic has a lot of aspects. If you want to
>> write a
>> post about it it would be great to show the migration of a more complex
>> pipeline (i.e. old formats, incompatible types ….). Many users will
>> eventually face this so it has a big impact. FYI probably only Flink 1.13
>> is the latest version with full DataSet support.
>>
>>> * - accumulators (https://github.com/apache/flink/pull/14558): it was
>>>    about an asynchronous get, once again a bit too low level for a blog
>>>    post ?
>>> *
>> To me accumulator are a kind of internal concept but maybe you can provide
>> the use-case which drove this change? Probably explaining the
>> semantics of them is already complicated.
>>
>>
>>> * - FileInputFormat mainly parquet improvements and fixes
>>>    (https://github.com/apache/flink/pull/15725,
>>>    https://github.com/apache/flink/pull/15172,
>>>    https://github.com/apache/flink/pull/15156): interesting but as this
>>>    API is being decommissioned, it might not be a good subject ?
>>> *
>> You have already summarized it: it is being deprecated and a much more
>> interesting topic is the migration from DataSet to the DataStream API in
>> case these old formats are used.
>>
>>
>>> * - doing a manual join in DataStream API in batch mode with
>>>    /KeyedCoProcessFunction///(
>> https://issues.apache.org/jira/browse/FLINK-22587).
>>>    As the target is more Flink table/SQL for these kind of things, the
>>>    same deprecation comment as above applies.
>>> *
>>>
>> I tend to not show this topic because my recommendation would be to use
>> the Table API directly and not build your own join in the DataStream API ;)
>>
>>> => maybe a blog post on back pressure in checkpointing (
>> https://github.com/apache/flink/pull/13040). WDYT ?
>> This is also an interesting topic but we constantly work on improving the
>> situation and I am unsure if the blogpost is already not up-to-date anymore
>> when it is released.
>>
>>
>> Please let me know what you think I am also happy to give more feedback
>> for one of the topics in more detail if you need it.
>>
>> Best,
>> Fabian

Re: [Flink blogs]

Posted by Till Rohrmann <tr...@apache.org>.

Hi Etienne,

Great to see that you want to write about one of the topics you have worked
on! Spreading the word about changes/improvements/new features is always
super important.

As a general recommendation I think you should write about the topic you
are most excited about. This usually results in an interesting blog post
for others. If you have multiple favourites, then I would think about what
topic could be most interesting for users. In my experience, blog posts
that deal with a user problem (e.g. how to solve xyz) get more attention
than technical blog posts. Having said this, I think the following topics
would be good candidates:

- migration of pipelines from DataSet API to DataStream API
As Fabian said, this could easily become a series of blog posts. Maybe this
could also become part of the documentation.

- doing a manual join in DataStream API in batch mode with
/KeyedCoProcessFunction
I could see that this is a nice blog post about a concrete recipe on how to
solve a certain set of problems with the DataStream API. Fabian is right
that in the future we will try to steer people towards the Table API but
maybe the join condition cannot be easily expressed with SQL so that people
would naturally switch to the DataStream API for it.

- back pressure in checkpointing
Improving the understanding of Flink operations is always a good and
worthwhile idea imo.

Cheers,
Till

On Thu, Sep 30, 2021 at 10:19 AM Fabian Paul <fa...@ververica.com>
wrote:

> Hi Etienne,
>
> Thanks for reaching out I think your list already looks very appealing.
>
> > * - metrics (https://github.com/apache/flink/pull/14510): it was
> >   dealing with delimiters. I think it is a bit low level for a blog post
> ?
> > *
>
> I am also unsure whether this a good fit to present. I can only imagine
> showing what kind of use-case it supports.
>
>
> >
> > * - migration of pipelines from DataSet API to DataStream API: it is
> >   already discussed in the flink website
> > *
>
> This is definitely something I’d like to see in my opinion it can also
> become a series because the topic has a lot of aspects. If you want to
> write a
> post about it it would be great to show the migration of a more complex
> pipeline (i.e. old formats, incompatible types ….). Many users will
> eventually face this so it has a big impact. FYI probably only Flink 1.13
> is the latest version with full DataSet support.
>
> >
> > * - accumulators (https://github.com/apache/flink/pull/14558): it was
> >   about an asynchronous get, once again a bit too low level for a blog
> >   post ?
> > *
>
> To me accumulator are a kind of internal concept but maybe you can provide
> the use-case which drove this change? Probably explaining the
> semantics of them is already complicated.
>
>
> >
> > * - FileInputFormat mainly parquet improvements and fixes
> >   (https://github.com/apache/flink/pull/15725,
> >   https://github.com/apache/flink/pull/15172,
> >   https://github.com/apache/flink/pull/15156): interesting but as this
> >   API is being decommissioned, it might not be a good subject ?
> > *
>
> You have already summarized it: it is being deprecated and a much more
> interesting topic is the migration from DataSet to the DataStream API in
> case these old formats are used.
>
>
> >
> > * - doing a manual join in DataStream API in batch mode with
> >   /KeyedCoProcessFunction///(
> https://issues.apache.org/jira/browse/FLINK-22587).
> >   As the target is more Flink table/SQL for these kind of things, the
> >   same deprecation comment as above applies.
> > *
> >
>
> I tend to not show this topic because my recommendation would be to use
> the Table API directly and not build your own join in the DataStream API ;)
>
> > => maybe a blog post on back pressure in checkpointing (
> https://github.com/apache/flink/pull/13040). WDYT ?
> >
>
> This is also an interesting topic but we constantly work on improving the
> situation and I am unsure if the blogpost is already not up-to-date anymore
> when it is released.
>
>
> Please let me know what you think I am also happy to give more feedback
> for one of the topics in more detail if you need it.
>
> Best,
> Fabian

Re: [Flink blogs]

Posted by Fabian Paul <fa...@ververica.com>.

Hi Etienne,

Thanks for reaching out I think your list already looks very appealing.

> * - metrics (https://github.com/apache/flink/pull/14510): it was
>   dealing with delimiters. I think it is a bit low level for a blog post ?
> *

I am also unsure whether this a good fit to present. I can only imagine showing what kind of use-case it supports.


> 
> * - migration of pipelines from DataSet API to DataStream API: it is
>   already discussed in the flink website
> *

This is definitely something I’d like to see in my opinion it can also become a series because the topic has a lot of aspects. If you want to write a 
post about it it would be great to show the migration of a more complex pipeline (i.e. old formats, incompatible types ….). Many users will 
eventually face this so it has a big impact. FYI probably only Flink 1.13 is the latest version with full DataSet support.

> 
> * - accumulators (https://github.com/apache/flink/pull/14558): it was
>   about an asynchronous get, once again a bit too low level for a blog
>   post ?
> *

To me accumulator are a kind of internal concept but maybe you can provide the use-case which drove this change? Probably explaining the 
semantics of them is already complicated.


> 
> * - FileInputFormat mainly parquet improvements and fixes
>   (https://github.com/apache/flink/pull/15725,
>   https://github.com/apache/flink/pull/15172,
>   https://github.com/apache/flink/pull/15156): interesting but as this
>   API is being decommissioned, it might not be a good subject ?
> *

You have already summarized it: it is being deprecated and a much more interesting topic is the migration from DataSet to the DataStream API in 
case these old formats are used.


> 
> * - doing a manual join in DataStream API in batch mode with
>   /KeyedCoProcessFunction///(https://issues.apache.org/jira/browse/FLINK-22587).
>   As the target is more Flink table/SQL for these kind of things, the
>   same deprecation comment as above applies.
> *
> 

I tend to not show this topic because my recommendation would be to use the Table API directly and not build your own join in the DataStream API ;)

> => maybe a blog post on back pressure in checkpointing (https://github.com/apache/flink/pull/13040). WDYT ?
> 

This is also an interesting topic but we constantly work on improving the situation and I am unsure if the blogpost is already not up-to-date anymore 
when it is released. 


Please let me know what you think I am also happy to give more feedback for one of the topics in more detail if you need it.

Best,
Fabian