You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by yuxia <lu...@alumni.sjtu.edu.cn> on 2023/04/03 01:37:42 UTC

Re: [blog article] Howto create a batch source with the new Source framework

Thanks Etienne for detail explanation.

Best regards,
Yuxia

----- 原始邮件 -----
发件人: "Etienne Chauchot" <ec...@apache.org>
收件人: "dev" <de...@flink.apache.org>
发送时间: 星期五, 2023年 3 月 31日 下午 9:08:36
主题: Re: [blog article] Howto create a batch source with the new Source framework

Hi Yuxia,

Thanks for your feedback.

Comments inline


Le 31/03/2023 à 04:21, yuxia a écrit :
> Hi, Etienne.
>
> Thanks for Etienne for sharing this article. I really like it and learn much from it.
=> Glad it was useful, that was precisely the point :)
>
> I'd like to raise some questions about implementing batch source. Welcome devs to share insights about them.
>
> The first question is how to generate splits:
> As the article mentioned:
> "Whenever possible, it is preferable to generate the splits lazily, meaning that each time a reader asks the enumerator for a split, the enumerator generates one on demand and assigns it to the reader."
> I think it maybe not for all cases. In some cases, generating split may be time counsuming, then it may be better to generate a batch of splits on demand to amortize the expense.
> But it then raises another question, how many splits should be generated in a batch, too many maywell cause OOM, too less may not make good use of batch generating splits.
> To solve it, I think maybe we can provide a configuration to make user to configure how many splits should be generated in a batch.
> What's your opinion on it. Have you ever encountered this problem in your implementation?

=> I agree, lazy splits is not the only way. I've mentioned in the 
article that batch generation is another in case of high split 
generation cost, thanks for the suggestion. During the implementation I 
didn't have this problem as generating a split was not costly, the only 
costly processing was the splits preparation. It was run asynchronously 
and only once, then each split generation was straightforward. That 
being said, during development, I had OOM risks in the size and number 
of splits. For the number of splits, lazy generation solved it as no 
list of splits was stored in the ennumerator apart from the splits to 
reassign. For the size of split I used a user provided max split memory 
size similar to what you suggest here. In the batch generation case, we 
could allow the user to set a max memory size for the batch : number of 
splits in batch looks more dangerous to me if we don't know the size of 
a split but if we are talking about storing the split objects and not 
their content then that is ok. IMO, memory size is more clear for the 
user as it is linked to the memory of a task manager.

>
>
> The second question is how to assign splits:
> What's your split assign stratgy?
=> the naïve one: a reader asks for a split, the enumerator receives the 
request, generates a split and assigns it to the demanding reader.
> In flink, we provide `LocalityAwareSplitAssigner` to make use of locality to assign split to reader.
=> well, it has interest only when the data-backend cluster nodes can be 
co-localized with Flink task managers right? That would rarely be the 
case as clusters seem to be separated most of the time to use the 
maximum available CPU (at least for CPU-band workloads) no ?
> But it may not perfert for the case of failover
=> Agree: it would require costly shuffle to keep the co-location after 
restoration and this cost would not be balanced by the gain raised by 
co-locality (mainly avoiding network use) I think.
> for which we intend to introduce another split assign strategy[1].
> But I do think it should be configurable to enable advanced user to decide which assign stratgy to use.

=> when you say the "user" I guess you mean user of the source not user 
of the dev framework (implementor of the source). I think that it should 
be configurable indeed as the user is the one knowing the repartition of 
the partitions of the backend data.

Best

Etienne

>
>
> Welcome other devs to share opinion.
>
> [1]: https://issues.apache.org/jira/browse/FLINK-31065
>
>
>
>
>
> Also as for split assigner .
>
>
> Best regards,
> Yuxia
>
> ----- 原始邮件 -----
> 发件人: "Etienne Chauchot" <ec...@apache.org>
> 收件人: "dev" <de...@flink.apache.org>
> 抄送: "Chesnay Schepler" <ch...@apache.org>
> 发送时间: 星期四, 2023年 3 月 30日 下午 10:36:39
> 主题: [blog article] Howto create a batch source with the new Source framework
>
> Hi all,
>
> After creating the Cassandra source connector (thanks Chesnay for the
> review!), I wrote a blog article about how to create a batch source with
> the new Source framework [1]. It gives field feedback on how to
> implement the different components.
>
> I felt it could be useful to people interested in contributing or
> migrating connectors.
>
> => Can you give me your opinion ?
>
> => I think it could be useful to post the article to Flink official blog
> also if you agree.
>
> => Same remark on my previous article [2]: what about publishing it to
> Flink official blog ?
>
>
> [1]https://echauchot.blogspot.com/2023/03/flink-howto-create-batch-source-with.html
>
> [2]https://echauchot.blogspot.com/2022/11/flink-howto-migrate-real-life-batch.html
>
>
> Best
>
> Etienne

Re: [blog article] Howto create a batch source with the new Source framework

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

I just published the last article of the series about creating a batch 
source with the new Source framework. This one is about testing the source.

Can you tell me if you think it would make sense to publish both 
articles to the official Flink blog as they could serve as a detailed 
documentation ?

Thanks

Etienne

[1] 
https://echauchot.blogspot.com/2023/04/flink-howto-test-batch-source-with-new.html

Le 03/04/2023 à 03:37, yuxia a écrit :
> Thanks Etienne for detail explanation.
>
> Best regards,
> Yuxia
>
> ----- 原始邮件 -----
> 发件人: "Etienne Chauchot" <ec...@apache.org>
> 收件人: "dev" <de...@flink.apache.org>
> 发送时间: 星期五, 2023年 3 月 31日 下午 9:08:36
> 主题: Re: [blog article] Howto create a batch source with the new Source framework
>
> Hi Yuxia,
>
> Thanks for your feedback.
>
> Comments inline
>
>
> Le 31/03/2023 à 04:21, yuxia a écrit :
>> Hi, Etienne.
>>
>> Thanks for Etienne for sharing this article. I really like it and learn much from it.
> => Glad it was useful, that was precisely the point :)
>> I'd like to raise some questions about implementing batch source. Welcome devs to share insights about them.
>>
>> The first question is how to generate splits:
>> As the article mentioned:
>> "Whenever possible, it is preferable to generate the splits lazily, meaning that each time a reader asks the enumerator for a split, the enumerator generates one on demand and assigns it to the reader."
>> I think it maybe not for all cases. In some cases, generating split may be time counsuming, then it may be better to generate a batch of splits on demand to amortize the expense.
>> But it then raises another question, how many splits should be generated in a batch, too many maywell cause OOM, too less may not make good use of batch generating splits.
>> To solve it, I think maybe we can provide a configuration to make user to configure how many splits should be generated in a batch.
>> What's your opinion on it. Have you ever encountered this problem in your implementation?
> => I agree, lazy splits is not the only way. I've mentioned in the
> article that batch generation is another in case of high split
> generation cost, thanks for the suggestion. During the implementation I
> didn't have this problem as generating a split was not costly, the only
> costly processing was the splits preparation. It was run asynchronously
> and only once, then each split generation was straightforward. That
> being said, during development, I had OOM risks in the size and number
> of splits. For the number of splits, lazy generation solved it as no
> list of splits was stored in the ennumerator apart from the splits to
> reassign. For the size of split I used a user provided max split memory
> size similar to what you suggest here. In the batch generation case, we
> could allow the user to set a max memory size for the batch : number of
> splits in batch looks more dangerous to me if we don't know the size of
> a split but if we are talking about storing the split objects and not
> their content then that is ok. IMO, memory size is more clear for the
> user as it is linked to the memory of a task manager.
>
>>
>> The second question is how to assign splits:
>> What's your split assign stratgy?
> => the naïve one: a reader asks for a split, the enumerator receives the
> request, generates a split and assigns it to the demanding reader.
>> In flink, we provide `LocalityAwareSplitAssigner` to make use of locality to assign split to reader.
> => well, it has interest only when the data-backend cluster nodes can be
> co-localized with Flink task managers right? That would rarely be the
> case as clusters seem to be separated most of the time to use the
> maximum available CPU (at least for CPU-band workloads) no ?
>> But it may not perfert for the case of failover
> => Agree: it would require costly shuffle to keep the co-location after
> restoration and this cost would not be balanced by the gain raised by
> co-locality (mainly avoiding network use) I think.
>> for which we intend to introduce another split assign strategy[1].
>> But I do think it should be configurable to enable advanced user to decide which assign stratgy to use.
> => when you say the "user" I guess you mean user of the source not user
> of the dev framework (implementor of the source). I think that it should
> be configurable indeed as the user is the one knowing the repartition of
> the partitions of the backend data.
>
> Best
>
> Etienne
>
>>
>> Welcome other devs to share opinion.
>>
>> [1]: https://issues.apache.org/jira/browse/FLINK-31065
>>
>>
>>
>>
>>
>> Also as for split assigner .
>>
>>
>> Best regards,
>> Yuxia
>>
>> ----- 原始邮件 -----
>> 发件人: "Etienne Chauchot" <ec...@apache.org>
>> 收件人: "dev" <de...@flink.apache.org>
>> 抄送: "Chesnay Schepler" <ch...@apache.org>
>> 发送时间: 星期四, 2023年 3 月 30日 下午 10:36:39
>> 主题: [blog article] Howto create a batch source with the new Source framework
>>
>> Hi all,
>>
>> After creating the Cassandra source connector (thanks Chesnay for the
>> review!), I wrote a blog article about how to create a batch source with
>> the new Source framework [1]. It gives field feedback on how to
>> implement the different components.
>>
>> I felt it could be useful to people interested in contributing or
>> migrating connectors.
>>
>> => Can you give me your opinion ?
>>
>> => I think it could be useful to post the article to Flink official blog
>> also if you agree.
>>
>> => Same remark on my previous article [2]: what about publishing it to
>> Flink official blog ?
>>
>>
>> [1]https://echauchot.blogspot.com/2023/03/flink-howto-create-batch-source-with.html
>>
>> [2]https://echauchot.blogspot.com/2022/11/flink-howto-migrate-real-life-batch.html
>>
>>
>> Best
>>
>> Etienne