You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "luoyuxia (Jira)" <ji...@apache.org> on 2022/10/13 09:07:00 UTC

[jira] [Comment Edited] (FLINK-29617) Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster

    [ https://issues.apache.org/jira/browse/FLINK-29617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616855#comment-17616855 ] 

luoyuxia edited comment on FLINK-29617 at 10/13/22 9:06 AM:
------------------------------------------------------------

[~dangshazi] Thanks for raising it and detail explanation. I'll be much appreciated that you can take the ticket.  If you don't have time, maybe I can help take it.

I'm fine with these two suggestions. But prefer suggestion 2 since suggestion 1 will bring new option which user may hardly know it.

I have one question, have you ever tried with these suggestions? If so, what's the improvement of these two suggestions?

Btw, the images uploaded is . Could you please upload them again?


was (Author: luoyuxia):
[~dangshazi] Thanks for raising it and detail explanation. I'll be much appreciated that you can take the ticket. 

I'm fine with these two suggestions. But prefer suggestion 2 since suggestion 1 will bring new option which user may hardly know it.

I have one question, have you ever tried with these suggestions? If so, what's the improvement of these two suggestions?

Btw, the images uploaded is . Could you please upload them again?

> Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-29617
>                 URL: https://issues.apache.org/jira/browse/FLINK-29617
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, Runtime / Coordination
>    Affects Versions: 1.15.2
>            Reporter: LI Mingkun
>            Priority: Major
>              Labels: coordination, file-system
>
> h1. Scenario:
> Our user use flink batch to compact small files in one day. Flink version : 1.15
> He split pipeline into 24 for each hour. So there are 24 source
>  
> I find it  costs too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
>  
>  as follow:
>  
> !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.1&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9SVAoAslMUGQdVQJ_ccmEf4LxhaONYKJvS_V8nvijvT3JXw_VlyRBAEE9EQhTtWdYPa4TLCO5rxjXGrTDK2_PGHX4RZDPTQTJ0LwKXAUr4BYlMhYZsjcrY9eo&disp=emb&realattid=ii_l95bh7qy0|width=542,height=260!
>  
> h1. Root Cause:
> I got the root cause after check: 
>  # AbstractFileSource will enumerateSplits when createEnumerator
>  # NotSplittingRecursiveEnumerator need to get fileblockLocation of every fileblock which is a heavy IO operation
> !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.3&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ8AoT071eCNMb_q3uJtcbrUmZnYbg3ucnDelMlRRPn7WLlXOBGj650srQk9vhqKyJEANvpOWoxHuH6jNHt7g6go8JkeRUZKc81yqT0yzzz7tbBciTe-YnRVQ7w&disp=emb&realattid=ii_l95bp1832|width=542,height=456! !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.2&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9phsX1nauTsx3xWje_YJM4uUaOLXKHcXKsm7WJquPQQGC7bQTni3OhQB5HtGYVOvrD-3Kbp9LURfUj6OiIUgsZU1AImSL0vj27cnDcf7HpVpLpaqdADtpoABU&disp=emb&realattid=ii_l95bjh1g1|width=526,height=542!
>  
> h1. Suggestion
>  # FileSource add option to disable location fetcher
>  # Move location fetcher into IOExecutor



--
This message was sent by Atlassian Jira
(v8.20.10#820010)