You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "LI Mingkun (Jira)" <ji...@apache.org> on 2022/10/13 07:02:00 UTC

[jira] [Created] (FLINK-29617) Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster

LI Mingkun created FLINK-29617:
----------------------------------

             Summary: Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
                 Key: FLINK-29617
                 URL: https://issues.apache.org/jira/browse/FLINK-29617
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / FileSystem, Runtime / Coordination
    Affects Versions: 1.15.2
            Reporter: LI Mingkun


h1. Scenario:
Our user use flink batch to compact small files in one day. Flink version : 1.15
He split pipeline into 24 for each hour. So there are 24 source
 
I find it  costs too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
 
 as follow:
 
!https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.1&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9SVAoAslMUGQdVQJ_ccmEf4LxhaONYKJvS_V8nvijvT3JXw_VlyRBAEE9EQhTtWdYPa4TLCO5rxjXGrTDK2_PGHX4RZDPTQTJ0LwKXAUr4BYlMhYZsjcrY9eo&disp=emb&realattid=ii_l95bh7qy0|width=542,height=260!
 
h1. Root Cause:
I got the root cause after check: 
 # AbstractFileSource will enumerateSplits when createEnumerator
 # NotSplittingRecursiveEnumerator need to get fileblockLocation of every fileblock which is a heavy IO operation

!https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.3&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ8AoT071eCNMb_q3uJtcbrUmZnYbg3ucnDelMlRRPn7WLlXOBGj650srQk9vhqKyJEANvpOWoxHuH6jNHt7g6go8JkeRUZKc81yqT0yzzz7tbBciTe-YnRVQ7w&disp=emb&realattid=ii_l95bp1832|width=542,height=456! !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.2&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9phsX1nauTsx3xWje_YJM4uUaOLXKHcXKsm7WJquPQQGC7bQTni3OhQB5HtGYVOvrD-3Kbp9LURfUj6OiIUgsZU1AImSL0vj27cnDcf7HpVpLpaqdADtpoABU&disp=emb&realattid=ii_l95bjh1g1|width=526,height=542!
 
h1. Suggestion
 # FileSource add option to disable location fetcher
 # Move location fetcher into IOExecutor



--
This message was sent by Atlassian Jira
(v8.20.10#820010)