You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "LI Mingkun (Jira)" <ji...@apache.org> on 2022/10/13 07:02:00 UTC
[jira] [Created] (FLINK-29617) Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
LI Mingkun created FLINK-29617:
----------------------------------
Summary: Cost too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
Key: FLINK-29617
URL: https://issues.apache.org/jira/browse/FLINK-29617
Project: Flink
Issue Type: Improvement
Components: Connectors / FileSystem, Runtime / Coordination
Affects Versions: 1.15.2
Reporter: LI Mingkun
h1. Scenario:
Our user use flink batch to compact small files in one day. Flink version : 1.15
He split pipeline into 24 for each hour. So there are 24 source
I find it costs too much time to start SourceCoordinator of hdfsFileSource when start JobMaster
as follow:
!https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.1&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9SVAoAslMUGQdVQJ_ccmEf4LxhaONYKJvS_V8nvijvT3JXw_VlyRBAEE9EQhTtWdYPa4TLCO5rxjXGrTDK2_PGHX4RZDPTQTJ0LwKXAUr4BYlMhYZsjcrY9eo&disp=emb&realattid=ii_l95bh7qy0|width=542,height=260!
h1. Root Cause:
I got the root cause after check:
# AbstractFileSource will enumerateSplits when createEnumerator
# NotSplittingRecursiveEnumerator need to get fileblockLocation of every fileblock which is a heavy IO operation
!https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.3&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ8AoT071eCNMb_q3uJtcbrUmZnYbg3ucnDelMlRRPn7WLlXOBGj650srQk9vhqKyJEANvpOWoxHuH6jNHt7g6go8JkeRUZKc81yqT0yzzz7tbBciTe-YnRVQ7w&disp=emb&realattid=ii_l95bp1832|width=542,height=456! !https://mail.google.com/mail/u/0?ui=2&ik=488d9ac3dd&attid=0.2&permmsgid=msg-a:r-3013789195315215531&th=183cb292e567fd9f&view=fimg&fur=ip&sz=s0-l75-ft&attbid=ANGjdJ9phsX1nauTsx3xWje_YJM4uUaOLXKHcXKsm7WJquPQQGC7bQTni3OhQB5HtGYVOvrD-3Kbp9LURfUj6OiIUgsZU1AImSL0vj27cnDcf7HpVpLpaqdADtpoABU&disp=emb&realattid=ii_l95bjh1g1|width=526,height=542!
h1. Suggestion
# FileSource add option to disable location fetcher
# Move location fetcher into IOExecutor
--
This message was sent by Atlassian Jira
(v8.20.10#820010)