You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Huyen Levan (JIRA)" <ji...@apache.org> on 2018/07/25 04:21:00 UTC
[jira] [Created] (FLINK-9940) File source continuous monitoring
mode: S3 files sometimes missed
Huyen Levan created FLINK-9940:
----------------------------------
Summary: File source continuous monitoring mode: S3 files sometimes missed
Key: FLINK-9940
URL: https://issues.apache.org/jira/browse/FLINK-9940
Project: Flink
Issue Type: Bug
Components: Streaming
Affects Versions: 1.5.1
Environment: Flink 1.5, EMRFS
Reporter: Huyen Levan
When using StreamExecutionEnvironment.readFile() with FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if there is a high amount of new/modified files at the same time, the directory monitoring process might miss some files. The number of missing files depends on the monitoring interval.
Cause: Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. So when there are multiple files having a same last-modified timestamp.
Suggested solution (thanks to [[Fabian Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]): a hybrid approach that keeps the names of all files that have a mod timestamp that is larger than the max mod time minus an offset. _org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)