You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Kirti Dhar Upadhyay K via user <us...@flink.apache.org> on 2023/04/25 08:28:46 UTC

File Source Limitations

Hi Community,

I am planning to use FileSource (with S3) in my application. Hence encountered with below limitations:
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations



  1.  Watermarking does not work very well for large backlogs of files. This is because watermarks eagerly advance within a file, and the next file might contain data later than the watermark.
Ques: Is there any ideal use case/settings/configurations where this problem does not come into picture? OR can be avoided?


  1.  For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can, in some cases, grow rather large.
Ques: As a workaround of this problem, what if I configure a state backend (say RocksDBStateBackend) with some configured TTL, which shall automatically delete the older data. Is there any repercussions of this?


Regards,
Kirti Dhar