You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Kirti Dhar Upadhyay K via user <us...@flink.apache.org> on 2023/04/13 12:27:23 UTC

Queries/Help regarding limitations on File source

Hi,

I am using Data stream file source connector in one of my use case.
I was going through the documentation where I found below limitations:

https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations

  1.  Watermarking does not work very well for large backlogs of files. This is because watermarks eagerly advance within a file, and the next file might contain data later than the watermark.
Queries:
Is there any FLIP/design document to better understand the impact of these limitations?
Also, is there any work ongoing on these limitations for future Flink releases, if yes, please redirect to any related document?




  1.  For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can, in some cases, grow rather large.
Query:
       What all data per file is part of checkpointing state by file source?

Appreciate any help!

Regards,
Kirti Dhar

Re: Queries/Help regarding limitations on File source

Posted by Shammon FY <zj...@gmail.com>.
Hi Kirti

For the watermark problem, I think the description in the document mainly
refers to the out-of-order data between multiple files. This will result in
a large number of late events [1], which will generate a large number of
retract events, and late events out of time will be discarded.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/time/#lateness

Best,
Shammon FY


On Thu, Apr 13, 2023 at 8:27 PM Kirti Dhar Upadhyay K via user <
user@flink.apache.org> wrote:

> Hi,
>
>
>
> I am using Data stream file source connector in one of my use case.
>
> I was going through the documentation where I found below limitations:
>
>
>
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations
>
>    1. Watermarking does not work very well for large backlogs of files.
>    This is because watermarks eagerly advance within a file, and the next file
>    might contain data later than the watermark.
>
> *Queries:*
>
> Is there any FLIP/design document to better understand the impact of these
> limitations?
>
> Also, is there any work ongoing on these limitations for future Flink
> releases, if yes, please redirect to any related document?
>
>
>
>
>
>    1. For Unbounded File Sources, the enumerator currently remembers
>    paths of all already processed files, which is a state that can, in some
>    cases, grow rather large.
>
> *Query:*
>
>        What all data per file is part of checkpointing state by file
> source?
>
>
>
> Appreciate any help!
>
>
>
> Regards,
>
> Kirti Dhar
>