You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Meghajit Mazumdar <me...@gojek.com> on 2022/01/19 10:05:20 UTC

FileSource Usage

Hello,

We are using FileSource
<https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/> to
process Parquet Files and had a few doubts around it. Would really
appreciate if somebody can help answer them:

1. For a given file, does FileSource read the contents inside it in order ?
In other words, what is the order in which the file splits are generated
from the contents of the file ?

2. We want to provide a GCS Bucket URL to the FileSource so that it can
read parquet files from there. The bucket has multiple parquet files.
Wanted to know, what is the order in which the files will be picked and
processed by this FileSource ? Can we provide an order strategy ourselves,
say, process according to creation time ?

3. Is it possible/good practice to apply checkpointing and watermarking for
a bounded source like FileSource ?

-- 
*Regards,*
*Meghajit*

Re: FileSource Usage

Posted by Guowei Ma <gu...@gmail.com>.
Hi, Meghajit

Thanks Meghajit for sharing your user case.
I found a workaround way that you could try to name your file in a
timestamp style. More details could be found here[1].
Another little concern is that Flink is a distributed system, which means
that we could not assume any order even if we list the file in the created
order.

[1]
https://stackoverflow.com/questions/49045725/gsutil-gcloud-storage-file-listing-sorted-date-descending
Best,
Guowei


On Thu, Jan 20, 2022 at 11:11 PM Meghajit Mazumdar <
meghajit.mazumdar@gojek.com> wrote:

> Hi Guowei,
>
> Thanks for your answer. Regarding your question,
> *> Currently there is no such public interface ,which you could extend to
> implement your own strategy. Would you like to share the specific problem
> you currently meet?*
>
> The GCS bucket that we are trying to read from is periodically populated
> with parquet files by another service. This can be daily or even hourly.
> For an already pre-populated bucket, we would like to read the files
> created from, say, day *T* till day *T+10*.  Order matters here and hence
> we would like to read the oldest files first, and then the new ones.  Would
> you know how I can enforce a reading order here ?
>
> Thanks,
> Meghajit
>
>
>
>
> On Thu, Jan 20, 2022 at 2:29 PM Guowei Ma <gu...@gmail.com> wrote:
>
>> Hi, Meghajit
>>
>> 1. From the implementation [1] the order of split depends on the
>> implementation of the FileSystem.
>>
>> 2. From the implementation [2] the order of the file also depends on the
>> implementation of the FileSystem.
>>
>> 3. Currently there is no such public interface ,which you could extend to
>> implement your own strategy. Would you like to share the specific problem
>> you currently meet?
>>
>> 3. `FileSource` supports checkpoints. I think the watermark is a general
>> mechanism so you could read the related documentation[3].
>>
>> [1]
>> https://github.com/apache/flink/blob/355b165859aebaae29b6425023d352246caa0613/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141
>>
>> [2]
>> https://github.com/apache/flink/blob/d33c39d974f08a5ac520f220219ecb0796c9448c/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L102
>>
>> [3]
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
>> Best,
>> Guowei
>>
>>
>> On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar <
>> meghajit.mazumdar@gojek.com> wrote:
>>
>>> Hello,
>>>
>>> We are using FileSource
>>> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/>
>>> to process Parquet Files and had a few doubts around it. Would really
>>> appreciate if somebody can help answer them:
>>>
>>> 1. For a given file, does FileSource read the contents inside it in
>>> order ? In other words, what is the order in which the file splits are
>>> generated from the contents of the file ?
>>>
>>> 2. We want to provide a GCS Bucket URL to the FileSource so that it can
>>> read parquet files from there. The bucket has multiple parquet files.
>>> Wanted to know, what is the order in which the files will be picked and
>>> processed by this FileSource ? Can we provide an order strategy ourselves,
>>> say, process according to creation time ?
>>>
>>> 3. Is it possible/good practice to apply checkpointing and watermarking
>>> for a bounded source like FileSource ?
>>>
>>> --
>>> *Regards,*
>>> *Meghajit*
>>>
>>
>
> --
> *Regards,*
> *Meghajit*
>

Re: FileSource Usage

Posted by Meghajit Mazumdar <me...@gojek.com>.
Hi Guowei,

Thanks for your answer. Regarding your question,
*> Currently there is no such public interface ,which you could extend to
implement your own strategy. Would you like to share the specific problem
you currently meet?*

The GCS bucket that we are trying to read from is periodically populated
with parquet files by another service. This can be daily or even hourly.
For an already pre-populated bucket, we would like to read the files
created from, say, day *T* till day *T+10*.  Order matters here and hence
we would like to read the oldest files first, and then the new ones.  Would
you know how I can enforce a reading order here ?

Thanks,
Meghajit




On Thu, Jan 20, 2022 at 2:29 PM Guowei Ma <gu...@gmail.com> wrote:

> Hi, Meghajit
>
> 1. From the implementation [1] the order of split depends on the
> implementation of the FileSystem.
>
> 2. From the implementation [2] the order of the file also depends on the
> implementation of the FileSystem.
>
> 3. Currently there is no such public interface ,which you could extend to
> implement your own strategy. Would you like to share the specific problem
> you currently meet?
>
> 3. `FileSource` supports checkpoints. I think the watermark is a general
> mechanism so you could read the related documentation[3].
>
> [1]
> https://github.com/apache/flink/blob/355b165859aebaae29b6425023d352246caa0613/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141
>
> [2]
> https://github.com/apache/flink/blob/d33c39d974f08a5ac520f220219ecb0796c9448c/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L102
>
> [3]
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
> Best,
> Guowei
>
>
> On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar <
> meghajit.mazumdar@gojek.com> wrote:
>
>> Hello,
>>
>> We are using FileSource
>> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/>
>> to process Parquet Files and had a few doubts around it. Would really
>> appreciate if somebody can help answer them:
>>
>> 1. For a given file, does FileSource read the contents inside it in order
>> ? In other words, what is the order in which the file splits are generated
>> from the contents of the file ?
>>
>> 2. We want to provide a GCS Bucket URL to the FileSource so that it can
>> read parquet files from there. The bucket has multiple parquet files.
>> Wanted to know, what is the order in which the files will be picked and
>> processed by this FileSource ? Can we provide an order strategy ourselves,
>> say, process according to creation time ?
>>
>> 3. Is it possible/good practice to apply checkpointing and watermarking
>> for a bounded source like FileSource ?
>>
>> --
>> *Regards,*
>> *Meghajit*
>>
>

-- 
*Regards,*
*Meghajit*

Re: FileSource Usage

Posted by Guowei Ma <gu...@gmail.com>.
Hi, Meghajit

1. From the implementation [1] the order of split depends on the
implementation of the FileSystem.

2. From the implementation [2] the order of the file also depends on the
implementation of the FileSystem.

3. Currently there is no such public interface ,which you could extend to
implement your own strategy. Would you like to share the specific problem
you currently meet?

3. `FileSource` supports checkpoints. I think the watermark is a general
mechanism so you could read the related documentation[3].

[1]
https://github.com/apache/flink/blob/355b165859aebaae29b6425023d352246caa0613/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141

[2]
https://github.com/apache/flink/blob/d33c39d974f08a5ac520f220219ecb0796c9448c/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L102

[3]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
Best,
Guowei


On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar <
meghajit.mazumdar@gojek.com> wrote:

> Hello,
>
> We are using FileSource
> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/> to
> process Parquet Files and had a few doubts around it. Would really
> appreciate if somebody can help answer them:
>
> 1. For a given file, does FileSource read the contents inside it in order
> ? In other words, what is the order in which the file splits are generated
> from the contents of the file ?
>
> 2. We want to provide a GCS Bucket URL to the FileSource so that it can
> read parquet files from there. The bucket has multiple parquet files.
> Wanted to know, what is the order in which the files will be picked and
> processed by this FileSource ? Can we provide an order strategy ourselves,
> say, process according to creation time ?
>
> 3. Is it possible/good practice to apply checkpointing and watermarking
> for a bounded source like FileSource ?
>
> --
> *Regards,*
> *Meghajit*
>