You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Kevin Lam <ke...@shopify.com> on 2022/03/14 14:40:21 UTC

Reading FileSource Files in a particular order

Hi all,

We're interested in being able to use a FileSource
<https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/FileSource.html>
read from a Google Cloud Storage (GCS) archive of messages from a Kafka
topic, roughly in order.

Our GCS archive is partitioned into folders by time, however, when we read
it using a FileSource, the messages are processed in a random order. We'd
like to be able to control what order the files are read in, and take
advantage of the clear ordering our GCS archive provides.

What is the best way to achieve this? Would it be possible to write a
custom FileEnumerator
<https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/enumerate/FileEnumerator.html>
that
sorts the directories and returns the splits in order?

Any help would be greatly appreciated!

Thanks,
Kevin

Re: Reading FileSource Files in a particular order

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

Are you running a batch job or a streaming job? For batch jobs just use
ORDER BY keyword in SQL to sort the records. For streaming jobs I'm afraid
it is hard to do so. A custom FileEnumerator might help, however if the
parallelism of your file system source is more than one then it is possible
that different parallelisms read files at different speeds, causing the
output of the file source to be random once again.

Kevin Lam <ke...@shopify.com> 于2022年3月14日周一 22:40写道:

> Hi all,
>
> We're interested in being able to use a FileSource
> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/FileSource.html>
> read from a Google Cloud Storage (GCS) archive of messages from a Kafka
> topic, roughly in order.
>
> Our GCS archive is partitioned into folders by time, however, when we read
> it using a FileSource, the messages are processed in a random order. We'd
> like to be able to control what order the files are read in, and take
> advantage of the clear ordering our GCS archive provides.
>
> What is the best way to achieve this? Would it be possible to write a
> custom FileEnumerator
> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/enumerate/FileEnumerator.html> that
> sorts the directories and returns the splits in order?
>
> Any help would be greatly appreciated!
>
> Thanks,
> Kevin
>