You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/01/06 13:03:10 UTC

[GitHub] [flink] MartijnVisser opened a new pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

MartijnVisser opened a new pull request #18288:
URL: https://github.com/apache/flink/pull/18288


   ## What is the purpose of the change
   
   * This PR adds documentation for the File Source
   
   ## Brief change log
   
   * This PR continues on the work of https://github.com/apache/flink/pull/18026
   * It renames File Sink to File System and includes File Source information at the top
   * This also introduces an overview page for Formats, so the File Source documentation can link to that
   * It removes references to the Streaming File Sink
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? docs 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c870affe72e6960ca0e86d6a60cd3a117e21134a Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708) 
   * 97186a28d79b18a5f4499bce9b4f57869ea7b7fa UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] infoverload commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

infoverload commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r786116646



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.

Review comment:
       ```suggestion
   connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed to provide exactly-once semantics for `STREAMING` execution.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems

Review comment:
       ```suggestion
   This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to file systems
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.

Review comment:
       ```suggestion
   with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g. Avro, CSV, Parquet) and
   produces a stream or records.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.

Review comment:
       ```suggestion
   A bounded `File Source` lists all files (via SplitEnumerator - a recursive directory list with filtered-out hidden files) and reads them all.
   
   An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
   In this case, the `SplitEnumerator` will enumerate like the bounded case but, after a certain interval, repeats the enumeration.
   For any repeated enumeration, the `SplitEnumerator` filters out previously detected files and only sends new ones to the `SourceReader`.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:

Review comment:
       ```suggestion
   You can start building a File Source via one of the following API calls:
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.

Review comment:
       ```suggestion
   The interfaces are a tradeoff between simplicity of implementation and flexibility/efficiency.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in some cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).

Review comment:
       ```suggestion
   For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can, in some cases, grow rather large.
   There are plans to add a compressed form of tracking already processed files in the future (for example, by keeping modification timestamps below boundaries).
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.

Review comment:
       ```suggestion
   * File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File `SourceReader`.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.

Review comment:
       ```suggestion
   You will need to combine the `File Source` with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}), which allows you to parse CSV, decode AVRO, or read Parquet columnar files.
   ```

##########
File path: docs/content/docs/deployment/filesystems/s3.md
##########
@@ -70,16 +70,15 @@ Both implementations are self-contained with no dependency footprint, so there i
   
      For example, Hadoop has a `fs.s3a.connection.maximum` configuration key. If you want to change it, you need to put `s3.connection.maximum: xyz` to the `flink-conf.yaml`. Flink will internally translate this back to `fs.s3a.connection.maximum`. There is no need to pass configuration parameters using Hadoop's XML configuration files.
   
-    It is the only S3 file system with support for the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).
+    It is the only S3 file system with support for the [FileSystem]({{< ref "docs/connectors/datastream/filesystem" >}}).
   
 
 Both `flink-s3-fs-hadoop` and `flink-s3-fs-presto` register default FileSystem
 wrappers for URIs with the *s3://* scheme, `flink-s3-fs-hadoop` also registers
 for *s3a://* and `flink-s3-fs-presto` also registers for *s3p://*, so you can
 use this to use both at the same time.
-For example, the job uses the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which only supports Hadoop, but uses Presto for checkpointing.
-In this case, it is advised to explicitly use *s3a://* as a scheme for the sink (Hadoop) and *s3p://* for checkpointing (Presto). The same holds for the 
-[FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).
+For example, the job uses the [FileSystem]({{< ref "docs/connectors/datastream/filesystem" >}}) which only supports Hadoop, but uses Presto for checkpointing.
+In this case, it is advised to explicitly use *s3a://* as a scheme for the sink (Hadoop) and *s3p://* for checkpointing (Presto).

Review comment:
       ```suggestion
   In this case, you should explicitly use *s3a://* as a scheme for the sink (Hadoop) and *s3p://* for checkpointing (Presto).
   ```

##########
File path: docs/content/docs/connectors/table/filesystem.md
##########
@@ -161,8 +161,8 @@ CREATE TABLE MyUserTableWithFilepath (
 
 ## Streaming Sink
 
-The file system connector supports streaming writes, based on Flink's [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}})
-to write records to file. Row-encoded Formats are csv and json. Bulk-encoded Formats are parquet, orc and avro.
+The file system connector supports streaming writes, based on Flink's [FileSystem]({{< ref "docs/connectors/datastream/filesystem" >}})
+to write records to file. Row-encoded Formats are csv and json. Bulk-encoded Formats are Parquet, ORC and Avro.

Review comment:
       ```suggestion
   The file system connector supports streaming writes, based on Flink's [FileSystem]({{< ref "docs/connectors/datastream/filesystem" >}}),
   to write records to file. Row-encoded Formats are CSV and JSON. Bulk-encoded Formats are Parquet, ORC and Avro.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.

Review comment:
       ```suggestion
   This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the File Source.
   
   For the bounded/batch case, the File Source processes all files under the given path(s). 
   For the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
   
   When you start creating a File Source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods), 
   the source is in bounded/batch mode by default. You can call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
   to put the source into continuous streaming mode.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.

Review comment:
       ```suggestion
   This is a simple version of `StreamFormat` for formats that are not splittable.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 

Review comment:
       ```suggestion
   It is divided into the following two parts: File `SplitEnumerator` and File `SourceReader`. 
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:

Review comment:
       ```suggestion
   An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized like this:
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+

Review comment:
       ```suggestion
   Watermarking does not work very well for large backlogs of files. This is because watermarks eagerly advance within a file, and the next file might contain data later than the watermark.
   
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.

Review comment:
       ```suggestion
   #### TextLine Format
   
   A `StreamFormat` reader formats text lines from a file.
   The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
   various supported charset encodings.
   This format does not support optimized recovery from checkpoints. On recovery, it will re-read
   and discard the number of lines that were processed before the last checkpoint. This is due to
   the fact that the offsets of lines in the file cannot be tracked through the charset decoders
   with their internal buffering of stream input and charset decoder state.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in some cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).
+
+### Behind the Scene
+{{< hint info >}}
+If you are interested in how File source works under the design of new data source API, you may
+want to read this part as a reference. For details about the new data source API,
+[documentation of data source]({{< ref "docs/dev/datastream/sources.md" >}}) and
+<a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface">FLIP-27</a>
+provide more descriptive discussions.
+{{< /hint >}}

Review comment:
       ```suggestion
   ### Behind the Scenes
   {{< hint info >}}
   If you are interested in how File Source works through the new data source API design, you may
   want to read this part as a reference. For details about the new data source API, check out the
   [documentation on data sources]({{< ref "docs/dev/datastream/sources.md" >}}) and
   <a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface">FLIP-27</a>
   for more descriptive discussions.
   {{< /hint >}}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785739924



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).

Review comment:
       Do you want to mention that in the docs (with a link to a Jira ticket)? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6be59be6902b7b454563cce5c49a6955f315ca94 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6be59be6902b7b454563cce5c49a6955f315ca94 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533) 
   * 22e070b0a4f84c7d0561479ef91fab820728bc2d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707) 
   * c870affe72e6960ca0e86d6a60cd3a117e21134a Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MonsterChenzhuo commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MonsterChenzhuo commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1014683099


   > > @MartijnVisser 我知道是否可以获取 [https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/formats/ ](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/formats/)链接页面的内容并将其移动到当前的公关。 因为目前pr最大的一点就是通过API引导用户，如果有各种类型的例子作为补充，会更加适合初学者。
   > 
   > 我不确定这是一个好主意，因为格式也可以被多个连接器使用。 您可以从 FileSystem 读取/写入 Avro，但您也可以将它与 Kafka 或 Pulsar 一起使用。 我认为我们不应该把它们混为一谈。 我认为一个自然的概念是，您首先从要使用数据的位置开始（这是一个连接器），然后考虑要使用哪些数据以及如何使用（与格式相关）。
   
   
   You're right. But I think it is very necessary to add examples of various types of usage.
   Reference：
   ![图片](https://user-images.githubusercontent.com/60029759/149801295-a11f768b-72dd-409b-be45-0fafc22fbee4.png)
   ![图片](https://user-images.githubusercontent.com/60029759/149801390-4689e626-c438-4955-9276-58315d3928cc.png)
   OrcColumnarRowInputFormat
   AvroInputFormat
   RowCsvInputFormat
   HiveInputFormat
   ParquetColumnarRowInputFormat


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785742437



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).
+
+### Behind the Scene
+{{< hint info >}}
+If you are interested in how File source works under the design of new data source API, you may
+want to read this part as a reference. For details about the new data source API,
+[documentation of data source]({{< ref "docs/dev/datastream/sources.md" >}}) and
+<a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface">FLIP-27</a>
+provide more descriptive discussions.
+{{< /hint >}}
+
+The `File Source` is divided in the following two parts: File SplitEnumerator and File SourceReader.

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785828434



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 

Review comment:
       Since the similar part for internal components has been removed, should we also delete this part?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 77ef5bba4cf42894132bd06988b96317d3e06760 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 77ef5bba4cf42894132bd06988b96317d3e06760 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047) 
   * 601f0e004b818fc1e97d5c25698b703759096989 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 77ef5bba4cf42894132bd06988b96317d3e06760 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479) 
   * 6be59be6902b7b454563cce5c49a6955f315ca94 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c870affe72e6960ca0e86d6a60cd3a117e21134a Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 77ef5bba4cf42894132bd06988b96317d3e06760 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1007195956


   @infoverload Can you also have a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 77ef5bba4cf42894132bd06988b96317d3e06760 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047) 
   * 601f0e004b818fc1e97d5c25698b703759096989 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824",
       "triggerID" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2202b6f37a9c0db7f9105c28e61ba42553000f96",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29830",
       "triggerID" : "2202b6f37a9c0db7f9105c28e61ba42553000f96",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97186a28d79b18a5f4499bce9b4f57869ea7b7fa Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824) 
   * 2202b6f37a9c0db7f9105c28e61ba42553000f96 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29830) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1014291236


   > Overall it looks great thanks for doing this effort. I am only wondering if we have any plans to split the documentation between user perspective and developer perspective i.e. how to use the FileSource, how to develop a format/enumerator?
   
   Good point. I'm leaning that these docs should be user focussed and that either JavaDocs or Markdown files in the repository should be tailored towards the developers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785741849



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).
+
+### Behind the Scene
+{{< hint info >}}
+If you are interested in how File source works under the design of new data source API, you may
+want to read this part as a reference. For details about the new data source API,
+[documentation of data source]({{< ref "docs/dev/datastream/sources.md" >}}) and
+<a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface">FLIP-27</a>
+provide more descriptive discussions.
+{{< /hint >}}
+
+The `File Source` is divided in the following two parts: File SplitEnumerator and File SourceReader.
+
+The Source of a file system is divided into the following two parts: File Split Enumerator and File Source Reader,
+File Split Enumerator (Split is an abstraction of the external file system data splitting)
+It is responsible for discovering the split in the external system and assigning it to the File SourceReader,
+and it also manages the global water level to ensure that the consumption rate is approximately the same between our different File Source Readers.

Review comment:
       Removed it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785749360



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).

Review comment:
       Created https://issues.apache.org/jira/browse/FLINK-25672 for this 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r786076589



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 

Review comment:
       True. There is a section about "Customizing File Enumeration" below. I am afraid that user might ask question about the relationship between the FileEnumerator and SplitEnumerator.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r786098299



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 

Review comment:
       That's a good point. Perhaps that should actually be removed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 22e070b0a4f84c7d0561479ef91fab820728bc2d Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707) 
   * c870affe72e6960ca0e86d6a60cd3a117e21134a Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824",
       "triggerID" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2202b6f37a9c0db7f9105c28e61ba42553000f96",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2202b6f37a9c0db7f9105c28e61ba42553000f96",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97186a28d79b18a5f4499bce9b4f57869ea7b7fa Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824) 
   * 2202b6f37a9c0db7f9105c28e61ba42553000f96 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479) 
   * 6be59be6902b7b454563cce5c49a6955f315ca94 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785738308



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.

Review comment:
       This was a copy & paste from Stephan's input. I'll remove it

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.

Review comment:
       Thanks, fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785828434



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 

Review comment:
       Since the similar part for internal components has been deleted, should we also delete this part?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578171

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

## Automated Checks
Last check on commit 77ef5bba4cf42894132bd06988b96317d3e06760 (Thu Jan 06 13:09:00 UTC 2022)

**Warnings:**
* Documentation files were touched, but no `docs/content.zh/` files: Update Chinese documentation or file Jira ticket.

<sub>Mention the bot in a comment to re-run the automated checks.</sub>
## Review Progress

* ❓ 1. The [description] looks good.
* ❓ 2. There is [consensus] that the contribution should go into to Flink.
* ❓ 3. Needs [attention] from.
* ❓ 4. The change fits into the overall [architecture].
* ❓ 5. Overall code [quality] is good.

Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
The @flinkbot bot supports the following commands:

- `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until `architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
</details>

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MonsterChenzhuo commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MonsterChenzhuo commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006697551


   hi @MartijnVisser I'm glad to see your reply. I couldn't agree with you more which we can improve it further, I've been learning about connector related content lately and will have some new understanding of the current pr I will add in #18288 in the next few days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MonsterChenzhuo commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MonsterChenzhuo commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1014696511


   > > 你说得对。  但是我觉得添加各种用法的例子是非常有必要的。
   > 
   > 是的，但是这些应该在对格式进行各种改进之后完成。 例如 CSV 将使用 [#17598 ](https://github.com/apache/flink/pull/17598)，这也应该是包含/更新示例的时刻。 它不应该通过这个 PR 来完成，而是作为单独的 PR
   
   
   
   > > You're right. But I think it is very necessary to add examples of various types of usage.
   > 
   > Yes, but those should be done after various improvements to Formats have been done. For example CSV will be addressed with #17598 and that should also be the moment to include/update examples. It should not be done via this PR, but as separate PRs
   
   
   
   > > You're right. But I think it is very necessary to add examples of various types of usage.
   > 
   > Yes, but those should be done after various improvements to Formats have been done. For example CSV will be addressed with #17598 and that should also be the moment to include/update examples. It should not be done via this PR, but as separate PRs
   
   Oh, understand. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1014686193


   > You're right. But I think it is very necessary to add examples of various types of usage.
   
   Yes, but those should be done after various improvements to Formats have been done. For example CSV will be addressed with https://github.com/apache/flink/pull/17598 and that should also be the moment to include/update examples. It should not be done via this PR, but as separate PRs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824",
       "triggerID" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 97186a28d79b18a5f4499bce9b4f57869ea7b7fa Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r786034317



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 

Review comment:
       I think we can keep some high level overview in, but I'm curious what @fapaul thinks. We are also mentioning the enumerator and reader on https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/kafka/#behind-the-scene for example. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MonsterChenzhuo commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MonsterChenzhuo commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1014630813


   ![图片](https://user-images.githubusercontent.com/60029759/149792486-9a2fb125-852f-4da2-8e71-e9c46dff1f74.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] fapaul commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

fapaul commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r785706129



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.

Review comment:
       ```suggestion
   For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in some cases grow rather large.
   ```

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.

Review comment:
       I cannot recall seeing any planned effort to fix this behavior.

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).

Review comment:
       The delta lake source also hit this problem we probably have to fix this rather soonish.

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).
+
+### Behind the Scene
+{{< hint info >}}
+If you are interested in how File source works under the design of new data source API, you may
+want to read this part as a reference. For details about the new data source API,
+[documentation of data source]({{< ref "docs/dev/datastream/sources.md" >}}) and
+<a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface">FLIP-27</a>
+provide more descriptive discussions.
+{{< /hint >}}
+
+The `File Source` is divided in the following two parts: File SplitEnumerator and File SourceReader.

Review comment:
       I'd remove everything beginning from here because the internals might be outdated or wrong and on the other not confusing for users.

##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,243 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 
+
+* File `SplitEnumerator` is responsible for discovering and identifying the files to read and assigns them to the File SourceReader.
+* File `SourceReader` requests the files it needs to process and reads the file from the filesystem. 
+
+You'll need to combine the File Source with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}). This allows you to
+parse CSV, decode AVRO or read Parquet columnar files.
+
+#### Bounded and Unbounded Streams
+
+A bounded `File Source` lists all files (via SplitEnumerator, for example a recursive directory list with filtered-out hidden files) and reads them all.
+
+An unbounded `File Source` is created when configuring the enumerator for periodic file discovery.
+In that case, the SplitEnumerator will enumerate like the bounded case but after a certain interval repeats the enumeration.
+For any repeated enumeration, the `SplitEnumerator` filters out previously detect files and only sends new ones to the `SourceReader`.
+
+### Usage
+
+You start building a file source via one of the following calls:
+
+{{< tabs "FileSourceUsage" >}}
+{{< tab "Java" >}}
+```java
+// reads the contents of a file from a file stream. 
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads batches of records from a file at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+This creates a `FileSource.FileSourceBuilder` on which you can configure all the properties of the file source.
+
+For the bounded/batch case, the file source processes all files under the given path(s). 
+In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
+
+When you start creating a file source (via the `FileSource.FileSourceBuilder` created through one of the above-mentioned methods) 
+the source is by default in bounded/batch mode. Call `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` 
+to put the source into continuous streaming mode.
+
+{{< tabs "FileSourceBuilder" >}}
+{{< tab "Java" >}}
+```java
+final FileSource<String> source =
+        FileSource.forRecordStreamFormat(...)
+        .monitorContinuously(Duration.ofMillis(5))  
+        .build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Format Types
+
+The reading of each file happens through file readers defined by file formats. 
+These define the parsing logic for the contents of the file. There are multiple classes that the source supports. 
+Their interfaces trade of simplicity of implementation and flexibility/efficiency.
+
+* A `StreamFormat` reads the contents of a file from a file stream. It is the simplest format to implement, 
+and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply 
+(such as object reuse, batching, etc.).
+
+* A `BulkFormat` reads batches of records from a file at a time. 
+It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
+
+#### TextLine format
+
+A `StreamFormat` reader format that text lines from a file.
+The reader uses Java's built-in `InputStreamReader` to decode the byte stream using
+various supported charset encodings.
+This format does not support optimized recovery from checkpoints. On recovery, it will re-read
+and discard the number of lined that were processed before the last checkpoint. That is due to
+the fact that the offsets of lines in the file cannot be tracked through the charset decoders
+with their internal buffering of stream input and charset decoder state.
+
+#### SimpleStreamFormat Abstract Class
+
+A simple version of `StreamFormat` for formats that are not splittable.
+Custom reads of Array or File can be done by implementing `SimpleStreamFormat`:
+
+{{< tabs "SimpleStreamFormat" >}}
+{{< tab "Java" >}}
+```java
+private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {
+    private static final long serialVersionUID = 1L;
+
+    @Override
+    public Reader<byte[]> createReader(Configuration config, FSDataInputStream stream)
+            throws IOException {
+        return new ArrayReader(stream);
+    }
+
+    @Override
+    public TypeInformation<byte[]> getProducedType() {
+        return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
+    }
+}
+
+final FileSource<byte[]> source =
+                FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).build();
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+An example of a `SimpleStreamFormat` is `CsvReaderFormat`. It can be initialized as follows:
+```java
+CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);
+FileSource<SomePojo> source = 
+        FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();
+```
+
+The schema for CSV parsing, in this case, is automatically derived based on the fields of the `SomePojo` class using the `Jackson` library. (Note: you might need to add `@JsonPropertyOrder({field1, field2, ...})` annotation to your class definition with the fields order exactly matching those of the CSV file columns).
+
+If you need more fine-grained control over the CSV schema or the parsing options, use the more low-level `forSchema` static factory method of `CsvReaderFormat`:
+
+```java
+CsvReaderFormat<T> forSchema(CsvMapper mapper, 
+                             CsvSchema schema, 
+                             TypeInformation<T> typeInformation) 
+```
+
+#### Bulk Format
+
+The BulkFormat reads and decodes batches of records at a time. Examples of bulk formats
+are formats like ORC or Parquet.
+The outer `BulkFormat` class acts mainly as a configuration holder and factory for the
+reader. The actual reading is done by the `BulkFormat.Reader`, which is created in the
+`BulkFormat#createReader(Configuration, FileSourceSplit)` method. If a bulk reader is
+created based on a checkpoint during checkpointed streaming execution, then the reader is
+re-created in the `BulkFormat#restoreReader(Configuration, FileSourceSplit)` method.
+
+A `SimpleStreamFormat` can be turned into a `BulkFormat` by wrapping it in a `StreamFormatAdapter`:
+```java
+BulkFormat<SomePojo, FileSourceSplit> bulkFormat = 
+        new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));
+```
+
+### Customizing File Enumeration
+
+{{< tabs "CustomizingFileEnumeration" >}}
+{{< tab "Java" >}}
+```java
+/**
+ * A FileEnumerator implementation for hive source, which generates splits based on 
+ * HiveTablePartition.
+ */
+public class HiveSourceFileEnumerator implements FileEnumerator {
+    
+    // reference constructor
+    public HiveSourceFileEnumerator(...) {
+        ...
+    }
+
+    /***
+     * Generates all file splits for the relevant files under the given paths. The {@code
+     * minDesiredSplits} is an optional hint indicating how many splits would be necessary to
+     * exploit parallelism properly.
+     */
+    @Override
+    public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesiredSplits)
+            throws IOException {
+        // createInputSplits:splitting files into fragmented collections
+        return new ArrayList<>(createInputSplits(...));
+    }
+
+    ...
+
+    /***
+     * A factory to create HiveSourceFileEnumerator.
+     */
+    public static class Provider implements FileEnumerator.Provider {
+
+        ...
+        @Override
+        public FileEnumerator create() {
+            return new HiveSourceFileEnumerator(...);
+        }
+    }
+}
+// use the customizing file enumeration
+new HiveSource<>(
+        ...,
+        new HiveSourceFileEnumerator.Provider(
+        partitions != null ? partitions : Collections.emptyList(),
+        new JobConfWrapper(jobConf)),
+       ...);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Current Limitations
+
+Watermarking doesn't work particularly well for large backlogs of files, because watermarks eagerly advance within a file, and the next file might contain data later than the watermark again.
+We are looking at ways to generate the watermarks more based on global information.
+
+For Unbounded File Sources, the enumerator currently remembers paths of all already processed files, which is a state that can in come cases grow rather large.
+The future will be planned to add a compressed form of tracking already processed files in the future (for example by keeping modification timestamps lower boundaries).
+
+### Behind the Scene
+{{< hint info >}}
+If you are interested in how File source works under the design of new data source API, you may
+want to read this part as a reference. For details about the new data source API,
+[documentation of data source]({{< ref "docs/dev/datastream/sources.md" >}}) and
+<a href="https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface">FLIP-27</a>
+provide more descriptive discussions.
+{{< /hint >}}
+
+The `File Source` is divided in the following two parts: File SplitEnumerator and File SourceReader.
+
+The Source of a file system is divided into the following two parts: File Split Enumerator and File Source Reader,
+File Split Enumerator (Split is an abstraction of the external file system data splitting)
+It is responsible for discovering the split in the external system and assigning it to the File SourceReader,
+and it also manages the global water level to ensure that the consumption rate is approximately the same between our different File Source Readers.

Review comment:
       I think this sentence is partially a lie because afaik watermark alignment is not supported by any source yet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "635fca6fbaa195fd5de673b7810b754a160e8b5d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "635fca6fbaa195fd5de673b7810b754a160e8b5d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * 635fca6fbaa195fd5de673b7810b754a160e8b5d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1013356057


   I've squashed the commits while co-authoring them. @fapaul can you merge it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 601f0e004b818fc1e97d5c25698b703759096989 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 601f0e004b818fc1e97d5c25698b703759096989 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084) 
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6be59be6902b7b454563cce5c49a6955f315ca94 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533) 
   * 22e070b0a4f84c7d0561479ef91fab820728bc2d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6be59be6902b7b454563cce5c49a6955f315ca94 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533) 
   * 22e070b0a4f84c7d0561479ef91fab820728bc2d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1007195956


   @infoverload Can you also have a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824",
       "triggerID" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2202b6f37a9c0db7f9105c28e61ba42553000f96",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29830",
       "triggerID" : "2202b6f37a9c0db7f9105c28e61ba42553000f96",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2202b6f37a9c0db7f9105c28e61ba42553000f96 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29830) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824",
       "triggerID" : "97186a28d79b18a5f4499bce9b4f57869ea7b7fa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c870affe72e6960ca0e86d6a60cd3a117e21134a Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29708) 
   * 97186a28d79b18a5f4499bce9b4f57869ea7b7fa Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29824) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29479",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533",
       "triggerID" : "6be59be6902b7b454563cce5c49a6955f315ca94",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707",
       "triggerID" : "22e070b0a4f84c7d0561479ef91fab820728bc2d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c870affe72e6960ca0e86d6a60cd3a117e21134a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6be59be6902b7b454563cce5c49a6955f315ca94 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29533) 
   * 22e070b0a4f84c7d0561479ef91fab820728bc2d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29707) 
   * c870affe72e6960ca0e86d6a60cd3a117e21134a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 601f0e004b818fc1e97d5c25698b703759096989 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084) 
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] fapaul commented on a change in pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

fapaul commented on a change in pull request #18288:
URL: https://github.com/apache/flink/pull/18288#discussion_r786117367



##########
File path: docs/content/docs/connectors/datastream/filesystem.md
##########
@@ -25,12 +27,227 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# File Sink
+# FileSystem
 
-This connector provides a unified Sink for `BATCH` and `STREAMING` that writes partitioned files to filesystems
+This connector provides a unified Source and Sink for `BATCH` and `STREAMING` that reads or writes (partitioned) files to filesystems
 supported by the [Flink `FileSystem` abstraction]({{< ref "docs/deployment/filesystems/overview" >}}). This filesystem
-connector provides the same guarantees for both `BATCH` and `STREAMING` and it is an evolution of the 
-existing [Streaming File Sink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) which was designed for providing exactly-once semantics for `STREAMING` execution.
+connector provides the same guarantees for both `BATCH` and `STREAMING` and is designed for providing exactly-once semantics for `STREAMING` execution.
+
+The connector supports reading and writing a set of files from any (distributed) file system (e.g. POSIX, S3, HDFS)
+with a [format]({{< ref "docs/connectors/datastream/formats/overview" >}}) (e.g., Avro, CSV, Parquet),
+producing a stream or records.
+
+## File Source
+
+The `File Source` is based on the [Source API]({{< ref "docs/dev/datastream/sources" >}}#the-data-source-api), 
+a unified data source that reads files - both in batch and in streaming mode. 
+It is divided into the following two parts: File SplitEnumerator and File SourceReader. 

Review comment:
       I guess we are mixing various technical concepts here. The interface to build generally applicable enumerators for the FLIP-27 source API is called `SplitEnumerator` and we have another abstraction that is the `FileEnumerator`. Probably `FileEnumerator` is not a good name and it should be more like `FileEnumerationStrategy`.
   
   I agree only seeing both terms is very confusing. I tend to use `SplitEnumerator` where ever possible and maybe have a special section about customizing the file enumeration strategy.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1014653823


   > @MartijnVisser I have an idea whether it is possible to take the content of the https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/formats/ link page and move it to the current pr. Because the biggest point of the current pr is to guide users through the API, and it would be more beginner-friendly to have various types of examples as a supplement.
   
   I'm not sure that's a good idea, because Formats can also be used by multiple connectors. You can read/write Avro from/to FileSystem, but you can also use it with Kafka or Pulsar. I don't think we should mix them up. I think a natural concept is that you first start with where you want to use data from (which is a connector) and then you think about what data you want to use and how (which is more format related). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MonsterChenzhuo commented on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

MonsterChenzhuo commented on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1014628097


   @MartijnVisser I have an idea whether it is possible to take the content of the https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/formats/ link page and move it to the current pr. Because the biggest point of the current pr is to guide users through the API, and it would be more beginner-friendly to have various types of examples as a supplement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #18288:
URL: https://github.com/apache/flink/pull/18288#issuecomment-1006578426


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29047",
       "triggerID" : "77ef5bba4cf42894132bd06988b96317d3e06760",
       "triggerType" : "PUSH"
     }, {
       "hash" : "601f0e004b818fc1e97d5c25698b703759096989",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29084",
       "triggerID" : "601f0e004b818fc1e97d5c25698b703759096989",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107",
       "triggerID" : "1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ff48692aa97724d0c88e9f6f9fb3d87e44b728a9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1ff05c08e7b9ebf3e3ae028c692a82dbf55e4786 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29107) 
   * ff48692aa97724d0c88e9f6f9fb3d87e44b728a9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] fapaul merged pull request #18288: [FLINK-20188][Connectors][Docs][FileSystem] Added documentation for File Source

Posted by GitBox <gi...@apache.org>.

fapaul merged pull request #18288:
URL: https://github.com/apache/flink/pull/18288


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org