You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/03/14 14:11:11 UTC

[GitHub] [flink] JingGe opened a new pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

JingGe opened a new pull request #19083:
URL: https://github.com/apache/flink/pull/19083


   ## What is the purpose of the change
   
   Update the document based on the feedback of release test task FLINK-26301.
   
   
   ## Brief change log
   
   *(for example:)*
     - remove redundant content of bounded and unbounded data
     - add description of mvn dependency for reading Avro records
     - new description of using namespace in schema for reading reflect records
   
   
   ## Verifying this change
   
   This change is a trivial document update.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826136692



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+
 ```
 
-#### Unbounded data example
+From now on, this document will only show you examples for bounded data. You can add a call of

Review comment:
       sounds better! Thanks! I will change it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zentol commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

zentol commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826843387



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Neither fastutil nor hadoop-client (or any hadoop dependency for that matter) is marked as `optional` in `flink-parquet`, but `provided`.
   fastutil is provided by flink-dist, while for hadoop we expect users to provided.
   
   These exclusions don't really make sense to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8cb420726ff9df76774507dad6e0fc11bfacd86f Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034) 
   * 05abc4a2a309046476534f2b5612c9a0d6f55381 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3377a70809ea0a641b8b582fc17217badb159a4b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33160",
       "triggerID" : "3377a70809ea0a641b8b582fc17217badb159a4b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 253391270af9893af12b05fcfb4b2046d2f9e843 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128) 
   * 3377a70809ea0a641b8b582fc17217badb159a4b Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33160) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * 253391270af9893af12b05fcfb4b2046d2f9e843 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826180052



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+
 ```
 
-#### Unbounded data example
+From now on, this document will only show you examples for bounded data. You can add a call of

Review comment:
       hmmm, I was checking the doc and found a conflict between this concept and the current implementation, since a File Source is created by default in the bounded mode.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r827026626



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**

Review comment:
       ```suggestion
   **Vectorized reader**
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,72 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+To read Avro records, you will need to add the `parquet-avro` dependency:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
 
-## Flink RowData
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-#### Bounded data example
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+**Batch mode**
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
 ```
 
-#### Unbounded data example
+**Streaming mode** 
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas that will
-infinitely grow as new files are added to the directory. It will monitor for new files each second.
-The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
+```java
+
+// reads bounded data of records from files at a time

Review comment:
       ```suggestion
   // Parquet rows are decoded as records
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,72 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+To read Avro records, you will need to add the `parquet-avro` dependency:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
 
-## Flink RowData
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-#### Bounded data example
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .

Review comment:
       ```suggestion
   {{< hint info >}}
   When you start a File Source it is configured for bounded data by default. 
   To configure the File Source for unbounded data, you must additionally call 
   `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
   {{< /hint >}}
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,72 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+To read Avro records, you will need to add the `parquet-avro` dependency:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
 
-## Flink RowData
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-#### Bounded data example
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+**Batch mode**
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
 ```
 
-#### Unbounded data example
+**Streaming mode** 
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas that will
-infinitely grow as new files are added to the directory. It will monitor for new files each second.
-The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths

Review comment:
       ```suggestion
   // Monitor the Paths to read data as unbounded data
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 

Review comment:
       ```suggestion
   **Avro Parquet reader** 
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -329,22 +297,94 @@ final DataStream<GenericRecord> stream =
         env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
 ```
 
-#### Unbounded data example
+#### Prerequisite of Parquet files

Review comment:
       ```suggestion
   #### Prerequisite for Parquet files
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,72 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+To read Avro records, you will need to add the `parquet-avro` dependency:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
 
-## Flink RowData
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-#### Bounded data example
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+**Batch mode**
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
 ```
 
-#### Unbounded data example
+**Streaming mode** 
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas that will
-infinitely grow as new files are added to the directory. It will monitor for new files each second.
-The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+
+```
+
+From now on, this document will only show you examples for bounded data. You can add a call of
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` for unbounded data.

Review comment:
       ```suggestion
   {{< hint info >}}
   The following examples are all configured for bounded data. 
   To configure the File Source for unbounded data, you must additionally call 
   `AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
   {{< /hint >}}
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot commented on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "04540d4a62fb3d0784c9c718a52553bf3d76a127",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "04540d4a62fb3d0784c9c718a52553bf3d76a127",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 04540d4a62fb3d0784c9c718a52553bf3d76a127 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826180052



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+
 ```
 
-#### Unbounded data example
+From now on, this document will only show you examples for bounded data. You can add a call of

Review comment:
       hmmm, I was checking the doc and found a conflict between this concept and the current implementation, since a File Source is created **by default** in the bounded mode.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3377a70809ea0a641b8b582fc17217badb159a4b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3377a70809ea0a641b8b582fc17217badb159a4b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 253391270af9893af12b05fcfb4b2046d2f9e843 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128) 
   * 3377a70809ea0a641b8b582fc17217badb159a4b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zentol commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

zentol commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826848366



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       The fastutil exclusion in flink-parquet, or rather the `provided` dependency on it, doesn't make sense because fastutil isn't actually bundled by flink-dist.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826135126



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**

Review comment:
       True and this case is intended to describe the execution mode 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8cb420726ff9df76774507dad6e0fc11bfacd86f Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1068887597


   FYI: I've copied the content to the content.zh/.../formats/parquet.md and squashed commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zentol commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

zentol commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826846224



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Ah now I see what you mean. flink-parquet contains an optional dependency on parquet-avro, and it has this set of exclusions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 05abc4a2a309046476534f2b5612c9a0d6f55381 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045) 
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826061449



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 

Review comment:
       ```suggestion
   **Unbounded data** 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * 253391270af9893af12b05fcfb4b2046d2f9e843 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826047567



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -30,7 +30,7 @@ under the License.
 
 Flink supports reading [Parquet](https://parquet.apache.org/) files, 
 producing {{< javadoc file="org/apache/flink/table/data/RowData.html" name="Flink RowData">}} and producing [Avro](https://avro.apache.org/) records.
-To use the format you need to add the Flink Parquet dependency to your project:
+To use the format you need to add the flink-parquet dependency to your project for reading Flink RowData:

Review comment:
       ```suggestion
   To use the format you need to add the `flink-parquet` dependency to your project:
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Are these exclusions only needed when using this in combination with `flink-parquet` ? Or can we already exclude these by default?

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+
 ```
 
-#### Unbounded data example
+From now on, this document will only show you examples for bounded data. You can add a call of

Review comment:
       If we only show examples for either bounded or unbounded data, I think I would prefer to have the examples show unbounded data and explain how you could apply them to bounded data. Since batch is a special case of streaming from a Flink perspective.

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -329,22 +296,94 @@ final DataStream<GenericRecord> stream =
         env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
 ```
 
-#### Unbounded data example
+#### Prerequisite of Parquet files
+
+In order to support reading Avro Reflect records, the parquet file must contain specific meta information.
+The Avro schema used for creating the parquet data must contain a `namespace`, 
+which will be used by the program to identify the concrete Java class for the reflection process.
+
+The following example shows the User schema used previously. But this time it contains a namespace 
+pointing to the location(in this case the package), where the User class for the reflection could be found.
+
+```java
+// avro schema with namespace
+final String schema = 
+                    "{\"type\": \"record\", "
+                        + "\"name\": \"User\", "
+                        + "\"namespace\": \"org.apache.flink.formats.parquet.avro\", "
+                        + "\"fields\": [\n"
+                        + "        {\"name\": \"name\", \"type\": \"string\" },\n"
+                        + "        {\"name\": \"favoriteNumber\",  \"type\": [\"int\", \"null\"] },\n"
+                        + "        {\"name\": \"favoriteColor\", \"type\": [\"string\", \"null\"] }\n"
+                        + "    ]\n"
+                        + "    }";
+
+```
+
+Parquet files created with this schema will contain meta information like:
 
-This example, similar to the bounded batch example, uses the same POJO Java class `Datum`
-and monitors for the new files every second to read Avro Reflect records from Parquet files
-infinitely as new files are added to the directory.
+```text
+creator:        parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
+extra:          parquet.avro.schema =
+{"type":"record","name":"User","namespace":"org.apache.flink.formats.parquet.avro","fields":[{"name":"name","type":"string"},{"name":"favoriteNumber","type":["int","null"]},{"name":"favoriteColor","type":["string","null"]}]}
+extra:          writer.model.name = avro
+
+file schema:    org.apache.flink.formats.parquet.avro.User
+--------------------------------------------------------------------------------
+name:           REQUIRED BINARY L:STRING R:0 D:0
+favoriteNumber: OPTIONAL INT32 R:0 D:1
+favoriteColor:  OPTIONAL BINARY L:STRING R:0 D:1
+
+row group 1:    RC:3 TS:143 OFFSET:4
+--------------------------------------------------------------------------------
+name:            BINARY UNCOMPRESSED DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:PLAIN,BIT_PACKED ST:[min: Jack, max: Tom, num_nulls: 0]
+favoriteNumber:  INT32 UNCOMPRESSED DO:0 FPO:51 SZ:41/41/1.00 VC:3 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 1, max: 3, num_nulls: 0]
+favoriteColor:   BINARY UNCOMPRESSED DO:0 FPO:92 SZ:55/55/1.00 VC:3 ENC:RLE,PLAIN,BIT_PACKED ST:[min: green, max: yellow, num_nulls: 0]
+
+```
+
+With the User class defined in the package org.apache.flink.formats.parquet.avro:

Review comment:
       ```suggestion
   With the `User` class defined in the package org.apache.flink.formats.parquet.avro:
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**

Review comment:
       I think Flink still uses Bounded and Unbounded data. Batch mode is referring to the execution mode

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:

Review comment:
       ```suggestion
   To read Avro records, you will need to add the `parquet-avro` dependency:
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 

Review comment:
       ```suggestion
   **Unbounded data** 
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -329,22 +296,94 @@ final DataStream<GenericRecord> stream =
         env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
 ```
 
-#### Unbounded data example
+#### Prerequisite of Parquet files
+
+In order to support reading Avro Reflect records, the parquet file must contain specific meta information.
+The Avro schema used for creating the parquet data must contain a `namespace`, 
+which will be used by the program to identify the concrete Java class for the reflection process.
+
+The following example shows the User schema used previously. But this time it contains a namespace 
+pointing to the location(in this case the package), where the User class for the reflection could be found.

Review comment:
       ```suggestion
   In order to support reading Avro reflect records, the Parquet file must contain specific meta information.
   The Avro schema used for creating the Parquet data must contain a `namespace`, 
   which will be used by the program to identify the concrete Java class for the reflection process.
   
   The following example shows the `User` schema used previously. But this time it contains a namespace 
   pointing to the location(in this case the package), where the `User` class for the reflection could be found.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zentol commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

zentol commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826843387



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Neither fastutil nor hadoop-client (or any hadoop dependency for that matter) is marked as `optional` in `flink-parquet`, but `provided`.
   fastutil is supposeldy provided by flink-dist, while for hadoop we expect users to provided.
   
   These exclusions don't really make sense to me.

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Neither fastutil nor hadoop-client (or any hadoop dependency for that matter) is marked as `optional` in `flink-parquet`, but `provided`.
   fastutil is supposeldy provided by flink-dist, while for hadoop we expect users to provide it.
   
   These exclusions don't really make sense to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zentol commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

zentol commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826843387



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Neither fastutil nor hadoop-client (or any hadoop dependency for that matter) is marked as `optional` in `flink-parquet`, but `provided`.
   fastutil is supposedly provided by flink-dist, while for hadoop we expect users to provide it.
   
   These exclusions don't really make sense to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * 253391270af9893af12b05fcfb4b2046d2f9e843 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * 253391270af9893af12b05fcfb4b2046d2f9e843 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8cb420726ff9df76774507dad6e0fc11bfacd86f UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826135126



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**

Review comment:
       True and this case is intended to describe the execution mode. Please take a look at the inline comments




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826874274



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Thanks for the suggestion. Beyond the reasons you mentioned above, what I understood additionally with the exclusions is that` flink-parquet` does not call any APIs and does not consume any functionality of the both excluded direct dependency `hadoop-client` and `fastutil`. Therefore, it is up to the user whether define them as provided or any other scopes in their own pom.xml. The Flink document just points out how to define the dependency of `parquet-avro`. In this case, with exclusions makes more sense. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

MartijnVisser commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r827019131



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,72 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+To read Avro records, you will need to add the `parquet-avro` dependency:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:

Review comment:
       ```suggestion
   This format is compatible with the new Source that can be used in both batch and streaming execution modes.
   Thus, you can use this format for two kinds of data:
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,72 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+To read Avro records, you will need to add the `parquet-avro` dependency:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
 
-## Flink RowData
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-#### Bounded data example
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+**Batch mode**
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths

Review comment:
       ```suggestion
   // Monitor the Paths to read data as unbounded data
   ```

##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,72 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+To read Avro records, you will need to add the `parquet-avro` dependency:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
 
-## Flink RowData
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-#### Bounded data example
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+**Batch mode**
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time

Review comment:
       ```suggestion
   // Parquet rows are decoded in batches
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ae62b7794136ed7686b8bdb1f982eb1ab6930df6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ae62b7794136ed7686b8bdb1f982eb1ab6930df6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 05abc4a2a309046476534f2b5612c9a0d6f55381 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045) 
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * ae62b7794136ed7686b8bdb1f982eb1ab6930df6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 05abc4a2a309046476534f2b5612c9a0d6f55381 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045) 
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * 253391270af9893af12b05fcfb4b2046d2f9e843 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826139610



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       The exclusions are defined in the pom.xml as optional in the flink-parquet sub module. I think there were some known conflicts.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1068528948


   Thanks @MartijnVisser for your effort! The content has been updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 253391270af9893af12b05fcfb4b2046d2f9e843 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 05abc4a2a309046476534f2b5612c9a0d6f55381 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zentol commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

zentol commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826843387



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       Neither fastutil nor hadoop-client (or any hadoop dependency for that matter) is marked as `optional`, but `provided`.
   fastutil is provided by flink-dist, while for hadoop we expect users to provided.
   
   These exclusions don't really make sense to me.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] MartijnVisser merged pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

MartijnVisser merged pull request #19083:
URL: https://github.com/apache/flink/pull/19083


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826840345



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       @zentol may I have your advice for this case? Thanks. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * 253391270af9893af12b05fcfb4b2046d2f9e843 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] zentol commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

zentol commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826852181



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       As for the hadoop-client exclusion in flink-parquet, it's a bit odd, and the only explanation I can offer is that it exists for dependency convergence.
   An exclusion in the user application somewhat makes sense because we don't want it to be bundled, but imo it would be more appropriate to set it to provided somehow.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33128",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3377a70809ea0a641b8b582fc17217badb159a4b",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33160",
       "triggerID" : "3377a70809ea0a641b8b582fc17217badb159a4b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3377a70809ea0a641b8b582fc17217badb159a4b Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33160) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1068887597


   FYI: I've copied the content to the content.zh/.../formats/parquet.md and squashed commits. The PR is ready to be merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     }, {
       "hash" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "253391270af9893af12b05fcfb4b2046d2f9e843",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   * 253391270af9893af12b05fcfb4b2046d2f9e843 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126",
       "triggerID" : "7b4f937bbcaaa3da29acddb8f176905f26b20711",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7b4f937bbcaaa3da29acddb8f176905f26b20711 Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33126) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826136692



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>
+</dependency>
+```
+
 This format is compatible with the new Source that can be used in both batch and streaming modes.
 Thus, you can use this format for two kinds of data:
-- Bounded data
-- Unbounded data: monitors a directory for new files that appear 
+- Bounded data: lists all files and reads them all.
+- Unbounded data: monitors a directory for new files that appear.
 
-## Flink RowData
+By default, a File Source is created in the bounded mode, to turn the source into the continuous unbounded mode you can call 
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)` additionally .
 
-#### Bounded data example
+**Batch mode**
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+```java
+
+// reads bounded data of records from files at a time
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forBulkFileFormat(BulkFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+```
+
+**Streaming mode** 
 
 ```java
-final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
 
-final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
-final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
-final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+// reads bounded data of records from files at a time
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        
+// reads unbounded data of records from files by monitoring the Paths
+FileSource.forRecordStreamFormat(StreamFormat,Path...)
+        .monitorContinuously(Duration.ofMillis(5L))
+        .build();
+
+
 ```
 
-#### Unbounded data example
+From now on, this document will only show you examples for bounded data. You can add a call of

Review comment:
       sounds better! Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] JingGe commented on a change in pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

JingGe commented on a change in pull request #19083:
URL: https://github.com/apache/flink/pull/19083#discussion_r826139610



##########
File path: docs/content/docs/connectors/datastream/formats/parquet.md
##########
@@ -39,46 +39,71 @@ To use the format you need to add the Flink Parquet dependency to your project:
 	<version>{{< version >}}</version>
 </dependency>
 ```
- 
+
+For reading Avro records, parquet-avro dependency is required additionally:
+
+```xml
+<dependency>
+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet-avro</artifactId>
+    <version>${flink.format.parquet.version}</version>
+    <optional>true</optional>
+    <exclusions>
+        <exclusion>
+            <groupId>org.apache.hadoop</groupId>
+            <artifactId>hadoop-client</artifactId>
+        </exclusion>
+        <exclusion>
+            <groupId>it.unimi.dsi</groupId>
+            <artifactId>fastutil</artifactId>
+        </exclusion>
+    </exclusions>

Review comment:
       The exclusions are defined in the pom.xml as optional in the flink-parquet sub module. I think there were some known conflict.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #19083: [FLINK-26604][doc] add more information for Avro records support and clean up redundant content of bounded and unbounded data

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #19083:
URL: https://github.com/apache/flink/pull/19083#issuecomment-1066928647


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034",
       "triggerID" : "8cb420726ff9df76774507dad6e0fc11bfacd86f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045",
       "triggerID" : "05abc4a2a309046476534f2b5612c9a0d6f55381",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 8cb420726ff9df76774507dad6e0fc11bfacd86f Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33034) 
   * 05abc4a2a309046476534f2b5612c9a0d6f55381 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=33045) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org