You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/01/24 08:26:38 UTC

[GitHub] [flink] xintongsong commented on a change in pull request #18430: [FLINK-25577][docs] Update GCS documentation

xintongsong commented on a change in pull request #18430:
URL: https://github.com/apache/flink/pull/18430#discussion_r790487706



##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other places as well, including your [high availability setup]({{< ref "docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref "docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the *gs://* scheme. It uses Google's [gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector) Hadoop library to access GCS. It also uses Google's [google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage) library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).

Review comment:
       These two references are invalid, as CI complained. They are removed / renamed in FLINK-20188. I think we should now refers to "docs/content/docs/connectors/datastream/filesystem.md"

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other places as well, including your [high availability setup]({{< ref "docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref "docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the *gs://* scheme. It uses Google's [gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector) Hadoop library to access GCS. It also uses Google's [google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage) library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the `plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation on Google Cloud Storage authentication](https://cloud.google.com/storage/docs/authentication) for more information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs configuration keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md) by adding the configurations to your `flink-conf.yaml`.

Review comment:
       Correct me if I'm wrong, I think the precise description should be:
   ```suggestion
   The underlying Hadoop file system can be [configured using gcs-connector's Hadoop configuration keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md) by adding the configurations to your `flink-conf.yaml`.
   ```

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other places as well, including your [high availability setup]({{< ref "docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref "docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the *gs://* scheme. It uses Google's [gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector) Hadoop library to access GCS. It also uses Google's [google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage) library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the `plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation on Google Cloud Storage authentication](https://cloud.google.com/storage/docs/authentication) for more information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs configuration keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md) by adding the configurations to your `flink-conf.yaml`.
 
-  ```xml
-  <configuration>
-    <property>
-      <name>google.cloud.auth.service.account.enable</name>
-      <value>true</value>
-    </property>
-    <property>
-      <name>google.cloud.auth.service.account.json.keyfile</name>
-      <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
-    </property>
-  </configuration>
-  ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If you want to change it, you need to set `gs.http.connect-timeout: xyz` in `flink-conf.yaml`. Flink will internally translate this back to `fs.gs.http.connect-timeout`. There is no need to pass configuration parameters using Hadoop's XML configuration files.

Review comment:
       ```suggestion
   For example, gcs-connector has a `fs.gs.http.connect-timeout` configuration key. If you want to change it, you need to set `gs.http.connect-timeout: xyz` in `flink-conf.yaml`. Flink will internally translate this back to `fs.gs.http.connect-timeout`. There is no need to pass configuration parameters using Hadoop's XML configuration files.
   ```

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other places as well, including your [high availability setup]({{< ref "docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref "docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the *gs://* scheme. It uses Google's [gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector) Hadoop library to access GCS. It also uses Google's [google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage) library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the `plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation on Google Cloud Storage authentication](https://cloud.google.com/storage/docs/authentication) for more information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs configuration keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md) by adding the configurations to your `flink-conf.yaml`.
 
-  ```xml
-  <configuration>
-    <property>
-      <name>google.cloud.auth.service.account.enable</name>
-      <value>true</value>
-    </property>
-    <property>
-      <name>google.cloud.auth.service.account.json.keyfile</name>
-      <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
-    </property>
-  </configuration>
-  ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If you want to change it, you need to set `gs.http.connect-timeout: xyz` in `flink-conf.yaml`. Flink will internally translate this back to `fs.gs.http.connect-timeout`. There is no need to pass configuration parameters using Hadoop's XML configuration files.
 
-  You would need to add the following to `flink-conf.yaml`
+`flink-gs-fs-hadoop` can also be configured by setting the following options in `flink-conf.yaml`:
 
-  ```yaml
-  flinkConfiguration:
-    fs.hdfs.hadoopconf: <DIRECTORY PATH WHERE core-site.xml IS SAVED>
-  ```
+| Key                                       | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| gs.writer.temporary.bucket.name           | If this property is not set, temporary blobs for in-progress writes via `RecoverableWriter` will be written to same bucket as the final file being written, prefixed with `.inprogress/`. <br><br>Set this property to choose a different bucket to hold the temporary blobs. It is recommended to choose a separate bucket in order to [assign it a TTL](https://cloud.google.com/storage/docs/lifecycle), to provide a mechanism to clean up orphaned blobs that can occur when restoring from check/savepoints.<br><br>If you do use a separate bucket with a TTL for temporary blobs, attempts to restart jobs from check/savepoints after the TTL interval expires may fail. 

Review comment:
       Maybe swap the first and second paragraph?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org