You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/01/25 00:19:32 UTC

[GitHub] [flink] galenwarren commented on a change in pull request #18430: [FLINK-25577][docs] Update GCS documentation

galenwarren commented on a change in pull request #18430:
URL: https://github.com/apache/flink/pull/18430#discussion_r791256107



##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other places as well, including your [high availability setup]({{< ref "docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref "docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the *gs://* scheme. It uses Google's [gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector) Hadoop library to access GCS. It also uses Google's [google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage) library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).

Review comment:
       Updated in https://github.com/apache/flink/pull/18430/commits/4d4b6ef9bf6c75d3dbbe3d64b58eb5ffcc218ebf.

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other places as well, including your [high availability setup]({{< ref "docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref "docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the *gs://* scheme. It uses Google's [gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector) Hadoop library to access GCS. It also uses Google's [google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage) library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the `plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation on Google Cloud Storage authentication](https://cloud.google.com/storage/docs/authentication) for more information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs configuration keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md) by adding the configurations to your `flink-conf.yaml`.

Review comment:
       Yes, that's better. Good catch. Updated in https://github.com/apache/flink/pull/18430/commits/4d4b6ef9bf6c75d3dbbe3d64b58eb5ffcc218ebf.

##########
File path: docs/content/docs/deployment/filesystems/gcs.md
##########
@@ -50,57 +48,40 @@ env.getCheckpointConfig().setCheckpointStorage("gs://<bucket>/<endpoint>");
 
 ```
 
-### Libraries
+Note that these examples are *not* exhaustive and you can use GCS in other places as well, including your [high availability setup]({{< ref "docs/deployment/ha/overview" >}}) or the [EmbeddedRocksDBStateBackend]({{< ref "docs/ops/state/state_backends" >}}#the-rocksdbstatebackend); everywhere that Flink expects a FileSystem URI.
 
-You must include the following jars in Flink's `lib` directory to connect Flink with gcs:
+### GCS File System plugin
 
-```xml
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-shaded-hadoop2-uber</artifactId>
-  <version>${flink.shared_hadoop_latest_version}</version>
-</dependency>
+Flink provides the `flink-gs-fs-hadoop` file system to write to GCS.
+This implementation is self-contained with no dependency footprint, so there is no need to add Hadoop to the classpath to use it.
 
-<dependency>
-  <groupId>com.google.cloud.bigdataoss</groupId>
-  <artifactId>gcs-connector</artifactId>
-  <version>hadoop2-2.2.0</version>
-</dependency>
-```
+`flink-gs-fs-hadoop` registers a `FileSystem` wrapper for URIs with the *gs://* scheme. It uses Google's [gcs-connector](https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector) Hadoop library to access GCS. It also uses Google's [google-cloud-storage](https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage) library to provide `RecoverableWriter` support. 
 
-We have tested with `flink-shared-hadoop2-uber` version >= `2.8.5-1.8.3`.
-You can track the latest version of the [gcs-connector hadoop 2](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar).
+This file system supports the [StreamingFileSink]({{< ref "docs/connectors/datastream/streamfile_sink" >}}) and the [FileSink]({{< ref "docs/connectors/datastream/file_sink" >}}).
 
-### Authentication to access GCS
+To use `flink-gs-fs-hadoop`, copy the JAR file from the `opt` directory to the `plugins` directory of your Flink distribution before starting Flink, i.e.
 
-Most operations on GCS require authentication. Please see [the documentation on Google Cloud Storage authentication](https://cloud.google.com/storage/docs/authentication) for more information.
+```bash
+mkdir ./plugins/gs-fs-hadoop
+cp ./opt/flink-gs-fs-hadoop-{{< version >}}.jar ./plugins/gs-fs-hadoop/
+```
+
+### Configuration
 
-You can use the following method for authentication
-* Configure via core-site.xml
-  You would need to add the following properties to `core-site.xml`
+The underlying Hadoop file system can be [configured using Hadoop's gs configuration keys](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md) by adding the configurations to your `flink-conf.yaml`.
 
-  ```xml
-  <configuration>
-    <property>
-      <name>google.cloud.auth.service.account.enable</name>
-      <value>true</value>
-    </property>
-    <property>
-      <name>google.cloud.auth.service.account.json.keyfile</name>
-      <value><PATH TO GOOGLE AUTHENTICATION JSON></value>
-    </property>
-  </configuration>
-  ```
+For example, Hadoop has a `fs.gs.http.connect-timeout` configuration key. If you want to change it, you need to set `gs.http.connect-timeout: xyz` in `flink-conf.yaml`. Flink will internally translate this back to `fs.gs.http.connect-timeout`. There is no need to pass configuration parameters using Hadoop's XML configuration files.

Review comment:
       Updated in https://github.com/apache/flink/pull/18430/commits/4d4b6ef9bf6c75d3dbbe3d64b58eb5ffcc218ebf.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org