You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2018/11/08 13:28:51 UTC

[GitHub] kl0u closed pull request #7046: [FLINK-10803] Update the documentation to include changes to the S3 connector.

kl0u closed pull request #7046: [FLINK-10803] Update the documentation to include changes to the S3 connector.
URL: https://github.com/apache/flink/pull/7046
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/dev/connectors/streamfile_sink.md b/docs/dev/connectors/streamfile_sink.md
index aea66c3cc48..8f50675ccbc 100644
--- a/docs/dev/connectors/streamfile_sink.md
+++ b/docs/dev/connectors/streamfile_sink.md
@@ -24,16 +24,25 @@ under the License.
 -->
 
 This connector provides a Sink that writes partitioned files to filesystems
-supported by the Flink `FileSystem` abstraction. Since in streaming the input
-is potentially infinite, the streaming file sink writes data into buckets. The
-bucketing behaviour is configurable but a useful default is time-based
+supported by the [Flink `FileSystem` abstraction]({{ site.baseurl}}/ops/filesystems.html).
+
+<span class="label label-danger">Important Note</span>: For S3, the `StreamingFileSink` 
+supports only the [Hadoop-based](https://hadoop.apache.org/) FileSystem implementation, not
+the implementation based on [Presto](https://prestodb.io/). In case your job uses the 
+`StreamingFileSink` to write to S3 but you want to use the Presto-based one for checkpointing,
+it is advised to use explicitly *"s3a://"* (for Hadoop) as the scheme for the target path of
+the sink and *"s3p://"* for checkpointing (for Presto). Using *"s3://"* for both the sink
+and checkpointing may lead to unpredictable behavior, as both implementations "listen" to that scheme.
+
+Since in streaming the input is potentially infinite, the streaming file sink writes data
+into buckets. The bucketing behaviour is configurable but a useful default is time-based
 bucketing where we start writing a new bucket every hour and thus get
 individual files that each contain a part of the infinite output stream.
 
 Within a bucket, we further split the output into smaller part files based on a
 rolling policy. This is useful to prevent individual bucket files from getting
 too big. This is also configurable but the default policy rolls files based on
-file size and a timeout, i.e if no new data was written to a part file. 
+file size and a timeout, *i.e* if no new data was written to a part file. 
 
 The `StreamingFileSink` supports both row-wise encoding formats and
 bulk-encoding formats, such as [Apache Parquet](http://parquet.apache.org).
diff --git a/docs/ops/filesystems.md b/docs/ops/filesystems.md
index 416302e49ca..77757b00dcf 100644
--- a/docs/ops/filesystems.md
+++ b/docs/ops/filesystems.md
@@ -22,32 +22,38 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-This page provides details on setting up and configuring distributed file systems for use with Flink.
+This page provides details on setting up and configuring different file systems for use with Flink.
+We start by describing how to use and configure the different file systems that are supported by Flink
+out-of-the-box, before describing the necessary steps in order to add support about other/custom file system
+implementations.
 
 ## Flink's File System support
 
-Flink uses file systems both as a source and sink in streaming/batch applications, and as a target for checkpointing.
+Flink uses file systems both as *sources* and *sinks* in streaming/batch applications and as a target for *checkpointing*.
 These file systems can for example be *Unix/Windows file systems*, *HDFS*, or even object stores like *S3*.
 
 The file system used for a specific file is determined by the file URI's scheme. For example `file:///home/user/text.txt` refers to
 a file in the local file system, while `hdfs://namenode:50010/data/user/text.txt` refers to a file in a specific HDFS cluster.
 
 File systems are represented via the `org.apache.flink.core.fs.FileSystem` class, which captures the ways to access and modify
-files and objects in that file system. FileSystem instances are instantiates once per process and then cached / pooled, to
-avoid configuration overhead per stream creation, and to enforce certain constraints, like connection/stream limits.
+files and objects in that file system. FileSystem instances are instantiated once per process and then cached / pooled, to
+avoid configuration overhead per stream creation and to enforce certain constraints, such as connection/stream limits.
 
 ### Built-in File Systems
 
-Flink directly implements the following file systems:
+Flink ships with support for most of the popular file systems, namely *local*, *hadoop-compatible*, *S3*, *MapR FS*
+and *OpenStack Swift FS*. Each of these is identified by the scheme included in the URI of the provide file path. 
+
+Flink ships with implementations for the following file systems:
 
   - **local**: This file system is used when the scheme is *"file://"*, and it represents the file system of the local machine, 
 including any NFS or SAN that is mounted into that local file system.
 
   - **S3**: Flink directly provides file systems to talk to Amazon S3. There are two alternative implementations, `flink-s3-fs-presto`
-    and `flink-s3-fs-hadoop`. Both implementations are self-contained with no dependency footprint, there is no need to add Hadoop to
+    and `flink-s3-fs-hadoop`. Both implementations are self-contained with no dependency footprint. There is no need to add Hadoop to
     the classpath to use them. Both internally use some Hadoop code, but "shade away" all classes to avoid any dependency conflicts.
 
-    - `flink-s3-fs-presto`, registered under the scheme *"s3://"*, is based on code from the [Presto project](https://prestodb.io/).
+    - `flink-s3-fs-presto`, registered under the scheme *"s3://"* and *"s3p://"*, is based on code from the [Presto project](https://prestodb.io/).
       You can configure it the same way you can [configure the Presto file system](https://prestodb.io/docs/0.185/connector/hive.html#amazon-s3-configuration).
       
     - `flink-s3-fs-hadoop`, registered under *"s3://"* and *"s3a://"*, based on code from the [Hadoop Project](https://hadoop.apache.org/).
@@ -56,7 +62,13 @@ including any NFS or SAN that is mounted into that local file system.
     To use those file systems when using Flink as a library, add the respective maven dependency (`org.apache.flink:flink-s3-fs-presto:{{ site.version }}`
     or `org.apache.flink:flink-s3-fs-hadoop:{{ site.version }}`). When starting a Flink application from the Flink binaries, copy or move
     the respective jar file from the `opt` folder to the `lib` folder. See also [AWS setup](deployment/aws.html) for additional details.
-
+    
+    <span class="label label-danger">Attention</span>: As described above, both Hadoop and Presto "listen" to paths with scheme set to *"s3://"*. This is 
+    convenient for switching between implementations (Hadoop or Presto), but it can lead to non-determinism when both
+    implementations are required. This can happen when, for example, the job uses the [StreamingFileSink]({{ site.baseurl}}/dev/connectors/streamfile_sink.html) 
+    which only supports Hadoop, but uses Presto for checkpointing. In this case, it is advised to use explicitly *"s3a://"* 
+    as a scheme for the sink (Hadoop) and *"s3p://"* for checkpointing (Presto).
+    
   - **MapR FS**: The MapR file system *"maprfs://"* is automatically available when the MapR libraries are in the classpath.
   
   - **OpenStack Swift FS**: Flink directly provides a file system to talk to the OpenStack Swift file system, registered under the scheme *"swift://"*. 
@@ -64,9 +76,9 @@ including any NFS or SAN that is mounted into that local file system.
   To use it when using Flink as a library, add the respective maven dependency (`org.apache.flink:flink-swift-fs-hadoop:{{ site.version }}`
   When starting a Flink application from the Flink binaries, copy or move the respective jar file from the `opt` folder to the `lib` folder.
 
-### HDFS and Hadoop File System support 
+#### HDFS and Hadoop File System support 
 
-For all schemes where Flink cannot find a directly supported file system, Flink will try to use Hadoop to instantiate a file system for the respective scheme.
+For all schemes where it cannot find a directly supported file system, Flink will try to use Hadoop to instantiate a file system for the respective scheme.
 All Hadoop file systems are automatically available once `flink-runtime` and the Hadoop libraries are in classpath.
 
 That way, Flink seamlessly supports all of Hadoop file systems, and all Hadoop-compatible file systems (HCFS), for example:


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services