You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (JIRA)" <ji...@apache.org> on 2017/10/20 12:18:00 UTC

[jira] [Comment Edited] (FLINK-5789) Make Bucketing Sink independent of Hadoop's FileSystem

    [ https://issues.apache.org/jira/browse/FLINK-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212540#comment-16212540 ] 

Piotr Nowojski edited comment on FLINK-5789 at 10/20/17 12:17 PM:
------------------------------------------------------------------

After short discussion with [~aljoscha] about https://issues.apache.org/jira/browse/FLINK-7737, we kind of agreed that current {{FSDataOutputStream}} interface is not sufficient/kind of weird if we want it to be used in BucketingSink. 

The issue is that depending on underlying FileSystem to "persist" snapshots we sometimes want to call {{hflush()}} (for regular HDFS) and sometimes {{hsync()}} (in S3 like environments). As [~StephanEwen] proposed it probably would best let {{FileSystem}} decide which one should be called to "persist" the data. But in that case calling the methods in {{FSDataOutputStream}} "flush" and "sync" is kind of strange and misleading. Maybe we should add a new one {{FSDataOutputStream::persist()}} (with default implementation calling {{sync()}}, which should be called by BucketingSink (and state backends?) and the concrete implementation of the {{FileSystem}} could decided how to interpret the {{FSDataOutputStream::persist()}} calls. At the same time I do not see a valid usage for {{FSDataOutputStream::flush()}}, so maybe we should drop/deprecate it?


was (Author: pnowojski):
After short discussion with [~aljoscha] about https://issues.apache.org/jira/browse/FLINK-7737, we kind of agreed that current {{FSDataOutputStream}} interface is not sufficient/kind of weird if we want it to be used in BucketingSink. 

The issue is that depending on underlying FileSystem to "persist" snapshots we sometimes want to call {{hflush()}} (for regular HDFS) and sometimes {{hsync()}} (in S3 like environments) and as [~StephanEwen] proposed it probably would best let {{FileSystem}} decide which one should be called to "persist" the data. But in that case calling the methods in {{FSDataOutputStream}} "flush" and "sync" is kind of strange and misleading. Maybe we should add a new one {{FSDataOutputStream::persist()}} (with default implementation calling {{sync()}}, which should be called by BucketingSink (and state backends?) and the concrete implementation of the {{FileSystem}} could decided how to interpret the {{FSDataOutputStream::persist()}} calls. At the same time I do not see a valid usage for {{FSDataOutputStream::flush()}}, so maybe we should drop/deprecate it?

> Make Bucketing Sink independent of Hadoop's FileSystem
> ------------------------------------------------------
>
>                 Key: FLINK-5789
>                 URL: https://issues.apache.org/jira/browse/FLINK-5789
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming Connectors
>    Affects Versions: 1.2.0, 1.1.4
>            Reporter: Stephan Ewen
>            Assignee: Gary Yao
>             Fix For: 1.4.0
>
>
> The {{BucketingSink}} is hard wired to Hadoop's FileSystem, bypassing Flink's file system abstraction.
> This causes several issues:
>   - The bucketing sink will behave different than other file sinks with respect to configuration
>   - Directly supported file systems (not through hadoop) like the MapR File System does not work in the same way with the BuketingSink as other file systems
>   - The previous point is all the more problematic in the effort to make Hadoop an optional dependency and with in other stacks (Mesos, Kubernetes, AWS, GCE, Azure) with ideally no Hadoop dependency.
> We should port the {{BucketingSink}} to use Flink's FileSystem classes.
> To support the *truncate* functionality that is needed for the exactly-once semantics of the Bucketing Sink, we should extend Flink's FileSystem abstraction to have the methods
>   - {{boolean supportsTruncate()}}
>   - {{void truncate(Path, long)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)