You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/02/28 21:46:00 UTC

[jira] [Commented] (FLINK-8814) Control over the extension of part files created by BucketingSink

    [ https://issues.apache.org/jira/browse/FLINK-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381072#comment-16381072 ] 

ASF GitHub Bot commented on FLINK-8814:
---------------------------------------

GitHub user jelmerk opened a pull request:

    https://github.com/apache/flink/pull/5603

    [FLINK-8814] Control over the extension of part files created by BucketingSink

    ## What is the purpose of the change
    
    Popular tools like hue and the avro connector for spark require files stored on hdfs to have the .avro extension. This patch makes it possible to configure a part file suffix
    
    ## Brief change log
    
    - adds support for partSuffix in BucketingSink
    
    ## Verifying this change
    
    The basic functionality of BucketingSink is verified by BucketingSinkTest. The structure of this test makes it awkward to test this in isolation
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): no
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
      - The serializers: no
      - The runtime per-record code paths (performance sensitive): no
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
      - The S3 file system connector: no
    
    ## Documentation
    
      - Does this pull request introduce a new feature? yes
      - If yes, how is the feature documented? JavaDocs


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jelmerk/flink FLINK_8814

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5603
    
----
commit 6487f0fb03870a74b948105ec685462e7b00cbc2
Author: Jelmer Kuperus <jk...@...>
Date:   2018-02-28T20:34:08Z

    [FLINK-8814] [file system sinks] Control over the extension of part files created by BucketingSink.

----


> Control over the extension of part files created by BucketingSink
> -----------------------------------------------------------------
>
>                 Key: FLINK-8814
>                 URL: https://issues.apache.org/jira/browse/FLINK-8814
>             Project: Flink
>          Issue Type: Improvement
>          Components: Streaming Connectors
>    Affects Versions: 1.4.0
>            Reporter: Jelmer Kuperus
>            Priority: Major
>
> BucketingSink creates files with the following pattern
> {noformat}
> partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter{noformat}
> When using checkpointing you have no control over the extension of the final files generated. This is incovenient when you are for instance writing files in the avro format because
>  # [Hue|http://gethue.com/] will not be able to render the files as avro See this [file|https://github.com/cloudera/hue/blob/master/apps/filebrowser/src/filebrowser/views.py#L730]
>  # [Spark avro|https://github.com/databricks/spark-avro/] will not be able to read the files unless you set a special property. See [this ticket|https://github.com/databricks/spark-avro/issues/203]
> It would be good if we had the ability to customize the extension of the files created
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)