You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Grzegorz Kołakowski (Jira)" <ji...@apache.org> on 2020/01/15 14:37:00 UTC

[jira] [Commented] (FLINK-11395) Support for Avro StreamingFileSink

    [ https://issues.apache.org/jira/browse/FLINK-11395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016037#comment-17016037 ] 

Grzegorz Kołakowski commented on FLINK-11395:
---------------------------------------------

In fact, my previous comment does not make much sense.

In addition to metadata kept in the header, the file consists of one or more data blocks. Data blocks, in turn, are separated by 16-byte sync markers ([docs|https://avro.apache.org/docs/1.9.1/spec.html#Object+Container+Files]). So what about modifying existing {{BulkWriter.Factory}} to have two create methods, e.g.: createForNewFile, createForResumedFile? Alternatively, let's introduce a new interface which is designed for appendable block formats like avro. This way, we can easily wrap a {{org.apache.avro.file.DataFileWriter}} with the interface. The difficult part might be resuming a file. To this end, the {{DataFileWriter}} has to read the file header (via {{org.apache.avro.file.SeekableInput}} interface) in order to load schema, sync marker, codec etc. I guess, this may require further changes in interfaces such as {{ResumeRecoverable}}.

I've noticed the fix version is updated to 1.11.0. May I ask what approach will be implemented? Will it be possible to control when file is closed like for row-based writer or the file is closed on checkpoint like for bulk writer?

> Support for Avro StreamingFileSink
> ----------------------------------
>
>                 Key: FLINK-11395
>                 URL: https://issues.apache.org/jira/browse/FLINK-11395
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / FileSystem
>            Reporter: Elango Ganesan
>            Priority: Major
>              Labels: usability
>             Fix For: 1.11.0
>
>
> Current implementation for StreamingFileSink supports Rowformat for text and json files . BulkWriter has support for Parquet files . Out of the box RowFormat does not seem to be right fit for writing Avro files as avro files need to persist metadata when opening new file.  We are thinking of implementing Avro writers similar to how Parquet is implemented . I am happy to submit a PR if this approach sounds good . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)