You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Edward Rojas <ed...@gmail.com> on 2018/11/16 10:31:31 UTC

BucketingSink vs StreamingFileSink

Hello,
We are currently using Flink 1.5 and we use the BucketingSink to save the
result of job processing to HDFS.
The data is in JSON format and we store one object per line in the resulting
files. 

We are planning to upgrade to Flink 1.6 and we see that there is this new
StreamingFileSink,  from the description it looks very similar to
BucketingSink when using Row-encoded Output Format, my question is, should
we consider to move to StreamingFileSink?

I would like to better understand what are the suggested use cases for each
of the two options now (?)

We are also considering to additionally output the data in Parquet format
for data scientists (to be stored in HDFS as well), for this I see some
utils to work with StreamingFileSink, so I guess for this case it's
recommended to use that option(?).
Is it possible to use the Parquet writers even when the schema of the data
may evolve ?

Thanks in advance for your help.
(Sorry if I put too many questions in the same message)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: BucketingSink vs StreamingFileSink

Posted by Edward Alexander Rojas Clavijo <ed...@gmail.com>.

Thank you very much for the information Andrey.

I'll try on my side to do the migration of what we have now and try to add
the sink with Parquet and I'll be back to you if I have more questions :)

Edward

El vie., 16 nov. 2018 a las 19:54, Andrey Zagrebin (<
andrey@data-artisans.com>) escribió:

> Hi,
>
> StreamingFileSink is supposed to subsume BucketingSink which will be
> deprecated.
>
> StreamingFileSink fixes some issues of BucketingSink, especially with AWS
> s3
> and adds more flexibility with defining the rolling policy.
>
> StreamingFileSink does not support older hadoop versions at the moment,
> but there are ideas how to resolve this.
>
> You can have a look how to use StreamingFileSink with Parquet here [1].
>
> I also cc’ed Kostas, he might add more to this topic.
>
> Best,
> Andrey
>
> [1]
> https://github.com/apache/flink/blob/0b4947b6142f813d2f1e0e662d0fefdecca0e382/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java
>
> > On 16 Nov 2018, at 11:31, Edward Rojas <ed...@gmail.com> wrote:
> >
> > Hello,
> > We are currently using Flink 1.5 and we use the BucketingSink to save the
> > result of job processing to HDFS.
> > The data is in JSON format and we store one object per line in the
> resulting
> > files.
> >
> > We are planning to upgrade to Flink 1.6 and we see that there is this new
> > StreamingFileSink,  from the description it looks very similar to
> > BucketingSink when using Row-encoded Output Format, my question is,
> should
> > we consider to move to StreamingFileSink?
> >
> > I would like to better understand what are the suggested use cases for
> each
> > of the two options now (?)
> >
> > We are also considering to additionally output the data in Parquet format
> > for data scientists (to be stored in HDFS as well), for this I see some
> > utils to work with StreamingFileSink, so I guess for this case it's
> > recommended to use that option(?).
> > Is it possible to use the Parquet writers even when the schema of the
> data
> > may evolve ?
> >
> > Thanks in advance for your help.
> > (Sorry if I put too many questions in the same message)
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>
>

-- 
*Edward Alexander Rojas Clavijo*



*Software EngineerHybrid CloudIBM France*

Re: BucketingSink vs StreamingFileSink

Posted by Andrey Zagrebin <an...@data-artisans.com>.

Hi,

StreamingFileSink is supposed to subsume BucketingSink which will be deprecated.

StreamingFileSink fixes some issues of BucketingSink, especially with AWS s3 
and adds more flexibility with defining the rolling policy.

StreamingFileSink does not support older hadoop versions at the moment, 
but there are ideas how to resolve this.

You can have a look how to use StreamingFileSink with Parquet here [1].

I also cc’ed Kostas, he might add more to this topic.

Best,
Andrey

[1] https://github.com/apache/flink/blob/0b4947b6142f813d2f1e0e662d0fefdecca0e382/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

> On 16 Nov 2018, at 11:31, Edward Rojas <ed...@gmail.com> wrote:
> 
> Hello,
> We are currently using Flink 1.5 and we use the BucketingSink to save the
> result of job processing to HDFS.
> The data is in JSON format and we store one object per line in the resulting
> files. 
> 
> We are planning to upgrade to Flink 1.6 and we see that there is this new
> StreamingFileSink,  from the description it looks very similar to
> BucketingSink when using Row-encoded Output Format, my question is, should
> we consider to move to StreamingFileSink?
> 
> I would like to better understand what are the suggested use cases for each
> of the two options now (?)
> 
> We are also considering to additionally output the data in Parquet format
> for data scientists (to be stored in HDFS as well), for this I see some
> utils to work with StreamingFileSink, so I guess for this case it's
> recommended to use that option(?).
> Is it possible to use the Parquet writers even when the schema of the data
> may evolve ?
> 
> Thanks in advance for your help.
> (Sorry if I put too many questions in the same message)
> 
> 
> 
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/