You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Kailash Dayanand <kd...@lyft.com> on 2019/05/10 16:27:53 UTC

Re: Use case for StreamingFileSink: Different parquet writers within the Sink

Hello,

I was able to solve this based by creating a data model where all the
incoming events are added into a message envelope and writing a Sink for a
dataStream containing these message envelopes. Also I ended up creating
parquet writers not when constructing the parquetWriter but instead inside
the addElement method of the BulkWriter function. Adding this information
for posterior in case someone needs to handle seem use case.

Thanks
Kailash

On Sun, Apr 28, 2019 at 3:57 PM Kailash Dayanand <kd...@lyft.com> wrote:

> We have the following use case: We are reading a stream of events which we
> want to write to different parquet files based on data within the element
> <IN>. The end goal is to register these parquet files in hive to query. I
> was exploring the option of using StreamingFileSink for this use case but
> found a new things which I could not customize.
>
> It looks like StreamingFileSink takes a single schema / provides a
> ParquetWriter for a specific schema. Since the elements needs to have
> different Avro schema based on the data in the elements I could not use the
> sink as-is (AvroParquetWriters, needs to specify the same Schema for the
> parquetBuilder). So looking a bit deeper I found that there is a
> WriterFactory here: https://tinyurl.com/y68drj35 . This can be extended
> to create a BulkPartWriter based on BucketID. Something like this:
> BulkWriter.Factory<IN, BucketID> writerFactory. In this way you can
> create a unique ParquetWriter for each bucket.  Is there any other option
> to do this?
>
> I considered another option of possibly using a common schema for all the
> different schemas but I have not fully explored that option.
>
> Thanks
> Kailash
>