You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Kamil ty <ka...@gmail.com> on 2021/12/02 23:11:50 UTC

Pyflink/Flink Java parquet streaming file sink for a dynamic schema stream

Hello,

I'm wondering if there is a possibility to create a parquet streaming file
sink in Pyflink (in Table API) or in Java Flink (in Datastream api).

To give an example of the expected behaviour. Each element of the stream is
going to contain a json string. I want to save this stream to parquet files
without having to explicitly define the schema/types of the messages (also
using a single sink).

If this is possible, (might be in Java Flink using a custom
ParquetBulkWriterFactory etc.) any direction for the implementation would
be appreciated.

Best regards
Kamil

Re: Pyflink/Flink Java parquet streaming file sink for a dynamic schema stream

Posted by Georg Heiler <ge...@gmail.com>.

Hi,

the schema of the after part depends on each table i.e. holds different
columns for each table.
So do you receive debezium changelog statements for all/ >1 table? I.e. is
the schema in the after part different?

Best,
Georg

Am Fr., 3. Dez. 2021 um 08:35 Uhr schrieb Kamil ty <ka...@gmail.com>:

> Yes the general JSON schema should follow a debezium JSON schema. The
> fields that need to be saved to the parquet file are in the "after" key.
>
> On Fri, 3 Dec 2021, 07:10 Georg Heiler, <ge...@gmail.com> wrote:
>
>> Do the JSONs have the same schema overall? Or is each potentially
>> structured differently?
>>
>> Best,
>> Georg
>>
>> Am Fr., 3. Dez. 2021 um 00:12 Uhr schrieb Kamil ty <ka...@gmail.com>:
>>
>>> Hello,
>>>
>>> I'm wondering if there is a possibility to create a parquet streaming
>>> file sink in Pyflink (in Table API) or in Java Flink (in Datastream api).
>>>
>>> To give an example of the expected behaviour. Each element of the stream
>>> is going to contain a json string. I want to save this stream to parquet
>>> files without having to explicitly define the schema/types of the messages
>>> (also using a single sink).
>>>
>>> If this is possible, (might be in Java Flink using a custom
>>> ParquetBulkWriterFactory etc.) any direction for the implementation would
>>> be appreciated.
>>>
>>> Best regards
>>> Kamil
>>>
>>

Re: Pyflink/Flink Java parquet streaming file sink for a dynamic schema stream

Posted by Georg Heiler <ge...@gmail.com>.

Do the JSONs have the same schema overall? Or is each potentially
structured differently?

Best,
Georg

Am Fr., 3. Dez. 2021 um 00:12 Uhr schrieb Kamil ty <ka...@gmail.com>:

> Hello,
>
> I'm wondering if there is a possibility to create a parquet streaming file
> sink in Pyflink (in Table API) or in Java Flink (in Datastream api).
>
> To give an example of the expected behaviour. Each element of the stream
> is going to contain a json string. I want to save this stream to parquet
> files without having to explicitly define the schema/types of the messages
> (also using a single sink).
>
> If this is possible, (might be in Java Flink using a custom
> ParquetBulkWriterFactory etc.) any direction for the implementation would
> be appreciated.
>
> Best regards
> Kamil
>