You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by James Sandys-Lumsdaine <ja...@hotmail.com> on 2021/08/10 14:58:20 UTC

Questions on reading JDBC data with Flink Streaming API

Hello,



I'm starting a new Flink application to allow my company to perform lots of reporting. We have an existing legacy system with most the data we need held in SQL Server databases. We will need to consume data from these databases initially before starting to consume more data from newly deployed Kafka streams.



I've spent a lot of time reading the Flink book and web pages but I have some simple questions and assumptions I hope you can help with so I can progress.



Firstly, I am wanting to use the DataStream API so we can both consume historic data and also realtime data. I do not think I want to use the DataSet API but I also don't see the point in using the SQL/Table apis as I would prefer to write my functions in Java classes. I need to maintain my own state and it seems DataStream keyed functions are the way to go.



Now I am trying to actually write code against our production databases I need to be able to read in "streams" of data with SQL queries - there does not appear to be a JDBC source connector so I think I have to make the JDBC call myself and then possibly create a DataSource using env.fromElements(). Obviously this is a "bounded" data set but how else am I meant to get historic data loaded in? In the future I want to include a Kafka stream as well which will only have a few weeks worth of data so I imagine I will sometimes need to merge data from a SQL Server/Snowflake database with a live stream from a Kafka stream. What is the best practice for this as I don't see examples discussing this.



With retrieving data from a JDBC source, I have also seen some examples using a StreamingTableEnvironment - am I meant to use this somehow instead to query data from a JDBC connection into my DataStream functions etc? Again, I want to write my functions in Java not some Flink SQL. Is it best practice to use a StreamingTableEnvironment to query JDBC data if I'm only using the DataStream API?



Thanks in advance - I'm sure I will have plenty more high-level questions like this.


Re: Questions on reading JDBC data with Flink Streaming API

Posted by Fuyao Li <fu...@oracle.com>.
Hello James,

To stream real time data out of the database. You need to spin up a CDC instance. For example, Debezium[1]. With the CDC engine, it streams out changed data to Kafka (for example). You can consume the message from Kafka using FlinkKafkaConsumer.
For history data, it could be considered as a bounded data stream processing using JDBC connector.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/debezium/
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/jdbc/

Thanks,
Fuyao

From: James Sandys-Lumsdaine <ja...@hotmail.com>
Date: Tuesday, August 10, 2021 at 07:58
To: user@flink.apache.org <us...@flink.apache.org>
Subject: [External] : Questions on reading JDBC data with Flink Streaming API

Hello,



I'm starting a new Flink application to allow my company to perform lots of reporting. We have an existing legacy system with most the data we need held in SQL Server databases. We will need to consume data from these databases initially before starting to consume more data from newly deployed Kafka streams.



I've spent a lot of time reading the Flink book and web pages but I have some simple questions and assumptions I hope you can help with so I can progress.



Firstly, I am wanting to use the DataStream API so we can both consume historic data and also realtime data. I do not think I want to use the DataSet API but I also don't see the point in using the SQL/Table apis as I would prefer to write my functions in Java classes. I need to maintain my own state and it seems DataStream keyed functions are the way to go.



Now I am trying to actually write code against our production databases I need to be able to read in "streams" of data with SQL queries - there does not appear to be a JDBC source connector so I think I have to make the JDBC call myself and then possibly create a DataSource using env.fromElements(). Obviously this is a "bounded" data set but how else am I meant to get historic data loaded in? In the future I want to include a Kafka stream as well which will only have a few weeks worth of data so I imagine I will sometimes need to merge data from a SQL Server/Snowflake database with a live stream from a Kafka stream. What is the best practice for this as I don't see examples discussing this.



With retrieving data from a JDBC source, I have also seen some examples using a StreamingTableEnvironment - am I meant to use this somehow instead to query data from a JDBC connection into my DataStream functions etc? Again, I want to write my functions in Java not some Flink SQL. Is it best practice to use a StreamingTableEnvironment to query JDBC data if I'm only using the DataStream API?



Thanks in advance - I'm sure I will have plenty more high-level questions like this.