You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Stefano Lissa <sa...@gmail.com> on 2019/06/15 16:27:38 UTC

Best practice to process DB stored log (is Flink the right choice?)

Hi,
surely really new-bye question but Flink sounds like to be the best choice.
I have a log continuously added to a database table where a machine status
is stored with a timestamp (the status is 0, 1, 2).

What I need is to process this status and produce another stream where a
sequence of status X is aggregated producing a new record with the first
status X timestamp found in the input stream and the time delta until a new
status different from X is seen.

A datastream to connect to a table is available? I've tried to find
something in the documentation, but not sure if I searched in the right
place.

I Flick an optimal option for that rather simple processing?

Thank you, Stefano.

Re: Best practice to process DB stored log (is Flink the right choice?)

Posted by Piotr Nowojski <pi...@ververica.com>.
Hi,

Those are good questions.

> A datastream to connect to a table is available? I

What table, what database system do you mean? You can check the list of existing connectors provided by Flink in the documentation. About reading from relational DB (example by using JDBC) you can read a little bit here: https://stackoverflow.com/questions/48162464/how-to-read-data-from-relational-database-in-apache-flink-streaming <https://stackoverflow.com/questions/48162464/how-to-read-data-from-relational-database-in-apache-flink-streaming>

> I Flick an optimal option for that rather simple processing?

It depends on many things. What you have described could be easily done by some trivial python script, however you would have to answer yourself couple of questions:

- what is the scale at which you would like to operate? Would your computation need to be distributed across multiple machines in a foreseeable future?
- do you care about reliability? What should happen in case of failures? Do you need High Availability?
- could you have more use cases/requirements in the future?
- do you care about at-least-once or exactly-once processing guarantees?
- do you care if you lost your computation state in case of failure?
- how do you want to deploy your job (flink provides out of the box integration with many systems like Mesos, Yarn etc…)
- will you need to integrate with some other external systems, for which Flink has built in support (like S3 file system, Kafka, Kinesis, …)
- do you care about monitoring your job? (Flink has built-in metrics)
- …

If you do not care about those things and you only need to process small number of records per second, then Flink might be an overkill. If not, or if you are not sure, then I would encourage you to read/do the research about the above mention things to make up your mind.

Piotrek

> On 15 Jun 2019, at 18:27, Stefano Lissa <sa...@gmail.com> wrote:
> 
> Hi,
> surely really new-bye question but Flink sounds like to be the best choice. I have a log continuously added to a database table where a machine status is stored with a timestamp (the status is 0, 1, 2).
> 
> What I need is to process this status and produce another stream where a sequence of status X is aggregated producing a new record with the first status X timestamp found in the input stream and the time delta until a new status different from X is seen.
> 
> A datastream to connect to a table is available? I've tried to find something in the documentation, but not sure if I searched in the right place.
> 
> I Flick an optimal option for that rather simple processing?
> 
> Thank you, Stefano.