You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Declan Harrison <de...@optiva.com> on 2021/09/30 07:47:53 UTC

In flight records on Flink : Newbie question

Hi Guys

I've just recently started using Apache Flink to evaluate its suitability
for  a project I'm working on.

First impressions are that the project is great, well documented and has
lots of examples and guidance showcasing the multitude of things that it
can do. Challenging knowing where to start at times as there are many ways
to achieve the same result.

So my pipeline is similar to an ETL, I have a continuous DataStream source
of Java records modelled as POJOs which I then transform each POJO to a
single JSON record before writing to a streaming sink.  This all works as
expected.

However my question is in 2 parts and I hope you can help, apologies in
advance if this question highlights my lack of experience.

   - Can I get access to the infight records that are currently being
   processed within Flink ? By inflight I mean the records that are currently
   being processed but haven't been written to the sink.
   - Is the number of inflight records deterministic? How many records does
   Flink process per subtask/thread.  For example, it might be 1 record at a
   time per subtask?

Thanks
Declan

Re: In flight records on Flink : Newbie question

Posted by Declan Harrison <de...@optiva.com>.
Many thanks Fabian for your prompt replies much appreciated

Thanks
Declan

On Wed, Oct 6, 2021 at 8:38 AM Fabian Paul <fa...@ververica.com> wrote:

> Hi Declan,
>
> As far as I know the FileSink does not buffer records but writes the
> records to temporary files which are bucketed later. For the Elasticsearch
> sink
>  you are right it buffers the records before flushing them to
> ElasticSearch but you can control the flushing behaviour based on a given
> interval
> or the buffer size (records, megabytes).
>
> Best,
> Fabian

Re: In flight records on Flink : Newbie question

Posted by Fabian Paul <fa...@ververica.com>.
Hi Declan,

As far as I know the FileSink does not buffer records but writes the records to temporary files which are bucketed later. For the Elasticsearch sink
 you are right it buffers the records before flushing them to ElasticSearch but you can control the flushing behaviour based on a given interval 
or the buffer size (records, megabytes).

Best,
Fabian

Re: In flight records on Flink : Newbie question

Posted by Declan Harrison <de...@optiva.com>.
Hi Fabian

I am currently using the streaming file sink to local disk though
potentially this sink could change to be Elastic Search.

Declan

On Mon, Oct 4, 2021 at 1:16 PM Fabian Paul <fa...@ververica.com> wrote:

> Hi Declan,
>
> I forgot to ask which sink you are using. I do not think it is generally
> applicable that all sinks buffer records and only send them periodically.
> It
> depends a lot on the connector and what kind of capabilities the external
> system you are writing to offers.
>
> The amount of buffered data in the sink is solely drive by the
> implementation of the sink.
>
> Best,
> Fabian

Re: In flight records on Flink : Newbie question

Posted by Fabian Paul <fa...@ververica.com>.
Hi Declan,

I forgot to ask which sink you are using. I do not think it is generally applicable that all sinks buffer records and only send them periodically. It 
depends a lot on the connector and what kind of capabilities the external system you are writing to offers.

The amount of buffered data in the sink is solely drive by the implementation of the sink.

Best,
Fabian

Re: In flight records on Flink : Newbie question

Posted by Declan Harrison <de...@optiva.com>.
Hi Fabian

Primarily more a case of understanding how many records are likely to be
buffered by the sink still awaiting processing. So we are streaming event
records to a sink for downstream processing in as close to real time as
possible but wondered how many might be buffered by Flink and if that was
deterministic or controllable (via configuration)?

 Thanks for mentioning metrics, I will take a look at the metric via REST
and maybe use the Prometheus reporter also.

Tjaks
Declan


On Fri, Oct 1, 2021 at 3:00 PM Fabian Paul <fa...@ververica.com> wrote:

> Hi Declan,
>
> Thanks for reaching out, we always welcome new users to Apache Flink
> community :)
>
> Your first question is a bit tricky. I am still trying to understand the
> motivation behind. In general there is no generic way to access the records
> which one of the operator currently processes.
> Are your referring to records which are buffered in the sink and not yet
> sind or are you referring to all record which are currently processed by
> the
> entire pipeline?
>
> Flink provides by default a set of metrics for each operator [1]. You can
> either collect via the REST API, or some kind of configured reporter (JMX,
> prometheus. The simplest way to look at the metrics is opening the web UI
> of Flink. It shows you an overview of the running pipeline and the
> metrics for all tasks.
>
> Overall, it seems you are trying to investigate some kind of performance
> problem, please correct me if I am wrong. You can also directly ask more
> detailed questions if you are seeing an unexpected behaviour.
>
> Best,
> Fabian
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#io

Re: In flight records on Flink : Newbie question

Posted by Fabian Paul <fa...@ververica.com>.
Hi Declan,

Thanks for reaching out, we always welcome new users to Apache Flink community :)

Your first question is a bit tricky. I am still trying to understand the motivation behind. In general there is no generic way to access the records which one of the operator currently processes. 
Are your referring to records which are buffered in the sink and not yet sind or are you referring to all record which are currently processed by the 
entire pipeline?

Flink provides by default a set of metrics for each operator [1]. You can either collect via the REST API, or some kind of configured reporter (JMX, 
prometheus. The simplest way to look at the metrics is opening the web UI of Flink. It shows you an overview of the running pipeline and the 
metrics for all tasks.

Overall, it seems you are trying to investigate some kind of performance problem, please correct me if I am wrong. You can also directly ask more 
detailed questions if you are seeing an unexpected behaviour.

Best,
Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#io