You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Francesco Ventura <fr...@campus.tu-berlin.de> on 2020/05/23 09:31:24 UTC

Collecting operators real output cardinalities as json files

Hi everybody, 

I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute(). 

There is a similar way to extract the real output cardinalities of each pipeline after the execution? 
Is there a place where the Flink cluster stores the history of the information about executed jobs?
Developing a REST client to extract such information is the only way possible? 

I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.

Maybe is a trivial problem but I have a quick look around and I can not find the solution.

Thank you very much,

Francesco

Re: Collecting operators real output cardinalities as json files

Posted by Francesco Ventura <fr...@campus.tu-berlin.de>.
Thank you very much for your explanation.
I will keep it in mind.

Best,

Francesco

> Il giorno 27 mag 2020, alle ore 15:43, Piotr Nowojski <pi...@ververica.com> ha scritto:
> 
> Hi Francesco,
> 
> As long as you do not set update interval of metric reporter to some very low value, there should be no visible performance degradation.
> 
> Maybe worth keeping in mind is that if you jobs are bounded (they are working on bounded input and they finish/complete at some point of time), the last updated metric value before job completes might not necessarily reflect the end state of the job. This limitation may not apply if you will be using REST API, as Job Manager might be remembering the values you are looking for.
> 
> Piotrek
> 
>> On 27 May 2020, at 11:41, Francesco Ventura <francesco.ventura@campus.tu-berlin.de <ma...@campus.tu-berlin.de>> wrote:
>> 
>> Hi Piotrek,
>> 
>> Thank you for you replay and for your suggestions. Just another doubt.
>> Does the usage of metrics reporter and custom metrics will affect the performances of the running jobs in term of execution time? Since I need the information about the exact netRunTime of each job maybe using the REST APIs to get the other information will be more reliable?
>> 
>> Thank you. Best,
>> 
>> Francesco
>> 
>>> Il giorno 25 mag 2020, alle ore 19:54, Piotr Nowojski <piotr@ververica.com <ma...@ververica.com>> ha scritto:
>>> 
>>> Hi Francesco,
>>> 
>>> Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.
>>> 
>>> Piotrek
>>> 
>>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html>
>>> [2] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io>
>>> [3] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter>
>>> 
>>>> On 23 May 2020, at 11:31, Francesco Ventura <francesco.ventura@campus.tu-berlin.de <ma...@campus.tu-berlin.de>> wrote:
>>>> 
>>>> Hi everybody, 
>>>> 
>>>> I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute(). 
>>>> 
>>>> There is a similar way to extract the real output cardinalities of each pipeline after the execution? 
>>>> Is there a place where the Flink cluster stores the history of the information about executed jobs?
>>>> Developing a REST client to extract such information is the only way possible? 
>>>> 
>>>> I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.
>>>> 
>>>> Maybe is a trivial problem but I have a quick look around and I can not find the solution.
>>>> 
>>>> Thank you very much,
>>>> 
>>>> Francesco
>>> 
>> 
> 


Re: Collecting operators real output cardinalities as json files

Posted by Piotr Nowojski <pi...@ververica.com>.
Hi Francesco,

As long as you do not set update interval of metric reporter to some very low value, there should be no visible performance degradation.

Maybe worth keeping in mind is that if you jobs are bounded (they are working on bounded input and they finish/complete at some point of time), the last updated metric value before job completes might not necessarily reflect the end state of the job. This limitation may not apply if you will be using REST API, as Job Manager might be remembering the values you are looking for.

Piotrek

> On 27 May 2020, at 11:41, Francesco Ventura <fr...@campus.tu-berlin.de> wrote:
> 
> Hi Piotrek,
> 
> Thank you for you replay and for your suggestions. Just another doubt.
> Does the usage of metrics reporter and custom metrics will affect the performances of the running jobs in term of execution time? Since I need the information about the exact netRunTime of each job maybe using the REST APIs to get the other information will be more reliable?
> 
> Thank you. Best,
> 
> Francesco
> 
>> Il giorno 25 mag 2020, alle ore 19:54, Piotr Nowojski <piotr@ververica.com <ma...@ververica.com>> ha scritto:
>> 
>> Hi Francesco,
>> 
>> Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.
>> 
>> Piotrek
>> 
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html>
>> [2] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io>
>> [3] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter>
>> 
>>> On 23 May 2020, at 11:31, Francesco Ventura <francesco.ventura@campus.tu-berlin.de <ma...@campus.tu-berlin.de>> wrote:
>>> 
>>> Hi everybody, 
>>> 
>>> I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute(). 
>>> 
>>> There is a similar way to extract the real output cardinalities of each pipeline after the execution? 
>>> Is there a place where the Flink cluster stores the history of the information about executed jobs?
>>> Developing a REST client to extract such information is the only way possible? 
>>> 
>>> I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.
>>> 
>>> Maybe is a trivial problem but I have a quick look around and I can not find the solution.
>>> 
>>> Thank you very much,
>>> 
>>> Francesco
>> 
> 


Re: Collecting operators real output cardinalities as json files

Posted by Francesco Ventura <fr...@campus.tu-berlin.de>.
Hi Piotrek,

Thank you for you replay and for your suggestions. Just another doubt.
Does the usage of metrics reporter and custom metrics will affect the performances of the running jobs in term of execution time? Since I need the information about the exact netRunTime of each job maybe using the REST APIs to get the other information will be more reliable?

Thank you. Best,

Francesco

> Il giorno 25 mag 2020, alle ore 19:54, Piotr Nowojski <pi...@ververica.com> ha scritto:
> 
> Hi Francesco,
> 
> Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.
> 
> Piotrek
> 
> [1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html>
> [2] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io>
> [3] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter>
> 
>> On 23 May 2020, at 11:31, Francesco Ventura <francesco.ventura@campus.tu-berlin.de <ma...@campus.tu-berlin.de>> wrote:
>> 
>> Hi everybody, 
>> 
>> I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute(). 
>> 
>> There is a similar way to extract the real output cardinalities of each pipeline after the execution? 
>> Is there a place where the Flink cluster stores the history of the information about executed jobs?
>> Developing a REST client to extract such information is the only way possible? 
>> 
>> I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.
>> 
>> Maybe is a trivial problem but I have a quick look around and I can not find the solution.
>> 
>> Thank you very much,
>> 
>> Francesco
> 


Re: Collecting operators real output cardinalities as json files

Posted by Piotr Nowojski <pi...@ververica.com>.
Hi Francesco,

Have you taken a look at the metrics? [1] And IO metrics [2] in particular? You can use some of the pre-existing metric reporter [3] or implement a custom one. You could export metrics to some 3rd party system, and get JSONs from there, or export them to JSON directly via a custom metric reporter.

Piotrek

[1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html>
[2] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#io>
[3] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter <https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#reporter>

> On 23 May 2020, at 11:31, Francesco Ventura <fr...@campus.tu-berlin.de> wrote:
> 
> Hi everybody, 
> 
> I would like to collect the statistics and the real output cardinalities about the execution of many jobs as json files. I know that exist a REST interface that can be used but I was looking for something simpler. In practice, I would like to get the information showed in the WebUI at runtime about a job and store it as a file. I am using the env.getExecutionPlan() to get the execution plan of a job with the estimated cardinalities for each operator. However, it includes only estimated cardinalities and it can be used only before calling env.execute(). 
> 
> There is a similar way to extract the real output cardinalities of each pipeline after the execution? 
> Is there a place where the Flink cluster stores the history of the information about executed jobs?
> Developing a REST client to extract such information is the only way possible? 
> 
> I also would like to avoid adding counters to the job source code since I am monitoring the run time execution and I should avoid everything that can interfere.
> 
> Maybe is a trivial problem but I have a quick look around and I can not find the solution.
> 
> Thank you very much,
> 
> Francesco