You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Tony Wei <to...@gmail.com> on 2017/09/22 08:33:36 UTC

Get EOF from PrometheusReporter in JM

Hi,

I have built the Prometheus reporter package from this PR
https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to
record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from
Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me: Get
http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message
and 9249 port was still alive.

To figure out what happened, I created another cluster and I found
Prometheus could connect to Flink cluster if there is no running job. After
JM triggered or completed the first checkpoint, Prometheus started getting
ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in log
file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei

Re: Get EOF from PrometheusReporter in JM

Posted by Tony Wei <to...@gmail.com>.
Hi Max,

Good to know. Thanks very much.

Best Regards,
Tony Wei

2017-10-24 13:52 GMT+08:00 Maximilian Bode <ma...@tngtech.com>:

> Hi Tony,
>
> thanks for troubleshooting this. I have added a commit to
> https://github.com/apache/flink/pull/4586 that should enable you to use
> the reporter with 1.3.2 as well.
>
> Best regards,
> Max
>
> Tony Wei <to...@gmail.com>
> 23. September 2017 um 13:11
> Hi Chesnay,
>
> I built another flink cluster using version 1.4, set the log level to
> DEBUG, and I found that the root cause might be this exception: *java.lang.NullPointerException:
> Value returned by gauge lastCheckpointExternalPath was null*.
>
> I updated `CheckpointStatsTracker` to ignore external path when it is
> null, and this exception didn't happen again. The prometheus reporter works
> as well.
>
> I have created a Jira issue for it: https://issues.apache.org/
> jira/browse/FLINK-7675 <https://issues.apache.org/jira/browse/FLINK-7675.>,
> and I will submit the PR after I passed Travis CI for my repository.
>
> Best Regards,
> Tony Wei
>
>
>
>
> Tony Wei <to...@gmail.com>
> 22. September 2017 um 16:20
> Hi Chesnay,
>
> I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
> For my setting for logging, It have already set to INFO level, but there
> wasn't any error or warning in log file as well.
>
> Best Regards,
> Tony Wei
>
>
> Chesnay Schepler <ch...@apache.org>
> 22. September 2017 um 16:07
> The Prometheus reporter should work with 1.3.2.
>
> Does this also occur with the reporter that currently exists in 1.4? (to
> rule out new bugs from the PR).
>
> To investigate this further, please set the logging level to WARN and try
> again, as all errors in the metric system are logged on that level.
>
> On 22.09.2017 10:33, Tony Wei wrote:
>
>
> Tony Wei <to...@gmail.com>
> 22. September 2017 um 10:33
> Hi,
>
> I have built the Prometheus reporter package from this PR
> https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to
> record every default metrics and those from `FlinkKafkaConsumer`.
>
> Originally, everything was fine. I could get those metrics in TM from
> Prometheus just like I saw on Flink Web UI.
> However, when I turned to JM, I found Prometheus gives this error to me: Get
> http://localhost:9249/metrics: EOF.
> I checked the log on JM and saw nothing in it. There was no error message
> and 9249 port was still alive.
>
> To figure out what happened, I created another cluster and I found
> Prometheus could connect to Flink cluster if there is no running job. After
> JM triggered or completed the first checkpoint, Prometheus started getting
> ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in
> log file and 9249 port was still alive.
>
> I was wondering where did the error occur. Flink or Prometheus reporter?
> Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.
>
> Best Regards,
> Tony Wei
>
>

Re: Get EOF from PrometheusReporter in JM

Posted by Maximilian Bode <ma...@tngtech.com>.
Hi Tony,

thanks for troubleshooting this. I have added a commit to
https://github.com/apache/flink/pull/4586 that should enable you to use
the reporter with 1.3.2 as well.

Best regards,
Max

> Tony Wei <ma...@gmail.com>
> 23. September 2017 um 13:11
> Hi Chesnay,
>
> I built another flink cluster using version 1.4, set the log level to
> DEBUG, and I found that the root cause might be this
> exception: *java.lang.NullPointerException: Value returned by gauge
> lastCheckpointExternalPath was null*.
>
> I updated `CheckpointStatsTracker` to ignore external path when it is
> null, and this exception didn't happen again. The prometheus reporter
> works as well.
>
> I have created a Jira issue for
> it: https://issues.apache.org/jira/browse/FLINK-7675
> <https://issues.apache.org/jira/browse/FLINK-7675.>, and I will submit
> the PR after I passed Travis CI for my repository.
>
> Best Regards,
> Tony Wei
>
>  
>
>
> Tony Wei <ma...@gmail.com>
> 22. September 2017 um 16:20
> Hi Chesnay,
>
> I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
> For my setting for logging, It have already set to INFO level, but
> there wasn't any error or warning in log file as well.
>
> Best Regards,
> Tony Wei
>
>
> Chesnay Schepler <ma...@apache.org>
> 22. September 2017 um 16:07
> The Prometheus reporter should work with 1.3.2.
>
> Does this also occur with the reporter that currently exists in 1.4?
> (to rule out new bugs from the PR).
>
> To investigate this further, please set the logging level to WARN and
> try again, as all errors in the metric system are logged on that level.
>
> On 22.09.2017 10:33, Tony Wei wrote:
>
>
> Tony Wei <ma...@gmail.com>
> 22. September 2017 um 10:33
> Hi, 
>
> I have built the Prometheus reporter package from this
> PR https://github.com/apache/flink/pull/4586, and used it on Flink
> 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.
>
> Originally, everything was fine. I could get those metrics in TM from
> Prometheus just like I saw on Flink Web UI.
> However, when I turned to JM, I found Prometheus gives this error to
> me: Get http://localhost:9249/metrics: EOF.
> I checked the log on JM and saw nothing in it. There was no error
> message and 9249 port was still alive.
>
> To figure out what happened, I created another cluster and I found
> Prometheus could connect to Flink cluster if there is no running job.
> After JM triggered or completed the first checkpoint, Prometheus
> started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was
> still no error in log file and 9249 port was still alive.
>
> I was wondering where did the error occur. Flink or Prometheus reporter?
> Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.
>
> Best Regards,
> Tony Wei

Re: Get EOF from PrometheusReporter in JM

Posted by Tony Wei <to...@gmail.com>.
Hi Chesnay,

I built another flink cluster using version 1.4, set the log level to
DEBUG, and I found that the root cause might be this exception:
*java.lang.NullPointerException:
Value returned by gauge lastCheckpointExternalPath was null*.

I updated `CheckpointStatsTracker` to ignore external path when it is null,
and this exception didn't happen again. The prometheus reporter works as
well.

I have created a Jira issue for it:
https://issues.apache.org/jira/browse/FLINK-7675
<https://issues.apache.org/jira/browse/FLINK-7675.>, and I will submit the
PR after I passed Travis CI for my repository.

Best Regards,
Tony Wei



2017-09-22 22:20 GMT+08:00 Tony Wei <to...@gmail.com>:

> Hi Chesnay,
>
> I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
> For my setting for logging, It have already set to INFO level, but there
> wasn't any error or warning in log file as well.
>
> Best Regards,
> Tony Wei
>
> 2017-09-22 22:07 GMT+08:00 Chesnay Schepler <ch...@apache.org>:
>
>> The Prometheus reporter should work with 1.3.2.
>>
>> Does this also occur with the reporter that currently exists in 1.4? (to
>> rule out new bugs from the PR).
>>
>> To investigate this further, please set the logging level to WARN and try
>> again, as all errors in the metric system are logged on that level.
>>
>>
>> On 22.09.2017 10:33, Tony Wei wrote:
>>
>> Hi,
>>
>> I have built the Prometheus reporter package from this PR
>> https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to
>> record every default metrics and those from `FlinkKafkaConsumer`.
>>
>> Originally, everything was fine. I could get those metrics in TM from
>> Prometheus just like I saw on Flink Web UI.
>> However, when I turned to JM, I found Prometheus gives this error to me: Get
>> http://localhost:9249/metrics: EOF.
>> I checked the log on JM and saw nothing in it. There was no error message
>> and 9249 port was still alive.
>>
>> To figure out what happened, I created another cluster and I found
>> Prometheus could connect to Flink cluster if there is no running job. After
>> JM triggered or completed the first checkpoint, Prometheus started getting
>> ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in
>> log file and 9249 port was still alive.
>>
>> I was wondering where did the error occur. Flink or Prometheus reporter?
>> Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.
>>
>> Best Regards,
>> Tony Wei
>>
>>
>>
>

Re: Get EOF from PrometheusReporter in JM

Posted by Tony Wei <to...@gmail.com>.
Hi Chesnay,

I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
For my setting for logging, It have already set to INFO level, but there
wasn't any error or warning in log file as well.

Best Regards,
Tony Wei

2017-09-22 22:07 GMT+08:00 Chesnay Schepler <ch...@apache.org>:

> The Prometheus reporter should work with 1.3.2.
>
> Does this also occur with the reporter that currently exists in 1.4? (to
> rule out new bugs from the PR).
>
> To investigate this further, please set the logging level to WARN and try
> again, as all errors in the metric system are logged on that level.
>
>
> On 22.09.2017 10:33, Tony Wei wrote:
>
> Hi,
>
> I have built the Prometheus reporter package from this PR
> https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to
> record every default metrics and those from `FlinkKafkaConsumer`.
>
> Originally, everything was fine. I could get those metrics in TM from
> Prometheus just like I saw on Flink Web UI.
> However, when I turned to JM, I found Prometheus gives this error to me: Get
> http://localhost:9249/metrics: EOF.
> I checked the log on JM and saw nothing in it. There was no error message
> and 9249 port was still alive.
>
> To figure out what happened, I created another cluster and I found
> Prometheus could connect to Flink cluster if there is no running job. After
> JM triggered or completed the first checkpoint, Prometheus started getting
> ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in
> log file and 9249 port was still alive.
>
> I was wondering where did the error occur. Flink or Prometheus reporter?
> Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.
>
> Best Regards,
> Tony Wei
>
>
>

Re: Get EOF from PrometheusReporter in JM

Posted by Chesnay Schepler <ch...@apache.org>.
The Prometheus reporter should work with 1.3.2.

Does this also occur with the reporter that currently exists in 1.4? (to 
rule out new bugs from the PR).

To investigate this further, please set the logging level to WARN and 
try again, as all errors in the metric system are logged on that level.

On 22.09.2017 10:33, Tony Wei wrote:
> Hi,
>
> I have built the Prometheus reporter package from this PR 
> https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 
> to record every default metrics and those from `FlinkKafkaConsumer`.
>
> Originally, everything was fine. I could get those metrics in TM from 
> Prometheus just like I saw on Flink Web UI.
> However, when I turned to JM, I found Prometheus gives this error to 
> me: Get http://localhost:9249/metrics: EOF.
> I checked the log on JM and saw nothing in it. There was no error 
> message and 9249 port was still alive.
>
> To figure out what happened, I created another cluster and I found 
> Prometheus could connect to Flink cluster if there is no running job. 
> After JM triggered or completed the first checkpoint, Prometheus 
> started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was 
> still no error in log file and 9249 port was still alive.
>
> I was wondering where did the error occur. Flink or Prometheus reporter?
> Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.
>
> Best Regards,
> Tony Wei