You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by David Jost <da...@uniberg.com> on 2022/09/05 14:58:31 UTC

Slow Tests in Flink 1.15

Hi,

we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2, when we noticed, that all our job tests, using a MiniClusterWithClientResource, are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have not found mentions in that regard in the changelog or documentation. The slowdown is rather extreme I hope to find a solution to this. I saw it mentioned once in the mailing list, but there was no (public) outcome to it.

I would appreciate any help on this. Thank you in advance.

Best
 David

Re: Slow Tests in Flink 1.15

Posted by David Jost <da...@uniberg.com>.

Oh my. Thank you so much, Maciek! I added the proposed snippet from the page you linked[0] to all pipeline tests and they are all performing as before. Perfect.

Interestingly enough, I read about it in the changelog, but seem to have ignored it. Notably though, I would expect the following to mean, that this only happens with two-phase commit source functions: 'operators using the two-phase commit, the tasks would wait for the final checkpoint completed'[1], which is not the case with our custom SinkFunctions.

However I am happy, that it solves our problem. I hope it also helps Alexey.

Thank you again and all the best
  David


[0]: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/#checkpointing-with-parts-of-the-graph-finished

    Configuration config = new Configuration();
    config.set(ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH, false);
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(config);

[1]: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/#waiting-for-the-final-checkpoint-before-task-exit

> On 9. Sep 2022, at 13:33, Maciek Próchniak <mp...@touk.pl> wrote:
> 
> Hi,
> 
> we also had similar problems in Nussknacker recently (tests on fake sources), my colleague found out it's due to ENABLE_CHECKPOINTS_AFTER_FINISH flag (https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/#waiting-for-the-final-checkpoint-before-task-exit) is set by default to true in 1.15. After the fake source ends, the job waits for next checkpoint to be triggered before finishing. In the end we reduced checkpoint interval in some places and disabled checkpoints altogether in some other tests
> 
> maciek
> 
> On 09.09.2022 10:44, David Jost wrote:
>> Hey, sorry for not coming back to this earlier, but I was hoping to better isolate the problem for analysis.
>> 
>> Maybe for comparison to Alexey's case: We have four different pipelines at ours, which are all built similarly. Though we use Kafka in the actual jobs, the tests are using fake sources and sinks. Only the tests which use the MiniClusterWithClientResource are affected (we have one badly written test, which runs the pipeline without MiniCluster and it is not affected).
>> 
>> I analysed it a bit and identified, that the job would hang for about 30s after the last event has been pushed out the sink. Summing up these 30s for each (job) test results in the additional time used by all tests in the end.
>> So I would assume, that there is some kind of wind-down timer or so, which holds up everything.
>> 
>> I hope, this is helpful somehow. I would love to find the source of this issue. I was hoping to isolate the issue in an MWE, but have been unsuccessful for now.
>> 
>> NB: I tested the tests with both, MiniClusterWithClientResource (with adjustments for JUnit 5) and MiniClusterExtension, but there was no noticeable difference.
>> 
>>> On 7. Sep 2022, at 14:41, Alexey Trenikhun <ye...@msn.com> wrote:
>>> 
>>> The class contains single test method, which runs single job (the job is quite complex), then waits for job being running after that waits for data being populated in output topic, and this doesn't happen during 5 minutes (test timeout). Tried under debugger, set breakpoint in Kafka record deserializer it is hit but very slow, roughly 3 records per 5 minute (the topic was pre-populated)
>>> 
>>> No table/sql API, only stream API
>>> From: Chesnay Schepler <ch...@apache.org>
>>> Sent: Wednesday, September 7, 2022 5:20 AM
>>> To: Alexey Trenikhun <ye...@msn.com>; David Jost <da...@uniberg.com>; Matthias Pohl <ma...@aiven.io>
>>> Cc: user@flink.apache.org <us...@flink.apache.org>
>>> Subject: Re: Slow Tests in Flink 1.15
>>>  The test that gotten slow; how many test cases does it actually contain / how many jobs does it actually run?
>>> Are these tests using the table/sql API?
>>> 
>>> On 07/09/2022 14:15, Alexey Trenikhun wrote:
>>>> We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2 integration tests . Both tests use Kafka. The slow test uses org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests complete job, which consumes and produces Kafka messages. Not affected test extends org.apache.flink.test.util.AbstractTestBase which uses MiniClusterWithClientResource, this test is simpler and only produce Kafka messages.
>>>> 
>>>> Thanks,
>>>> Alexey
>>>> From: Matthias Pohl via user <us...@flink.apache.org>
>>>> Sent: Tuesday, September 6, 2022 6:36 AM
>>>> To: David Jost <da...@uniberg.com>
>>>> Cc: user@flink.apache.org <us...@flink.apache.org>
>>>> Subject: Re: Slow Tests in Flink 1.15
>>>>  Hi David,
>>>> I guess, you're referring to [1]. But as Chesnay already pointed out in the previous thread: It would be helpful to get more insights into what exactly your tests are executing (logs, code, ...). That would help identifying the cause.
>>>>> Can you give us a more complete stacktrace so we can see what call in
>>>>> Flink is waiting for something?
>>>>> 
>>>>> Does this happen to all of your tests?
>>>>> Can you provide us with an example that we can try ourselves? If not,
>>>>> can you describe the test structure (e.g., is it using a
>>>>> MiniClusterResource).
>>>> Matthias
>>>> 
>>>> [1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk
>>>> 
>>>> On Mon, Sep 5, 2022 at 4:59 PM David Jost <da...@uniberg.com> wrote:
>>>> Hi,
>>>> 
>>>> we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2, when we noticed, that all our job tests, using a MiniClusterWithClientResource, are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have not found mentions in that regard in the changelog or documentation. The slowdown is rather extreme I hope to find a solution to this. I saw it mentioned once in the mailing list, but there was no (public) outcome to it.
>>>> 
>>>> I would appreciate any help on this. Thank you in advance.
>>>> 
>>>> Best
>>>>  David

Re: Slow Tests in Flink 1.15

Posted by Maciek Próchniak <mp...@touk.pl>.

Hi,

we also had similar problems in Nussknacker recently (tests on fake 
sources), my colleague found out it's due to 
ENABLE_CHECKPOINTS_AFTER_FINISH flag 
(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/#waiting-for-the-final-checkpoint-before-task-exit) 
is set by default to true in 1.15. After the fake source ends, the job 
waits for next checkpoint to be triggered before finishing. In the end 
we reduced checkpoint interval in some places and disabled checkpoints 
altogether in some other tests

maciek

On 09.09.2022 10:44, David Jost wrote:
> Hey, sorry for not coming back to this earlier, but I was hoping to better isolate the problem for analysis.
>
> Maybe for comparison to Alexey's case: We have four different pipelines at ours, which are all built similarly. Though we use Kafka in the actual jobs, the tests are using fake sources and sinks. Only the tests which use the MiniClusterWithClientResource are affected (we have one badly written test, which runs the pipeline without MiniCluster and it is not affected).
>
> I analysed it a bit and identified, that the job would hang for about 30s after the last event has been pushed out the sink. Summing up these 30s for each (job) test results in the additional time used by all tests in the end.
> So I would assume, that there is some kind of wind-down timer or so, which holds up everything.
>
> I hope, this is helpful somehow. I would love to find the source of this issue. I was hoping to isolate the issue in an MWE, but have been unsuccessful for now.
>
> NB: I tested the tests with both, MiniClusterWithClientResource (with adjustments for JUnit 5) and MiniClusterExtension, but there was no noticeable difference.
>
>> On 7. Sep 2022, at 14:41, Alexey Trenikhun <ye...@msn.com> wrote:
>>
>> The class contains single test method, which runs single job (the job is quite complex), then waits for job being running after that waits for data being populated in output topic, and this doesn't happen during 5 minutes (test timeout). Tried under debugger, set breakpoint in Kafka record deserializer it is hit but very slow, roughly 3 records per 5 minute (the topic was pre-populated)
>>
>> No table/sql API, only stream API
>> From: Chesnay Schepler <ch...@apache.org>
>> Sent: Wednesday, September 7, 2022 5:20 AM
>> To: Alexey Trenikhun <ye...@msn.com>; David Jost <da...@uniberg.com>; Matthias Pohl <ma...@aiven.io>
>> Cc: user@flink.apache.org <us...@flink.apache.org>
>> Subject: Re: Slow Tests in Flink 1.15
>>   
>> The test that gotten slow; how many test cases does it actually contain / how many jobs does it actually run?
>> Are these tests using the table/sql API?
>>
>> On 07/09/2022 14:15, Alexey Trenikhun wrote:
>>> We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2 integration tests . Both tests use Kafka. The slow test uses org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests complete job, which consumes and produces Kafka messages. Not affected test extends org.apache.flink.test.util.AbstractTestBase which uses MiniClusterWithClientResource, this test is simpler and only produce Kafka messages.
>>>
>>> Thanks,
>>> Alexey
>>> From: Matthias Pohl via user <us...@flink.apache.org>
>>> Sent: Tuesday, September 6, 2022 6:36 AM
>>> To: David Jost <da...@uniberg.com>
>>> Cc: user@flink.apache.org <us...@flink.apache.org>
>>> Subject: Re: Slow Tests in Flink 1.15
>>>   
>>> Hi David,
>>> I guess, you're referring to [1]. But as Chesnay already pointed out in the previous thread: It would be helpful to get more insights into what exactly your tests are executing (logs, code, ...). That would help identifying the cause.
>>>> Can you give us a more complete stacktrace so we can see what call in
>>>> Flink is waiting for something?
>>>>
>>>> Does this happen to all of your tests?
>>>> Can you provide us with an example that we can try ourselves? If not,
>>>> can you describe the test structure (e.g., is it using a
>>>> MiniClusterResource).
>>> Matthias
>>>
>>> [1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk
>>>
>>> On Mon, Sep 5, 2022 at 4:59 PM David Jost <da...@uniberg.com> wrote:
>>> Hi,
>>>
>>> we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2, when we noticed, that all our job tests, using a MiniClusterWithClientResource, are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have not found mentions in that regard in the changelog or documentation. The slowdown is rather extreme I hope to find a solution to this. I saw it mentioned once in the mailing list, but there was no (public) outcome to it.
>>>
>>> I would appreciate any help on this. Thank you in advance.
>>>
>>> Best
>>>   David

Re: Slow Tests in Flink 1.15

Posted by David Jost <da...@uniberg.com>.

Hey, sorry for not coming back to this earlier, but I was hoping to better isolate the problem for analysis.

Maybe for comparison to Alexey's case: We have four different pipelines at ours, which are all built similarly. Though we use Kafka in the actual jobs, the tests are using fake sources and sinks. Only the tests which use the MiniClusterWithClientResource are affected (we have one badly written test, which runs the pipeline without MiniCluster and it is not affected).

I analysed it a bit and identified, that the job would hang for about 30s after the last event has been pushed out the sink. Summing up these 30s for each (job) test results in the additional time used by all tests in the end.
So I would assume, that there is some kind of wind-down timer or so, which holds up everything.

I hope, this is helpful somehow. I would love to find the source of this issue. I was hoping to isolate the issue in an MWE, but have been unsuccessful for now.

NB: I tested the tests with both, MiniClusterWithClientResource (with adjustments for JUnit 5) and MiniClusterExtension, but there was no noticeable difference.

> On 7. Sep 2022, at 14:41, Alexey Trenikhun <ye...@msn.com> wrote:
> 
> The class contains single test method, which runs single job (the job is quite complex), then waits for job being running after that waits for data being populated in output topic, and this doesn't happen during 5 minutes (test timeout). Tried under debugger, set breakpoint in Kafka record deserializer it is hit but very slow, roughly 3 records per 5 minute (the topic was pre-populated)
> 
> No table/sql API, only stream API
> From: Chesnay Schepler <ch...@apache.org>
> Sent: Wednesday, September 7, 2022 5:20 AM
> To: Alexey Trenikhun <ye...@msn.com>; David Jost <da...@uniberg.com>; Matthias Pohl <ma...@aiven.io>
> Cc: user@flink.apache.org <us...@flink.apache.org>
> Subject: Re: Slow Tests in Flink 1.15
>  
> The test that gotten slow; how many test cases does it actually contain / how many jobs does it actually run?
> Are these tests using the table/sql API?
> 
> On 07/09/2022 14:15, Alexey Trenikhun wrote:
>> We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2 integration tests . Both tests use Kafka. The slow test uses org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests complete job, which consumes and produces Kafka messages. Not affected test extends org.apache.flink.test.util.AbstractTestBase which uses MiniClusterWithClientResource, this test is simpler and only produce Kafka messages. 
>> 
>> Thanks,
>> Alexey
>> From: Matthias Pohl via user <us...@flink.apache.org>
>> Sent: Tuesday, September 6, 2022 6:36 AM
>> To: David Jost <da...@uniberg.com>
>> Cc: user@flink.apache.org <us...@flink.apache.org>
>> Subject: Re: Slow Tests in Flink 1.15
>>  
>> Hi David,
>> I guess, you're referring to [1]. But as Chesnay already pointed out in the previous thread: It would be helpful to get more insights into what exactly your tests are executing (logs, code, ...). That would help identifying the cause.
>> > Can you give us a more complete stacktrace so we can see what call in
>> > Flink is waiting for something?
>> > 
>> > Does this happen to all of your tests?
>> > Can you provide us with an example that we can try ourselves? If not,
>> > can you describe the test structure (e.g., is it using a
>> > MiniClusterResource).
>> 
>> Matthias
>> 
>> [1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk
>> 
>> On Mon, Sep 5, 2022 at 4:59 PM David Jost <da...@uniberg.com> wrote:
>> Hi,
>> 
>> we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2, when we noticed, that all our job tests, using a MiniClusterWithClientResource, are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have not found mentions in that regard in the changelog or documentation. The slowdown is rather extreme I hope to find a solution to this. I saw it mentioned once in the mailing list, but there was no (public) outcome to it.
>> 
>> I would appreciate any help on this. Thank you in advance.
>> 
>> Best
>>  David

Re: Slow Tests in Flink 1.15

Posted by Alexey Trenikhun <ye...@msn.com>.

The class contains single test method, which runs single job (the job is quite complex), then waits for job being running after that waits for data being populated in output topic, and this doesn't happen during 5 minutes (test timeout). Tried under debugger, set breakpoint in Kafka record deserializer it is hit but very slow, roughly 3 records per 5 minute (the topic was pre-populated)

No table/sql API, only stream API
________________________________
From: Chesnay Schepler <ch...@apache.org>
Sent: Wednesday, September 7, 2022 5:20 AM
To: Alexey Trenikhun <ye...@msn.com>; David Jost <da...@uniberg.com>; Matthias Pohl <ma...@aiven.io>
Cc: user@flink.apache.org <us...@flink.apache.org>
Subject: Re: Slow Tests in Flink 1.15

The test that gotten slow; how many test cases does it actually contain / how many jobs does it actually run?
Are these tests using the table/sql API?

On 07/09/2022 14:15, Alexey Trenikhun wrote:
We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2 integration tests . Both tests use Kafka. The slow test uses org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests complete job, which consumes and produces Kafka messages. Not affected test extends org.apache.flink.test.util.AbstractTestBase which uses MiniClusterWithClientResource, this test is simpler and only produce Kafka messages.

Thanks,
Alexey
________________________________
From: Matthias Pohl via user <us...@flink.apache.org>
Sent: Tuesday, September 6, 2022 6:36 AM
To: David Jost <da...@uniberg.com>
Cc: user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>
Subject: Re: Slow Tests in Flink 1.15

Hi David,
I guess, you're referring to [1]. But as Chesnay already pointed out in the previous thread: It would be helpful to get more insights into what exactly your tests are executing (logs, code, ...). That would help identifying the cause.
> Can you give us a more complete stacktrace so we can see what call in
> Flink is waiting for something?
>
> Does this happen to all of your tests?
> Can you provide us with an example that we can try ourselves? If not,
> can you describe the test structure (e.g., is it using a
> MiniClusterResource).

Matthias

[1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk

On Mon, Sep 5, 2022 at 4:59 PM David Jost <da...@uniberg.com>> wrote:
Hi,

we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2, when we noticed, that all our job tests, using a MiniClusterWithClientResource, are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have not found mentions in that regard in the changelog or documentation. The slowdown is rather extreme I hope to find a solution to this. I saw it mentioned once in the mailing list, but there was no (public) outcome to it.

I would appreciate any help on this. Thank you in advance.

Best
 David

Re: Slow Tests in Flink 1.15

Posted by Chesnay Schepler <ch...@apache.org>.

The test that gotten slow; how many test cases does it actually contain 
/ how many jobs does it actually run?
Are these tests using the table/sql API?

On 07/09/2022 14:15, Alexey Trenikhun wrote:
> We are also observing extreme slow down (5+ minutes vs 15 seconds) in 
> 1 of 2 integration tests . Both tests use Kafka. The slow test 
> uses org.apache.flink.runtime.minicluster.TestingMiniCluster, this 
> test tests complete job, which consumes and produces Kafka messages. 
> Not affected test extends org.apache.flink.test.util.AbstractTestBase 
> which uses MiniClusterWithClientResource, this test is simpler 
> and only produce Kafka messages.
>
> Thanks,
> Alexey
> ------------------------------------------------------------------------
> *From:* Matthias Pohl via user <us...@flink.apache.org>
> *Sent:* Tuesday, September 6, 2022 6:36 AM
> *To:* David Jost <da...@uniberg.com>
> *Cc:* user@flink.apache.org <us...@flink.apache.org>
> *Subject:* Re: Slow Tests in Flink 1.15
> Hi David,
> I guess, you're referring to [1]. But as Chesnay already pointed out 
> in the previous thread: It would be helpful to get more insights into 
> what exactly your tests are executing (logs, code, ...). That would 
> help identifying the cause.
> > Can you give us a more complete stacktrace so we can see what call in
> > Flink is waiting for something?
> >
> > Does this happen to all of your tests?
> > Can you provide us with an example that we can try ourselves? If not,
> > can you describe the test structure (e.g., is it using a
> > MiniClusterResource).
>
> Matthias
>
> [1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk
>
> On Mon, Sep 5, 2022 at 4:59 PM David Jost <da...@uniberg.com> wrote:
>
>     Hi,
>
>     we were going to upgrade our application from Flink 1.14.4 to
>     Flink 1.15.2, when we noticed, that all our job tests, using a
>     MiniClusterWithClientResource, are multiple times slower in 1.15
>     than before in 1.14. I, unfortunately, have not found mentions in
>     that regard in the changelog or documentation. The slowdown is
>     rather extreme I hope to find a solution to this. I saw it
>     mentioned once in the mailing list, but there was no (public)
>     outcome to it.
>
>     I would appreciate any help on this. Thank you in advance.
>
>     Best
>      David
>

Re: Slow Tests in Flink 1.15

Posted by Alexey Trenikhun <ye...@msn.com>.

We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2 integration tests . Both tests use Kafka. The slow test uses org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests complete job, which consumes and produces Kafka messages. Not affected test extends org.apache.flink.test.util.AbstractTestBase which uses MiniClusterWithClientResource, this test is simpler and only produce Kafka messages.

Thanks,
Alexey
________________________________
From: Matthias Pohl via user <us...@flink.apache.org>
Sent: Tuesday, September 6, 2022 6:36 AM
To: David Jost <da...@uniberg.com>
Cc: user@flink.apache.org <us...@flink.apache.org>
Subject: Re: Slow Tests in Flink 1.15

Hi David,
I guess, you're referring to [1]. But as Chesnay already pointed out in the previous thread: It would be helpful to get more insights into what exactly your tests are executing (logs, code, ...). That would help identifying the cause.
> Can you give us a more complete stacktrace so we can see what call in
> Flink is waiting for something?
>
> Does this happen to all of your tests?
> Can you provide us with an example that we can try ourselves? If not,
> can you describe the test structure (e.g., is it using a
> MiniClusterResource).

Matthias

[1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk

On Mon, Sep 5, 2022 at 4:59 PM David Jost <da...@uniberg.com>> wrote:
Hi,

we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2, when we noticed, that all our job tests, using a MiniClusterWithClientResource, are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have not found mentions in that regard in the changelog or documentation. The slowdown is rather extreme I hope to find a solution to this. I saw it mentioned once in the mailing list, but there was no (public) outcome to it.

I would appreciate any help on this. Thank you in advance.

Best
 David

Re: Slow Tests in Flink 1.15

Posted by Matthias Pohl via user <us...@flink.apache.org>.

Hi David,
I guess, you're referring to [1]. But as Chesnay already pointed out in the
previous thread: It would be helpful to get more insights into what exactly
your tests are executing (logs, code, ...). That would help identifying the
cause.
> Can you give us a more complete stacktrace so we can see what call in
> Flink is waiting for something?
>
> Does this happen to all of your tests?
> Can you provide us with an example that we can try ourselves? If not,
> can you describe the test structure (e.g., is it using a
> MiniClusterResource).

Matthias

[1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk

On Mon, Sep 5, 2022 at 4:59 PM David Jost <da...@uniberg.com> wrote:

> Hi,
>
> we were going to upgrade our application from Flink 1.14.4 to Flink
> 1.15.2, when we noticed, that all our job tests, using a
> MiniClusterWithClientResource, are multiple times slower in 1.15 than
> before in 1.14. I, unfortunately, have not found mentions in that regard in
> the changelog or documentation. The slowdown is rather extreme I hope to
> find a solution to this. I saw it mentioned once in the mailing list, but
> there was no (public) outcome to it.
>
> I would appreciate any help on this. Thank you in advance.
>
> Best
>  David