You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by John Smith <ja...@gmail.com> on 2022/05/01 02:04:14 UTC

Re: Task manager shutting down.

Hi Martin, is there anything I need to check for?

On Tue, Apr 26, 2022 at 9:50 PM John Smith <ja...@gmail.com> wrote:

> Yeah based off the flink JDBC output format...
>
>
> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Hi John,
>>
>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC
>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>> https://github.com/MartijnVisser
>>
>>
>> On Tue, 26 Apr 2022 at 16:01, John Smith <ja...@gmail.com> wrote:
>>
>>> Hi running 1.14.4
>>>
>>> Logs included:
>>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0
>>>
>>> 1- My task managers shut down with: Terminating TaskManagerRunner with
>>> exit code 1.
>>> 2- It seems to happen at the same time every day. Which leads me to
>>> believe it's our database indexing (See below for reasoning of this).
>>> 3- Most of our jobs are ETL from Kafka to SQL Server.
>>> 4- We see the following exceptions in the logs:
>>>       - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal -
>>> interrupting; it is stuck for 30 seconds in method:
>>> ... com.microsoft.sqlserver.jdbc.TDSChannel ...
>>>       - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched
>>> from RUNNING to FAILED with failure cause:
>>> org.apache.flink.util.FlinkException: Disconnect from JobManager
>>> responsible for ...
>>> 5- Also seeing this: Failed to close consumer network client with type
>>> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
>>> java.lang.NoClassDefFoundError:
>>> org/apache/kafka/common/network/Selector$CloseMode
>>>
>>> So what I'm guessing is happening is the indexing is blocking the job
>>> and the task manager cannot cleanly remove the job and finally after a
>>> while it decides to shut down completely?
>>>
>>> Is there a way to pause the stream and restart at a later time knowing
>>> that this happens always at the same wall clock time? Or maybe allow the
>>> JDBC to cleanly shutdown with a timeout?
>>>
>>>
>>>

Re: Task manager shutting down.

Posted by John Smith <ja...@gmail.com>.

Actually what's happening is there's a nightly indexing job. So when we
call the insert it takes longer than the specified checkpoint threshold.
JDBC will hapilly continue waiting for a response from the DB until it's
done. So the checkpoint threshold is reached and the job tries to shut down
and restart, but the job is blocked on the JDBC driver and it's causing all
kinds of crazy exceptions as you see in the logs.

So a stop gap solution was to add setQueryTimeout to a value a bit shorter
than the threshold of the checkpoint. This allows the job to fail
"gracefully" and restart until indexing is done.

1- We can review the indexing policy, if it's required nightly, which just
means that instead of having the job fail every night it will fail only
when the indexing happens.
2- The other is to try to figure out a way to pause the job, maybe through
cron and savepoints. But it seems way overly thought.

On Wed, May 4, 2022 at 1:40 PM Martijn Visser <ma...@apache.org>
wrote:

> Hi John,
>
> In an ideal scenario you would be able to leverage Flink's backpressure
> mechanism. That would effectively slow down the processing until the reason
> for backpressure has been resolved. However, given that indexing happens
> after you've sinked your result, from a Flink perspective, the action is
> completed. Perhaps someone else has a different idea on how to achieve
> this.
>
> Best regards,
>
> Martijn
>
> On Wed, 4 May 2022 at 19:31, John Smith <ja...@gmail.com> wrote:
>
>> So I know specifically, it's the indexing and I put setQueryTimeout. So
>> the job fails. And goes into retry. That's fine.
>>
>> But just wondering is there a way to pause the stream at a specified
>> time/checkpoint and then resume after a specified time?
>>
>> On Wed, May 4, 2022 at 10:23 AM Martijn Visser <ma...@apache.org>
>> wrote:
>>
>>> Hi John,
>>>
>>> It is generic, but each database has its own dialect implementation
>>> because they all have their differences unfortunately :)
>>>
>>> I wish I knew how I could help you out here. Perhaps some of the JDBC
>>> maintainers could chip in.
>>>
>>> Best regards,
>>>
>>> Martijn
>>>
>>> On Sun, 1 May 2022 at 04:06, John Smith <ja...@gmail.com> wrote:
>>>
>>>> Plus in a way isn't the flink-jdbc connector kinda generic? At least
>>>> the older one didn't seem to be server specific.
>>>>
>>>> On Sat, Apr 30, 2022 at 10:04 PM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Martin, is there anything I need to check for?
>>>>>
>>>>> On Tue, Apr 26, 2022 at 9:50 PM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yeah based off the flink JDBC output format...
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser <
>>>>>> martijnvisser@apache.org> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC
>>>>>>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Martijn Visser
>>>>>>> https://twitter.com/MartijnVisser82
>>>>>>> https://github.com/MartijnVisser
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 26 Apr 2022 at 16:01, John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi running 1.14.4
>>>>>>>>
>>>>>>>> Logs included:
>>>>>>>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0
>>>>>>>>
>>>>>>>> 1- My task managers shut down with: Terminating TaskManagerRunner
>>>>>>>> with exit code 1.
>>>>>>>> 2- It seems to happen at the same time every day. Which leads me to
>>>>>>>> believe it's our database indexing (See below for reasoning of this).
>>>>>>>> 3- Most of our jobs are ETL from Kafka to SQL Server.
>>>>>>>> 4- We see the following exceptions in the logs:
>>>>>>>>       - Task 'Sink: jdbc (1/1)#10' did not react to cancelling
>>>>>>>> signal - interrupting; it is stuck for 30 seconds in method:
>>>>>>>> ... com.microsoft.sqlserver.jdbc.TDSChannel ...
>>>>>>>>       - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c)
>>>>>>>> switched from RUNNING to FAILED with failure cause:
>>>>>>>> org.apache.flink.util.FlinkException: Disconnect from JobManager
>>>>>>>> responsible for ...
>>>>>>>> 5- Also seeing this: Failed to close consumer network client with
>>>>>>>> type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
>>>>>>>> java.lang.NoClassDefFoundError:
>>>>>>>> org/apache/kafka/common/network/Selector$CloseMode
>>>>>>>>
>>>>>>>> So what I'm guessing is happening is the indexing is blocking the
>>>>>>>> job and the task manager cannot cleanly remove the job and finally after a
>>>>>>>> while it decides to shut down completely?
>>>>>>>>
>>>>>>>> Is there a way to pause the stream and restart at a later time
>>>>>>>> knowing that this happens always at the same wall clock time? Or maybe
>>>>>>>> allow the JDBC to cleanly shutdown with a timeout?
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Task manager shutting down.

Posted by Martijn Visser <ma...@apache.org>.

Hi John,

In an ideal scenario you would be able to leverage Flink's backpressure
mechanism. That would effectively slow down the processing until the reason
for backpressure has been resolved. However, given that indexing happens
after you've sinked your result, from a Flink perspective, the action is
completed. Perhaps someone else has a different idea on how to achieve
this.

Best regards,

Martijn

On Wed, 4 May 2022 at 19:31, John Smith <ja...@gmail.com> wrote:

> So I know specifically, it's the indexing and I put setQueryTimeout. So
> the job fails. And goes into retry. That's fine.
>
> But just wondering is there a way to pause the stream at a specified
> time/checkpoint and then resume after a specified time?
>
> On Wed, May 4, 2022 at 10:23 AM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Hi John,
>>
>> It is generic, but each database has its own dialect implementation
>> because they all have their differences unfortunately :)
>>
>> I wish I knew how I could help you out here. Perhaps some of the JDBC
>> maintainers could chip in.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Sun, 1 May 2022 at 04:06, John Smith <ja...@gmail.com> wrote:
>>
>>> Plus in a way isn't the flink-jdbc connector kinda generic? At least the
>>> older one didn't seem to be server specific.
>>>
>>> On Sat, Apr 30, 2022 at 10:04 PM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Hi Martin, is there anything I need to check for?
>>>>
>>>> On Tue, Apr 26, 2022 at 9:50 PM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yeah based off the flink JDBC output format...
>>>>>
>>>>>
>>>>> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser <
>>>>> martijnvisser@apache.org> wrote:
>>>>>
>>>>>> Hi John,
>>>>>>
>>>>>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC
>>>>>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Martijn Visser
>>>>>> https://twitter.com/MartijnVisser82
>>>>>> https://github.com/MartijnVisser
>>>>>>
>>>>>>
>>>>>> On Tue, 26 Apr 2022 at 16:01, John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi running 1.14.4
>>>>>>>
>>>>>>> Logs included:
>>>>>>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0
>>>>>>>
>>>>>>> 1- My task managers shut down with: Terminating TaskManagerRunner
>>>>>>> with exit code 1.
>>>>>>> 2- It seems to happen at the same time every day. Which leads me to
>>>>>>> believe it's our database indexing (See below for reasoning of this).
>>>>>>> 3- Most of our jobs are ETL from Kafka to SQL Server.
>>>>>>> 4- We see the following exceptions in the logs:
>>>>>>>       - Task 'Sink: jdbc (1/1)#10' did not react to cancelling
>>>>>>> signal - interrupting; it is stuck for 30 seconds in method:
>>>>>>> ... com.microsoft.sqlserver.jdbc.TDSChannel ...
>>>>>>>       - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c)
>>>>>>> switched from RUNNING to FAILED with failure cause:
>>>>>>> org.apache.flink.util.FlinkException: Disconnect from JobManager
>>>>>>> responsible for ...
>>>>>>> 5- Also seeing this: Failed to close consumer network client with
>>>>>>> type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
>>>>>>> java.lang.NoClassDefFoundError:
>>>>>>> org/apache/kafka/common/network/Selector$CloseMode
>>>>>>>
>>>>>>> So what I'm guessing is happening is the indexing is blocking the
>>>>>>> job and the task manager cannot cleanly remove the job and finally after a
>>>>>>> while it decides to shut down completely?
>>>>>>>
>>>>>>> Is there a way to pause the stream and restart at a later time
>>>>>>> knowing that this happens always at the same wall clock time? Or maybe
>>>>>>> allow the JDBC to cleanly shutdown with a timeout?
>>>>>>>
>>>>>>>
>>>>>>>

Re: Task manager shutting down.

Posted by John Smith <ja...@gmail.com>.

So I know specifically, it's the indexing and I put setQueryTimeout. So the
job fails. And goes into retry. That's fine.

But just wondering is there a way to pause the stream at a specified
time/checkpoint and then resume after a specified time?

On Wed, May 4, 2022 at 10:23 AM Martijn Visser <ma...@apache.org>
wrote:

> Hi John,
>
> It is generic, but each database has its own dialect implementation
> because they all have their differences unfortunately :)
>
> I wish I knew how I could help you out here. Perhaps some of the JDBC
> maintainers could chip in.
>
> Best regards,
>
> Martijn
>
> On Sun, 1 May 2022 at 04:06, John Smith <ja...@gmail.com> wrote:
>
>> Plus in a way isn't the flink-jdbc connector kinda generic? At least the
>> older one didn't seem to be server specific.
>>
>> On Sat, Apr 30, 2022 at 10:04 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Hi Martin, is there anything I need to check for?
>>>
>>> On Tue, Apr 26, 2022 at 9:50 PM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Yeah based off the flink JDBC output format...
>>>>
>>>>
>>>> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser <
>>>> martijnvisser@apache.org> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC
>>>>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Martijn Visser
>>>>> https://twitter.com/MartijnVisser82
>>>>> https://github.com/MartijnVisser
>>>>>
>>>>>
>>>>> On Tue, 26 Apr 2022 at 16:01, John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi running 1.14.4
>>>>>>
>>>>>> Logs included:
>>>>>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0
>>>>>>
>>>>>> 1- My task managers shut down with: Terminating TaskManagerRunner
>>>>>> with exit code 1.
>>>>>> 2- It seems to happen at the same time every day. Which leads me to
>>>>>> believe it's our database indexing (See below for reasoning of this).
>>>>>> 3- Most of our jobs are ETL from Kafka to SQL Server.
>>>>>> 4- We see the following exceptions in the logs:
>>>>>>       - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal
>>>>>> - interrupting; it is stuck for 30 seconds in method:
>>>>>> ... com.microsoft.sqlserver.jdbc.TDSChannel ...
>>>>>>       - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c)
>>>>>> switched from RUNNING to FAILED with failure cause:
>>>>>> org.apache.flink.util.FlinkException: Disconnect from JobManager
>>>>>> responsible for ...
>>>>>> 5- Also seeing this: Failed to close consumer network client with
>>>>>> type org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
>>>>>> java.lang.NoClassDefFoundError:
>>>>>> org/apache/kafka/common/network/Selector$CloseMode
>>>>>>
>>>>>> So what I'm guessing is happening is the indexing is blocking the job
>>>>>> and the task manager cannot cleanly remove the job and finally after a
>>>>>> while it decides to shut down completely?
>>>>>>
>>>>>> Is there a way to pause the stream and restart at a later time
>>>>>> knowing that this happens always at the same wall clock time? Or maybe
>>>>>> allow the JDBC to cleanly shutdown with a timeout?
>>>>>>
>>>>>>
>>>>>>

Re: Task manager shutting down.

Posted by Martijn Visser <ma...@apache.org>.

Hi John,

It is generic, but each database has its own dialect implementation because
they all have their differences unfortunately :)

I wish I knew how I could help you out here. Perhaps some of the JDBC
maintainers could chip in.

Best regards,

Martijn

On Sun, 1 May 2022 at 04:06, John Smith <ja...@gmail.com> wrote:

> Plus in a way isn't the flink-jdbc connector kinda generic? At least the
> older one didn't seem to be server specific.
>
> On Sat, Apr 30, 2022 at 10:04 PM John Smith <ja...@gmail.com>
> wrote:
>
>> Hi Martin, is there anything I need to check for?
>>
>> On Tue, Apr 26, 2022 at 9:50 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Yeah based off the flink JDBC output format...
>>>
>>>
>>> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser <
>>> martijnvisser@apache.org> wrote:
>>>
>>>> Hi John,
>>>>
>>>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC
>>>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver.
>>>>
>>>> Best regards,
>>>>
>>>> Martijn Visser
>>>> https://twitter.com/MartijnVisser82
>>>> https://github.com/MartijnVisser
>>>>
>>>>
>>>> On Tue, 26 Apr 2022 at 16:01, John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi running 1.14.4
>>>>>
>>>>> Logs included:
>>>>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0
>>>>>
>>>>> 1- My task managers shut down with: Terminating TaskManagerRunner with
>>>>> exit code 1.
>>>>> 2- It seems to happen at the same time every day. Which leads me to
>>>>> believe it's our database indexing (See below for reasoning of this).
>>>>> 3- Most of our jobs are ETL from Kafka to SQL Server.
>>>>> 4- We see the following exceptions in the logs:
>>>>>       - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal
>>>>> - interrupting; it is stuck for 30 seconds in method:
>>>>> ... com.microsoft.sqlserver.jdbc.TDSChannel ...
>>>>>       - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched
>>>>> from RUNNING to FAILED with failure cause:
>>>>> org.apache.flink.util.FlinkException: Disconnect from JobManager
>>>>> responsible for ...
>>>>> 5- Also seeing this: Failed to close consumer network client with type
>>>>> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
>>>>> java.lang.NoClassDefFoundError:
>>>>> org/apache/kafka/common/network/Selector$CloseMode
>>>>>
>>>>> So what I'm guessing is happening is the indexing is blocking the job
>>>>> and the task manager cannot cleanly remove the job and finally after a
>>>>> while it decides to shut down completely?
>>>>>
>>>>> Is there a way to pause the stream and restart at a later time knowing
>>>>> that this happens always at the same wall clock time? Or maybe allow the
>>>>> JDBC to cleanly shutdown with a timeout?
>>>>>
>>>>>
>>>>>

Re: Task manager shutting down.

Posted by John Smith <ja...@gmail.com>.

Plus in a way isn't the flink-jdbc connector kinda generic? At least the
older one didn't seem to be server specific.

On Sat, Apr 30, 2022 at 10:04 PM John Smith <ja...@gmail.com> wrote:

> Hi Martin, is there anything I need to check for?
>
> On Tue, Apr 26, 2022 at 9:50 PM John Smith <ja...@gmail.com> wrote:
>
>> Yeah based off the flink JDBC output format...
>>
>>
>> On Tue, Apr 26, 2022 at 10:05 AM Martijn Visser <ma...@apache.org>
>> wrote:
>>
>>> Hi John,
>>>
>>> Have you built your own JDBC MSSQL source or sink or perhaps a CDC
>>> driver? Because I'm not aware of a Flink Microsoft SQL Server JDBC driver.
>>>
>>> Best regards,
>>>
>>> Martijn Visser
>>> https://twitter.com/MartijnVisser82
>>> https://github.com/MartijnVisser
>>>
>>>
>>> On Tue, 26 Apr 2022 at 16:01, John Smith <ja...@gmail.com> wrote:
>>>
>>>> Hi running 1.14.4
>>>>
>>>> Logs included:
>>>> https://www.dropbox.com/s/8zjndt5rzd9o80f/flink-flink-taskexecutor-138-task-0002.log?dl=0
>>>>
>>>> 1- My task managers shut down with: Terminating TaskManagerRunner with
>>>> exit code 1.
>>>> 2- It seems to happen at the same time every day. Which leads me to
>>>> believe it's our database indexing (See below for reasoning of this).
>>>> 3- Most of our jobs are ETL from Kafka to SQL Server.
>>>> 4- We see the following exceptions in the logs:
>>>>       - Task 'Sink: jdbc (1/1)#10' did not react to cancelling signal -
>>>> interrupting; it is stuck for 30 seconds in method:
>>>> ... com.microsoft.sqlserver.jdbc.TDSChannel ...
>>>>       - Sink: jdbc (1/1)#9 (3aaf6d8a45df6c43198bc8297b42354c) switched
>>>> from RUNNING to FAILED with failure cause:
>>>> org.apache.flink.util.FlinkException: Disconnect from JobManager
>>>> responsible for ...
>>>> 5- Also seeing this: Failed to close consumer network client with type
>>>> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
>>>> java.lang.NoClassDefFoundError:
>>>> org/apache/kafka/common/network/Selector$CloseMode
>>>>
>>>> So what I'm guessing is happening is the indexing is blocking the job
>>>> and the task manager cannot cleanly remove the job and finally after a
>>>> while it decides to shut down completely?
>>>>
>>>> Is there a way to pause the stream and restart at a later time knowing
>>>> that this happens always at the same wall clock time? Or maybe allow the
>>>> JDBC to cleanly shutdown with a timeout?
>>>>
>>>>
>>>>