You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "DONG, Weike" <ky...@connect.hku.hk> on 2020/10/12 08:14:05 UTC

TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Hi community,

Recently we have noticed a strange behavior for Flink jobs on Kubernetes
per-job mode: when the parallelism increases, the time it takes for the
TaskManagers to register with *JobManager *becomes abnormally long (for a
task with parallelism of 50, it could take 60 ~ 120 seconds or even longer
for the registration attempt), and usually more than 10 attempts are needed
to finish this registration.

Because of this, we could not submit a job requiring more than 20 slots
with the default configuration, as the TaskManager would say:


> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
> attempt 9 timed out after 25600 ms

Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because: The
> slot 60d5277e138a94fb73fc6691557001e0 has timed out.

Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
> 493cd86e389ccc8f2887e1222903b5ce).
> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has timed
> out.


In order to cope with this issue, we have to change the below configuration
parameters:

>
> # Prevent "Could not allocate the required slot within slot request
> timeout. Please make sure that the cluster has enough resources. Stopping
> the JobMaster for job"
> slot.request.timeout: 500000

# Increase max timeout in a single attempt
> cluster.registration.max-timeout: 300000
> # Prevent "free slot (TaskSlot)"
> akka.ask.timeout: 10 min
> # Prevent "Heartbeat of TaskManager timed out."
> heartbeat.timeout: 500000


However, we acknowledge that this is only a temporary dirty fix, which is
not what we want. It could be seen that during TaskManager registration to
JobManager, lots of warning messages come out in logs:

No hostname could be resolved for the IP address 9.166.0.118, using IP
> address as host name. Local input split assignment (such as for HDFS files)
> may be impacted.


Initially we thought this was probably the cause (reverse lookup of DNS
might take up a long time), however we later found that the reverse lookup
only took less than 1ms, so maybe not because of this.

Also, we have checked the GC log of both TaskManagers and JobManager, and
they seem to be perfectly normal, without any signs of pauses. And the
heartbeats are processed as normal according to the logs.

Moreover, TaskManagers register quickly with ResourceManager, but then
extra slow with TaskManager, so this is not because of a slow network
connection.

Here we wonder what could be the cause for the slow registration between
JobManager and TaskManager(s)? No other warning or error messages in the
log (DEBUG level) other than the "No hostname could be resolved" messages,
which is quite weird.

Thanks for the reading, and hope to get some insights into this issues : )

Sincerely,
Weike

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by Till Rohrmann <tr...@apache.org>.
Done, you are assigned now Weike.

Cheers,
Till

On Fri, Oct 16, 2020 at 1:33 PM DONG, Weike <ky...@connect.hku.hk> wrote:

> Hi Till,
>
> Thank you for the kind reminder, and I have created a JIRA ticket for this
> issue https://issues.apache.org/jira/browse/FLINK-19677
>
> Could you please assign it to me? I will try to submit a PR this weekend
> to fix this : )
>
> Sincerely,
> Weike
>
> On Fri, Oct 16, 2020 at 5:54 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Great, thanks a lot Weike. I think the first step would be to open a JIRA
>> issue, get assigned and then start on fixing it and opening a PR.
>>
>> Cheers,
>> Till
>>
>> On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike <ky...@connect.hku.hk>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks for all the replies, and I agree with Yang, as we have found that
>>> for a pod without a service (like TaskManager pod), the reverse DNS lookup
>>> would always fail, so this lookup is not necessary for the Kubernetes
>>> environment.
>>>
>>> I am glad to help fix this issue to make Flink better : )
>>>
>>> Best,
>>> Weike
>>>
>>> On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Hi Weike,
>>>>
>>>> thanks for getting back to us with your findings. Looking at the
>>>> `TaskManagerLocation`, we are actually calling
>>>> `InetAddress.getCanonicalHostName` twice for every creation of a
>>>> `TaskManagerLocation` instance. This does not look right.
>>>>
>>>> I think it should be fine to make the look up configurable. Moreover,
>>>> one could think about only doing a lazy look up if the canonical hostname
>>>> is really needed (as far as I can see it is only really needed input split
>>>> assignments and for the LocationPreferenceSlotSelectionStrategy to
>>>> calculate how many TMs run on the same machine).
>>>>
>>>> Do you want to fix this issue?
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <ky...@connect.hku.hk>
>>>> wrote:
>>>>
>>>>> Hi Till and community,
>>>>>
>>>>> By the way, initially I resolved the IPs several times but results
>>>>> returned rather quickly (less than 1ms, possibly due to DNS cache on the
>>>>> server), so I thought it might not be the DNS issue.
>>>>>
>>>>> However, after debugging and logging, it is found that the lookup time
>>>>> exhibited high variance, i. e. normally it completes fast but occasionally
>>>>> some slow results would block the thread. So an unstable DNS server might
>>>>> have a great impact on the performance of Flink job startup.
>>>>>
>>>>> Best,
>>>>> Weike
>>>>>
>>>>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <ky...@connect.hku.hk>
>>>>> wrote:
>>>>>
>>>>>> Hi Till and community,
>>>>>>
>>>>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes
>>>>>> this issue alleviated but not disappeared.
>>>>>>
>>>>>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>>>>>> found the culprit is
>>>>>>
>>>>>> inetAddress.getCanonicalHostName()
>>>>>>
>>>>>> in
>>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>>>>>> and
>>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>>>>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>>>>>> are severely blocked by that.
>>>>>>
>>>>>> By commenting out the two methods, this issue seems to be solved
>>>>>> immediately, so I wonder if Flink could provide a configuration parameter
>>>>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>>>>>> could run happily without it.
>>>>>>
>>>>>> Sincerely,
>>>>>> Weike
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Weike,
>>>>>>>
>>>>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>>>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>>>>>> component.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Weike,
>>>>>>>>
>>>>>>>> thanks for posting the logs. I will take a look at them. My
>>>>>>>> suspicion would be that there is some operation blocking the JobMaster's
>>>>>>>> main thread which causes the registrations from the TMs to time out. Maybe
>>>>>>>> the logs allow me to validate/falsify this suspicion.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <
>>>>>>>> kyledong@connect.hku.hk> wrote:
>>>>>>>>
>>>>>>>>> Hi community,
>>>>>>>>>
>>>>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1
>>>>>>>>> (one of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>>>>>> configuration, and it clearly shows that TaskManager failed to register
>>>>>>>>> with JobManager after 10 attempts.
>>>>>>>>>
>>>>>>>>> Here is the link:
>>>>>>>>>
>>>>>>>>> JobManager:
>>>>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>>>>>
>>>>>>>>> TaskManager-1-1:
>>>>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks : )
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Weike
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <
>>>>>>>>> kyledong@connect.hku.hk> wrote:
>>>>>>>>>
>>>>>>>>>> Hi community,
>>>>>>>>>>
>>>>>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>>>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ 120
>>>>>>>>>> seconds or even longer for the registration attempt), and usually more than
>>>>>>>>>> 10 attempts are needed to finish this registration.
>>>>>>>>>>
>>>>>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>>>>>
>>>>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>>> has timed out.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In order to cope with this issue, we have to change the below
>>>>>>>>>> configuration parameters:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> # Prevent "Could not allocate the required slot within slot
>>>>>>>>>>> request timeout. Please make sure that the cluster has enough resources.
>>>>>>>>>>> Stopping the JobMaster for job"
>>>>>>>>>>> slot.request.timeout: 500000
>>>>>>>>>>
>>>>>>>>>> # Increase max timeout in a single attempt
>>>>>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>>>>>> akka.ask.timeout: 10 min
>>>>>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>>>>>> heartbeat.timeout: 500000
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>>>>>> registration to JobManager, lots of warning messages come out in logs:
>>>>>>>>>>
>>>>>>>>>> No hostname could be resolved for the IP address 9.166.0.118,
>>>>>>>>>>> using IP address as host name. Local input split assignment (such as for
>>>>>>>>>>> HDFS files) may be impacted.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Initially we thought this was probably the cause (reverse lookup
>>>>>>>>>> of DNS might take up a long time), however we later found that the reverse
>>>>>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>>>>>
>>>>>>>>>> Also, we have checked the GC log of both TaskManagers and
>>>>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of
>>>>>>>>>> pauses. And the heartbeats are processed as normal according to the logs.
>>>>>>>>>>
>>>>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>>>>>> then extra slow with TaskManager, so this is not because of a slow network
>>>>>>>>>> connection.
>>>>>>>>>>
>>>>>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>>>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>>>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>>>>>>>> messages, which is quite weird.
>>>>>>>>>>
>>>>>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>>>>>> issues : )
>>>>>>>>>>
>>>>>>>>>> Sincerely,
>>>>>>>>>> Weike
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by "DONG, Weike" <ky...@connect.hku.hk>.
Hi Till,

Thank you for the kind reminder, and I have created a JIRA ticket for this
issue https://issues.apache.org/jira/browse/FLINK-19677

Could you please assign it to me? I will try to submit a PR this weekend to
fix this : )

Sincerely,
Weike

On Fri, Oct 16, 2020 at 5:54 PM Till Rohrmann <tr...@apache.org> wrote:

> Great, thanks a lot Weike. I think the first step would be to open a JIRA
> issue, get assigned and then start on fixing it and opening a PR.
>
> Cheers,
> Till
>
> On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike <ky...@connect.hku.hk>
> wrote:
>
>> Hi all,
>>
>> Thanks for all the replies, and I agree with Yang, as we have found that
>> for a pod without a service (like TaskManager pod), the reverse DNS lookup
>> would always fail, so this lookup is not necessary for the Kubernetes
>> environment.
>>
>> I am glad to help fix this issue to make Flink better : )
>>
>> Best,
>> Weike
>>
>> On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Weike,
>>>
>>> thanks for getting back to us with your findings. Looking at the
>>> `TaskManagerLocation`, we are actually calling
>>> `InetAddress.getCanonicalHostName` twice for every creation of a
>>> `TaskManagerLocation` instance. This does not look right.
>>>
>>> I think it should be fine to make the look up configurable. Moreover,
>>> one could think about only doing a lazy look up if the canonical hostname
>>> is really needed (as far as I can see it is only really needed input split
>>> assignments and for the LocationPreferenceSlotSelectionStrategy to
>>> calculate how many TMs run on the same machine).
>>>
>>> Do you want to fix this issue?
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <ky...@connect.hku.hk>
>>> wrote:
>>>
>>>> Hi Till and community,
>>>>
>>>> By the way, initially I resolved the IPs several times but results
>>>> returned rather quickly (less than 1ms, possibly due to DNS cache on the
>>>> server), so I thought it might not be the DNS issue.
>>>>
>>>> However, after debugging and logging, it is found that the lookup time
>>>> exhibited high variance, i. e. normally it completes fast but occasionally
>>>> some slow results would block the thread. So an unstable DNS server might
>>>> have a great impact on the performance of Flink job startup.
>>>>
>>>> Best,
>>>> Weike
>>>>
>>>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <ky...@connect.hku.hk>
>>>> wrote:
>>>>
>>>>> Hi Till and community,
>>>>>
>>>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
>>>>> issue alleviated but not disappeared.
>>>>>
>>>>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>>>>> found the culprit is
>>>>>
>>>>> inetAddress.getCanonicalHostName()
>>>>>
>>>>> in
>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>>>>> and
>>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>>>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>>>>> are severely blocked by that.
>>>>>
>>>>> By commenting out the two methods, this issue seems to be solved
>>>>> immediately, so I wonder if Flink could provide a configuration parameter
>>>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>>>>> could run happily without it.
>>>>>
>>>>> Sincerely,
>>>>> Weike
>>>>>
>>>>>
>>>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Weike,
>>>>>>
>>>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>>>>> component.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Weike,
>>>>>>>
>>>>>>> thanks for posting the logs. I will take a look at them. My
>>>>>>> suspicion would be that there is some operation blocking the JobMaster's
>>>>>>> main thread which causes the registrations from the TMs to time out. Maybe
>>>>>>> the logs allow me to validate/falsify this suspicion.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <
>>>>>>> kyledong@connect.hku.hk> wrote:
>>>>>>>
>>>>>>>> Hi community,
>>>>>>>>
>>>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1
>>>>>>>> (one of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>>>>> configuration, and it clearly shows that TaskManager failed to register
>>>>>>>> with JobManager after 10 attempts.
>>>>>>>>
>>>>>>>> Here is the link:
>>>>>>>>
>>>>>>>> JobManager:
>>>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>>>>
>>>>>>>> TaskManager-1-1:
>>>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>>>>
>>>>>>>> Thanks : )
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Weike
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <
>>>>>>>> kyledong@connect.hku.hk> wrote:
>>>>>>>>
>>>>>>>>> Hi community,
>>>>>>>>>
>>>>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ 120
>>>>>>>>> seconds or even longer for the registration attempt), and usually more than
>>>>>>>>> 10 attempts are needed to finish this registration.
>>>>>>>>>
>>>>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>>>>
>>>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0
>>>>>>>>>> has timed out.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In order to cope with this issue, we have to change the below
>>>>>>>>> configuration parameters:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # Prevent "Could not allocate the required slot within slot
>>>>>>>>>> request timeout. Please make sure that the cluster has enough resources.
>>>>>>>>>> Stopping the JobMaster for job"
>>>>>>>>>> slot.request.timeout: 500000
>>>>>>>>>
>>>>>>>>> # Increase max timeout in a single attempt
>>>>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>>>>> akka.ask.timeout: 10 min
>>>>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>>>>> heartbeat.timeout: 500000
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>>>>> registration to JobManager, lots of warning messages come out in logs:
>>>>>>>>>
>>>>>>>>> No hostname could be resolved for the IP address 9.166.0.118,
>>>>>>>>>> using IP address as host name. Local input split assignment (such as for
>>>>>>>>>> HDFS files) may be impacted.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Initially we thought this was probably the cause (reverse lookup
>>>>>>>>> of DNS might take up a long time), however we later found that the reverse
>>>>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>>>>
>>>>>>>>> Also, we have checked the GC log of both TaskManagers and
>>>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of
>>>>>>>>> pauses. And the heartbeats are processed as normal according to the logs.
>>>>>>>>>
>>>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>>>>> then extra slow with TaskManager, so this is not because of a slow network
>>>>>>>>> connection.
>>>>>>>>>
>>>>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>>>>>>> messages, which is quite weird.
>>>>>>>>>
>>>>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>>>>> issues : )
>>>>>>>>>
>>>>>>>>> Sincerely,
>>>>>>>>> Weike
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by Till Rohrmann <tr...@apache.org>.
Great, thanks a lot Weike. I think the first step would be to open a JIRA
issue, get assigned and then start on fixing it and opening a PR.

Cheers,
Till

On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike <ky...@connect.hku.hk>
wrote:

> Hi all,
>
> Thanks for all the replies, and I agree with Yang, as we have found that
> for a pod without a service (like TaskManager pod), the reverse DNS lookup
> would always fail, so this lookup is not necessary for the Kubernetes
> environment.
>
> I am glad to help fix this issue to make Flink better : )
>
> Best,
> Weike
>
> On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Weike,
>>
>> thanks for getting back to us with your findings. Looking at the
>> `TaskManagerLocation`, we are actually calling
>> `InetAddress.getCanonicalHostName` twice for every creation of a
>> `TaskManagerLocation` instance. This does not look right.
>>
>> I think it should be fine to make the look up configurable. Moreover, one
>> could think about only doing a lazy look up if the canonical hostname is
>> really needed (as far as I can see it is only really needed input split
>> assignments and for the LocationPreferenceSlotSelectionStrategy to
>> calculate how many TMs run on the same machine).
>>
>> Do you want to fix this issue?
>>
>> Cheers,
>> Till
>>
>> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <ky...@connect.hku.hk>
>> wrote:
>>
>>> Hi Till and community,
>>>
>>> By the way, initially I resolved the IPs several times but results
>>> returned rather quickly (less than 1ms, possibly due to DNS cache on the
>>> server), so I thought it might not be the DNS issue.
>>>
>>> However, after debugging and logging, it is found that the lookup time
>>> exhibited high variance, i. e. normally it completes fast but occasionally
>>> some slow results would block the thread. So an unstable DNS server might
>>> have a great impact on the performance of Flink job startup.
>>>
>>> Best,
>>> Weike
>>>
>>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <ky...@connect.hku.hk>
>>> wrote:
>>>
>>>> Hi Till and community,
>>>>
>>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
>>>> issue alleviated but not disappeared.
>>>>
>>>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>>>> found the culprit is
>>>>
>>>> inetAddress.getCanonicalHostName()
>>>>
>>>> in
>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>>>> and
>>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>>>> are severely blocked by that.
>>>>
>>>> By commenting out the two methods, this issue seems to be solved
>>>> immediately, so I wonder if Flink could provide a configuration parameter
>>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>>>> could run happily without it.
>>>>
>>>> Sincerely,
>>>> Weike
>>>>
>>>>
>>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Weike,
>>>>>
>>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>>>> component.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Weike,
>>>>>>
>>>>>> thanks for posting the logs. I will take a look at them. My suspicion
>>>>>> would be that there is some operation blocking the JobMaster's main thread
>>>>>> which causes the registrations from the TMs to time out. Maybe the logs
>>>>>> allow me to validate/falsify this suspicion.
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi community,
>>>>>>>
>>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one
>>>>>>> of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>>>> configuration, and it clearly shows that TaskManager failed to register
>>>>>>> with JobManager after 10 attempts.
>>>>>>>
>>>>>>> Here is the link:
>>>>>>>
>>>>>>> JobManager:
>>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>>>
>>>>>>> TaskManager-1-1:
>>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>>>
>>>>>>> Thanks : )
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Weike
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi community,
>>>>>>>>
>>>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ 120
>>>>>>>> seconds or even longer for the registration attempt), and usually more than
>>>>>>>> 10 attempts are needed to finish this registration.
>>>>>>>>
>>>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>>>
>>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>>>
>>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>>>>>>> timed out.
>>>>>>>>
>>>>>>>>
>>>>>>>> In order to cope with this issue, we have to change the below
>>>>>>>> configuration parameters:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> # Prevent "Could not allocate the required slot within slot
>>>>>>>>> request timeout. Please make sure that the cluster has enough resources.
>>>>>>>>> Stopping the JobMaster for job"
>>>>>>>>> slot.request.timeout: 500000
>>>>>>>>
>>>>>>>> # Increase max timeout in a single attempt
>>>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>>>> akka.ask.timeout: 10 min
>>>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>>>> heartbeat.timeout: 500000
>>>>>>>>
>>>>>>>>
>>>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>>>> registration to JobManager, lots of warning messages come out in logs:
>>>>>>>>
>>>>>>>> No hostname could be resolved for the IP address 9.166.0.118, using
>>>>>>>>> IP address as host name. Local input split assignment (such as for HDFS
>>>>>>>>> files) may be impacted.
>>>>>>>>
>>>>>>>>
>>>>>>>> Initially we thought this was probably the cause (reverse lookup of
>>>>>>>> DNS might take up a long time), however we later found that the reverse
>>>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>>>
>>>>>>>> Also, we have checked the GC log of both TaskManagers and
>>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of
>>>>>>>> pauses. And the heartbeats are processed as normal according to the logs.
>>>>>>>>
>>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>>>> then extra slow with TaskManager, so this is not because of a slow network
>>>>>>>> connection.
>>>>>>>>
>>>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>>>>>> messages, which is quite weird.
>>>>>>>>
>>>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>>>> issues : )
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>> Weike
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by "DONG, Weike" <ky...@connect.hku.hk>.
Hi all,

Thanks for all the replies, and I agree with Yang, as we have found that
for a pod without a service (like TaskManager pod), the reverse DNS lookup
would always fail, so this lookup is not necessary for the Kubernetes
environment.

I am glad to help fix this issue to make Flink better : )

Best,
Weike

On Thu, Oct 15, 2020 at 7:57 PM Till Rohrmann <tr...@apache.org> wrote:

> Hi Weike,
>
> thanks for getting back to us with your findings. Looking at the
> `TaskManagerLocation`, we are actually calling
> `InetAddress.getCanonicalHostName` twice for every creation of a
> `TaskManagerLocation` instance. This does not look right.
>
> I think it should be fine to make the look up configurable. Moreover, one
> could think about only doing a lazy look up if the canonical hostname is
> really needed (as far as I can see it is only really needed input split
> assignments and for the LocationPreferenceSlotSelectionStrategy to
> calculate how many TMs run on the same machine).
>
> Do you want to fix this issue?
>
> Cheers,
> Till
>
> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <ky...@connect.hku.hk>
> wrote:
>
>> Hi Till and community,
>>
>> By the way, initially I resolved the IPs several times but results
>> returned rather quickly (less than 1ms, possibly due to DNS cache on the
>> server), so I thought it might not be the DNS issue.
>>
>> However, after debugging and logging, it is found that the lookup time
>> exhibited high variance, i. e. normally it completes fast but occasionally
>> some slow results would block the thread. So an unstable DNS server might
>> have a great impact on the performance of Flink job startup.
>>
>> Best,
>> Weike
>>
>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <ky...@connect.hku.hk>
>> wrote:
>>
>>> Hi Till and community,
>>>
>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
>>> issue alleviated but not disappeared.
>>>
>>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>>> found the culprit is
>>>
>>> inetAddress.getCanonicalHostName()
>>>
>>> in
>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>>> and
>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>>> are severely blocked by that.
>>>
>>> By commenting out the two methods, this issue seems to be solved
>>> immediately, so I wonder if Flink could provide a configuration parameter
>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>>> could run happily without it.
>>>
>>> Sincerely,
>>> Weike
>>>
>>>
>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Hi Weike,
>>>>
>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>>> component.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Weike,
>>>>>
>>>>> thanks for posting the logs. I will take a look at them. My suspicion
>>>>> would be that there is some operation blocking the JobMaster's main thread
>>>>> which causes the registrations from the TMs to time out. Maybe the logs
>>>>> allow me to validate/falsify this suspicion.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
>>>>> wrote:
>>>>>
>>>>>> Hi community,
>>>>>>
>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one
>>>>>> of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>>> configuration, and it clearly shows that TaskManager failed to register
>>>>>> with JobManager after 10 attempts.
>>>>>>
>>>>>> Here is the link:
>>>>>>
>>>>>> JobManager:
>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>>
>>>>>> TaskManager-1-1:
>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>>
>>>>>> Thanks : )
>>>>>>
>>>>>> Best regards,
>>>>>> Weike
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi community,
>>>>>>>
>>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ 120
>>>>>>> seconds or even longer for the registration attempt), and usually more than
>>>>>>> 10 attempts are needed to finish this registration.
>>>>>>>
>>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>>
>>>>>>>
>>>>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>>
>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>>
>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>>>>>> timed out.
>>>>>>>
>>>>>>>
>>>>>>> In order to cope with this issue, we have to change the below
>>>>>>> configuration parameters:
>>>>>>>
>>>>>>>>
>>>>>>>> # Prevent "Could not allocate the required slot within slot
>>>>>>>> request timeout. Please make sure that the cluster has enough resources.
>>>>>>>> Stopping the JobMaster for job"
>>>>>>>> slot.request.timeout: 500000
>>>>>>>
>>>>>>> # Increase max timeout in a single attempt
>>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>>> akka.ask.timeout: 10 min
>>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>>> heartbeat.timeout: 500000
>>>>>>>
>>>>>>>
>>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>>> registration to JobManager, lots of warning messages come out in logs:
>>>>>>>
>>>>>>> No hostname could be resolved for the IP address 9.166.0.118, using
>>>>>>>> IP address as host name. Local input split assignment (such as for HDFS
>>>>>>>> files) may be impacted.
>>>>>>>
>>>>>>>
>>>>>>> Initially we thought this was probably the cause (reverse lookup of
>>>>>>> DNS might take up a long time), however we later found that the reverse
>>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>>
>>>>>>> Also, we have checked the GC log of both TaskManagers and
>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of
>>>>>>> pauses. And the heartbeats are processed as normal according to the logs.
>>>>>>>
>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>>> then extra slow with TaskManager, so this is not because of a slow network
>>>>>>> connection.
>>>>>>>
>>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>>>>> messages, which is quite weird.
>>>>>>>
>>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>>> issues : )
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> Weike
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by Yang Wang <da...@gmail.com>.
I am afraid the InetAddress cache could not take effect. Because Kubernetes
only
creates A and SRV records for Services. It doesn't generate pods' A records
as you may expect. Refer here[1][2] for more information. So the DNS reverse
lookup will always fail. IIRC, the default timeout is 5s. This could
explain the delay
of "getHostName" or "getFqdnHostName".

I agree that we should add a config option to disable the DNS reverse
lookup.


[1].
https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#coredns-configmap-options
[2].
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-aaaa-records-1


Best,
Yang

Chesnay Schepler <ch...@apache.org> 于2020年10月15日周四 下午8:41写道:

> The InetAddress caches the result of getCanonicalHostName(), so it is not
> a problem to call it twice.
>
> On 10/15/2020 1:57 PM, Till Rohrmann wrote:
>
> Hi Weike,
>
> thanks for getting back to us with your findings. Looking at the
> `TaskManagerLocation`, we are actually calling
> `InetAddress.getCanonicalHostName` twice for every creation of a
> `TaskManagerLocation` instance. This does not look right.
>
> I think it should be fine to make the look up configurable. Moreover, one
> could think about only doing a lazy look up if the canonical hostname is
> really needed (as far as I can see it is only really needed input split
> assignments and for the LocationPreferenceSlotSelectionStrategy to
> calculate how many TMs run on the same machine).
>
> Do you want to fix this issue?
>
> Cheers,
> Till
>
> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <ky...@connect.hku.hk>
> wrote:
>
>> Hi Till and community,
>>
>> By the way, initially I resolved the IPs several times but results
>> returned rather quickly (less than 1ms, possibly due to DNS cache on the
>> server), so I thought it might not be the DNS issue.
>>
>> However, after debugging and logging, it is found that the lookup time
>> exhibited high variance, i. e. normally it completes fast but occasionally
>> some slow results would block the thread. So an unstable DNS server might
>> have a great impact on the performance of Flink job startup.
>>
>> Best,
>> Weike
>>
>> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <ky...@connect.hku.hk>
>> wrote:
>>
>>> Hi Till and community,
>>>
>>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
>>> issue alleviated but not disappeared.
>>>
>>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>>> found the culprit is
>>>
>>> inetAddress.getCanonicalHostName()
>>>
>>> in
>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>>> and
>>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>>> are severely blocked by that.
>>>
>>> By commenting out the two methods, this issue seems to be solved
>>> immediately, so I wonder if Flink could provide a configuration parameter
>>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>>> could run happily without it.
>>>
>>> Sincerely,
>>> Weike
>>>
>>>
>>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Hi Weike,
>>>>
>>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>>> component.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Weike,
>>>>>
>>>>> thanks for posting the logs. I will take a look at them. My suspicion
>>>>> would be that there is some operation blocking the JobMaster's main thread
>>>>> which causes the registrations from the TMs to time out. Maybe the logs
>>>>> allow me to validate/falsify this suspicion.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
>>>>> wrote:
>>>>>
>>>>>> Hi community,
>>>>>>
>>>>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one
>>>>>> of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>>> configuration, and it clearly shows that TaskManager failed to register
>>>>>> with JobManager after 10 attempts.
>>>>>>
>>>>>> Here is the link:
>>>>>>
>>>>>> JobManager:
>>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>>
>>>>>> TaskManager-1-1:
>>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>>
>>>>>> Thanks : )
>>>>>>
>>>>>> Best regards,
>>>>>> Weike
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi community,
>>>>>>>
>>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ 120
>>>>>>> seconds or even longer for the registration attempt), and usually more than
>>>>>>> 10 attempts are needed to finish this registration.
>>>>>>>
>>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>>
>>>>>>>
>>>>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>>
>>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>>
>>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>>>>>> timed out.
>>>>>>>
>>>>>>>
>>>>>>> In order to cope with this issue, we have to change the below
>>>>>>> configuration parameters:
>>>>>>>
>>>>>>>>
>>>>>>>> # Prevent "Could not allocate the required slot within slot
>>>>>>>> request timeout. Please make sure that the cluster has enough resources.
>>>>>>>> Stopping the JobMaster for job"
>>>>>>>> slot.request.timeout: 500000
>>>>>>>
>>>>>>> # Increase max timeout in a single attempt
>>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>>> akka.ask.timeout: 10 min
>>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>>> heartbeat.timeout: 500000
>>>>>>>
>>>>>>>
>>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>>> registration to JobManager, lots of warning messages come out in logs:
>>>>>>>
>>>>>>> No hostname could be resolved for the IP address 9.166.0.118, using
>>>>>>>> IP address as host name. Local input split assignment (such as for HDFS
>>>>>>>> files) may be impacted.
>>>>>>>
>>>>>>>
>>>>>>> Initially we thought this was probably the cause (reverse lookup of
>>>>>>> DNS might take up a long time), however we later found that the reverse
>>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>>
>>>>>>> Also, we have checked the GC log of both TaskManagers and
>>>>>>> JobManager, and they seem to be perfectly normal, without any signs of
>>>>>>> pauses. And the heartbeats are processed as normal according to the logs.
>>>>>>>
>>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>>> then extra slow with TaskManager, so this is not because of a slow network
>>>>>>> connection.
>>>>>>>
>>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>>>>> messages, which is quite weird.
>>>>>>>
>>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>>> issues : )
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> Weike
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by Chesnay Schepler <ch...@apache.org>.
The InetAddress caches the result of getCanonicalHostName(), so it is 
not a problem to call it twice.

On 10/15/2020 1:57 PM, Till Rohrmann wrote:
> Hi Weike,
>
> thanks for getting back to us with your findings. Looking at the 
> `TaskManagerLocation`, we are actually calling 
> `InetAddress.getCanonicalHostName` twice for every creation of a 
> `TaskManagerLocation` instance. This does not look right.
>
> I think it should be fine to make the look up configurable. Moreover, 
> one could think about only doing a lazy look up if the canonical 
> hostname is really needed (as far as I can see it is only really 
> needed input split assignments and for 
> the LocationPreferenceSlotSelectionStrategy to calculate how many TMs 
> run on the same machine).
>
> Do you want to fix this issue?
>
> Cheers,
> Till
>
> On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <kyledong@connect.hku.hk 
> <ma...@connect.hku.hk>> wrote:
>
>     Hi Till and community,
>
>     By the way, initially I resolved the IPs several times but results
>     returned rather quickly (less than 1ms, possibly due to DNS cache
>     on the server), so I thought it might not be the DNS issue.
>
>     However, after debugging and logging, it is found that the lookup
>     time exhibited high variance, i. e. normally it completes fast but
>     occasionally some slow results would block the thread. So an
>     unstable DNS server might have a great impact on the performance
>     of Flink job startup.
>
>     Best,
>     Weike
>
>     On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike
>     <kyledong@connect.hku.hk <ma...@connect.hku.hk>> wrote:
>
>         Hi Till and community,
>
>         Increasing `kubernetes.jobmanager.cpu` in the configuration
>         makes this issue alleviated but not disappeared.
>
>         After adding DEBUG logs to the internals of /flink-runtime/,
>         we have found the culprit is
>
>         inetAddress.getCanonicalHostName()
>
>         in
>         /org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName/
>         and
>         /org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName/,
>         which could take ~ 6 seconds to complete, thus Akka
>         dispatcher(s) are severely blocked by that.
>
>         By commenting out the two methods, this issue seems to be
>         solved immediately, so I wonder if Flink could provide a
>         configuration parameter to turn off the DNS reverse lookup
>         process, as it seems that Flink jobs could run happily without it.
>
>         Sincerely,
>         Weike
>
>
>         On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann
>         <trohrmann@apache.org <ma...@apache.org>> wrote:
>
>             Hi Weike,
>
>             could you try setting kubernetes.jobmanager.cpu: 4 in your
>             flink-conf.yaml? I fear that a single CPU is too low for
>             the JobManager component.
>
>             Cheers,
>             Till
>
>             On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann
>             <trohrmann@apache.org <ma...@apache.org>> wrote:
>
>                 Hi Weike,
>
>                 thanks for posting the logs. I will take a look at
>                 them. My suspicion would be that there is some
>                 operation blocking the JobMaster's main thread which
>                 causes the registrations from the TMs to time out.
>                 Maybe the logs allow me to validate/falsify this
>                 suspicion.
>
>                 Cheers,
>                 Till
>
>                 On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike
>                 <kyledong@connect.hku.hk
>                 <ma...@connect.hku.hk>> wrote:
>
>                     Hi community,
>
>                     I have uploaded the log files of JobManager and
>                     TaskManager-1-1 (one of the 50 TaskManagers) with
>                     DEBUG log level and default Flink configuration,
>                     and it clearly shows that TaskManager failed to
>                     register with JobManager after 10 attempts.
>
>                     Here is the link:
>
>                     JobManager:
>                     https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>
>                     TaskManager-1-1:
>                     https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>
>
>                     Thanks : )
>
>                     Best regards,
>                     Weike
>
>
>                     On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike
>                     <kyledong@connect.hku.hk
>                     <ma...@connect.hku.hk>> wrote:
>
>                         Hi community,
>
>                         Recently we have noticed a strange behavior
>                         for Flink jobs on Kubernetes per-job mode:
>                         when the parallelism increases, the time it
>                         takes for the TaskManagers to register with
>                         *JobManager *becomes abnormally long (for a
>                         task with parallelism of 50, it could take 60
>                         ~ 120 seconds or even longer for the
>                         registration attempt), and usually more than
>                         10 attempts are needed to finish this
>                         registration.
>
>                         Because of this, we could not submit a job
>                         requiring more than 20 slots with the default
>                         configuration, as the TaskManager would say:
>
>                             Registration at JobManager
>                             (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>                             attempt 9 timed out after 25600 ms
>
>                             Free slot with allocation id
>                             60d5277e138a94fb73fc6691557001e0 because:
>                             The slot 60d5277e138a94fb73fc6691557001e0
>                             has timed out. 
>
>                             Free slot TaskSlot(index:0,
>                             state:ALLOCATED, resource profile:
>                             ResourceProfile{cpuCores=1.0000000000000000,
>                             taskHeapMemory=1.425gb (1530082070 bytes),
>                             taskOffHeapMemory=0 bytes,
>                             managedMemory=1.340gb (1438814063 bytes),
>                             networkMemory=343.040mb (359703515
>                             bytes)}, allocationId:
>                             60d5277e138a94fb73fc6691557001e0, jobId:
>                             493cd86e389ccc8f2887e1222903b5ce).
>                             java.lang.Exception: The slot
>                             60d5277e138a94fb73fc6691557001e0 has timed
>                             out. 
>
>
>                         In order to cope with this issue, we have to
>                         change the below configuration parameters:
>
>
>                             # Prevent "Could not allocate the required
>                             slot within slot request timeout. Please
>                             make sure that the cluster has enough
>                             resources. Stopping the JobMaster for job"
>                             slot.request.timeout: 500000 
>
>                             # Increase max timeout in a single attempt
>                             cluster.registration.max-timeout: 300000
>                             # Prevent "free slot (TaskSlot)"
>                             akka.ask.timeout: 10 min
>                             # Prevent "Heartbeat of TaskManager timed
>                             out."
>                             heartbeat.timeout: 500000
>
>
>                         However, we acknowledge that this is only a
>                         temporary dirty fix, which is not what we
>                         want. It could be seen that during TaskManager
>                         registration to JobManager, lots of warning
>                         messages come out in logs:
>
>                             No hostname could be resolved for the IP
>                             address 9.166.0.118, using IP address as
>                             host name. Local input split assignment
>                             (such as for HDFS files) may be impacted.
>
>
>                         Initially we thought this was probably the
>                         cause (reverse lookup of DNS might take up a
>                         long time), however we later found that the
>                         reverse lookup only took less than 1ms, so
>                         maybe not because of this.
>
>                         Also, we have checked the GC log of both
>                         TaskManagers and JobManager, and they seem to
>                         be perfectly normal, without any signs of
>                         pauses. And the heartbeats are processed as
>                         normal according to the logs.
>
>                         Moreover, TaskManagers register quickly with
>                         ResourceManager, but then extra slow with
>                         TaskManager, so this is not because of a slow
>                         network connection.
>
>                         Here we wonder what could be the cause for the
>                         slow registration between JobManager and
>                         TaskManager(s)? No other warning or error
>                         messages in the log (DEBUG level) other than
>                         the "No hostname could be resolved" messages,
>                         which is quite weird.
>
>                         Thanks for the reading, and hope to get some
>                         insights into this issues : )
>
>                         Sincerely,
>                         Weike
>


Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by Till Rohrmann <tr...@apache.org>.
Hi Weike,

thanks for getting back to us with your findings. Looking at the
`TaskManagerLocation`, we are actually calling
`InetAddress.getCanonicalHostName` twice for every creation of a
`TaskManagerLocation` instance. This does not look right.

I think it should be fine to make the look up configurable. Moreover, one
could think about only doing a lazy look up if the canonical hostname is
really needed (as far as I can see it is only really needed input split
assignments and for the LocationPreferenceSlotSelectionStrategy to
calculate how many TMs run on the same machine).

Do you want to fix this issue?

Cheers,
Till

On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike <ky...@connect.hku.hk>
wrote:

> Hi Till and community,
>
> By the way, initially I resolved the IPs several times but results
> returned rather quickly (less than 1ms, possibly due to DNS cache on the
> server), so I thought it might not be the DNS issue.
>
> However, after debugging and logging, it is found that the lookup time
> exhibited high variance, i. e. normally it completes fast but occasionally
> some slow results would block the thread. So an unstable DNS server might
> have a great impact on the performance of Flink job startup.
>
> Best,
> Weike
>
> On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <ky...@connect.hku.hk>
> wrote:
>
>> Hi Till and community,
>>
>> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
>> issue alleviated but not disappeared.
>>
>> After adding DEBUG logs to the internals of *flink-runtime*, we have
>> found the culprit is
>>
>> inetAddress.getCanonicalHostName()
>>
>> in *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
>> and
>> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
>> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
>> are severely blocked by that.
>>
>> By commenting out the two methods, this issue seems to be solved
>> immediately, so I wonder if Flink could provide a configuration parameter
>> to turn off the DNS reverse lookup process, as it seems that Flink jobs
>> could run happily without it.
>>
>> Sincerely,
>> Weike
>>
>>
>> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Weike,
>>>
>>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>>> component.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Hi Weike,
>>>>
>>>> thanks for posting the logs. I will take a look at them. My suspicion
>>>> would be that there is some operation blocking the JobMaster's main thread
>>>> which causes the registrations from the TMs to time out. Maybe the logs
>>>> allow me to validate/falsify this suspicion.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
>>>> wrote:
>>>>
>>>>> Hi community,
>>>>>
>>>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one
>>>>> of the 50 TaskManagers) with DEBUG log level and default Flink
>>>>> configuration, and it clearly shows that TaskManager failed to register
>>>>> with JobManager after 10 attempts.
>>>>>
>>>>> Here is the link:
>>>>>
>>>>> JobManager:
>>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>>
>>>>> TaskManager-1-1:
>>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>>
>>>>> Thanks : )
>>>>>
>>>>> Best regards,
>>>>> Weike
>>>>>
>>>>>
>>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
>>>>> wrote:
>>>>>
>>>>>> Hi community,
>>>>>>
>>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>>>> for the TaskManagers to register with *JobManager *becomes
>>>>>> abnormally long (for a task with parallelism of 50, it could take 60 ~ 120
>>>>>> seconds or even longer for the registration attempt), and usually more than
>>>>>> 10 attempts are needed to finish this registration.
>>>>>>
>>>>>> Because of this, we could not submit a job requiring more than 20
>>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>>
>>>>>>
>>>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>>> attempt 9 timed out after 25600 ms
>>>>>>
>>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0
>>>>>>> because: The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>>
>>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>>>>> timed out.
>>>>>>
>>>>>>
>>>>>> In order to cope with this issue, we have to change the below
>>>>>> configuration parameters:
>>>>>>
>>>>>>>
>>>>>>> # Prevent "Could not allocate the required slot within slot request
>>>>>>> timeout. Please make sure that the cluster has enough resources. Stopping
>>>>>>> the JobMaster for job"
>>>>>>> slot.request.timeout: 500000
>>>>>>
>>>>>> # Increase max timeout in a single attempt
>>>>>>> cluster.registration.max-timeout: 300000
>>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>>> akka.ask.timeout: 10 min
>>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>>> heartbeat.timeout: 500000
>>>>>>
>>>>>>
>>>>>> However, we acknowledge that this is only a temporary dirty fix,
>>>>>> which is not what we want. It could be seen that during TaskManager
>>>>>> registration to JobManager, lots of warning messages come out in logs:
>>>>>>
>>>>>> No hostname could be resolved for the IP address 9.166.0.118, using
>>>>>>> IP address as host name. Local input split assignment (such as for HDFS
>>>>>>> files) may be impacted.
>>>>>>
>>>>>>
>>>>>> Initially we thought this was probably the cause (reverse lookup of
>>>>>> DNS might take up a long time), however we later found that the reverse
>>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>>
>>>>>> Also, we have checked the GC log of both TaskManagers and JobManager,
>>>>>> and they seem to be perfectly normal, without any signs of pauses. And the
>>>>>> heartbeats are processed as normal according to the logs.
>>>>>>
>>>>>> Moreover, TaskManagers register quickly with ResourceManager, but
>>>>>> then extra slow with TaskManager, so this is not because of a slow network
>>>>>> connection.
>>>>>>
>>>>>> Here we wonder what could be the cause for the slow registration
>>>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>>>> messages, which is quite weird.
>>>>>>
>>>>>> Thanks for the reading, and hope to get some insights into this
>>>>>> issues : )
>>>>>>
>>>>>> Sincerely,
>>>>>> Weike
>>>>>>
>>>>>>
>>>>>>
>>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by "DONG, Weike" <ky...@connect.hku.hk>.
Hi Till and community,

By the way, initially I resolved the IPs several times but results returned
rather quickly (less than 1ms, possibly due to DNS cache on the server), so
I thought it might not be the DNS issue.

However, after debugging and logging, it is found that the lookup time
exhibited high variance, i. e. normally it completes fast but occasionally
some slow results would block the thread. So an unstable DNS server might
have a great impact on the performance of Flink job startup.

Best,
Weike

On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike <ky...@connect.hku.hk> wrote:

> Hi Till and community,
>
> Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
> issue alleviated but not disappeared.
>
> After adding DEBUG logs to the internals of *flink-runtime*, we have
> found the culprit is
>
> inetAddress.getCanonicalHostName()
>
> in *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
> and
> *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
> which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
> are severely blocked by that.
>
> By commenting out the two methods, this issue seems to be solved
> immediately, so I wonder if Flink could provide a configuration parameter
> to turn off the DNS reverse lookup process, as it seems that Flink jobs
> could run happily without it.
>
> Sincerely,
> Weike
>
>
> On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Weike,
>>
>> could you try setting kubernetes.jobmanager.cpu: 4 in your
>> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
>> component.
>>
>> Cheers,
>> Till
>>
>> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Weike,
>>>
>>> thanks for posting the logs. I will take a look at them. My suspicion
>>> would be that there is some operation blocking the JobMaster's main thread
>>> which causes the registrations from the TMs to time out. Maybe the logs
>>> allow me to validate/falsify this suspicion.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
>>> wrote:
>>>
>>>> Hi community,
>>>>
>>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one of
>>>> the 50 TaskManagers) with DEBUG log level and default Flink configuration,
>>>> and it clearly shows that TaskManager failed to register with JobManager
>>>> after 10 attempts.
>>>>
>>>> Here is the link:
>>>>
>>>> JobManager:
>>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>>
>>>> TaskManager-1-1:
>>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>>
>>>> Thanks : )
>>>>
>>>> Best regards,
>>>> Weike
>>>>
>>>>
>>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
>>>> wrote:
>>>>
>>>>> Hi community,
>>>>>
>>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>>> for the TaskManagers to register with *JobManager *becomes abnormally
>>>>> long (for a task with parallelism of 50, it could take 60 ~ 120 seconds or
>>>>> even longer for the registration attempt), and usually more than 10
>>>>> attempts are needed to finish this registration.
>>>>>
>>>>> Because of this, we could not submit a job requiring more than 20
>>>>> slots with the default configuration, as the TaskManager would say:
>>>>>
>>>>>
>>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>>> attempt 9 timed out after 25600 ms
>>>>>
>>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because:
>>>>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>>
>>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>>>> timed out.
>>>>>
>>>>>
>>>>> In order to cope with this issue, we have to change the below
>>>>> configuration parameters:
>>>>>
>>>>>>
>>>>>> # Prevent "Could not allocate the required slot within slot request
>>>>>> timeout. Please make sure that the cluster has enough resources. Stopping
>>>>>> the JobMaster for job"
>>>>>> slot.request.timeout: 500000
>>>>>
>>>>> # Increase max timeout in a single attempt
>>>>>> cluster.registration.max-timeout: 300000
>>>>>> # Prevent "free slot (TaskSlot)"
>>>>>> akka.ask.timeout: 10 min
>>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>>> heartbeat.timeout: 500000
>>>>>
>>>>>
>>>>> However, we acknowledge that this is only a temporary dirty fix, which
>>>>> is not what we want. It could be seen that during TaskManager registration
>>>>> to JobManager, lots of warning messages come out in logs:
>>>>>
>>>>> No hostname could be resolved for the IP address 9.166.0.118, using IP
>>>>>> address as host name. Local input split assignment (such as for HDFS files)
>>>>>> may be impacted.
>>>>>
>>>>>
>>>>> Initially we thought this was probably the cause (reverse lookup of
>>>>> DNS might take up a long time), however we later found that the reverse
>>>>> lookup only took less than 1ms, so maybe not because of this.
>>>>>
>>>>> Also, we have checked the GC log of both TaskManagers and JobManager,
>>>>> and they seem to be perfectly normal, without any signs of pauses. And the
>>>>> heartbeats are processed as normal according to the logs.
>>>>>
>>>>> Moreover, TaskManagers register quickly with ResourceManager, but then
>>>>> extra slow with TaskManager, so this is not because of a slow network
>>>>> connection.
>>>>>
>>>>> Here we wonder what could be the cause for the slow registration
>>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>>> messages, which is quite weird.
>>>>>
>>>>> Thanks for the reading, and hope to get some insights into this issues
>>>>> : )
>>>>>
>>>>> Sincerely,
>>>>> Weike
>>>>>
>>>>>
>>>>>
>>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by "DONG, Weike" <ky...@connect.hku.hk>.
Hi Till and community,

Increasing `kubernetes.jobmanager.cpu` in the configuration makes this
issue alleviated but not disappeared.

After adding DEBUG logs to the internals of *flink-runtime*, we have found
the culprit is

inetAddress.getCanonicalHostName()

in *org.apache.flink.runtime.taskmanager.TaskManagerLocation#getHostName*
and
*org.apache.flink.runtime.taskmanager.TaskManagerLocation#getFqdnHostName*,
which could take ~ 6 seconds to complete, thus Akka dispatcher(s)
are severely blocked by that.

By commenting out the two methods, this issue seems to be solved
immediately, so I wonder if Flink could provide a configuration parameter
to turn off the DNS reverse lookup process, as it seems that Flink jobs
could run happily without it.

Sincerely,
Weike


On Tue, Oct 13, 2020 at 6:52 PM Till Rohrmann <tr...@apache.org> wrote:

> Hi Weike,
>
> could you try setting kubernetes.jobmanager.cpu: 4 in your
> flink-conf.yaml? I fear that a single CPU is too low for the JobManager
> component.
>
> Cheers,
> Till
>
> On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Weike,
>>
>> thanks for posting the logs. I will take a look at them. My suspicion
>> would be that there is some operation blocking the JobMaster's main thread
>> which causes the registrations from the TMs to time out. Maybe the logs
>> allow me to validate/falsify this suspicion.
>>
>> Cheers,
>> Till
>>
>> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
>> wrote:
>>
>>> Hi community,
>>>
>>> I have uploaded the log files of JobManager and TaskManager-1-1 (one of
>>> the 50 TaskManagers) with DEBUG log level and default Flink configuration,
>>> and it clearly shows that TaskManager failed to register with JobManager
>>> after 10 attempts.
>>>
>>> Here is the link:
>>>
>>> JobManager:
>>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>>
>>> TaskManager-1-1:
>>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>>
>>> Thanks : )
>>>
>>> Best regards,
>>> Weike
>>>
>>>
>>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
>>> wrote:
>>>
>>>> Hi community,
>>>>
>>>> Recently we have noticed a strange behavior for Flink jobs on
>>>> Kubernetes per-job mode: when the parallelism increases, the time it takes
>>>> for the TaskManagers to register with *JobManager *becomes abnormally
>>>> long (for a task with parallelism of 50, it could take 60 ~ 120 seconds or
>>>> even longer for the registration attempt), and usually more than 10
>>>> attempts are needed to finish this registration.
>>>>
>>>> Because of this, we could not submit a job requiring more than 20 slots
>>>> with the default configuration, as the TaskManager would say:
>>>>
>>>>
>>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>>> attempt 9 timed out after 25600 ms
>>>>
>>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because:
>>>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>>
>>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>>> timed out.
>>>>
>>>>
>>>> In order to cope with this issue, we have to change the below
>>>> configuration parameters:
>>>>
>>>>>
>>>>> # Prevent "Could not allocate the required slot within slot request
>>>>> timeout. Please make sure that the cluster has enough resources. Stopping
>>>>> the JobMaster for job"
>>>>> slot.request.timeout: 500000
>>>>
>>>> # Increase max timeout in a single attempt
>>>>> cluster.registration.max-timeout: 300000
>>>>> # Prevent "free slot (TaskSlot)"
>>>>> akka.ask.timeout: 10 min
>>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>>> heartbeat.timeout: 500000
>>>>
>>>>
>>>> However, we acknowledge that this is only a temporary dirty fix, which
>>>> is not what we want. It could be seen that during TaskManager registration
>>>> to JobManager, lots of warning messages come out in logs:
>>>>
>>>> No hostname could be resolved for the IP address 9.166.0.118, using IP
>>>>> address as host name. Local input split assignment (such as for HDFS files)
>>>>> may be impacted.
>>>>
>>>>
>>>> Initially we thought this was probably the cause (reverse lookup of DNS
>>>> might take up a long time), however we later found that the reverse lookup
>>>> only took less than 1ms, so maybe not because of this.
>>>>
>>>> Also, we have checked the GC log of both TaskManagers and JobManager,
>>>> and they seem to be perfectly normal, without any signs of pauses. And the
>>>> heartbeats are processed as normal according to the logs.
>>>>
>>>> Moreover, TaskManagers register quickly with ResourceManager, but then
>>>> extra slow with TaskManager, so this is not because of a slow network
>>>> connection.
>>>>
>>>> Here we wonder what could be the cause for the slow registration
>>>> between JobManager and TaskManager(s)? No other warning or error messages
>>>> in the log (DEBUG level) other than the "No hostname could be resolved"
>>>> messages, which is quite weird.
>>>>
>>>> Thanks for the reading, and hope to get some insights into this issues
>>>> : )
>>>>
>>>> Sincerely,
>>>> Weike
>>>>
>>>>
>>>>
>>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by Till Rohrmann <tr...@apache.org>.
Hi Weike,

could you try setting kubernetes.jobmanager.cpu: 4 in your flink-conf.yaml?
I fear that a single CPU is too low for the JobManager component.

Cheers,
Till

On Tue, Oct 13, 2020 at 11:33 AM Till Rohrmann <tr...@apache.org> wrote:

> Hi Weike,
>
> thanks for posting the logs. I will take a look at them. My suspicion
> would be that there is some operation blocking the JobMaster's main thread
> which causes the registrations from the TMs to time out. Maybe the logs
> allow me to validate/falsify this suspicion.
>
> Cheers,
> Till
>
> On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
> wrote:
>
>> Hi community,
>>
>> I have uploaded the log files of JobManager and TaskManager-1-1 (one of
>> the 50 TaskManagers) with DEBUG log level and default Flink configuration,
>> and it clearly shows that TaskManager failed to register with JobManager
>> after 10 attempts.
>>
>> Here is the link:
>>
>> JobManager:
>> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>>
>> TaskManager-1-1:
>> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>>
>> Thanks : )
>>
>> Best regards,
>> Weike
>>
>>
>> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
>> wrote:
>>
>>> Hi community,
>>>
>>> Recently we have noticed a strange behavior for Flink jobs on Kubernetes
>>> per-job mode: when the parallelism increases, the time it takes for the
>>> TaskManagers to register with *JobManager *becomes abnormally long (for
>>> a task with parallelism of 50, it could take 60 ~ 120 seconds or even
>>> longer for the registration attempt), and usually more than 10 attempts are
>>> needed to finish this registration.
>>>
>>> Because of this, we could not submit a job requiring more than 20 slots
>>> with the default configuration, as the TaskManager would say:
>>>
>>>
>>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>>> attempt 9 timed out after 25600 ms
>>>
>>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because:
>>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>>
>>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>>> 493cd86e389ccc8f2887e1222903b5ce).
>>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has
>>>> timed out.
>>>
>>>
>>> In order to cope with this issue, we have to change the below
>>> configuration parameters:
>>>
>>>>
>>>> # Prevent "Could not allocate the required slot within slot request
>>>> timeout. Please make sure that the cluster has enough resources. Stopping
>>>> the JobMaster for job"
>>>> slot.request.timeout: 500000
>>>
>>> # Increase max timeout in a single attempt
>>>> cluster.registration.max-timeout: 300000
>>>> # Prevent "free slot (TaskSlot)"
>>>> akka.ask.timeout: 10 min
>>>> # Prevent "Heartbeat of TaskManager timed out."
>>>> heartbeat.timeout: 500000
>>>
>>>
>>> However, we acknowledge that this is only a temporary dirty fix, which
>>> is not what we want. It could be seen that during TaskManager registration
>>> to JobManager, lots of warning messages come out in logs:
>>>
>>> No hostname could be resolved for the IP address 9.166.0.118, using IP
>>>> address as host name. Local input split assignment (such as for HDFS files)
>>>> may be impacted.
>>>
>>>
>>> Initially we thought this was probably the cause (reverse lookup of DNS
>>> might take up a long time), however we later found that the reverse lookup
>>> only took less than 1ms, so maybe not because of this.
>>>
>>> Also, we have checked the GC log of both TaskManagers and JobManager,
>>> and they seem to be perfectly normal, without any signs of pauses. And the
>>> heartbeats are processed as normal according to the logs.
>>>
>>> Moreover, TaskManagers register quickly with ResourceManager, but then
>>> extra slow with TaskManager, so this is not because of a slow network
>>> connection.
>>>
>>> Here we wonder what could be the cause for the slow registration between
>>> JobManager and TaskManager(s)? No other warning or error messages in the
>>> log (DEBUG level) other than the "No hostname could be resolved" messages,
>>> which is quite weird.
>>>
>>> Thanks for the reading, and hope to get some insights into this issues :
>>> )
>>>
>>> Sincerely,
>>> Weike
>>>
>>>
>>>
>>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by Till Rohrmann <tr...@apache.org>.
Hi Weike,

thanks for posting the logs. I will take a look at them. My suspicion would
be that there is some operation blocking the JobMaster's main thread which
causes the registrations from the TMs to time out. Maybe the logs allow me
to validate/falsify this suspicion.

Cheers,
Till

On Mon, Oct 12, 2020 at 10:43 AM DONG, Weike <ky...@connect.hku.hk>
wrote:

> Hi community,
>
> I have uploaded the log files of JobManager and TaskManager-1-1 (one of
> the 50 TaskManagers) with DEBUG log level and default Flink configuration,
> and it clearly shows that TaskManager failed to register with JobManager
> after 10 attempts.
>
> Here is the link:
>
> JobManager:
> https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce
>
> TaskManager-1-1:
> https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe
>
> Thanks : )
>
> Best regards,
> Weike
>
>
> On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk>
> wrote:
>
>> Hi community,
>>
>> Recently we have noticed a strange behavior for Flink jobs on Kubernetes
>> per-job mode: when the parallelism increases, the time it takes for the
>> TaskManagers to register with *JobManager *becomes abnormally long (for
>> a task with parallelism of 50, it could take 60 ~ 120 seconds or even
>> longer for the registration attempt), and usually more than 10 attempts are
>> needed to finish this registration.
>>
>> Because of this, we could not submit a job requiring more than 20 slots
>> with the default configuration, as the TaskManager would say:
>>
>>
>>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>>> attempt 9 timed out after 25600 ms
>>
>> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because:
>>> The slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>>
>> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>>> 493cd86e389ccc8f2887e1222903b5ce).
>>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has timed
>>> out.
>>
>>
>> In order to cope with this issue, we have to change the below
>> configuration parameters:
>>
>>>
>>> # Prevent "Could not allocate the required slot within slot request
>>> timeout. Please make sure that the cluster has enough resources. Stopping
>>> the JobMaster for job"
>>> slot.request.timeout: 500000
>>
>> # Increase max timeout in a single attempt
>>> cluster.registration.max-timeout: 300000
>>> # Prevent "free slot (TaskSlot)"
>>> akka.ask.timeout: 10 min
>>> # Prevent "Heartbeat of TaskManager timed out."
>>> heartbeat.timeout: 500000
>>
>>
>> However, we acknowledge that this is only a temporary dirty fix, which is
>> not what we want. It could be seen that during TaskManager registration to
>> JobManager, lots of warning messages come out in logs:
>>
>> No hostname could be resolved for the IP address 9.166.0.118, using IP
>>> address as host name. Local input split assignment (such as for HDFS files)
>>> may be impacted.
>>
>>
>> Initially we thought this was probably the cause (reverse lookup of DNS
>> might take up a long time), however we later found that the reverse lookup
>> only took less than 1ms, so maybe not because of this.
>>
>> Also, we have checked the GC log of both TaskManagers and JobManager, and
>> they seem to be perfectly normal, without any signs of pauses. And the
>> heartbeats are processed as normal according to the logs.
>>
>> Moreover, TaskManagers register quickly with ResourceManager, but then
>> extra slow with TaskManager, so this is not because of a slow network
>> connection.
>>
>> Here we wonder what could be the cause for the slow registration between
>> JobManager and TaskManager(s)? No other warning or error messages in the
>> log (DEBUG level) other than the "No hostname could be resolved" messages,
>> which is quite weird.
>>
>> Thanks for the reading, and hope to get some insights into this issues : )
>>
>> Sincerely,
>> Weike
>>
>>
>>
>

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

Posted by "DONG, Weike" <ky...@connect.hku.hk>.
Hi community,

I have uploaded the log files of JobManager and TaskManager-1-1 (one of the
50 TaskManagers) with DEBUG log level and default Flink configuration, and
it clearly shows that TaskManager failed to register with JobManager after
10 attempts.

Here is the link:

JobManager:
https://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce

TaskManager-1-1:
https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe

Thanks : )

Best regards,
Weike


On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike <ky...@connect.hku.hk> wrote:

> Hi community,
>
> Recently we have noticed a strange behavior for Flink jobs on Kubernetes
> per-job mode: when the parallelism increases, the time it takes for the
> TaskManagers to register with *JobManager *becomes abnormally long (for a
> task with parallelism of 50, it could take 60 ~ 120 seconds or even longer
> for the registration attempt), and usually more than 10 attempts are needed
> to finish this registration.
>
> Because of this, we could not submit a job requiring more than 20 slots
> with the default configuration, as the TaskManager would say:
>
>
>> Registration at JobManager (akka.tcp://flink@myjob-201076.default:6123/user/rpc/jobmanager_2)
>> attempt 9 timed out after 25600 ms
>
> Free slot with allocation id 60d5277e138a94fb73fc6691557001e0 because: The
>> slot 60d5277e138a94fb73fc6691557001e0 has timed out.
>
> Free slot TaskSlot(index:0, state:ALLOCATED, resource profile:
>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.425gb
>> (1530082070 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.340gb
>> (1438814063 bytes), networkMemory=343.040mb (359703515 bytes)},
>> allocationId: 60d5277e138a94fb73fc6691557001e0, jobId:
>> 493cd86e389ccc8f2887e1222903b5ce).
>> java.lang.Exception: The slot 60d5277e138a94fb73fc6691557001e0 has timed
>> out.
>
>
> In order to cope with this issue, we have to change the below
> configuration parameters:
>
>>
>> # Prevent "Could not allocate the required slot within slot request
>> timeout. Please make sure that the cluster has enough resources. Stopping
>> the JobMaster for job"
>> slot.request.timeout: 500000
>
> # Increase max timeout in a single attempt
>> cluster.registration.max-timeout: 300000
>> # Prevent "free slot (TaskSlot)"
>> akka.ask.timeout: 10 min
>> # Prevent "Heartbeat of TaskManager timed out."
>> heartbeat.timeout: 500000
>
>
> However, we acknowledge that this is only a temporary dirty fix, which is
> not what we want. It could be seen that during TaskManager registration to
> JobManager, lots of warning messages come out in logs:
>
> No hostname could be resolved for the IP address 9.166.0.118, using IP
>> address as host name. Local input split assignment (such as for HDFS files)
>> may be impacted.
>
>
> Initially we thought this was probably the cause (reverse lookup of DNS
> might take up a long time), however we later found that the reverse lookup
> only took less than 1ms, so maybe not because of this.
>
> Also, we have checked the GC log of both TaskManagers and JobManager, and
> they seem to be perfectly normal, without any signs of pauses. And the
> heartbeats are processed as normal according to the logs.
>
> Moreover, TaskManagers register quickly with ResourceManager, but then
> extra slow with TaskManager, so this is not because of a slow network
> connection.
>
> Here we wonder what could be the cause for the slow registration between
> JobManager and TaskManager(s)? No other warning or error messages in the
> log (DEBUG level) other than the "No hostname could be resolved" messages,
> which is quite weird.
>
> Thanks for the reading, and hope to get some insights into this issues : )
>
> Sincerely,
> Weike
>
>
>