You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yun Gao (Jira)" <ji...@apache.org> on 2022/01/12 14:16:00 UTC

[jira] [Comment Edited] (FLINK-25307) Resuming Savepoint (hashmap, async, no parallelism change) end-to-end test timeout on azure

    [ https://issues.apache.org/jira/browse/FLINK-25307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474559#comment-17474559 ] 

Yun Gao edited comment on FLINK-25307 at 1/12/22, 2:15 PM:
-----------------------------------------------------------

Compared the successful log with the failed log:
Successful one: [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29235&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=133]

Failed one: [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29231&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=070ff179-953e-5bda-71fa-d6599415701c&l=687]

 
 # In the successful one the curl returns error code 7 before JM started, which means the host is not reachable and is as expected, while in the failed case, the curl failed after ~ 2mins with error code 28, which means operation timed out. In fact the JM has started at that time.
 # More important for the successful case the curl are always trying to connect to the same ip address, while in the failed case, the ip address is always changing.

Thus it looks to me perhaps the problem is with DNS resolve? Perhaps we could try to use `localhost` or `127.0.0.1` instead of hostname in this case. I'll have a try. 


was (Author: gaoyunhaii):
Compared the successful log with the failed log:
Successful one: [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29235&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=133]

Failed one: [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29231&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=070ff179-953e-5bda-71fa-d6599415701c&l=687]

 
 # In the successful one the curl returns error code 7 before JM started, which means the host is not reachable and is as expected, while in the failed case, the curl failed after ~ 2mins with error code 28, which means operation timed out. In fact the JM has started at that time.
 # More important for the successful case the curl are always trying to connect to the same ip address, while in the failed case, the ip address is always changing.

Thus it looks to me perhaps the problem is with DNS resolve? Perhaps we could try to use `localhost` or `127.0.0.1` instead of hostname in this case. 

> Resuming Savepoint (hashmap, async, no parallelism change) end-to-end test timeout on azure
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25307
>                 URL: https://issues.apache.org/jira/browse/FLINK-25307
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / Azure Pipelines, Runtime / Coordination
>    Affects Versions: 1.13.3, 1.15.0
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Blocker
>              Labels: pull-request-available, stale-critical, test-stability
>             Fix For: 1.15.0
>
>
> {code:java}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common.sh: line 860: kill: (93166) - No such process
> Dec 14 10:30:13 Stopping job timeout watchdog (with pid=93166)
> Dec 14 10:30:13 [FAIL] Test script contains errors.
> Dec 14 10:30:13 Checking for errors...
> Dec 14 10:30:14 No errors in log files.
> Dec 14 10:30:14 Checking for exceptions...
> Dec 14 10:30:14 No exceptions in log files.
> Dec 14 10:30:14 Checking for non-empty .out files...
> Dec 14 10:30:14 No non-empty .out files.
> Dec 14 10:30:14 
> Dec 14 10:30:14 [FAIL] 'Resuming Savepoint (hashmap, async, no parallelism change) end-to-end test' failed after 15 minutes and 0 seconds! Test exited with exit code 1
> Dec 14 10:30:14 
> 10:30:14 ##[group]Environment Information
> Dec 14 10:30:15 Searching for .dump, .dumpstream and related files in '/home/vsts/work/1/s'
> dmesg: read kernel buffer failed: Operation not permitted
> Dec 14 10:30:16 Stopping taskexecutor daemon (pid: 93751) on host fv-az43-70.
> Dec 14 10:30:17 Stopping standalonesession daemon (pid: 93500) on host fv-az43-70.
> The STDIO streams did not close within 10 seconds of the exit event from process '/usr/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.
> ##[error]Bash exited with code '1'.
> Finishing: Run e2e tests
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28088&view=logs&j=bea52777-eaf8-5663-8482-18fbc3630e81&t=b2642e3a-5b86-574d-4c8a-f7e2842bfb14&l=79112



--
This message was sent by Atlassian Jira
(v8.20.1#820001)