You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Guowei Ma (JIRA)" <ji...@apache.org> on 2019/07/20 11:57:00 UTC

[jira] [Comment Edited] (FLINK-10819) JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure is unstable

    [ https://issues.apache.org/jira/browse/FLINK-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889486#comment-16889486 ] 

Guowei Ma edited comment on FLINK-10819 at 7/20/19 11:56 AM:
-------------------------------------------------------------

It might be a bug according to https://api.travis-ci.org/v3/job/508500560/log.txt

1. The test is time out because two "READY_MARKER_FILE_PREFIX" files are missing.

2. Two tasks, which response for creating the two files can't be deployed because the resource is not available.

!image-2019-07-19-17-01-19-758.png!

3. The slots from one TM(34dbf0f8264469af49be8e1dbc2ad811) are not recognized by SlotManger. Since this, the two tasks can't be deployed.

!image-2019-07-19-17-00-17-194.png!

4. The TM(34dbf0f8264469af49be8e1dbc2ad811) registers to RM twice.

 

!image-2019-07-19-16-59-10-178.png!

The RM responses two RegistrationResponses to TM. But TM uses different threads to deal  with RegistrationResponse. The registrationId of old RegistrationResponse override the registrationId of new RegistrationResponse.

 

The simple idea is to use the main thread to process on the TM side. I am still thinking about it if there is another method.

 

 

 

 

 

 


was (Author: maguowei):
It might be a bug. 

1. The test is time out because two "READY_MARKER_FILE_PREFIX" files are missing.

2. Two tasks, which response for creating the two files can't be deployed because the resource is not available.

!image-2019-07-19-17-01-19-758.png!

3. The slots from one TM(34dbf0f8264469af49be8e1dbc2ad811) are not recognized by SlotManger. Since this, the two tasks can't be deployed.

!image-2019-07-19-17-00-17-194.png!

4. The TM(34dbf0f8264469af49be8e1dbc2ad811) registers to RM twice.

 

!image-2019-07-19-16-59-10-178.png!

The RM responses two RegistrationResponses to TM. But TM uses different threads to deal  with RegistrationResponse. The registrationId of old RegistrationResponse override the registrationId of new RegistrationResponse.

 

The simple idea is to use the main thread to process on the TM side. I am still thinking about it if there is another method.

 

 

 

 

 

 

> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure is unstable
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-10819
>                 URL: https://issues.apache.org/jira/browse/FLINK-10819
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.7.0
>            Reporter: sunjincheng
>            Assignee: Guowei Ma
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.9.0
>
>         Attachments: image-2019-07-19-16-59-10-178.png, image-2019-07-19-17-00-17-194.png, image-2019-07-19-17-01-19-758.png
>
>
> Found the following error in the process of CI:
> Results :
> Tests in error: 
>  JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure:331 » IllegalArgument
> Tests run: 1463, Failures: 0, Errors: 1, Skipped: 29
> 18:40:55.828 [INFO] ------------------------------------------------------------------------
> 18:40:55.829 [INFO] BUILD FAILURE
> 18:40:55.829 [INFO] ------------------------------------------------------------------------
> 18:40:55.830 [INFO] Total time: 30:19 min
> 18:40:55.830 [INFO] Finished at: 2018-11-07T18:40:55+00:00
> 18:40:56.294 [INFO] Final Memory: 92M/678M
> 18:40:56.294 [INFO] ------------------------------------------------------------------------
> 18:40:56.294 [WARNING] The requested profile "include-kinesis" could not be activated because it does not exist.
> 18:40:56.295 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (integration-tests) on project flink-tests_2.11: There are test failures.
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] Please refer to /home/travis/build/sunjincheng121/flink/flink-tests/target/surefire-reports for the individual test results.
> 18:40:56.295 [ERROR] -> [Help 1]
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
> 18:40:56.295 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] For more information about the errors and possible solutions, please read the following articles:
> 18:40:56.295 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> MVN exited with EXIT CODE: 1.
> Trying to KILL watchdog (11329).
> ./tools/travis_mvn_watchdog.sh: line 269: 11329 Terminated watchdog
> PRODUCED build artifacts.
> But after the rerun, the error disappeared. 
> Currently,no specific reasons are found, and will continue to pay attention.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)