You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by zentol <gi...@git.apache.org> on 2018/06/27 11:13:46 UTC

[GitHub] flink pull request #6216: [FLINK-9674][tests] Replace hard-coded sleeps in Q...

GitHub user zentol opened a pull request:

    https://github.com/apache/flink/pull/6216

    [FLINK-9674][tests] Replace hard-coded sleeps in QS E2E test 

    ## What is the purpose of the change
    
    This PR replaces a hard-coded sleep in a QueryableState end-to-end test. The test was sleeping for 65 seconds after the TM was restarted, to wait until the restarted.
    
    Instead we now first wait for the job restart procedure to kick in (using a new method to check for state transitions) and then waiting for the job to run.
    
    Additionally
    * reduce `heartbeat.timeout` to speed up TM loss detection
    * fix `common#wait_job_running` returning true for scheduled jobs, we now pass the `-r` flag
    * simplify jar paths in QS test (now relative to END_TO_END_DIR)
    * change `test-runner-common#cleanup` to use existing `common#clean_[log|stdout]_files` functions
    
    ## Verifying this change
    
    * manually verified

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zentol/flink 9674

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6216.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6216
    
----
commit b6ab955173e1813b9903bf1fb2eb1345ff17abd3
Author: zentol <ch...@...>
Date:   2018-06-27T10:58:49Z

    [hotfix][tests] Reuse existing functions for cleaning logs

commit b4866c55bf64710aa5f21d214c86caa4702dd73e
Author: zentol <ch...@...>
Date:   2018-06-27T10:59:16Z

    [hotfix][tests] Simplify jar paths to QS tests

commit 2e4788f5c6840780c4ff0152a91435087802a14e
Author: zentol <ch...@...>
Date:   2018-06-27T11:01:06Z

    [FLINK-9674][tests] Replace hard-coded sleeps in QS E2E test

----


---

[GitHub] flink issue #6216: [FLINK-9674][tests] Replace hard-coded sleeps in QS E2E t...

Posted by florianschmidt1994 <gi...@git.apache.org>.

Github user florianschmidt1994 commented on the issue:

    https://github.com/apache/flink/pull/6216
  
    > (Did you start a review, but wrote your comment separately? Then your review would still be in progress!)
    
    Yes :D I submitted it now
    
    >The issues you identified are certainly valid, but I'm wondering if it really makes sense to invest time into this now.
    >There've been discussions about writing a python/java based framework for the tests (and I've already started tinkering on a java version); so any significant changes we make to the bash-scripts might be subsumed soon™.
    
    Alright, then let's leave it this way and we can still take care of it should we run into problems with this in the future



---

[GitHub] flink issue #6216: [FLINK-9674][tests] Replace hard-coded sleeps in QS E2E t...

Posted by zentol <gi...@git.apache.org>.

Github user zentol commented on the issue:

    https://github.com/apache/flink/pull/6216
  
    @florianschmidt1994 I don't see the comment you're referring to :(
    (Did you start a review, but wrote your comment separately? Then your review would still be in progress!)
    
    The issues you identified are certainly valid, but I'm wondering if it really makes sense to invest time into this now.
    There've been discussions about writing a python/java based framework for the tests (and I've already started tinkering on a java version); so any _significant_ changes we make to the bash-scripts might be subsumed soon&#8482;.


---

[GitHub] flink pull request #6216: [FLINK-9674][tests] Replace hard-coded sleeps in Q...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/6216


---

[GitHub] flink issue #6216: [FLINK-9674][tests] Replace hard-coded sleeps in QS E2E t...

Posted by florianschmidt1994 <gi...@git.apache.org>.

Github user florianschmidt1994 commented on the issue:

    https://github.com/apache/flink/pull/6216
  
    Thanks @zentol, I have one remark (see above), besides that looks good to me!
    
    Additionally I had some ideas that came to mind that I think we could discuss:
    - We have a common pattern of `wait_for_sth` functions, that either
      - get stuck in a loop for ever if the desired event doesn't happen (I think `wait_for_job_state_transition` also behaves like that, right?)
      - or iterate a fixed number of times and then continue execution, whereas instead they should fail.
    
    I think we should add an issue for that to refactor that over all the tests to have consistent and useful behaviour
    - Also I think that we could have the backup config and revert config as part of the test runner  and always do that, so we avoid running into a corrupted flink-dist if tests don't behave correctly?
    
    What do you think!



---

[GitHub] flink issue #6216: [FLINK-9674][tests] Replace hard-coded sleeps in QS E2E t...

Posted by zentol <gi...@git.apache.org>.

Github user zentol commented on the issue:

    https://github.com/apache/flink/pull/6216
  
    merging.


---

[GitHub] flink issue #6216: [FLINK-9674][tests] Replace hard-coded sleeps in QS E2E t...

Posted by yanghua <gi...@git.apache.org>.

Github user yanghua commented on the issue:

    https://github.com/apache/flink/pull/6216
  
    +1


---

[GitHub] flink pull request #6216: [FLINK-9674][tests] Replace hard-coded sleeps in Q...

Posted by zentol <gi...@git.apache.org>.

Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6216#discussion_r199738568
  
    --- Diff: flink-end-to-end-tests/test-scripts/test_queryable_state_restart_tm.sh ---
    @@ -85,20 +90,23 @@ function run_test() {
             exit 1
         fi
     
    -    local current_num_checkpoints=current_num_checkpoints$(get_completed_number_of_checkpoints ${JOB_ID})
    -
         kill_random_taskmanager
     
         latest_snapshot_count=$(cat $FLINK_DIR/log/*out* | grep "on snapshot" | tail -n 1 | awk '{print $4}')
         echo "Latest snapshot count was ${latest_snapshot_count}"
     
    -    sleep 65 # this is a little longer than the heartbeat timeout so that the TM is gone
    +    # wait until the TM loss was detected
    +    wait_for_job_state_transition ${JOB_ID} "RESTARTING" "CREATED"
     
         start_and_wait_for_tm
     
    +    wait_job_running ${JOB_ID}
    +
    +    local current_num_checkpoints="$(get_completed_number_of_checkpoints ${JOB_ID})"
    --- End diff --
    
    So when I ran this test locally it frequently occurred that we never actually waited for checkpoints to complete; the number of checkpoint that we waited for had already occurred and we exited early.
    The job just switches to running, and right away we got 3 completed checkpoints?
    This was a bit, well, _odd_. 
    
    In the previous version the checkpoint count could further increase while we shut down the TM.
    Technically there's no guarantee how up-to-date the result if `get_completed_number_of_checkpoints` or how long `kill_random_taskmanager` actually takes, so we may have fulfilled the checkpoint count condition before the job even restarts. It's admittedly unlikely.
    
    This change simply guarantees that the job actually completes N checkpoints after it was restarted, and just serves to eliminate doubts about the correctness of the test.


---

[GitHub] flink pull request #6216: [FLINK-9674][tests] Replace hard-coded sleeps in Q...

Posted by florianschmidt1994 <gi...@git.apache.org>.

Github user florianschmidt1994 commented on a diff in the pull request:

    https://github.com/apache/flink/pull/6216#discussion_r199719951
  
    --- Diff: flink-end-to-end-tests/test-scripts/test_queryable_state_restart_tm.sh ---
    @@ -85,20 +90,23 @@ function run_test() {
             exit 1
         fi
     
    -    local current_num_checkpoints=current_num_checkpoints$(get_completed_number_of_checkpoints ${JOB_ID})
    -
         kill_random_taskmanager
     
         latest_snapshot_count=$(cat $FLINK_DIR/log/*out* | grep "on snapshot" | tail -n 1 | awk '{print $4}')
         echo "Latest snapshot count was ${latest_snapshot_count}"
     
    -    sleep 65 # this is a little longer than the heartbeat timeout so that the TM is gone
    +    # wait until the TM loss was detected
    +    wait_for_job_state_transition ${JOB_ID} "RESTARTING" "CREATED"
     
         start_and_wait_for_tm
     
    +    wait_job_running ${JOB_ID}
    +
    +    local current_num_checkpoints="$(get_completed_number_of_checkpoints ${JOB_ID})"
    --- End diff --
    
    Why did you move this from `before killing the TM` to `after having the TM restarted again` ?


---