You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "PengFei Li (Jira)" <ji...@apache.org> on 2019/12/24 07:02:00 UTC

[jira] [Comment Edited] (FLINK-14843) Streaming bucketing end-to-end test can fail with Output hash mismatch

    [ https://issues.apache.org/jira/browse/FLINK-14843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002690#comment-17002690 ] 

PengFei Li edited comment on FLINK-14843 at 12/24/19 7:01 AM:
--------------------------------------------------------------

I think it is not a bug, but how the test works. The output of test_streaming_bucketing.sh tells us that number of produced values is 62530, which is more than the expected 60000, so checksum fails. The duplicated data is from those pending files which isn't included in a checkpoint, and can't be truncated to remove duplicated data when job is restored. The meaning of "sleep 10" is waiting for at least one completed checkpoint before triggering another failover, so that pending files generated when job is closing are in the restored checkpoint. 10 seconds is enough because checkpoint interval is set to 4s inĀ BucketingSinkTestProgram. Maybe we need to add a comment on "sleep 10". What do you think? [~gjy] [~kkl0u]


was (Author: banmoy):
I think it is not a bug, but how the test works. The output of test_streaming_bucketing.sh tells us that number of produced values is 62530, which is more than the expected 60000, so checksum fails. The duplicated data is from those pending files which isn't included in a checkpoint, and can't be truncated to remove duplicated data when job is restored. The meaning of "sleep 10" is waiting for at least one completed checkpoint before triggering another failover, so that pending files generated when job is closing are in the restored checkpoint. 10 seconds is enough because checkpoint interval is set to 4s inĀ BucketingSinkTestProgram. Maybe we just need to add a comment on "sleep 10". What do you think? [~gjy] [~kkl0u]

> Streaming bucketing end-to-end test can fail with Output hash mismatch
> ----------------------------------------------------------------------
>
>                 Key: FLINK-14843
>                 URL: https://issues.apache.org/jira/browse/FLINK-14843
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem, Tests
>    Affects Versions: 1.10.0
>         Environment: rev: dcc1330375826b779e4902176bb2473704dabb11
>            Reporter: Gary Yao
>            Assignee: PengFei Li
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.10.0
>
>         Attachments: complete_result, flink-gary-standalonesession-0-gyao-desktop.log, flink-gary-taskexecutor-0-gyao-desktop.log, flink-gary-taskexecutor-1-gyao-desktop.log, flink-gary-taskexecutor-2-gyao-desktop.log, flink-gary-taskexecutor-3-gyao-desktop.log, flink-gary-taskexecutor-4-gyao-desktop.log, flink-gary-taskexecutor-5-gyao-desktop.log, flink-gary-taskexecutor-6-gyao-desktop.log
>
>
> *Description*
> Streaming bucketing end-to-end test ({{test_streaming_bucketing.sh}}) can fail with Output hash mismatch.
> {noformat}
> Number of running task managers has reached 4.
> Job (e0b7a86e4d4111f3947baa3d004e083a) is running.
> Waiting until all values have been produced
> Truncating buckets
> Number of produced values 26930/60000
> Truncating buckets
> Number of produced values 30890/60000
> Truncating buckets
> Number of produced values 37340/60000
> Truncating buckets
> Number of produced values 41290/60000
> Truncating buckets
> Number of produced values 46710/60000
> Truncating buckets
> Number of produced values 52120/60000
> Truncating buckets
> Number of produced values 57110/60000
> Truncating buckets
> Number of produced values 62530/60000
> Cancelling job e0b7a86e4d4111f3947baa3d004e083a.
> Cancelled job e0b7a86e4d4111f3947baa3d004e083a.
> Waiting for job (e0b7a86e4d4111f3947baa3d004e083a) to reach terminal state CANCELED ...
> Job (e0b7a86e4d4111f3947baa3d004e083a) reached terminal state CANCELED
> Job e0b7a86e4d4111f3947baa3d004e083a was cancelled, time to verify
> FAIL Bucketing Sink: Output hash mismatch.  Got 9e00429abfb30eea4f459eb812b470ad, expected 01aba5ff77a0ef5e5cf6a727c248bdc3.
> head hexdump of actual:
> 0000000   (   2   ,   1   0   ,   0   ,   S   o   m   e       p   a   y
> 0000010   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   1
> 0000020   ,   S   o   m   e       p   a   y   l   o   a   d   .   .   .
> 0000030   )  \n   (   2   ,   1   0   ,   2   ,   S   o   m   e       p
> 0000040   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0
> 0000050   ,   3   ,   S   o   m   e       p   a   y   l   o   a   d   .
> 0000060   .   .   )  \n   (   2   ,   1   0   ,   4   ,   S   o   m   e
> 0000070       p   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,
> 0000080   1   0   ,   5   ,   S   o   m   e       p   a   y   l   o   a
> 0000090   d   .   .   .   )  \n   (   2   ,   1   0   ,   6   ,   S   o
> 00000a0   m   e       p   a   y   l   o   a   d   .   .   .   )  \n   (
> 00000b0   2   ,   1   0   ,   7   ,   S   o   m   e       p   a   y   l
> 00000c0   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   8   ,
> 00000d0   S   o   m   e       p   a   y   l   o   a   d   .   .   .   )
> 00000e0  \n   (   2   ,   1   0   ,   9   ,   S   o   m   e       p   a
> 00000f0   y   l   o   a   d   .   .   .   )  \n                        
> 00000fa
> Stopping taskexecutor daemon (pid: 55164) on host gyao-desktop.
> Stopping standalonesession daemon (pid: 51073) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 51504) on host gyao-desktop.
> Skipping taskexecutor daemon (pid: 52034), because it is not running anymore on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52472), because it is not running anymore on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52916), because it is not running anymore on gyao-desktop.
> Stopping taskexecutor daemon (pid: 54121) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 54726) on host gyao-desktop.
> [FAIL] Test script contains errors.
> Checking of logs skipped.
> [FAIL] 'flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh' failed after 2 minutes and 3 seconds! Test exited with exit code 1
> {noformat}
> *How to reproduce*
> Comment out the delay of 10s after the 1st TM is restarted to provoke the issue:
> {code:bash}
> echo "Restarting 1 TM"
> $FLINK_DIR/bin/taskmanager.sh start
> wait_for_number_of_running_tms 4
> #sleep 10
> echo "Killing 2 TMs"
> kill_random_taskmanager
> kill_random_taskmanager
> wait_for_number_of_running_tms 2
> {code}
> Command to run the test:
> {noformat}
> FLINK_DIR=build-target/ flink-end-to-end-tests/run-single-test.sh skip flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)