You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Amit Sela (JIRA)" <ji...@apache.org> on 2017/03/17 22:14:41 UTC

[jira] [Comment Edited] (BEAM-1582) ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.

    [ https://issues.apache.org/jira/browse/BEAM-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895861#comment-15895861 ] 

Amit Sela edited comment on BEAM-1582 at 3/17/17 10:14 PM:
-----------------------------------------------------------

Could be related to SPARK-16480 so that the last {{CheckpointMark}} is not properly checkpointed.
If for some reason the runtime environment was so slow it failed to start execution until timeout was hit, graceful stop would force to at least finish the first batch, and if this first batch included the read from Kafka on one hand, while failing to checkpoint the {{Reader}} mark on the other, resuming from checkpoint would read all the Kafka back log again causing the failures we see.

I'll have a look at failed tests execution time to figure out if that seems to be the case, and if so I will simply move this test to post commit because This issue in Spark was only resolved for v2.0


was (Author: amitsela):
Could be related to SPARK-16480 so that the last {{CheckpointMark}} is not properly checkpointed.

> ResumeFromCheckpointStreamingTest flakes with what appears as a second firing.
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-1582
>                 URL: https://issues.apache.org/jira/browse/BEAM-1582
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>             Fix For: First stable release
>
>
> See: https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-spark/2788/testReport/junit/org.apache.beam.runners.spark.translation.streaming/ResumeFromCheckpointStreamingTest/testWithResume/
> After some digging in it appears that a second firing occurs (though only one is expected) but it doesn't come from a stale state (state is empty before it fires).
> Might be a retry happening for some reason, which is OK in terms of fault-tolerance guarantees (at-least-once), but not so much in terms of flaky tests. 
> I'm looking into this hoping to fix this ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)