You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Luke Cwik (Jira)" <ji...@apache.org> on 2020/10/07 16:34:00 UTC

[jira] [Updated] (BEAM-10942) beam_PostCommit_Java_Nexmark_Flink failing due to loss of state when executing splittable DoFn

     [ https://issues.apache.org/jira/browse/BEAM-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Luke Cwik updated BEAM-10942:
-----------------------------
    Status: Resolved  (was: Open)

> beam_PostCommit_Java_Nexmark_Flink failing due to loss of state when executing splittable DoFn
> ----------------------------------------------------------------------------------------------
>
>                 Key: BEAM-10942
>                 URL: https://issues.apache.org/jira/browse/BEAM-10942
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>    Affects Versions: 2.25.0
>            Reporter: Luke Cwik
>            Assignee: Luke Cwik
>            Priority: P0
>
> Nexmark fails due to NPE caused by failure in loading element/restriction state for splittable DoFn.
> Additional logging in https://github.com/lukecwik/incubator-beam/tree/beam10670.2 shows that the element restriction is null/null 
> {noformat}
> Sep 21, 2020 5:26:23 PM org.apache.beam.runners.core.SplittableParDoViaKeyedWorkItems$ProcessFn processElement
> INFO: before: KV{null, null}
> {noformat}
> Earlier logging shows that we have set this from state and then set a timer and then failed to load the data:
> {noformat}
> INFO: Setting: //:713916102:element:ValueInGlobalWindow{value=UnboundedEventSource(0, 100000), pane=PaneInfo.NO_FIRING}
> INFO: Setting: //:713916102:restriction:UnboundedSourceRestriction{source=UnboundedEventSource(33000, 34000), checkpoint=GeneratorCheckpoint{numEvents=1000, wallclockBaseTime=1600734382646}, watermark=294247-01-10T04:00:54.775Z}
> INFO: Reading: //:713916102:element:ValueInGlobalWindow{value=UnboundedEventSource(0, 100000), pane=PaneInfo.NO_FIRING}
> INFO: Reading: //:713916102:restriction:UnboundedSourceRestriction{source=UnboundedEventSource(33000, 34000), checkpoint=GeneratorCheckpoint{numEvents=1000, wallclockBaseTime=1600734382646}, watermark=294247-01-10T04:00:54.775Z}
> INFO: Setting: //:713916102:watermarkEstimatorState:294247-01-10T04:00:54.775Z
> INFO: Setting timer: 713916102 1:1600734383019// at 1600734383019 with output time 1600734383019
> INFO: Reading: //:713916102:element:null
> INFO: Reading: //:713916102:restriction:null
> INFO: Reading: //:713916102:watermarkEstimatorState:null
> {noformat}
> for byte[] key with hash 713916102 in the global window namespace.
> Note that in a rerun the keys will be different since they are UUIDs but the pattern is the same.
> To rerun:
> checkout https://github.com/lukecwik/incubator-beam/tree/beam10670.2 for additional log output
> setup gradle task:
> with task:
> {noformat}
> :sdks:java:testing:nexmark:run
> {noformat}
> with args:
> {noformat}
> -Pnexmark.runner=":runners:flink:1.11"
> -Pnexmark.args="     --runner=FlinkRunner     --suite=SMOKE     --streamTimeout=60     --streaming=true     --manageResources=false     --monitorJobs=true     --flinkMaster=[local] --numEvents=100000 --query=4 --checkpointingInterval=3000 --shutdownSourcesAfterIdleMs=60000"
> {noformat}
> Note, you may need to increase your console size (Editor>General>Console) to capture enough output before it fails. I usually add a breakpoint for the UserCodeException constructor to pause the program and inspect the state at the point in time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)