You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/01/02 20:05:00 UTC
[jira] [Work logged] (BEAM-3499) Watch can make no progress if a single poll takes more than checkpoint interval

     [ https://issues.apache.org/jira/browse/BEAM-3499?focusedWorklogId=180390&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-180390 ]

ASF GitHub Bot logged work on BEAM-3499:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Jan/19 20:04
            Start Date: 02/Jan/19 20:04
    Worklog Time Spent: 10m 
      Work Description: kennknowles commented on issue #4483: [BEAM-3499, BEAM-2607] Gives the runner access to positions of SDF claimed blocks
URL: https://github.com/apache/beam/pull/4483#issuecomment-450969455
 
 
   Noting here, too, that `InMemoryStateInternals` is not part of the direct runner, but a general utility. The cloning changes caused perf regressions in multiple other contexts and need to be reverted and re-instantiated only in the direct runner.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 180390)
    Time Spent: 0.5h  (was: 20m)

> Watch can make no progress if a single poll takes more than checkpoint interval
> -------------------------------------------------------------------------------
>
>                 Key: BEAM-3499
>                 URL: https://issues.apache.org/jira/browse/BEAM-3499
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Eugene Kirpichov
>            Priority: Major
>             Fix For: 2.3.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> E.g. when using it to poll a filepattern with hundreds of thousands of files, a single poll may take >10 seconds (default checkpoint interval in OutputAndTimeBoundedSplittableProcessElementInvoker). Because of that, the tracker (GrowthTracker) gets checkpointed before anything is added to it, i.e. before [https://github.com/apache/beam/blob/0d918b7cab8c4ccb2b5e050501327912161d40a7/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L727,] at a moment when it doesn't contain any useful information, so the residual checkpoint state is as empty as the initial one. When we resume from the residual checkpoint, the situation simply repeats - until we get lucky enough to either take <10s to poll, or to not be asked to checkpoint for >10s (e.g. cause the checkpointing thread isn't scheduled).
> One possible fix to this is to change the SDF checkpointing strategy to have a progress guarantee: e.g., start counting time from the moment the first block is claimed, or allow the tracker to refuse checkpointing if nothing is claimed yet, or something like that.
>  
> A workaround for users of this (primarily via FileIO.match().continuously()) is to shard their filepattern into a set of finer-granularity filepatterns matching fewer files, so that each match call takes less than 10 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)