You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/10/06 22:17:00 UTC

[jira] [Commented] (BEAM-3030) watchForNewFiles() can emit a file multiple times if it's growing

    [ https://issues.apache.org/jira/browse/BEAM-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195328#comment-16195328 ] 

ASF GitHub Bot commented on BEAM-3030:
--------------------------------------

GitHub user jkff opened a pull request:

    https://github.com/apache/beam/pull/3957

    Fixes TextIO and AvroIO tests of watchForNewFiles

    * AvroIO: Need to specify a trigger to make sure that files are really generated continuously and testing of watchForNewFiles is non-vacuous.
    
    * TextIO: files were generated by manual code, and sometimes writing of a file could race with TextIO reading it, and it might see the same file with two different sizes, and count it as two different files (two Metadata objects for the same filename with different sizes are not equal) and read the file twice.
    
    It makes sense to address that separately: e.g. in the Watch transform allow specifying a key extractor - but it's outside the scope of this PR and tracked in https://issues.apache.org/jira/browse/BEAM-3030.
    
    R: @reuvenlax 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkff/incubator-beam read-watch-test

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/3957.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3957
    
----
commit 59b450d82917707a0802c60cf910c998215cbca4
Author: Eugene Kirpichov <ki...@google.com>
Date:   2017-10-06T20:29:10Z

    Fixes TextIO and AvroIO tests of watchForNewFiles
    
    * AvroIO: Need to specify a trigger to make sure that files are really
    generated continuously and testing of watchForNewFiles is non-vacuous.
    
    * TextIO: files were generated by manual code,
    and sometimes writing of a file could race with TextIO reading it, and
    it might see the same file with two different sizes, and count it as two
    different files (two Metadata objects for the same filename with
    different sizes are not equal) and read the file twice.
    
    It makes sense to address that separately: e.g. in the Watch transform
    allow specifying a key extractor - but it's outside the scope of this
    PR.

----


> watchForNewFiles() can emit a file multiple times if it's growing
> -----------------------------------------------------------------
>
>                 Key: BEAM-3030
>                 URL: https://issues.apache.org/jira/browse/BEAM-3030
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Eugene Kirpichov
>             Fix For: 2.3.0
>
>
> TextIO and AvroIO watchForNewFiles(), as well as FileIO.match().continuously(), use Watch transform under the hood, and watch the set of Metadata matching a filepattern.
> Two Metadata's with the same filename but different size are not considered equal, so if these transforms observe the same file multiple times with different sizes, they'll read the file multiple times.
> This is likely not yet a problem for production users, because these features require SDF, it's supported only in Dataflow runner, and users of the Dataflow runner are likely to use only files on GCS which doesn't support appends. However, this needs to be fixed still.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)