You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Chamikara Jayalath (JIRA)" <ji...@apache.org> on 2016/08/15 19:21:22 UTC

[jira] [Resolved] (BEAM-522) Update FileSink.finalize_write() to be idempotent

     [ https://issues.apache.org/jira/browse/BEAM-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chamikara Jayalath resolved BEAM-522.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: Not applicable

> Update FileSink.finalize_write() to be idempotent
> -------------------------------------------------
>
>                 Key: BEAM-522
>                 URL: https://issues.apache.org/jira/browse/BEAM-522
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>            Reporter: Chamikara Jayalath
>            Assignee: Chamikara Jayalath
>             Fix For: Not applicable
>
>
> Currently FileSink.finelize_write() in fileio.py [1] performs following operations.
> (1) Obtains a list of temporary files as a side input
> (2) Renames each temporary file to the location where final output should be stored.
> iobase.Sink.finalize_write() operation should be idempotent since runner implementations may call this operation multiple times due to task failures. 
> Current implementation is not idempotent because if we re-run the operation after renaming a sub-set of files, the operations may fail due to not being able to find some files at source location (for example, [2] for GCS files).
> We can fix this by checking if the destination file is already available before performing the rename and not performing the rename for files that are already available at the destination.
> [1] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503
> [2] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)