You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Chamikara Jayalath (JIRA)" <ji...@apache.org> on 2016/08/03 21:04:20 UTC

[jira] [Created] (BEAM-522) Update FileSink.finalize_write() to be idempotent

Chamikara Jayalath created BEAM-522:
---------------------------------------

             Summary: Update FileSink.finalize_write() to be idempotent
                 Key: BEAM-522
                 URL: https://issues.apache.org/jira/browse/BEAM-522
             Project: Beam
          Issue Type: Bug
          Components: sdk-py
            Reporter: Chamikara Jayalath
            Assignee: Chamikara Jayalath


Currently FileSink.finelize_write() in fileio.py [1] performs following operations.

(1) Obtains a list of temporary files as a side input
(2) Renames each temporary file to the location where final output should be stored.

iobase.Sink.finalize_write() operation should be idempotent since runner implementations may call this operation multiple times due to task failures. 

Current implementation is not idempotent because if we re-run the operation after renaming a sub-set of files, the operations may fail due to not being able to find some files at source location (for example, [2] for GCS files).

We can fix this by checking if the destination file is already available before performing the rename and not performing the rename for files that are already available at the destination.

[1] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503

[2] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)