You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Chamikara Jayalath (JIRA)" <ji...@apache.org> on 2016/08/03 21:04:20 UTC
[jira] [Created] (BEAM-522) Update FileSink.finalize_write() to be
idempotent
Chamikara Jayalath created BEAM-522:
---------------------------------------
Summary: Update FileSink.finalize_write() to be idempotent
Key: BEAM-522
URL: https://issues.apache.org/jira/browse/BEAM-522
Project: Beam
Issue Type: Bug
Components: sdk-py
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath
Currently FileSink.finelize_write() in fileio.py [1] performs following operations.
(1) Obtains a list of temporary files as a side input
(2) Renames each temporary file to the location where final output should be stored.
iobase.Sink.finalize_write() operation should be idempotent since runner implementations may call this operation multiple times due to task failures.
Current implementation is not idempotent because if we re-run the operation after renaming a sub-set of files, the operations may fail due to not being able to find some files at source location (for example, [2] for GCS files).
We can fix this by checking if the destination file is already available before performing the rename and not performing the rename for files that are already available at the destination.
[1] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L503
[2] https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/gcsio.py#L187
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)