You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kyle Weaver (JIRA)" <ji...@apache.org> on 2019/05/30 03:26:00 UTC

[jira] [Resolved] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

     [ https://issues.apache.org/jira/browse/BEAM-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kyle Weaver resolved BEAM-7131.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 2.13.0

> Spark portable runner appears to be repeating work (in TFX example)
> -------------------------------------------------------------------
>
>                 Key: BEAM-7131
>                 URL: https://issues.apache.org/jira/browse/BEAM-7131
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Weaver
>            Assignee: Kyle Weaver
>            Priority: Major
>             Fix For: 2.13.0
>
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark portable runner. TFDV works fine, but the preprocess step (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which is okay, but the error occurs when they are copied to the same permanent output directory. Specifically, the copy tree operation in transform_fn_io.py [3] is run twice with the same output directory. The error doesn't occur when that code is modified to allow overwriting existing files, but that's only a shallow fix. While the TF transform should probably be made idempotent, this is also an issue with the Spark runner, which shouldn't be repeating work like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)