You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kyle Weaver (JIRA)" <ji...@apache.org> on 2019/04/22 20:14:00 UTC

[jira] [Created] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

Kyle Weaver created BEAM-7131:
---------------------------------

             Summary: Spark portable runner appears to be repeating work (in TFX example)
                 Key: BEAM-7131
                 URL: https://issues.apache.org/jira/browse/BEAM-7131
             Project: Beam
          Issue Type: Improvement
          Components: runner-spark
            Reporter: Kyle Weaver
            Assignee: Kyle Weaver


I've been trying to run the TFX Chicago taxi example [1] on the Spark portable runner. TFDV works fine, but the preprocess step (preprocess_flink.sh [2]) fails with the following error:

RuntimeError: AlreadyExistsError: file already exists [while running 'WriteTransformFn/WriteTransformFn']

The copy tree operation in transform_fn_io.py [3] is seemingly being run twice. This problem doesn't occur when that code is modified to allow overwriting existing files, but that's only a shallow fix. The deeper problem here seems to be that the Spark runner is repeating work for some reason.

[1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]

[2] [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]

[3] [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)