You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Pei He (JIRA)" <ji...@apache.org> on 2017/01/10 00:57:58 UTC

[jira] [Created] (BEAM-1252) BigQueryIO.Read: validate exported files with GCS glob.

Pei He created BEAM-1252:
----------------------------

             Summary: BigQueryIO.Read: validate exported files with GCS glob.
                 Key: BEAM-1252
                 URL: https://issues.apache.org/jira/browse/BEAM-1252
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-gcp
            Reporter: Pei He
            Assignee: Pei He


BigQuery has started creating user-visible temp files that we notice and start reading from, but then they get moved. It could cause job failures and data duplication.

On Beam side, we can have stronger validation:
1. When listing files, validate that they match the expected URI.
2. When BQ has finished job, integrity check to verify that # files read from == # files BQ claims exist.
3. If possible, add a prefix to the filename of the glob (*.avro to step*.avro). Step name? Other? This might be as easy as dropping a '/' in the middle of the path. A la #7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)