You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Dmitry Bigunyak (JIRA)" <ji...@apache.org> on 2017/10/17 08:53:02 UTC
[jira] [Created] (BEAM-3067) BigQueryIO.Write fails on empty
PCollection with DirectRunner (batch job)
Dmitry Bigunyak created BEAM-3067:
-------------------------------------
Summary: BigQueryIO.Write fails on empty PCollection with DirectRunner (batch job)
Key: BEAM-3067
URL: https://issues.apache.org/jira/browse/BEAM-3067
Project: Beam
Issue Type: Bug
Components: runner-direct, sdk-java-gcp
Affects Versions: 2.1.0
Environment: Arch Linux, Java 1.8.0_144
Reporter: Dmitry Bigunyak
Assignee: Thomas Groh
I'm using side output feature to filter out malformatted events (errors) from a stream of valid events. Then I save valid events into one BigQuery table and errors go into another dedicated table.
Here is the code for outputting error rows:
{code:java}
invalidEventRows.apply("WriteErrors", BigQueryIO.writeTableRows()
.to(errorTableRef)
.withSchema(ProcessEvents.getErrorSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
{code}
The problem is that when running on DirectRunner in a batch mode (reading input from a file) and {{invalidEventRows}} PCollection ends up being empty (all events are valid -- no errors), I get the following error:
{code}
[ERROR] "status" : {
[ERROR] "errorResult" : {
[ERROR] "message" : "No schema specified on job or table.",
[ERROR] "reason" : "invalid"
[ERROR] },
[ERROR] "errors" : [ {
[ERROR] "message" : "No schema specified on job or table.",
[ERROR] "reason" : "invalid"
[ERROR] } ],
[ERROR] "state" : "DONE"
[ERROR] },
{code}
There are no errors when executing the same code and {{invalidEventRows}} PCollection is not empty, the BigQuery table is created and the data are correctly inserted.
Also everything seems to be working fine in a streaming mode (reading from Pub/Sub) on both DirectRunner and DataflowRunner.
Looks like a bug?
Or should I open an issue in GoogleCloudPlatform/DataflowJavaSDK github project?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)