You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Beam JIRA Bot (Jira)" <ji...@apache.org> on 2022/03/20 17:26:00 UTC

[jira] [Commented] (BEAM-10826) Expose BigQuery schema autodetect in Java SDK

    [ https://issues.apache.org/jira/browse/BEAM-10826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509486#comment-17509486 ] 

Beam JIRA Bot commented on BEAM-10826:
--------------------------------------

This issue is P2 but has been unassigned without any comment for 60 days so it has been labeled "stale-P2". If this issue is still affecting you, we care! Please comment and remove the label. Otherwise, in 14 days the issue will be moved to P3.

Please see https://beam.apache.org/contribute/jira-priorities/ for a detailed explanation of what these priorities mean.


> Expose BigQuery schema autodetect in Java SDK
> ---------------------------------------------
>
>                 Key: BEAM-10826
>                 URL: https://issues.apache.org/jira/browse/BEAM-10826
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Marcus Truscello
>            Priority: P2
>              Labels: Clarified, bigquery, schema, stale-P2
>
> The Beam Java SDK's BigQueryIO transform currently doesn't expose the [schema autodetect job configuration|https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#setAutodetect-java.lang.Boolean-].  The feature is exposed by the current [Python SDK|https://github.com/apache/beam/blob/a99c6826a067f49ebb60e625c8652900c7d0e810/sdks/python/apache_beam/io/gcp/bigquery.py#L1593], but not the Java SDK.
> Although Java is more strict about types and schemas, the BigQueryIO transform supports writing TableRows which don't inherently have a schema. This provides a convenient path for loading JSON data into BigQuery but is massively thwarted by the fact that a schema is required to make use of the SchemaUpdateOption values ALLOW_FIELD_ADDITION and ALLOW_FIELD_RELAXATION.
> The BigQuery schema autodetection feature must be enabled at the JobConfigurationLoad level. The BigQueryIO creates the JobConfigurationLoad in only one place: [WriteTables.java|https://github.com/apache/beam/blob/752bdfd09bc4175dd9f51a096f81c9e5b0805913/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L370]. Exposing the autodetection option would mean adding it here, then propagating the change upwards until it's exposed at the BigQueryIO.Write level.
> A big of context on this issue:
>  * Google cloud's blog [has an article|https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix] on handling mutating JSON schemas in Dataflow using a black-box "Validate and Mutate BQ Schema" step.
>  * Suggested workarounds include creating a stateful DoFn to dynamically generate a schema, load it as a side input to create a PCollectionView, then passing it to BigQuerIO using withSchemaFromView: https://stackoverflow.com/a/58809875/477563
>  * [Entire projects|https://github.com/the-dagger/dataflow-dynamic-schema] have been created to try and work around this issue.
> All of the above would be rendered moot (and many headaches spared!) if only the schema autodetection were exposed in the Java SDK _like it already is_ in the Python SDK.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)