You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Charith Ellawala <ch...@gmail.com> on 2019/05/19 15:12:46 UTC

BigQueryIO TableRow ARRAY fields

Hi,

I am working on adding schema support to BigQuery reads (BEAM-6673) and I
am a bit confused by two contradictory code paths that deal with ARRAY type
fields in TableRow objects.

The TableRowParser implementation in BigQueryIO ultimately calls
BigQueryAvroUtils#convertRepeatedField (
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java#L214)
and that code is simply treating ARRAY types as lists containing objects of
the underlying element type. This is congruent with the documentation I
have found [1].

However, when I look at the code to convert a TableRow to a Beam row (
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java#L315),
it is expecting ARRAY type fields to contain a List of Maps where each Map
contains an entry with the key "v" and the underlying element type of the
array.

I think that this nested Map representation for arrays of scalar types is
not correct and I would really appreciate it if someone knowledgeable with
BigQuery internals could chime in to confirm whether I am right or wrong.

(All the unit tests pass even after I comment out the Map value extraction
in line 323 but that is not a confirmation of the fact.)

Thank you.

[1] I could not find any official documentation about the JSON format of
BigQuery rows in the API docs but this seems to be the best description of
it:
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#to_json_string.
This description matches the JSON output produced by the BigQuery query
editor.

Re: BigQueryIO TableRow ARRAY fields

Posted by Reuven Lax <re...@google.com>.
The code to convert a TableRow to a Beam row is very new, and hasn't been
tested extensively. It would not surprise me if there are bugs in it.

On Sun, May 19, 2019 at 8:13 AM Charith Ellawala <ch...@gmail.com>
wrote:

> Hi,
>
> I am working on adding schema support to BigQuery reads (BEAM-6673) and I
> am a bit confused by two contradictory code paths that deal with ARRAY type
> fields in TableRow objects.
>
> The TableRowParser implementation in BigQueryIO ultimately calls
> BigQueryAvroUtils#convertRepeatedField (
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java#L214)
> and that code is simply treating ARRAY types as lists containing objects of
> the underlying element type. This is congruent with the documentation I
> have found [1].
>
> However, when I look at the code to convert a TableRow to a Beam row (
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java#L315),
> it is expecting ARRAY type fields to contain a List of Maps where each Map
> contains an entry with the key "v" and the underlying element type of the
> array.
>
> I think that this nested Map representation for arrays of scalar types is
> not correct and I would really appreciate it if someone knowledgeable with
> BigQuery internals could chime in to confirm whether I am right or wrong.
>
> (All the unit tests pass even after I comment out the Map value extraction
> in line 323 but that is not a confirmation of the fact.)
>
> Thank you.
>
> [1] I could not find any official documentation about the JSON format of
> BigQuery rows in the API docs but this seems to be the best description of
> it:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#to_json_string.
> This description matches the JSON output produced by the BigQuery query
> editor.
>