You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "joachim-isaksson-centiro (via GitHub)" <gi...@apache.org> on 2023/02/17 10:52:13 UTC

[GitHub] [beam] joachim-isaksson-centiro opened a new issue, #25526: [Bug]: BigqueryAvroUtils builds invalid Avro schema which prevents insert

joachim-isaksson-centiro opened a new issue, #25526:
URL: https://github.com/apache/beam/issues/25526

   ### What happened?
   
   I will add a failing test below, but basically we have a structure in our system which looks something like;
   
   class1 { identifier: record1 }
   class2 { identifier: record2, class1: class1 }
   
   That is, we have two separate members with the name "identifier" in two different parts of the type we're trying to write to BigQuery.
   
   When BigqueryIO calls BigQueryAvroUtils.toGenericAvroSchema() on the type, it generates a schema for the structure, but unfortunately calling toString() on the resulting avro schema crashes with;
   
   Method threw 'org.apache.avro.SchemaParseException' exception.
   
   It seems to be due to that;
   
   * BigQueryAvroUtils.toGenericAvroSchema uses a _static_ namespace of "org.apache.beam.sdk.io.gcp.bigquery" for all types, no matter where in the type structure it's located. If it in this case for example added the encompassing type to the namespace (org.apache.beam.sdk.io.gcp.bigquery.class1.identifier), there should be no problem.
   
   * It seems to handle the member _name_ (identifier) as a type name in the schema, so it thinks the two members with the same _name_ are trying to redefine a _type_.
   
   Not quite clear on the terminology here so I may be using it wrong, but basically it tries to register org.apache.beam.sdk.io.gcp.bigquery.identifier twice in org.apache.avro.Schema$Names.put and that crashes the write to BQ.
   
   The structure is working without any issues up to Beam 2.42 but fails on 2.43 and 2.44.
   
   To maybe make it clearer, here's a very basic unit test (in Kotlin, but should translate over to java fairly easily I hope) that fails on the toString() call; it builds the TableSchema manually, but in the same structure as it's seems to be built by BigqueryIO for our type.
   
   ```
   package org.apache.beam.sdk.io.gcp.bigquery;
   
   import com.google.api.services.bigquery.model.TableFieldSchema
   import org.junit.jupiter.api.Test
   
   class SchemaTest {
   
       @Test
       fun test() {
   
           val stringSchema1 = TableFieldSchema().setName("id1").setType("STRING")
           val stringSchema2 = TableFieldSchema().setName("id2").setType("STRING")
   
           val identifier1Schema = TableFieldSchema().setName("identifier").setType("RECORD")
               .setFields(listOf(stringSchema1))
   
           val identifier2Schema = TableFieldSchema().setName("identifier").setType("RECORD")
               .setFields(listOf(stringSchema2))
   
           val recordSchema = TableFieldSchema().setName("record").setType("RECORD")
               .setFields(listOf(identifier1Schema))
   
           val rootSchema = TableFieldSchema().setName("root").setType("RECORD")
               .setFields(listOf(recordSchema, identifier2Schema))
   
           val output = BigQueryAvroUtils.toGenericAvroSchema("root", rootSchema.fields)
   
           val outputAsString = output.toString()
       }
   }
   ```
   
   The test fails as is, but renaming the member id2 to id1 so that both instances of the member with the name "identifier" are seen as the same type makes the test pass.
   
   If it helps, I'll try to make a more complete example that builds the TableSchema from the type in the same way BigqueryIO does, but I hope this makes the problem clear.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [X] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: BigqueryAvroUtils builds invalid Avro schema which prevents insert [beam]

Posted by "joachim-isaksson-centiro (via GitHub)" <gi...@apache.org>.
joachim-isaksson-centiro commented on issue #25526:
URL: https://github.com/apache/beam/issues/25526#issuecomment-1758967910

   My code started working with 2.50 again, every version between 2.42 (working) and 2.50(working) were broken.
   Leaving the bug open in case other people are still having problems with the new version.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: BigqueryAvroUtils builds invalid Avro schema which prevents insert [beam]

Posted by "stiv1qaz1 (via GitHub)" <gi...@apache.org>.
stiv1qaz1 commented on issue #25526:
URL: https://github.com/apache/beam/issues/25526#issuecomment-1754748673

   Same for me, When BigQueryIO tries to convert a structure to an Avro schema, it generates a schema for the structure. However, when BigQueryIO tries to convert the schema to a string, it crashes with an error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] brodseba commented on issue #25526: [Bug]: BigqueryAvroUtils builds invalid Avro schema which prevents insert

Posted by "brodseba (via GitHub)" <gi...@apache.org>.
brodseba commented on issue #25526:
URL: https://github.com/apache/beam/issues/25526#issuecomment-1545733272

   Also affect 2.46.0.  The same issue occurs while reading BigQuery table, not just writing/inserting.
   
   I encountered this issue trying to read a Firebase/Google Analytics 4 table.  Here a public example of such a table: https://developers.google.com/analytics/bigquery/web-ecommerce-demo-dataset
   
   Same issue as: https://github.com/apache/beam/issues/26318
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Bug]: BigqueryAvroUtils builds invalid Avro schema which prevents insert [beam]

Posted by "joachim-isaksson-centiro (via GitHub)" <gi...@apache.org>.
joachim-isaksson-centiro closed issue #25526: [Bug]: BigqueryAvroUtils builds invalid Avro schema which prevents insert
URL: https://github.com/apache/beam/issues/25526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org