You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by "bastewart (via GitHub)" <gi...@apache.org> on 2023/02/23 16:59:38 UTC

[GitHub] [beam] bastewart commented on issue #25228: [Feature Request]: Less strict numeric conversions writing to BigQuery with Storage API

bastewart commented on issue #25228:
URL: https://github.com/apache/beam/issues/25228#issuecomment-1442119847

Sorry, I had missed you reply!

Summary on this message is I think in our case it's very hard to do the conversion. I think it's also reasonable to treat `1.0` as an integer, JSON schema [does for example](https://json-schema.org/understanding-json-schema/reference/numeric.html#integer). I also think it might also be a regression over the Streaming API implementation (at least this with #25227 and #25233 means there is a regression).

> Even in the above example, not all INT64 values can be converted to Java double. e.g. the following code:

Sorry, I may be misunderstanding, but that's not not the way round we have the issue.

Our issue is Java doubles arriving and needing to be converted to an `INT64` for BigQuery. In that case - by my understanding ([see my fix commit](https://github.com/bastewart/beam/commit/38601213f81896444c60dd9e590f8a795358d09a)) - it's trivial to optimistically convert it to a long and fail if you've lost precision.

> It also should be easy for any user to write a ParDo (or MapElements) that does their own numeric conversions before calling BigQueryIO.

Unfortunately in our case this is very-much non-trivial 😅

We write to 1000+ tables which have dynamic schemas*. This means we're relying on Beam to load the schema and convert the values for us. I think we'd have to re-implement, or use directly, most of `StorageApiDynamicDestinations` to grab schema info, and then do our own traversal of all data before handing it over to Beam.

*I am aware that automatic schema reloading in the Beam Storage Writes API is not yet released...

We also don't have direct/easy control over the types are the arrive. We're consuming a JSON blob off Kafka and bundling the data into BigQuery. Those JSON blobs _have_ been validated against the BigQuery schema using [JSON Schema](https://json-schema.org/). JSON schema [validates numbers with `0` fractional part as an integer](https://json-schema.org/understanding-json-schema/reference/numeric.html#integer), so this would be hard for us to guard against.

More generally, and personally, I feel like `1.0` should be treated as an integer. It's extremely easy for that kind of mis-conversion to occur, and being overly strict just leads to unexpected consequences. This issue coupled with #25227 and #25233 means that a lot of rows can be dropped "silently".

### Streaming API Behaviour

I need to check (will get back to you later!), but I am reasonably sure the Streaming API allows this type of coercion to occur. I'll double check and get back to you though.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org