You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Etienne Chauchot <ec...@apache.org> on 2018/06/13 08:54:11 UTC

[BigQuery] TableRowJsonCoder question

Hi all,

While playing with BigQueryIO I noticed something. 

When we create a TableRow (e.g. in a row function in bigQueryIO) using new TableRow().set(), for ex a long gets boxed
into a Long. But when it is encoded using TableRowJsonCoder and then re-read it might be decoded as an Integer if the
value fits into Integer. It causes failure in asserts in tests like write then read. 
What I did for now is to downcast long to int to force it to be boxed into an Integer (because test value fits into
Integer) at TableRow creation.

Is there a way to fix it in TableRowJsonCoder or a better workaround?

Etienne

Re: [BigQuery] TableRowJsonCoder question

Posted by Etienne Chauchot <ec...@apache.org>.

Thanks Reuven, Using SchemaCoder is better indeed to avoid loosing the type information.
Etienne
Le jeudi 14 juin 2018 à 10:04 -0700, Reuven Lax a écrit :
> I think Thomas Groh hit this issue and might know a workaround.
> In general, TableRowJsonCoder has been a huge pain, partially because Json itself cannot always represent all types
> (numeric types are a constant source of trouble in Json). In addition, I've found that encoding all data into Json
> (which is space inefficient) is quite expensive when shuffling that data (and bigQueryIO does do a GroupByKey on
> TableRows). I'm working on a PR that will extract schema information and allow BigQueryIO to use SchemaCoder instead
> of TableRowJsonCoder, however this is not quite ready to be merged yet.
> 
> Reuven
> On Wed, Jun 13, 2018 at 1:54 AM Etienne Chauchot <ec...@apache.org> wrote:
> > Hi all,
> > 
> > While playing with BigQueryIO I noticed something. 
> > 
> > When we create a TableRow (e.g. in a row function in bigQueryIO) using new TableRow().set(), for ex a long gets
> > boxed into a Long. But when it is encoded using TableRowJsonCoder and then re-read it might be decoded as an Integer
> > if the value fits into Integer. It causes failure in asserts in tests like write then read. 
> > What I did for now is to downcast long to int to force it to be boxed into an Integer (because test value fits into
> > Integer) at TableRow creation.
> > 
> > Is there a way to fix it in TableRowJsonCoder or a better workaround?
> > 
> > Etienne

Re: [BigQuery] TableRowJsonCoder question

Posted by Reuven Lax <re...@google.com>.

I think Thomas Groh hit this issue and might know a workaround.

In general, TableRowJsonCoder has been a huge pain, partially because Json
itself cannot always represent all types (numeric types are a constant
source of trouble in Json). In addition, I've found that encoding all data
into Json (which is space inefficient) is quite expensive when shuffling
that data (and bigQueryIO does do a GroupByKey on TableRows). I'm working
on a PR that will extract schema information and allow BigQueryIO to use
SchemaCoder instead of TableRowJsonCoder, however this is not quite ready
to be merged yet.

Reuven

On Wed, Jun 13, 2018 at 1:54 AM Etienne Chauchot <ec...@apache.org>
wrote:

> Hi all,
>
> While playing with BigQueryIO I noticed something.
>
> When we create a TableRow (e.g. in a row function in bigQueryIO) using new
> TableRow().set(), for ex a long gets boxed into a Long. But when it is
> encoded using TableRowJsonCoder and then re-read it might be decoded as an
> Integer if the value fits into Integer. It causes failure in asserts in
> tests like write then read.
> What I did for now is to downcast long to int to force it to be boxed into
> an Integer (because test value fits into Integer) at TableRow creation.
>
> Is there a way to fix it in TableRowJsonCoder or a better workaround?
>
> Etienne
>