You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Chuck Yang <ch...@getcruise.com> on 2019/11/27 03:06:52 UTC

Re: [EXT] Re: using avro instead of json for BigQueryIO.Write

Has anyone looked into implementing this for the Python SDK? It would
be nice to have it if only for the ability to write float values with
NaN and infinity values. I didn't see anything in Jira, happy to
create a ticket, but wanted to ask around first.

On Thu, Oct 17, 2019 at 12:53 PM Reuven Lax <re...@google.com> wrote:
>
> I'll take a look as well. Thanks for doing this!
>
> On Fri, Oct 4, 2019 at 9:16 PM Pablo Estrada <pa...@google.com> wrote:
>>
>> Thanks Steve!
>> I'll take a look next week. Sorry about the delay so far.
>> Best
>> -P.
>>
>> On Fri, Sep 27, 2019 at 10:37 AM Steve Niemitz <sn...@apache.org> wrote:
>>>
>>> I put up a semi-WIP pull request https://github.com/apache/beam/pull/9665 for this.  The initial results look good.  I'll spend some time soon adding unit tests and documentation, but I'd appreciate it if someone could take a first pass over it.
>>>
>>> On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada <pa...@google.com> wrote:
>>>>
>>>> Thanks for offering to work on this! It would be awesome to have it. I can say that we don't have that for Python ATM.
>>>>
>>>> On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz <sn...@apache.org> wrote:
>>>>>
>>>>> Our experience has actually been that avro is more efficient than even parquet, but that might also be skewed from our datasets.
>>>>>
>>>>> I might try to take a crack at this, I found https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which coincidentally references my thread from a couple years ago on the read side of this :) ).
>>>>>
>>>>> On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>> It's been talked about, but nobody's done anything. There as some difficulties related to type conversion (json and avro don't support the same types), but if those are overcome then an avro version would be much more efficient. I believe Parquet files would be even more efficient if you wanted to go that path, but there might be more code to write (as we already have some code in the codebase to convert between TableRows and Avro).
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz <sn...@apache.org> wrote:
>>>>>>>
>>>>>>> Has anyone investigated using avro rather than json to load data into BigQuery using BigQueryIO (+ FILE_LOADS)?
>>>>>>>
>>>>>>> I'd be interested in enhancing it to support this, but I'm curious if there's any prior work here.

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.

Re: [EXT] Re: using avro instead of json for BigQueryIO.Write

Posted by Pablo Estrada <pa...@google.com>.
This is great. I'll take a look today.

On Wed, Feb 26, 2020 at 9:42 AM Chuck Yang <ch...@getcruise.com> wrote:

> Hi Devs,
>
> I was able to get around to working on Avro file loads to BigQuery in
> Python SDK and now have a PR available at
> https://github.com/apache/beam/pull/10979 . Comments appreciated :)
>
> Thanks,
> Chuck
>
> On Wed, Nov 27, 2019 at 10:10 AM Chuck Yang <ch...@getcruise.com>
> wrote:
> >
> > I would love to fix this, but not sure if I have the bandwidth at the
> > moment. Anyway, created the jira here:
> > https://jira.apache.org/jira/browse/BEAM-8841
> >
> > Thanks!
> > Chuck
>
> --
>
>
> *Confidentiality Note:* We care about protecting our proprietary
> information, confidential material, and trade secrets. This message may
> contain some or all of those things. Cruise will suffer material harm if
> anyone other than the intended recipient disseminates or takes any action
> based on this message. If you have received this message (including any
> attachments) in error, please delete it immediately and notify the sender
> promptly.
>

Re: [EXT] Re: using avro instead of json for BigQueryIO.Write

Posted by Chuck Yang <ch...@getcruise.com>.
Hi Devs,

I was able to get around to working on Avro file loads to BigQuery in
Python SDK and now have a PR available at
https://github.com/apache/beam/pull/10979 . Comments appreciated :)

Thanks,
Chuck

On Wed, Nov 27, 2019 at 10:10 AM Chuck Yang <ch...@getcruise.com> wrote:
>
> I would love to fix this, but not sure if I have the bandwidth at the
> moment. Anyway, created the jira here:
> https://jira.apache.org/jira/browse/BEAM-8841
>
> Thanks!
> Chuck

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.

Re: [EXT] Re: using avro instead of json for BigQueryIO.Write

Posted by Chuck Yang <ch...@getcruise.com>.
I would love to fix this, but not sure if I have the bandwidth at the
moment. Anyway, created the jira here:
https://jira.apache.org/jira/browse/BEAM-8841

Thanks!
Chuck

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.

Re: [EXT] Re: using avro instead of json for BigQueryIO.Write

Posted by Chamikara Jayalath <ch...@google.com>.
I don't believe so, please create one (we can dedup if we happen to find
another issue).

Even better if you can contribute to fix this :)

Thanks,
Cham

On Tue, Nov 26, 2019 at 7:07 PM Chuck Yang <ch...@getcruise.com> wrote:

> Has anyone looked into implementing this for the Python SDK? It would
> be nice to have it if only for the ability to write float values with
> NaN and infinity values. I didn't see anything in Jira, happy to
> create a ticket, but wanted to ask around first.
>
> On Thu, Oct 17, 2019 at 12:53 PM Reuven Lax <re...@google.com> wrote:
> >
> > I'll take a look as well. Thanks for doing this!
> >
> > On Fri, Oct 4, 2019 at 9:16 PM Pablo Estrada <pa...@google.com> wrote:
> >>
> >> Thanks Steve!
> >> I'll take a look next week. Sorry about the delay so far.
> >> Best
> >> -P.
> >>
> >> On Fri, Sep 27, 2019 at 10:37 AM Steve Niemitz <sn...@apache.org>
> wrote:
> >>>
> >>> I put up a semi-WIP pull request
> https://github.com/apache/beam/pull/9665 for this.  The initial results
> look good.  I'll spend some time soon adding unit tests and documentation,
> but I'd appreciate it if someone could take a first pass over it.
> >>>
> >>> On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada <pa...@google.com>
> wrote:
> >>>>
> >>>> Thanks for offering to work on this! It would be awesome to have it.
> I can say that we don't have that for Python ATM.
> >>>>
> >>>> On Mon, Sep 16, 2019 at 10:56 AM Steve Niemitz <sn...@apache.org>
> wrote:
> >>>>>
> >>>>> Our experience has actually been that avro is more efficient than
> even parquet, but that might also be skewed from our datasets.
> >>>>>
> >>>>> I might try to take a crack at this, I found
> https://issues.apache.org/jira/browse/BEAM-2879 tracking it (which
> coincidentally references my thread from a couple years ago on the read
> side of this :) ).
> >>>>>
> >>>>> On Mon, Sep 16, 2019 at 1:38 PM Reuven Lax <re...@google.com> wrote:
> >>>>>>
> >>>>>> It's been talked about, but nobody's done anything. There as some
> difficulties related to type conversion (json and avro don't support the
> same types), but if those are overcome then an avro version would be much
> more efficient. I believe Parquet files would be even more efficient if you
> wanted to go that path, but there might be more code to write (as we
> already have some code in the codebase to convert between TableRows and
> Avro).
> >>>>>>
> >>>>>> Reuven
> >>>>>>
> >>>>>> On Mon, Sep 16, 2019 at 10:33 AM Steve Niemitz <sn...@apache.org>
> wrote:
> >>>>>>>
> >>>>>>> Has anyone investigated using avro rather than json to load data
> into BigQuery using BigQueryIO (+ FILE_LOADS)?
> >>>>>>>
> >>>>>>> I'd be interested in enhancing it to support this, but I'm curious
> if there's any prior work here.
>
> --
>
>
> *Confidentiality Note:* We care about protecting our proprietary
> information, confidential material, and trade secrets. This message may
> contain some or all of those things. Cruise will suffer material harm if
> anyone other than the intended recipient disseminates or takes any action
> based on this message. If you have received this message (including any
> attachments) in error, please delete it immediately and notify the sender
> promptly.
>