You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2022/03/02 00:21:00 UTC

[jira] [Commented] (BEAM-11587) Support pd.read_gbq and DataFrame.to_gbq

    [ https://issues.apache.org/jira/browse/BEAM-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499818#comment-17499818 ] 

Brian Hulette commented on BEAM-11587:
--------------------------------------

Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually using the pandas read_gbq and to_gbq (in the same way t hat we use read_csv and to_csv), is more trouble than it's worth. The reason being that the files read by read_csv (and other read_* functions) are relatively easy to split up into partitions for reading from distributed worker nodes. But splitting up a BigQuery read is more complicated, and we'd need to implement a bunch of logic for it.

The traditional BigQueryIO already has all of this splitting logic. The gap for BigQueryIO is that it only produces/consumes dictionaries, while we need to get a schema for it to use it with the DataFrame API (that's why we need to specify the schema manually in [this example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).

A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with BigQueryIO (this would mean looking up the schema in BQ, then adding logic to the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the hood. These methods would still be nice to have so users familiar with DataFrames don't need to use classic Beam at all.

> Support pd.read_gbq and DataFrame.to_gbq
> ----------------------------------------
>
>                 Key: BEAM-11587
>                 URL: https://issues.apache.org/jira/browse/BEAM-11587
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-dataframe, io-py-gcp, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>              Labels: dataframe-api
>
> We should install pandas-gbq as part of the gcp extras and use it for querying BigQuery.
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html]
>  
> and 
>  
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)