You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2022/03/02 00:21:00 UTC

[jira] [Comment Edited] (BEAM-11587) Support pd.read_gbq and DataFrame.to_gbq

    [ https://issues.apache.org/jira/browse/BEAM-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499818#comment-17499818 ] 

Brian Hulette edited comment on BEAM-11587 at 3/2/22, 12:20 AM:
----------------------------------------------------------------

Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually using the pandas read_gbq and to_gbq (in the same way that we use read_csv and to_csv), is more trouble than it's worth. The reason being that the files read by read_csv (and other read_* functions) are relatively easy to split up into partitions for reading from distributed worker nodes. But splitting up a BigQuery read is more complicated, and we'd need to implement a bunch of logic for it.

The traditional BigQueryIO already has all of this splitting logic. The gap for BigQueryIO is that it only produces/consumes dictionaries, while we need to get a schema for it to use it with the DataFrame API (that's why we need to specify the schema manually in [this example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).

A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with BigQueryIO (this would mean looking up the schema in BQ, then adding logic to the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the hood. These methods would still be nice to have so users familiar with DataFrames don't need to use classic Beam at all.


was (Author: bhulette):
Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually using the pandas read_gbq and to_gbq (in the same way t hat we use read_csv and to_csv), is more trouble than it's worth. The reason being that the files read by read_csv (and other read_* functions) are relatively easy to split up into partitions for reading from distributed worker nodes. But splitting up a BigQuery read is more complicated, and we'd need to implement a bunch of logic for it.

The traditional BigQueryIO already has all of this splitting logic. The gap for BigQueryIO is that it only produces/consumes dictionaries, while we need to get a schema for it to use it with the DataFrame API (that's why we need to specify the schema manually in [this example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).

A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with BigQueryIO (this would mean looking up the schema in BQ, then adding logic to the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the hood. These methods would still be nice to have so users familiar with DataFrames don't need to use classic Beam at all.

> Support pd.read_gbq and DataFrame.to_gbq
> ----------------------------------------
>
>                 Key: BEAM-11587
>                 URL: https://issues.apache.org/jira/browse/BEAM-11587
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-dataframe, io-py-gcp, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P3
>              Labels: dataframe-api
>
> We should install pandas-gbq as part of the gcp extras and use it for querying BigQuery.
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html]
>  
> and 
>  
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)