You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2022/03/02 00:21:00 UTC
[jira] [Commented] (BEAM-11587) Support pd.read_gbq and DataFrame.to_gbq
[ https://issues.apache.org/jira/browse/BEAM-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499818#comment-17499818 ]
Brian Hulette commented on BEAM-11587:
--------------------------------------
Hi Svetak, I discussed this with [~robertwb] offline. We agreed that actually using the pandas read_gbq and to_gbq (in the same way t hat we use read_csv and to_csv), is more trouble than it's worth. The reason being that the files read by read_csv (and other read_* functions) are relatively easy to split up into partitions for reading from distributed worker nodes. But splitting up a BigQuery read is more complicated, and we'd need to implement a bunch of logic for it.
The traditional BigQueryIO already has all of this splitting logic. The gap for BigQueryIO is that it only produces/consumes dictionaries, while we need to get a schema for it to use it with the DataFrame API (that's why we need to specify the schema manually in [this example|https://github.com/apache/beam/blob/3cd1f7f949bd476abb11bdb0b368a2f12a496cd1/sdks/python/apache_beam/examples/dataframe/flight_delays.py#L91]).
A better approach to this would be:
# Add support for producing/consuming a PCollection with a schema with BigQueryIO (this would mean looking up the schema in BQ, then adding logic to the pipeline to make PCollections with a schema).
# Optionally add read_gbq and to_gbq functions that use BigQueryIO under the hood. These methods would still be nice to have so users familiar with DataFrames don't need to use classic Beam at all.
> Support pd.read_gbq and DataFrame.to_gbq
> ----------------------------------------
>
> Key: BEAM-11587
> URL: https://issues.apache.org/jira/browse/BEAM-11587
> Project: Beam
> Issue Type: New Feature
> Components: dsl-dataframe, io-py-gcp, sdk-py-core
> Reporter: Brian Hulette
> Assignee: Svetak Vihaan Sundhar
> Priority: P3
> Labels: dataframe-api
>
> We should install pandas-gbq as part of the gcp extras and use it for querying BigQuery.
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html]
>
> and
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html
--
This message was sent by Atlassian Jira
(v8.20.1#820001)