You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Richard Deurwaarder <ri...@xeli.eu> on 2019/06/04 12:01:04 UTC

Re: BigQuery source ?

I've looked into this briefly a while ago out of interest and read about
how beam handles this. I've never actually implemented but the concept
sounds reasonable to me.

What I read from their code is that beam exports the BigQuery data to
Google Storage. This export shards the data in files with a max size of 1GB
and these files are then processed by the 'source functions' in beam.

I think implementing this in Flink would require the following:

* Before starting the Flink job run the BigQuery to Google Storage Export (
https://cloud.google.com/bigquery/docs/exporting-data)
* Start the flink job and point towards the Google storage files (using
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage to
easily read from Google Storage buckets)

So the job might look something like this:

> List<String> files = doBigQueryExportJob();
> DataSet<String> records = environment.fromCollection(files)
>                 .flatMap(new ReadFromFile())
>                 .map(doWork());

On Fri, May 31, 2019 at 10:15 AM Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
>
> Has anyone created a source to READ from BigQuery into Flink yet (we
> have Flink running on K8S in the Google cloud)?
> I would like to retrieve a DataSet in a distributed way (the data ...
> it's kinda big) and process that with Flink running on k8s (which we
> have running already).
>
> So far I have not been able to find anything yet.
> Any pointers/hints/code fragments are welcome.
>
> Thanks
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>