You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Matt Terwilliger <ma...@mailchimp.com> on 2020/07/02 13:28:23 UTC

BigQuery query caching?

Hello,

I'm writing a Beam pipeline that does some relatively expensive reads from
BigQuery. I want to be able to run the pipeline in a development loop
without racking up a huge bill.

I know BigQuery has support for query caching, but from the docs, that only
works if you don't specify a destination table.

For the purposes of development, I don't mind trading off stale data (i.e.
reusing an existing destination table if it exists) to save money.

Is there any way to do this now, or relevant any open issues? I did a quick
pass through JIRA but couldn't find anything.

Thanks,
Matt

Re: BigQuery query caching?

Posted by Jeff Klukas <jk...@mozilla.com>.

I don't have any particularly exciting recommendations for keeping the
query up to date. It would be up to the developer to remember to rerun the
query to populate the dev results table as they make changes to the query.

I'd likely be copying and pasting the query into the BQ console, wrapping
it with something like:

```
DROP TABLE myproj.mydataset.my_temp_results;
CREATE TABLE myproj.mydataset.my_temp_results AS
<paste the query here>
```

And then the pipeline would be referencing myproj.mydataset.my_temp_results
in BigQueryIO.read

On Thu, Jul 2, 2020 at 1:49 PM Matthew Terwilliger <mt...@mailchimp.com> wrote:

> How are you materializing the query results for the first time and keeping
> the table up to date as your queries change?
>
> Thanks,
> Matt
>
> On Thu, Jul 2, 2020 at 10:06 AM Matthew Terwilliger <mt...@mailchimp.com>
> wrote:
>
>> Hi Jeff,
>>
>> Using Java.
>>
>> Yeah - we are issuing a query rather than reading a table. Materializing
>> the results myself and reading them back seems simple enough. I will give
>> that a try!
>>
>> Thanks,
>> Matt
>>
>> On Thu, Jul 2, 2020 at 9:42 AM Jeff Klukas <jk...@mozilla.com> wrote:
>>
>>> It sounds like your pipeline is issuing a query rather than reading a
>>> whole table.
>>>
>>> Are you using Java or Python? I'm only familiar with the Java SDK so my
>>> answer may be Java-biased.
>>>
>>> I would recommend materializing the query results to a table, and then
>>> configuring your pipeline to read that table rather than reading from a
>>> query. In that case, no query job is involved so you incur no query cost.
>>>
>>> By default, the read from a table will do an export to avro files. There
>>> is no GCP cost associated with that export, but there is a quota involved,
>>> which you may run into if you run your pipeline repeatedly. So an even
>>> better loop would be to do the export to GCS out of band, and then
>>> reference those avro files. But that would require much more extensive code
>>> changes in your pipeline whereas the switch of reading from a query to
>>> reading from a table is a one-line code change.
>>>
>>> You can also avoid the export to avro files by configuring BigQueryIO to
>>> use direct reads from your temporary table rather than file exports. There
>>> is a cost associated with direct reads, but it should generally be much
>>> smaller than the cost of repeatedly running a query.
>>>
>>> On Thu, Jul 2, 2020 at 9:28 AM Matt Terwilliger <
>>> matt.terwilliger@mailchimp.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm writing a Beam pipeline that does some relatively expensive reads
>>>> from BigQuery. I want to be able to run the pipeline in a development loop
>>>> without racking up a huge bill.
>>>>
>>>> I know BigQuery has support for query caching, but from the docs, that
>>>> only works if you don't specify a destination table.
>>>>
>>>> For the purposes of development, I don't mind trading off stale data
>>>> (i.e. reusing an existing destination table if it exists) to save money.
>>>>
>>>> Is there any way to do this now, or relevant any open issues? I did a
>>>> quick pass through JIRA but couldn't find anything.
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>

Re: BigQuery query caching?

Posted by Matthew Terwilliger <mt...@mailchimp.com>.

How are you materializing the query results for the first time and keeping
the table up to date as your queries change?

Thanks,
Matt

On Thu, Jul 2, 2020 at 10:06 AM Matthew Terwilliger <mt...@mailchimp.com>
wrote:

> Hi Jeff,
>
> Using Java.
>
> Yeah - we are issuing a query rather than reading a table. Materializing
> the results myself and reading them back seems simple enough. I will give
> that a try!
>
> Thanks,
> Matt
>
> On Thu, Jul 2, 2020 at 9:42 AM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> It sounds like your pipeline is issuing a query rather than reading a
>> whole table.
>>
>> Are you using Java or Python? I'm only familiar with the Java SDK so my
>> answer may be Java-biased.
>>
>> I would recommend materializing the query results to a table, and then
>> configuring your pipeline to read that table rather than reading from a
>> query. In that case, no query job is involved so you incur no query cost.
>>
>> By default, the read from a table will do an export to avro files. There
>> is no GCP cost associated with that export, but there is a quota involved,
>> which you may run into if you run your pipeline repeatedly. So an even
>> better loop would be to do the export to GCS out of band, and then
>> reference those avro files. But that would require much more extensive code
>> changes in your pipeline whereas the switch of reading from a query to
>> reading from a table is a one-line code change.
>>
>> You can also avoid the export to avro files by configuring BigQueryIO to
>> use direct reads from your temporary table rather than file exports. There
>> is a cost associated with direct reads, but it should generally be much
>> smaller than the cost of repeatedly running a query.
>>
>> On Thu, Jul 2, 2020 at 9:28 AM Matt Terwilliger <
>> matt.terwilliger@mailchimp.com> wrote:
>>
>>> Hello,
>>>
>>> I'm writing a Beam pipeline that does some relatively expensive reads
>>> from BigQuery. I want to be able to run the pipeline in a development loop
>>> without racking up a huge bill.
>>>
>>> I know BigQuery has support for query caching, but from the docs, that
>>> only works if you don't specify a destination table.
>>>
>>> For the purposes of development, I don't mind trading off stale data
>>> (i.e. reusing an existing destination table if it exists) to save money.
>>>
>>> Is there any way to do this now, or relevant any open issues? I did a
>>> quick pass through JIRA but couldn't find anything.
>>>
>>> Thanks,
>>> Matt
>>>
>>

Re: BigQuery query caching?

Posted by Matthew Terwilliger <mt...@mailchimp.com>.

Hi Jeff,

Using Java.

Yeah - we are issuing a query rather than reading a table. Materializing
the results myself and reading them back seems simple enough. I will give
that a try!

Thanks,
Matt

On Thu, Jul 2, 2020 at 9:42 AM Jeff Klukas <jk...@mozilla.com> wrote:

> It sounds like your pipeline is issuing a query rather than reading a
> whole table.
>
> Are you using Java or Python? I'm only familiar with the Java SDK so my
> answer may be Java-biased.
>
> I would recommend materializing the query results to a table, and then
> configuring your pipeline to read that table rather than reading from a
> query. In that case, no query job is involved so you incur no query cost.
>
> By default, the read from a table will do an export to avro files. There
> is no GCP cost associated with that export, but there is a quota involved,
> which you may run into if you run your pipeline repeatedly. So an even
> better loop would be to do the export to GCS out of band, and then
> reference those avro files. But that would require much more extensive code
> changes in your pipeline whereas the switch of reading from a query to
> reading from a table is a one-line code change.
>
> You can also avoid the export to avro files by configuring BigQueryIO to
> use direct reads from your temporary table rather than file exports. There
> is a cost associated with direct reads, but it should generally be much
> smaller than the cost of repeatedly running a query.
>
> On Thu, Jul 2, 2020 at 9:28 AM Matt Terwilliger <
> matt.terwilliger@mailchimp.com> wrote:
>
>> Hello,
>>
>> I'm writing a Beam pipeline that does some relatively expensive reads
>> from BigQuery. I want to be able to run the pipeline in a development loop
>> without racking up a huge bill.
>>
>> I know BigQuery has support for query caching, but from the docs, that
>> only works if you don't specify a destination table.
>>
>> For the purposes of development, I don't mind trading off stale data
>> (i.e. reusing an existing destination table if it exists) to save money.
>>
>> Is there any way to do this now, or relevant any open issues? I did a
>> quick pass through JIRA but couldn't find anything.
>>
>> Thanks,
>> Matt
>>
>

Re: BigQuery query caching?

Posted by Jeff Klukas <jk...@mozilla.com>.

It sounds like your pipeline is issuing a query rather than reading a whole
table.

Are you using Java or Python? I'm only familiar with the Java SDK so my
answer may be Java-biased.

I would recommend materializing the query results to a table, and then
configuring your pipeline to read that table rather than reading from a
query. In that case, no query job is involved so you incur no query cost.

By default, the read from a table will do an export to avro files. There is
no GCP cost associated with that export, but there is a quota involved,
which you may run into if you run your pipeline repeatedly. So an even
better loop would be to do the export to GCS out of band, and then
reference those avro files. But that would require much more extensive code
changes in your pipeline whereas the switch of reading from a query to
reading from a table is a one-line code change.

You can also avoid the export to avro files by configuring BigQueryIO to
use direct reads from your temporary table rather than file exports. There
is a cost associated with direct reads, but it should generally be much
smaller than the cost of repeatedly running a query.

On Thu, Jul 2, 2020 at 9:28 AM Matt Terwilliger <
matt.terwilliger@mailchimp.com> wrote:

> Hello,
>
> I'm writing a Beam pipeline that does some relatively expensive reads from
> BigQuery. I want to be able to run the pipeline in a development loop
> without racking up a huge bill.
>
> I know BigQuery has support for query caching, but from the docs, that
> only works if you don't specify a destination table.
>
> For the purposes of development, I don't mind trading off stale data (i.e.
> reusing an existing destination table if it exists) to save money.
>
> Is there any way to do this now, or relevant any open issues? I did a
> quick pass through JIRA but couldn't find anything.
>
> Thanks,
> Matt
>