You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Chamikara Jayalath <ch...@google.com> on 2020/12/03 03:30:35 UTC

Re: Caching issue in BigQueryIO

State of Dataflow pipelines is not maintained across different runs of a
pipeline. I think here also you can add a custom ParDo that stores such
state in an external storage system and retrieve that state when starting
up a fresh pipeline.

Thanks,
Cham

On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <de...@gmail.com> wrote:

> Hey folks,
>
> While using BigQueryIO for 10k tables insertion, I found that it has an
> issue in it's local caching technique for table creation. Tables are first
> search in BigQueryIO's local cache and then checks whether to create a
> table or not. The main issue is when inserting to thousands of table: let's
> suppose we have 10k tables to insert in realtime and now since we will
> deploy a fresh dataflow pipeline once in a week, local cache will be empty
> and it will take a huge time just to build that cache for 10k tables even
> though these 10k tables were already created in BigQuery.
>
> The solution i could propose for this is we can provide an option for
> using external caching services like Redis/Memcached so that we don't have
> to rebuild cache again and again after a fresh deployment of pipeline.
>