You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Vasu Gupta <de...@gmail.com> on 2020/12/02 21:20:04 UTC

Caching issue in BigQueryIO

Hey folks, 

While using BigQueryIO for 10k tables insertion, I found that it has an issue in it's local caching technique for table creation. Tables are first search in BigQueryIO's local cache and then checks whether to create a table or not. The main issue is when inserting to thousands of table: let's suppose we have 10k tables to insert in realtime and now since we will deploy a fresh dataflow pipeline once in a week, local cache will be empty and it will take a huge time just to build that cache for 10k tables even though these 10k tables were already created in BigQuery.

The solution i could propose for this is we can provide an option for using external caching services like Redis/Memcached so that we don't have to rebuild cache again and again after a fresh deployment of pipeline.

Re: Caching issue in BigQueryIO

Posted by Reuven Lax <re...@google.com>.
How long does it take to rebuild? Even for thousands of tables I would not
expect it to take very long, unless you are hitting quota rate limits with
BigQuery. If that's the case, maybe a better solution is to see if those
quotas could be raised?

On Fri, Dec 4, 2020 at 9:57 AM Vasu Gupta <de...@gmail.com> wrote:

> Hey Reuven, yes you are correct that BigQueryIO is working intended but
> the issue is that since it's a local cache, this cache will rebuild again
> from sratch when pipeline is redeployed which is very time consuming for
> thousands of table.
>
> On 2020/12/03 17:58:04, Reuven Lax <re...@google.com> wrote:
> > What exactly is the issue? If the cache is empty, then BigQueryIO will
> try
> > and create the table again, and the creation will fail since the table
> > exists. This is working as intended.
> >
> > The only reason for the cache is so that BigQueryIO doesn't continuously
> > hammer BigQuery with creation requests every second.
> >
> > On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <de...@gmail.com>
> wrote:
> >
> > > Hey folks,
> > >
> > > While using BigQueryIO for 10k tables insertion, I found that it has an
> > > issue in it's local caching technique for table creation. Tables are
> first
> > > search in BigQueryIO's local cache and then checks whether to create a
> > > table or not. The main issue is when inserting to thousands of table:
> let's
> > > suppose we have 10k tables to insert in realtime and now since we will
> > > deploy a fresh dataflow pipeline once in a week, local cache will be
> empty
> > > and it will take a huge time just to build that cache for 10k tables
> even
> > > though these 10k tables were already created in BigQuery.
> > >
> > > The solution i could propose for this is we can provide an option for
> > > using external caching services like Redis/Memcached so that we don't
> have
> > > to rebuild cache again and again after a fresh deployment of pipeline.
> > >
> >
>

Re: Caching issue in BigQueryIO

Posted by Vasu Gupta <de...@gmail.com>.
Hey Reuven, yes you are correct that BigQueryIO is working intended but the issue is that since it's a local cache, this cache will rebuild again from sratch when pipeline is redeployed which is very time consuming for thousands of table.

On 2020/12/03 17:58:04, Reuven Lax <re...@google.com> wrote: 
> What exactly is the issue? If the cache is empty, then BigQueryIO will try
> and create the table again, and the creation will fail since the table
> exists. This is working as intended.
> 
> The only reason for the cache is so that BigQueryIO doesn't continuously
> hammer BigQuery with creation requests every second.
> 
> On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <de...@gmail.com> wrote:
> 
> > Hey folks,
> >
> > While using BigQueryIO for 10k tables insertion, I found that it has an
> > issue in it's local caching technique for table creation. Tables are first
> > search in BigQueryIO's local cache and then checks whether to create a
> > table or not. The main issue is when inserting to thousands of table: let's
> > suppose we have 10k tables to insert in realtime and now since we will
> > deploy a fresh dataflow pipeline once in a week, local cache will be empty
> > and it will take a huge time just to build that cache for 10k tables even
> > though these 10k tables were already created in BigQuery.
> >
> > The solution i could propose for this is we can provide an option for
> > using external caching services like Redis/Memcached so that we don't have
> > to rebuild cache again and again after a fresh deployment of pipeline.
> >
> 

Re: Caching issue in BigQueryIO

Posted by Reuven Lax <re...@google.com>.
What exactly is the issue? If the cache is empty, then BigQueryIO will try
and create the table again, and the creation will fail since the table
exists. This is working as intended.

The only reason for the cache is so that BigQueryIO doesn't continuously
hammer BigQuery with creation requests every second.

On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <de...@gmail.com> wrote:

> Hey folks,
>
> While using BigQueryIO for 10k tables insertion, I found that it has an
> issue in it's local caching technique for table creation. Tables are first
> search in BigQueryIO's local cache and then checks whether to create a
> table or not. The main issue is when inserting to thousands of table: let's
> suppose we have 10k tables to insert in realtime and now since we will
> deploy a fresh dataflow pipeline once in a week, local cache will be empty
> and it will take a huge time just to build that cache for 10k tables even
> though these 10k tables were already created in BigQuery.
>
> The solution i could propose for this is we can provide an option for
> using external caching services like Redis/Memcached so that we don't have
> to rebuild cache again and again after a fresh deployment of pipeline.
>

Re: Caching issue in BigQueryIO

Posted by Chamikara Jayalath <ch...@google.com>.
State of Dataflow pipelines is not maintained across different runs of a
pipeline. I think here also you can add a custom ParDo that stores such
state in an external storage system and retrieve that state when starting
up a fresh pipeline.

Thanks,
Cham

On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <de...@gmail.com> wrote:

> Hey folks,
>
> While using BigQueryIO for 10k tables insertion, I found that it has an
> issue in it's local caching technique for table creation. Tables are first
> search in BigQueryIO's local cache and then checks whether to create a
> table or not. The main issue is when inserting to thousands of table: let's
> suppose we have 10k tables to insert in realtime and now since we will
> deploy a fresh dataflow pipeline once in a week, local cache will be empty
> and it will take a huge time just to build that cache for 10k tables even
> though these 10k tables were already created in BigQuery.
>
> The solution i could propose for this is we can provide an option for
> using external caching services like Redis/Memcached so that we don't have
> to rebuild cache again and again after a fresh deployment of pipeline.
>