You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Chaim Turkel <ch...@behalf.com> on 2017/09/26 13:40:42 UTC

BigQueryIO Partitions

Hi,

   Does BigQueryIO support Partitions when writing? will it improve my
performance?


chaim

Re: BigQueryIO Partitions

Posted by Reuven Lax <re...@google.com.INVALID>.

When I glanced before, this was due to having to create many separate load
jobs - one for each partition. I'm not sure if there's anything Beam can do
here. I believe there may be some upcoming features in BigQuery that make
this better.

Reuvenn

On Tue, Sep 26, 2017 at 6:57 AM, Chaim Turkel <ch...@behalf.com> wrote:

> by the way currently the performance on bigquery partitions is very bad.
> Is there a repository where i can test with 2.2.0?
>
> chaim
>
> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid>
> wrote:
> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the
> table
> > containing the partitions is not pre created (fixed in 2.2.0).
> >
> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:
> >
> >> Hi,
> >>
> >>    Does BigQueryIO support Partitions when writing? will it improve my
> >> performance?
> >>
> >>
> >> chaim
> >>
>

Re: BigQueryIO Partitions

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

What is your concern about this job? It seems not that slow, and it's not
really bottlenecked by writing to BigQuery (<50% of the wall-clock time is
in the step that writes to bigquery).

On Thu, Sep 28, 2017 at 12:38 AM Chaim Turkel <ch...@behalf.com> wrote:

> you can see my job at:
>
> https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-09-26_03_17_44-4821512213867199289?project=ordinal-ember-163410
>
>
> On Wed, Sep 27, 2017 at 10:47 PM, Reuven Lax <re...@google.com.invalid>
> wrote:
> > There are a couple of options, and if you provide a job id (since you are
> > using the Dataflow runner) we can better advise.
> >
> > If you are writing to more than 2000 partitions, this won't work -
> BigQuery
> > has a hard quota of 1000 partition updates per table per day.
> >
> > If you have fewer than 1000 jobs, there are a few possibilities. It's
> > possible that BigQuery is taking a while to schedule some of those jobs;
> > they'll end up sitting in a queue waiting to be scheduled. We can look at
> > one of the jobs in detail to see if that's happening. Eugene's suggestion
> > of using your pipeline to load into a single table might be the best one.
> > You can write the date into a separate column, and then write a shell
> > script to copy each date to it's own partition (see
> >
> https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query
> > for some examples).
> >
> > On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov <
> > kirpichov@google.com.invalid> wrote:
> >
> >> I see. Then Reuven's answer above applies.
> >> Maybe you could write to a non-partitioned table, and then split it into
> >> smaller partitioned tables. See https://stackoverflow.com/a/
> >> 39001706/278042
> >> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of
> the
> >> current options - granted, it seems like there currently don't exist
> very
> >> good options for creating a very large number of table partitions from
> >> existing data.
> >>
> >> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <ch...@behalf.com> wrote:
> >>
> >> > thank you for your detailed response.
> >> > Currently i am a bit stuck.
> >> > I need to migrate data from mongo to bigquery, we have about 1 terra
> >> > of data. It is history data, so i want to use bigquery partitions.
> >> > It seems that the io connector creates a job per partition so it takes
> >> > a very long time, and i hit the quota in bigquery of the amount of
> >> > jobs per day.
> >> > I would like to use streaming but you cannot stream old data more
> than 30
> >> > day
> >> >
> >> > So I thought of partitions to see if i can do more parraleism
> >> >
> >> > chaim
> >> >
> >> >
> >> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
> >> > <ki...@google.com.invalid> wrote:
> >> > > Okay, I see - there's about 3 different meanings of the word
> >> "partition"
> >> > > that could have been involved here (BigQuery partitions,
> >> runner-specific
> >> > > bundles, and the Partition transform), hence my request for
> >> > clarification.
> >> > >
> >> > > If you mean the Partition transform - then I'm confused what do you
> >> mean
> >> > by
> >> > > BigQueryIO "supporting" it? The Partition transform takes a
> PCollection
> >> > and
> >> > > produces a bunch of PCollections; these are ordinary PCollection's
> and
> >> > you
> >> > > can apply any Beam transforms to them, and BigQueryIO.write() is no
> >> > > exception to this - you can apply it too.
> >> > >
> >> > > To answer whether using Partition would improve your performance,
> I'd
> >> > need
> >> > > to understand exactly what you're comparing against what. I suppose
> >> > you're
> >> > > comparing the following:
> >> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
> >> > table
> >> > > 2) Splitting a PCollection into several smaller PCollection's using
> >> > > Partition, and applying BigQueryIO.write() to each of them, writing
> to
> >> > > different tables I suppose? (or do you want to write to different
> >> > BigQuery
> >> > > partitions of the same table using a table partition decorator?)
> >> > > I would expect #2 to perform strictly worse than #1, because it
> writes
> >> > the
> >> > > same amount of data but increases the number of BigQuery load jobs
> >> > involved
> >> > > (thus increases per-job overhead and consumes BigQuery quota).
> >> > >
> >> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com>
> >> wrote:
> >> > >
> >> > >> https://beam.apache.org/documentation/programming-guide/#partition
> >> > >>
> >> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
> >> > >> <ki...@google.com.invalid> wrote:
> >> > >> > What do you mean by Beam partitions?
> >> > >> >
> >> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com>
> >> wrote:
> >> > >> >
> >> > >> >> by the way currently the performance on bigquery partitions is
> very
> >> > bad.
> >> > >> >> Is there a repository where i can test with 2.2.0?
> >> > >> >>
> >> > >> >> chaim
> >> > >> >>
> >> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax
> >> <relax@google.com.invalid
> >> > >
> >> > >> >> wrote:
> >> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug
> if
> >> > the
> >> > >> >> table
> >> > >> >> > containing the partitions is not pre created (fixed in 2.2.0).
> >> > >> >> >
> >> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <
> chaim@behalf.com>
> >> > >> wrote:
> >> > >> >> >
> >> > >> >> >> Hi,
> >> > >> >> >>
> >> > >> >> >>    Does BigQueryIO support Partitions when writing? will it
> >> > improve
> >> > >> my
> >> > >> >> >> performance?
> >> > >> >> >>
> >> > >> >> >>
> >> > >> >> >> chaim
> >> > >> >> >>
> >> > >> >>
> >> > >>
> >> >
> >>
>

Re: BigQueryIO Partitions

Posted by Chaim Turkel <ch...@behalf.com>.

you can see my job at:
https://console.cloud.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2017-09-26_03_17_44-4821512213867199289?project=ordinal-ember-163410


On Wed, Sep 27, 2017 at 10:47 PM, Reuven Lax <re...@google.com.invalid> wrote:
> There are a couple of options, and if you provide a job id (since you are
> using the Dataflow runner) we can better advise.
>
> If you are writing to more than 2000 partitions, this won't work - BigQuery
> has a hard quota of 1000 partition updates per table per day.
>
> If you have fewer than 1000 jobs, there are a few possibilities. It's
> possible that BigQuery is taking a while to schedule some of those jobs;
> they'll end up sitting in a queue waiting to be scheduled. We can look at
> one of the jobs in detail to see if that's happening. Eugene's suggestion
> of using your pipeline to load into a single table might be the best one.
> You can write the date into a separate column, and then write a shell
> script to copy each date to it's own partition (see
> https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query
> for some examples).
>
> On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov <
> kirpichov@google.com.invalid> wrote:
>
>> I see. Then Reuven's answer above applies.
>> Maybe you could write to a non-partitioned table, and then split it into
>> smaller partitioned tables. See https://stackoverflow.com/a/
>> 39001706/278042
>> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the
>> current options - granted, it seems like there currently don't exist very
>> good options for creating a very large number of table partitions from
>> existing data.
>>
>> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <ch...@behalf.com> wrote:
>>
>> > thank you for your detailed response.
>> > Currently i am a bit stuck.
>> > I need to migrate data from mongo to bigquery, we have about 1 terra
>> > of data. It is history data, so i want to use bigquery partitions.
>> > It seems that the io connector creates a job per partition so it takes
>> > a very long time, and i hit the quota in bigquery of the amount of
>> > jobs per day.
>> > I would like to use streaming but you cannot stream old data more than 30
>> > day
>> >
>> > So I thought of partitions to see if i can do more parraleism
>> >
>> > chaim
>> >
>> >
>> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
>> > <ki...@google.com.invalid> wrote:
>> > > Okay, I see - there's about 3 different meanings of the word
>> "partition"
>> > > that could have been involved here (BigQuery partitions,
>> runner-specific
>> > > bundles, and the Partition transform), hence my request for
>> > clarification.
>> > >
>> > > If you mean the Partition transform - then I'm confused what do you
>> mean
>> > by
>> > > BigQueryIO "supporting" it? The Partition transform takes a PCollection
>> > and
>> > > produces a bunch of PCollections; these are ordinary PCollection's and
>> > you
>> > > can apply any Beam transforms to them, and BigQueryIO.write() is no
>> > > exception to this - you can apply it too.
>> > >
>> > > To answer whether using Partition would improve your performance, I'd
>> > need
>> > > to understand exactly what you're comparing against what. I suppose
>> > you're
>> > > comparing the following:
>> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
>> > table
>> > > 2) Splitting a PCollection into several smaller PCollection's using
>> > > Partition, and applying BigQueryIO.write() to each of them, writing to
>> > > different tables I suppose? (or do you want to write to different
>> > BigQuery
>> > > partitions of the same table using a table partition decorator?)
>> > > I would expect #2 to perform strictly worse than #1, because it writes
>> > the
>> > > same amount of data but increases the number of BigQuery load jobs
>> > involved
>> > > (thus increases per-job overhead and consumes BigQuery quota).
>> > >
>> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com>
>> wrote:
>> > >
>> > >> https://beam.apache.org/documentation/programming-guide/#partition
>> > >>
>> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
>> > >> <ki...@google.com.invalid> wrote:
>> > >> > What do you mean by Beam partitions?
>> > >> >
>> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com>
>> wrote:
>> > >> >
>> > >> >> by the way currently the performance on bigquery partitions is very
>> > bad.
>> > >> >> Is there a repository where i can test with 2.2.0?
>> > >> >>
>> > >> >> chaim
>> > >> >>
>> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax
>> <relax@google.com.invalid
>> > >
>> > >> >> wrote:
>> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if
>> > the
>> > >> >> table
>> > >> >> > containing the partitions is not pre created (fixed in 2.2.0).
>> > >> >> >
>> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com>
>> > >> wrote:
>> > >> >> >
>> > >> >> >> Hi,
>> > >> >> >>
>> > >> >> >>    Does BigQueryIO support Partitions when writing? will it
>> > improve
>> > >> my
>> > >> >> >> performance?
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> chaim
>> > >> >> >>
>> > >> >>
>> > >>
>> >
>>

Re: BigQueryIO Partitions

Posted by Reuven Lax <re...@google.com.INVALID>.

There are a couple of options, and if you provide a job id (since you are
using the Dataflow runner) we can better advise.

If you are writing to more than 2000 partitions, this won't work - BigQuery
has a hard quota of 1000 partition updates per table per day.

If you have fewer than 1000 jobs, there are a few possibilities. It's
possible that BigQuery is taking a while to schedule some of those jobs;
they'll end up sitting in a queue waiting to be scheduled. We can look at
one of the jobs in detail to see if that's happening. Eugene's suggestion
of using your pipeline to load into a single table might be the best one.
You can write the date into a separate column, and then write a shell
script to copy each date to it's own partition (see
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query
for some examples).

On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov <
kirpichov@google.com.invalid> wrote:

> I see. Then Reuven's answer above applies.
> Maybe you could write to a non-partitioned table, and then split it into
> smaller partitioned tables. See https://stackoverflow.com/a/
> 39001706/278042
> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the
> current options - granted, it seems like there currently don't exist very
> good options for creating a very large number of table partitions from
> existing data.
>
> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <ch...@behalf.com> wrote:
>
> > thank you for your detailed response.
> > Currently i am a bit stuck.
> > I need to migrate data from mongo to bigquery, we have about 1 terra
> > of data. It is history data, so i want to use bigquery partitions.
> > It seems that the io connector creates a job per partition so it takes
> > a very long time, and i hit the quota in bigquery of the amount of
> > jobs per day.
> > I would like to use streaming but you cannot stream old data more than 30
> > day
> >
> > So I thought of partitions to see if i can do more parraleism
> >
> > chaim
> >
> >
> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> > > Okay, I see - there's about 3 different meanings of the word
> "partition"
> > > that could have been involved here (BigQuery partitions,
> runner-specific
> > > bundles, and the Partition transform), hence my request for
> > clarification.
> > >
> > > If you mean the Partition transform - then I'm confused what do you
> mean
> > by
> > > BigQueryIO "supporting" it? The Partition transform takes a PCollection
> > and
> > > produces a bunch of PCollections; these are ordinary PCollection's and
> > you
> > > can apply any Beam transforms to them, and BigQueryIO.write() is no
> > > exception to this - you can apply it too.
> > >
> > > To answer whether using Partition would improve your performance, I'd
> > need
> > > to understand exactly what you're comparing against what. I suppose
> > you're
> > > comparing the following:
> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
> > table
> > > 2) Splitting a PCollection into several smaller PCollection's using
> > > Partition, and applying BigQueryIO.write() to each of them, writing to
> > > different tables I suppose? (or do you want to write to different
> > BigQuery
> > > partitions of the same table using a table partition decorator?)
> > > I would expect #2 to perform strictly worse than #1, because it writes
> > the
> > > same amount of data but increases the number of BigQuery load jobs
> > involved
> > > (thus increases per-job overhead and consumes BigQuery quota).
> > >
> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com>
> wrote:
> > >
> > >> https://beam.apache.org/documentation/programming-guide/#partition
> > >>
> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
> > >> <ki...@google.com.invalid> wrote:
> > >> > What do you mean by Beam partitions?
> > >> >
> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com>
> wrote:
> > >> >
> > >> >> by the way currently the performance on bigquery partitions is very
> > bad.
> > >> >> Is there a repository where i can test with 2.2.0?
> > >> >>
> > >> >> chaim
> > >> >>
> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax
> <relax@google.com.invalid
> > >
> > >> >> wrote:
> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if
> > the
> > >> >> table
> > >> >> > containing the partitions is not pre created (fixed in 2.2.0).
> > >> >> >
> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com>
> > >> wrote:
> > >> >> >
> > >> >> >> Hi,
> > >> >> >>
> > >> >> >>    Does BigQueryIO support Partitions when writing? will it
> > improve
> > >> my
> > >> >> >> performance?
> > >> >> >>
> > >> >> >>
> > >> >> >> chaim
> > >> >> >>
> > >> >>
> > >>
> >
>

Re: BigQueryIO Partitions

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

I see. Then Reuven's answer above applies.
Maybe you could write to a non-partitioned table, and then split it into
smaller partitioned tables. See https://stackoverflow.com/a/39001706/278042
<https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the
current options - granted, it seems like there currently don't exist very
good options for creating a very large number of table partitions from
existing data.

On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <ch...@behalf.com> wrote:

> thank you for your detailed response.
> Currently i am a bit stuck.
> I need to migrate data from mongo to bigquery, we have about 1 terra
> of data. It is history data, so i want to use bigquery partitions.
> It seems that the io connector creates a job per partition so it takes
> a very long time, and i hit the quota in bigquery of the amount of
> jobs per day.
> I would like to use streaming but you cannot stream old data more than 30
> day
>
> So I thought of partitions to see if i can do more parraleism
>
> chaim
>
>
> On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
> > Okay, I see - there's about 3 different meanings of the word "partition"
> > that could have been involved here (BigQuery partitions, runner-specific
> > bundles, and the Partition transform), hence my request for
> clarification.
> >
> > If you mean the Partition transform - then I'm confused what do you mean
> by
> > BigQueryIO "supporting" it? The Partition transform takes a PCollection
> and
> > produces a bunch of PCollections; these are ordinary PCollection's and
> you
> > can apply any Beam transforms to them, and BigQueryIO.write() is no
> > exception to this - you can apply it too.
> >
> > To answer whether using Partition would improve your performance, I'd
> need
> > to understand exactly what you're comparing against what. I suppose
> you're
> > comparing the following:
> > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
> table
> > 2) Splitting a PCollection into several smaller PCollection's using
> > Partition, and applying BigQueryIO.write() to each of them, writing to
> > different tables I suppose? (or do you want to write to different
> BigQuery
> > partitions of the same table using a table partition decorator?)
> > I would expect #2 to perform strictly worse than #1, because it writes
> the
> > same amount of data but increases the number of BigQuery load jobs
> involved
> > (thus increases per-job overhead and consumes BigQuery quota).
> >
> > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com> wrote:
> >
> >> https://beam.apache.org/documentation/programming-guide/#partition
> >>
> >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
> >> <ki...@google.com.invalid> wrote:
> >> > What do you mean by Beam partitions?
> >> >
> >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com> wrote:
> >> >
> >> >> by the way currently the performance on bigquery partitions is very
> bad.
> >> >> Is there a repository where i can test with 2.2.0?
> >> >>
> >> >> chaim
> >> >>
> >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <relax@google.com.invalid
> >
> >> >> wrote:
> >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if
> the
> >> >> table
> >> >> > containing the partitions is not pre created (fixed in 2.2.0).
> >> >> >
> >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com>
> >> wrote:
> >> >> >
> >> >> >> Hi,
> >> >> >>
> >> >> >>    Does BigQueryIO support Partitions when writing? will it
> improve
> >> my
> >> >> >> performance?
> >> >> >>
> >> >> >>
> >> >> >> chaim
> >> >> >>
> >> >>
> >>
>

Re: BigQueryIO Partitions

Posted by Chaim Turkel <ch...@behalf.com>.

thank you for your detailed response.
Currently i am a bit stuck.
I need to migrate data from mongo to bigquery, we have about 1 terra
of data. It is history data, so i want to use bigquery partitions.
It seems that the io connector creates a job per partition so it takes
a very long time, and i hit the quota in bigquery of the amount of
jobs per day.
I would like to use streaming but you cannot stream old data more than 30 day

So I thought of partitions to see if i can do more parraleism

chaim


On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
<ki...@google.com.invalid> wrote:
> Okay, I see - there's about 3 different meanings of the word "partition"
> that could have been involved here (BigQuery partitions, runner-specific
> bundles, and the Partition transform), hence my request for clarification.
>
> If you mean the Partition transform - then I'm confused what do you mean by
> BigQueryIO "supporting" it? The Partition transform takes a PCollection and
> produces a bunch of PCollections; these are ordinary PCollection's and you
> can apply any Beam transforms to them, and BigQueryIO.write() is no
> exception to this - you can apply it too.
>
> To answer whether using Partition would improve your performance, I'd need
> to understand exactly what you're comparing against what. I suppose you're
> comparing the following:
> 1) Applying BigQueryIO.write() to a PCollection, writing to a single table
> 2) Splitting a PCollection into several smaller PCollection's using
> Partition, and applying BigQueryIO.write() to each of them, writing to
> different tables I suppose? (or do you want to write to different BigQuery
> partitions of the same table using a table partition decorator?)
> I would expect #2 to perform strictly worse than #1, because it writes the
> same amount of data but increases the number of BigQuery load jobs involved
> (thus increases per-job overhead and consumes BigQuery quota).
>
> On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com> wrote:
>
>> https://beam.apache.org/documentation/programming-guide/#partition
>>
>> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
>> <ki...@google.com.invalid> wrote:
>> > What do you mean by Beam partitions?
>> >
>> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> by the way currently the performance on bigquery partitions is very bad.
>> >> Is there a repository where i can test with 2.2.0?
>> >>
>> >> chaim
>> >>
>> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid>
>> >> wrote:
>> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the
>> >> table
>> >> > containing the partitions is not pre created (fixed in 2.2.0).
>> >> >
>> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com>
>> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >>    Does BigQueryIO support Partitions when writing? will it improve
>> my
>> >> >> performance?
>> >> >>
>> >> >>
>> >> >> chaim
>> >> >>
>> >>
>>

Re: BigQueryIO Partitions

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Okay, I see - there's about 3 different meanings of the word "partition"
that could have been involved here (BigQuery partitions, runner-specific
bundles, and the Partition transform), hence my request for clarification.

If you mean the Partition transform - then I'm confused what do you mean by
BigQueryIO "supporting" it? The Partition transform takes a PCollection and
produces a bunch of PCollections; these are ordinary PCollection's and you
can apply any Beam transforms to them, and BigQueryIO.write() is no
exception to this - you can apply it too.

To answer whether using Partition would improve your performance, I'd need
to understand exactly what you're comparing against what. I suppose you're
comparing the following:
1) Applying BigQueryIO.write() to a PCollection, writing to a single table
2) Splitting a PCollection into several smaller PCollection's using
Partition, and applying BigQueryIO.write() to each of them, writing to
different tables I suppose? (or do you want to write to different BigQuery
partitions of the same table using a table partition decorator?)
I would expect #2 to perform strictly worse than #1, because it writes the
same amount of data but increases the number of BigQuery load jobs involved
(thus increases per-job overhead and consumes BigQuery quota).

On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <ch...@behalf.com> wrote:

> https://beam.apache.org/documentation/programming-guide/#partition
>
> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
> > What do you mean by Beam partitions?
> >
> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com> wrote:
> >
> >> by the way currently the performance on bigquery partitions is very bad.
> >> Is there a repository where i can test with 2.2.0?
> >>
> >> chaim
> >>
> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid>
> >> wrote:
> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the
> >> table
> >> > containing the partitions is not pre created (fixed in 2.2.0).
> >> >
> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com>
> wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >>    Does BigQueryIO support Partitions when writing? will it improve
> my
> >> >> performance?
> >> >>
> >> >>
> >> >> chaim
> >> >>
> >>
>

Re: BigQueryIO Partitions

Posted by Chaim Turkel <ch...@behalf.com>.

https://beam.apache.org/documentation/programming-guide/#partition

On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
<ki...@google.com.invalid> wrote:
> What do you mean by Beam partitions?
>
> On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com> wrote:
>
>> by the way currently the performance on bigquery partitions is very bad.
>> Is there a repository where i can test with 2.2.0?
>>
>> chaim
>>
>> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid>
>> wrote:
>> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the
>> table
>> > containing the partitions is not pre created (fixed in 2.2.0).
>> >
>> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:
>> >
>> >> Hi,
>> >>
>> >>    Does BigQueryIO support Partitions when writing? will it improve my
>> >> performance?
>> >>
>> >>
>> >> chaim
>> >>
>>

Re: BigQueryIO Partitions

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

What do you mean by Beam partitions?

On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <ch...@behalf.com> wrote:

> by the way currently the performance on bigquery partitions is very bad.
> Is there a repository where i can test with 2.2.0?
>
> chaim
>
> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid>
> wrote:
> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the
> table
> > containing the partitions is not pre created (fixed in 2.2.0).
> >
> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:
> >
> >> Hi,
> >>
> >>    Does BigQueryIO support Partitions when writing? will it improve my
> >> performance?
> >>
> >>
> >> chaim
> >>
>

Re: BigQueryIO Partitions

Posted by Chaim Turkel <ch...@behalf.com>.

by the way currently the performance on bigquery partitions is very bad.
Is there a repository where i can test with 2.2.0?

chaim

On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the table
> containing the partitions is not pre created (fixed in 2.2.0).
>
> On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> Hi,
>>
>>    Does BigQueryIO support Partitions when writing? will it improve my
>> performance?
>>
>>
>> chaim
>>

Re: BigQueryIO Partitions

Posted by Chaim Turkel <ch...@behalf.com>.

no i mean't beam partitions

On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <re...@google.com.invalid> wrote:
> Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the table
> containing the partitions is not pre created (fixed in 2.2.0).
>
> On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:
>
>> Hi,
>>
>>    Does BigQueryIO support Partitions when writing? will it improve my
>> performance?
>>
>>
>> chaim
>>

Re: BigQueryIO Partitions

Posted by Reuven Lax <re...@google.com.INVALID>.

Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the table
containing the partitions is not pre created (fixed in 2.2.0).

On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <ch...@behalf.com> wrote:

> Hi,
>
>    Does BigQueryIO support Partitions when writing? will it improve my
> performance?
>
>
> chaim
>