You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Unais Thachuparambil <un...@careem.com> on 2018/01/16 09:15:20 UTC

Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection

I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB) using
beam.io.BigQuerySource() by running a simple query

 """
SELECT
  filed1,
  filed2,
  filed3,
  filed4,
  filed5,
  filed6
FROM
  TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
TIMESTAMP('{end_date}'))
WHERE
  filed1 IS NOT NULL
"""

after that, I'm partitioning the source data based on field2 date and
converting to date partitioned P-Collections

But while monitoring the data flow console I noticed that the BQRead
operation taking more than 1hr 40min out of 2hr: 54-minute total execution.

Why the BQ io read taking a long time? Is there any implemented method in
data flow (I'm using python API) to speed up this process.

How I can reduce the read io execution time?.

Screenshot of graph is attached (Time showed on the graph is wrong - It
took 2hr 54-min to finish)

Re: Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection

Posted by Eugene Kirpichov <ki...@google.com>.

I'm looking at this job and it appears to apply hundreds of transforms
after reading from BigQuery inside the BQRead step, and that's what's
taking the time. If you suspect that the problem is still in pure reading
from BigQuery, can you simplify the pipeline so that it only uses
beam.io.BigQuerySource() and nothing else, so we can look at just the
performance of that?

On Tue, Jan 16, 2018 at 10:56 PM Unais Thachuparambil <
unais.thachuparambil@careem.com> wrote:

> This is my dataflow Job id  2018-01-15_03_45_25-16237456828769339983
>
> On Wed, Jan 17, 2018 at 12:10 AM, Eugene Kirpichov <ki...@google.com>
> wrote:
>
>> After the bigquery export job, Beam needs to also ingest the results of
>> the export (Avro files); Beam uses the standard Python avro library for
>> that, which has notoriously awful performance; taking 2 hours to read 44GB
>> of data would not be surprising. It's possible that this is the root of the
>> issue, but we'd need a job id to tell for sure. There are several forks of
>> the avro library with better performance but the forks are unmaintained.
>>
>> On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> BQ Read step involves running a query job in BQ followed by running a
>>> export job to export the resulting table to GCS. Is it possible that these
>>> jobs took a long time for some reason ?
>>>
>>> Dataflow job should log the BQ job IDs of these jobs and it should be
>>> possible to check the status using following command.
>>>
>>> bq show -j --project_id=<GCP project ID> <BQ job ID>
>>>
>>> Feel free to mention your job ID in Dataflow SDK's stackoverflow channel
>>> if you want Dataflow team to take a look.
>>> https://stackoverflow.com/questions/tagged/dataflow
>>>
>>> - Cham
>>>
>>>
>>>
>>> On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil <
>>> unais.thachuparambil@careem.com> wrote:
>>>
>>>>
>>>> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB)
>>>> using beam.io.BigQuerySource() by running a simple query
>>>>
>>>>  """
>>>> SELECT
>>>>   filed1,
>>>>   filed2,
>>>>   filed3,
>>>>   filed4,
>>>>   filed5,
>>>>   filed6
>>>> FROM
>>>>   TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
>>>> TIMESTAMP('{end_date}'))
>>>> WHERE
>>>>   filed1 IS NOT NULL
>>>> """
>>>>
>>>> after that, I'm partitioning the source data based on field2 date and
>>>> converting to date partitioned P-Collections
>>>>
>>>> But while monitoring the data flow console I noticed that the BQRead
>>>> operation taking more than 1hr 40min out of 2hr: 54-minute total execution.
>>>>
>>>> Why the BQ io read taking a long time? Is there any implemented method
>>>> in data flow (I'm using python API) to speed up this process.
>>>>
>>>> How I can reduce the read io execution time?.
>>>>
>>>> Screenshot of graph is attached (Time showed on the graph is wrong - It
>>>> took 2hr 54-min to finish)
>>>>
>>>> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png]
>>>> 
>>>>
>>>
>

Re: Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection

Posted by Unais Thachuparambil <un...@careem.com>.

This is my dataflow Job id  2018-01-15_03_45_25-16237456828769339983

On Wed, Jan 17, 2018 at 12:10 AM, Eugene Kirpichov <ki...@google.com>
wrote:

> After the bigquery export job, Beam needs to also ingest the results of
> the export (Avro files); Beam uses the standard Python avro library for
> that, which has notoriously awful performance; taking 2 hours to read 44GB
> of data would not be surprising. It's possible that this is the root of the
> issue, but we'd need a job id to tell for sure. There are several forks of
> the avro library with better performance but the forks are unmaintained.
>
> On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> BQ Read step involves running a query job in BQ followed by running a
>> export job to export the resulting table to GCS. Is it possible that these
>> jobs took a long time for some reason ?
>>
>> Dataflow job should log the BQ job IDs of these jobs and it should be
>> possible to check the status using following command.
>>
>> bq show -j --project_id=<GCP project ID> <BQ job ID>
>>
>> Feel free to mention your job ID in Dataflow SDK's stackoverflow channel
>> if you want Dataflow team to take a look.
>> https://stackoverflow.com/questions/tagged/dataflow
>>
>> - Cham
>>
>>
>>
>> On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil <
>> unais.thachuparambil@careem.com> wrote:
>>
>>>
>>> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB)
>>> using beam.io.BigQuerySource() by running a simple query
>>>
>>>  """
>>> SELECT
>>>   filed1,
>>>   filed2,
>>>   filed3,
>>>   filed4,
>>>   filed5,
>>>   filed6
>>> FROM
>>>   TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
>>> TIMESTAMP('{end_date}'))
>>> WHERE
>>>   filed1 IS NOT NULL
>>> """
>>>
>>> after that, I'm partitioning the source data based on field2 date and
>>> converting to date partitioned P-Collections
>>>
>>> But while monitoring the data flow console I noticed that the BQRead
>>> operation taking more than 1hr 40min out of 2hr: 54-minute total execution.
>>>
>>> Why the BQ io read taking a long time? Is there any implemented method
>>> in data flow (I'm using python API) to speed up this process.
>>>
>>> How I can reduce the read io execution time?.
>>>
>>> Screenshot of graph is attached (Time showed on the graph is wrong - It
>>> took 2hr 54-min to finish)
>>>
>>> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png]
>>> 
>>>
>>

Re: Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection

Posted by Eugene Kirpichov <ki...@google.com>.

After the bigquery export job, Beam needs to also ingest the results of the
export (Avro files); Beam uses the standard Python avro library for that,
which has notoriously awful performance; taking 2 hours to read 44GB of
data would not be surprising. It's possible that this is the root of the
issue, but we'd need a job id to tell for sure. There are several forks of
the avro library with better performance but the forks are unmaintained.

On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <ch...@google.com>
wrote:

> BQ Read step involves running a query job in BQ followed by running a
> export job to export the resulting table to GCS. Is it possible that these
> jobs took a long time for some reason ?
>
> Dataflow job should log the BQ job IDs of these jobs and it should be
> possible to check the status using following command.
>
> bq show -j --project_id=<GCP project ID> <BQ job ID>
>
> Feel free to mention your job ID in Dataflow SDK's stackoverflow channel
> if you want Dataflow team to take a look.
> https://stackoverflow.com/questions/tagged/dataflow
>
> - Cham
>
>
>
> On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil <
> unais.thachuparambil@careem.com> wrote:
>
>>
>> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB) using
>> beam.io.BigQuerySource() by running a simple query
>>
>>  """
>> SELECT
>>   filed1,
>>   filed2,
>>   filed3,
>>   filed4,
>>   filed5,
>>   filed6
>> FROM
>>   TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
>> TIMESTAMP('{end_date}'))
>> WHERE
>>   filed1 IS NOT NULL
>> """
>>
>> after that, I'm partitioning the source data based on field2 date and
>> converting to date partitioned P-Collections
>>
>> But while monitoring the data flow console I noticed that the BQRead
>> operation taking more than 1hr 40min out of 2hr: 54-minute total execution.
>>
>> Why the BQ io read taking a long time? Is there any implemented method in
>> data flow (I'm using python API) to speed up this process.
>>
>> How I can reduce the read io execution time?.
>>
>> Screenshot of graph is attached (Time showed on the graph is wrong - It
>> took 2hr 54-min to finish)
>>
>> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png]
>> 
>>
>

Re: Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection

Posted by Chamikara Jayalath <ch...@google.com>.

BQ Read step involves running a query job in BQ followed by running a
export job to export the resulting table to GCS. Is it possible that these
jobs took a long time for some reason ?

Dataflow job should log the BQ job IDs of these jobs and it should be
possible to check the status using following command.

bq show -j --project_id=<GCP project ID> <BQ job ID>

Feel free to mention your job ID in Dataflow SDK's stackoverflow channel if
you want Dataflow team to take a look.
https://stackoverflow.com/questions/tagged/dataflow

- Cham



On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil <
unais.thachuparambil@careem.com> wrote:

>
> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB) using
> beam.io.BigQuerySource() by running a simple query
>
>  """
> SELECT
>   filed1,
>   filed2,
>   filed3,
>   filed4,
>   filed5,
>   filed6
> FROM
>   TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
> TIMESTAMP('{end_date}'))
> WHERE
>   filed1 IS NOT NULL
> """
>
> after that, I'm partitioning the source data based on field2 date and
> converting to date partitioned P-Collections
>
> But while monitoring the data flow console I noticed that the BQRead
> operation taking more than 1hr 40min out of 2hr: 54-minute total execution.
>
> Why the BQ io read taking a long time? Is there any implemented method in
> data flow (I'm using python API) to speed up this process.
>
> How I can reduce the read io execution time?.
>
> Screenshot of graph is attached (Time showed on the graph is wrong - It
> took 2hr 54-min to finish)
>
> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png]
> 
>