You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/14 23:30:03 UTC
[GitHub] [beam] siddjain commented on issue #19239: Python BigQuery performance much worse than Java
siddjain commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1155805410
if it helps, we ran the python code today and in our case it completed in 13 mins. the original code gives errors and we had to make modifications to it. using Apache Beam Python 3.7 SDK 2.39.0:
```
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args()
print(pipeline_args)
# We use the save_main_session option because one or more DoFn's in this
# workflow rely on global context (e.g., a module imported at module level).
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
pipeline = beam.Pipeline(options=pipeline_options)
(pipeline
| 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT weight_pounds FROM [bigquery-public-data:samples.natality]'))
| 'MapToFloat' >> beam.Map(lambda elem: 0 if elem['weight_pounds'] == None else elem['weight_pounds'])
| 'Top' >> beam.combiners.Top.Largest(100)
| 'MapToString' >> beam.Map(lambda elem: str(elem))
| 'Write' >> beam.io.WriteToText("<output-file>"))
# When an Apache Beam Python program runs a pipeline on a service such as Dataflow, it is typically executed asynchronously.
# To block until pipeline completion, use the wait_until_finish() method of the PipelineResult object, returned from the run() method of the runner.
pipeline.run().wait_until_finish()
```
```
real 13m9.053s
user 0m8.418s
sys 0m2.222s
```
<img width="244" alt="Screen Shot 2022-06-14 at 4 27 26 PM" src="https://user-images.githubusercontent.com/758321/173705791-45cf7817-1350-42ad-82a8-28fcf7bf60c2.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org