You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/14 23:30:03 UTC

[GitHub] [beam] siddjain commented on issue #19239: Python BigQuery performance much worse than Java

siddjain commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1155805410

   if it helps, we ran the python code today and in our case it completed in 13 mins. the original code gives errors and we had to make modifications to it. using Apache Beam Python 3.7 SDK 2.39.0:
   
   ```
   parser = argparse.ArgumentParser()
   known_args, pipeline_args = parser.parse_known_args()
   print(pipeline_args)
   # We use the save_main_session option because one or more DoFn's in this
   # workflow rely on global context (e.g., a module imported at module level).
   pipeline_options = PipelineOptions(pipeline_args)
   pipeline_options.view_as(SetupOptions).save_main_session = True
   
   pipeline = beam.Pipeline(options=pipeline_options)
   (pipeline
       | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT weight_pounds FROM [bigquery-public-data:samples.natality]'))
       | 'MapToFloat' >> beam.Map(lambda elem: 0 if elem['weight_pounds'] == None else elem['weight_pounds'])
       | 'Top' >> beam.combiners.Top.Largest(100)
       | 'MapToString' >> beam.Map(lambda elem: str(elem))
       | 'Write' >> beam.io.WriteToText("<output-file>"))
   
   # When an Apache Beam Python program runs a pipeline on a service such as Dataflow, it is typically executed asynchronously. 
   # To block until pipeline completion, use the wait_until_finish() method of the PipelineResult object, returned from the run() method of the runner.
   pipeline.run().wait_until_finish()
   ```
   
   ```
   real	13m9.053s
   user	0m8.418s
   sys	0m2.222s
   ```
   
   <img width="244" alt="Screen Shot 2022-06-14 at 4 27 26 PM" src="https://user-images.githubusercontent.com/758321/173705791-45cf7817-1350-42ad-82a8-28fcf7bf60c2.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org