You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Chamikara Jayalath (JIRA)" <ji...@apache.org> on 2019/01/10 16:28:00 UTC
[jira] [Resolved] (BEAM-6064) Python BigQuery performance much
worse than Java
[ https://issues.apache.org/jira/browse/BEAM-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chamikara Jayalath resolved BEAM-6064.
--------------------------------------
Resolution: Fixed
Fix Version/s: 2.9.0
> Python BigQuery performance much worse than Java
> ------------------------------------------------
>
> Key: BEAM-6064
> URL: https://issues.apache.org/jira/browse/BEAM-6064
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Affects Versions: 2.8.0
> Reporter: Jan Kuipers
> Assignee: Chamikara Jayalath
> Priority: Major
> Fix For: 2.9.0
>
> Attachments: results-java.png, results-python.png
>
>
> The performance of reading from BigQuery in Python seems to be much worse than the performance of it in Java.
> To reproduce this, I've run the following two programs on the Google Cloud, which basically read the weights from the public data set "natality" and outputs the top 100 largest weights.
> Python:
> {code:java}
> # <cut imports>
> options = PipelineOptions()
> options.view_as(StandardOptions).runner = 'DataflowRunner'
> # <cut more options>
> pipeline = Pipeline(options=options)
> (pipeline
> | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT weight_pounds FROM [bigquery-public-data:samples.natality]'))
> | 'MapToFloat' >> beam.Map(lambda elem: elem['weight_pounds'])
> | 'Top' >> beam.combiners.Top.Largest(100)
> | 'MapToString' >> beam.Map(lambda elem: str(elem))
> | 'Write' >> beam.io.WriteToText("<output-file>"))
> pipeline.run()
> {code}
> Java:
> {code:java}
> // <cut imports>
> public class Natality {
> public static void main(String[] args) {
> DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
> options.setRunner(DataflowRunner.class);
> // <cut more options>
>
> Pipeline pipeline = Pipeline.create(options);
> pipeline.apply("Read", BigQueryIO.readTableRows()
> .fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]"))
> .apply("MapToDouble", MapElements
> .into(TypeDescriptors.doubles())
> .via(row -> {
> Object obj = row.get("weight_pounds");
> return (obj == null ? 0.0 : (Double) obj);
> }))
> .apply("Top", Top.largest(100))
> .apply("MapToString", MapElements
> .into(TypeDescriptors.strings())
> .via(weight -> weight.toString()))
> .apply("Write", TextIO.write().to("<output-file>"));
> pipeline.run().waitUntilFinish();
> }
> }
> {code}
> The "<cut more options>" are basic options like project, job name, temp location, etc. Both programs produce identical outputs.
> Running these programs launches a DataFlow job on the Google Cloud with the following results (data from the Google Cloud Platform web interface; screenshots attached).
> Python:
> {noformat}
> Read Succeeded 1 hr 40 min 40 sec
> MapToFloat Succeeded 2 min 43 sec
> Top Succeeded 5 min 25 sec
> MapToString Succeeded 0 sec
> Write Succeeded 3 sec{noformat}
> Java:
> {noformat}
> Read Succeeded 4 min 45 sec
> MapToDouble Succeeded 45 sec
> Top Succeeded 52 sec
> MapToString Succeeded 0 sec
> Write Succeeded 1 sec
> {noformat}
> As you can see, there is an enormous performance hit in Python w.r.t. the reading from BigQuery: 1h40m vs less than 5 minutes.
> Furthermore the other standard operations (like Top) are also much slower in Python than in Java.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)