You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Chamikara Jayalath (JIRA)" <ji...@apache.org> on 2019/01/10 16:28:00 UTC
[jira] [Resolved] (BEAM-6064) Python BigQuery performance much worse than Java

     [ https://issues.apache.org/jira/browse/BEAM-6064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chamikara Jayalath resolved BEAM-6064.
--------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.9.0

> Python BigQuery performance much worse than Java
> ------------------------------------------------
>
>                 Key: BEAM-6064
>                 URL: https://issues.apache.org/jira/browse/BEAM-6064
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.8.0
>            Reporter: Jan Kuipers
>            Assignee: Chamikara Jayalath
>            Priority: Major
>             Fix For: 2.9.0
>
>         Attachments: results-java.png, results-python.png
>
>
> The performance of reading from BigQuery in Python seems to be much worse than the performance of it in Java.
> To reproduce this, I've run the following two programs on the Google Cloud, which basically read the weights from the public data set "natality" and outputs the top 100 largest weights.
> Python:
> {code:java}
> # <cut imports>
> options = PipelineOptions()
> options.view_as(StandardOptions).runner = 'DataflowRunner'
> # <cut more options>
> pipeline = Pipeline(options=options)
> (pipeline
>     | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT weight_pounds FROM [bigquery-public-data:samples.natality]'))
>     | 'MapToFloat' >> beam.Map(lambda elem: elem['weight_pounds'])
>     | 'Top' >> beam.combiners.Top.Largest(100)
>     | 'MapToString' >> beam.Map(lambda elem: str(elem))
>     | 'Write' >> beam.io.WriteToText("<output-file>"))
> pipeline.run()
> {code}
>  Java:
> {code:java}
> // <cut imports>
> public class Natality {
>     public static void main(String[] args) {
>         DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
>         options.setRunner(DataflowRunner.class);
>         // <cut more options>
>         
>         Pipeline pipeline = Pipeline.create(options);
>         pipeline.apply("Read", BigQueryIO.readTableRows()
>             .fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]"))
>             .apply("MapToDouble", MapElements
>                 .into(TypeDescriptors.doubles())
>                 .via(row -> {
>                      Object obj = row.get("weight_pounds");
>                      return (obj == null ? 0.0 : (Double) obj);
>                 }))
>             .apply("Top", Top.largest(100))
>             .apply("MapToString", MapElements
>                 .into(TypeDescriptors.strings())
>                 .via(weight -> weight.toString()))
>             .apply("Write", TextIO.write().to("<output-file>"));
>         pipeline.run().waitUntilFinish();
>     }
> }
> {code}
> The "<cut more options>" are basic options like project, job name, temp location, etc. Both programs produce identical outputs.
> Running these programs launches a DataFlow job on the Google Cloud with the following results (data from the Google Cloud Platform web interface; screenshots attached).
> Python:
> {noformat}
> Read Succeeded 1 hr 40 min 40 sec
> MapToFloat Succeeded 2 min 43 sec
> Top Succeeded 5 min 25 sec
> MapToString Succeeded 0 sec
> Write Succeeded 3 sec{noformat}
> Java:
> {noformat}
> Read Succeeded 4 min 45 sec
> MapToDouble Succeeded 45 sec
> Top Succeeded 52 sec
> MapToString Succeeded 0 sec
> Write Succeeded 1 sec
> {noformat}
> As you can see, there is an enormous performance hit in Python w.r.t. the reading from BigQuery: 1h40m vs less than 5 minutes.
> Furthermore the other standard operations (like Top) are also much slower in Python than in Java.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)