You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 23:03:03 UTC

[GitHub] [beam] kennknowles opened a new issue, #19239: Python BigQuery performance much worse than Java

kennknowles opened a new issue, #19239:
URL: https://github.com/apache/beam/issues/19239

   The performance of reading from BigQuery in Python seems to be much worse than the performance of it in Java.
   
   To reproduce this, I've run the following two programs on the Google Cloud, which basically read the weights from the public data set "natality" and outputs the top 100 largest weights.
   
   Python:
   ```
   
   # <cut imports>
   
   options = PipelineOptions()
   options.view_as(StandardOptions).runner = 'DataflowRunner'
   #
   <cut more options>
   
   pipeline = Pipeline(options=options)
   (pipeline
       | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT
   weight_pounds FROM [bigquery-public-data:samples.natality]'))
       | 'MapToFloat' >> beam.Map(lambda
   elem: elem['weight_pounds'])
       | 'Top' >> beam.combiners.Top.Largest(100)
       | 'MapToString' >>
   beam.Map(lambda elem: str(elem))
       | 'Write' >> beam.io.WriteToText("<output-file>"))
   
   pipeline.run()
   
   ```
   
    Java:
   ```
   
   // <cut imports>
   
   public class Natality {
       public static void main(String[] args) {
          
   DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
   
          options.setRunner(DataflowRunner.class);
           // <cut more options>
           
           Pipeline
   pipeline = Pipeline.create(options);
   
           pipeline.apply("Read", BigQueryIO.readTableRows()
   
              .fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]"))
         
        .apply("MapToDouble", MapElements
                   .into(TypeDescriptors.doubles())
            
         .via(row -> {
                        Object obj = row.get("weight_pounds");
                    
      return (obj == null ? 0.0 : (Double) obj);
                   }))
               .apply("Top", Top.largest(100))
   
              .apply("MapToString", MapElements
                   .into(TypeDescriptors.strings())
      
               .via(weight -> weight.toString()))
               .apply("Write", TextIO.write().to("<output-file>"));
   
   
          pipeline.run().waitUntilFinish();
       }
   }
   
   ```
   
   The "<cut more options\>" are basic options like project, job name, temp location, etc. Both programs produce identical outputs.
   
   Running these programs launches a DataFlow job on the Google Cloud with the following results (data from the Google Cloud Platform web interface; screenshots attached).
   
   Python:
   ```
   
   Read Succeeded 1 hr 40 min 40 sec
   MapToFloat Succeeded 2 min 43 sec
   Top Succeeded 5 min 25 sec
   MapToString
   Succeeded 0 sec
   Write Succeeded 3 sec
   ```
   
   Java:
   ```
   
   Read Succeeded 4 min 45 sec
   MapToDouble Succeeded 45 sec
   Top Succeeded 52 sec
   MapToString Succeeded
   0 sec
   Write Succeeded 1 sec
   
   ```
   
   As you can see, there is an enormous performance hit in Python w.r.t. the reading from BigQuery: 1h40m vs less than 5 minutes.
   
   Furthermore the other standard operations (like Top) are also much slower in Python than in Java.
   
    
   
   Imported from Jira [BEAM-6064](https://issues.apache.org/jira/browse/BEAM-6064). Original Jira may contain additional context.
   Reported by: jankuipers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles commented on issue #19239: Python BigQuery performance much worse than Java

Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1156660024

   A factor of 10 in CPU and memory is pretty bad but also not totally unheard of. I would suggest framing this as simply "trying to make BigQueryIO cheaper in Python". At some point, will using xlang be the best option?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] siddjain commented on issue #19239: Python BigQuery performance much worse than Java

Posted by GitBox <gi...@apache.org>.
siddjain commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1159353731

   btw, I also tried converting the code into scala using spotify scio. The performance is identical to Java if anyone interested.
   
   Apache Beam SDK for Java 2.38.0
   
   <img width="285" alt="Screen Shot 2022-06-17 at 8 40 31 PM" src="https://user-images.githubusercontent.com/758321/174421548-c19ae167-47b0-459c-ab42-df4d9bf83a27.png">
   
   scala code:
   
   ```
   @BigQueryType.fromQuery("SELECT weight_pounds FROM [bigquery-public-data:samples.natality]")
     class Row
   
     @BigQueryType.toTable
     case class Result(weight_pounds: Double)
   
     def pipeline(cmdlineArgs: Array[String]): ScioContext = {
       val (sc, args) = ContextAndArgs(cmdlineArgs)
   
       sc.typedBigQuery[Row]()
         .map(r => r.weight_pounds.getOrElse(0.0)) // return 0 if weight_pounds is None
         .top(100)     // select top 100. this returns a SCollection[Iterable[Double]]
         .flatMap(x => x)
         .map(x => Result(x))
         .saveAsTypedBigQueryTable(
           Table.Spec(args("output")),
           writeDisposition = WRITE_TRUNCATE,
           createDisposition = CREATE_IF_NEEDED
         )
   
       sc
     }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] siddjain commented on issue #19239: Python BigQuery performance much worse than Java

Posted by GitBox <gi...@apache.org>.
siddjain commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1156916204

   thanks @kennknowles 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] siddjain commented on issue #19239: Python BigQuery performance much worse than Java

Posted by GitBox <gi...@apache.org>.
siddjain commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1155805410

   if it helps, we ran the python code today and in our case it completed in 13 mins. the original code gives errors and we had to make modifications to it. using Apache Beam Python 3.7 SDK 2.39.0:
   
   ```
   parser = argparse.ArgumentParser()
   known_args, pipeline_args = parser.parse_known_args()
   print(pipeline_args)
   # We use the save_main_session option because one or more DoFn's in this
   # workflow rely on global context (e.g., a module imported at module level).
   pipeline_options = PipelineOptions(pipeline_args)
   pipeline_options.view_as(SetupOptions).save_main_session = True
   
   pipeline = beam.Pipeline(options=pipeline_options)
   (pipeline
       | 'Read' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT weight_pounds FROM [bigquery-public-data:samples.natality]'))
       | 'MapToFloat' >> beam.Map(lambda elem: 0 if elem['weight_pounds'] == None else elem['weight_pounds'])
       | 'Top' >> beam.combiners.Top.Largest(100)
       | 'MapToString' >> beam.Map(lambda elem: str(elem))
       | 'Write' >> beam.io.WriteToText("<output-file>"))
   
   # When an Apache Beam Python program runs a pipeline on a service such as Dataflow, it is typically executed asynchronously. 
   # To block until pipeline completion, use the wait_until_finish() method of the PipelineResult object, returned from the run() method of the runner.
   pipeline.run().wait_until_finish()
   ```
   
   ```
   real	13m9.053s
   user	0m8.418s
   sys	0m2.222s
   ```
   
   <img width="244" alt="Screen Shot 2022-06-14 at 4 27 26 PM" src="https://user-images.githubusercontent.com/758321/173705791-45cf7817-1350-42ad-82a8-28fcf7bf60c2.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles commented on issue #19239: Python BigQuery performance much worse than Java

Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1156885794

   The PTransforms in a graph contain "user code" in the form of Java and Python DoFns (things like BoundedSource and UnboundedSource also are DoFns when you get to execution). A connector like BigQueryIO is "just" a PTransform, in other words a small subgraph that performance the data reading.
   
   For most connectors, we expect to have just one implementation. Usually this will probably be Java. Other SDKs will use the connector via Beam's portability APIs. But some connectors are implemented more than once, in different SDKs. For example FileIO is implemented in all SDKs largely to get the SDK started. Another example is BigQueryIO which is implemented in both Java and Python. So the DoFns that are executed in each case are different. The Python implementation of BigQueryIO is also missing features that are available in Java.
   
   Hope that helps!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] siddjain commented on issue #19239: Python BigQuery performance much worse than Java

Posted by GitBox <gi...@apache.org>.
siddjain commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1156782591

   Can someone explain why the performance depends on the SDK chosen - Java vs Python? I am not familiar with internals of Beam and thought no matter what SDK one chooses, internally all of them would make HTTP calls to the backend and the backend would perform same way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] siddjain commented on issue #19239: Python BigQuery performance much worse than Java

Posted by GitBox <gi...@apache.org>.
siddjain commented on issue #19239:
URL: https://github.com/apache/beam/issues/19239#issuecomment-1155880557

   I take it back. The wall clock time of python was 13mins but the total vCPU and memory time is much higher (10x difference).
   
   Python (Apache Beam Python 3.7 SDK 2.39.0):
   
   <img width="304" alt="Screen Shot 2022-06-14 at 6 37 34 PM" src="https://user-images.githubusercontent.com/758321/173717775-cada2b06-01f0-4a07-85f9-3a5037463cf2.png">
   
   
   Java (Apache Beam SDK for Java 2.39.0):
   
   <img width="293" alt="Screen Shot 2022-06-14 at 6 37 02 PM" src="https://user-images.githubusercontent.com/758321/173717736-3e703817-f186-47aa-91f1-06acfb96fabb.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org