You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "Abacn (via GitHub)" <gi...@apache.org> on 2023/03/23 14:03:40 UTC

[GitHub] [beam] Abacn opened a new issue, #25944: [Task]: Improve the performance of Python Synthetic Source

Abacn opened a new issue, #25944:
URL: https://github.com/apache/beam/issues/25944

   ### What needs to happen?
   
   It is found that the cost of generating synthetic source is as expensive as write to sink in a Python IO performance test https://github.com/apache/beam/issues/19084#issuecomment-1343373709 . This prevents from the benchmark reporting accurate performance data.
   
   Ran a pipeline with synthetic source only, cloud profile shows
   
   - assign random seed alone costs 30% of total cpu time
   - generate bytes costs 20% of total cpu time
   
   <img width="1695" alt="image" src="https://user-images.githubusercontent.com/8010435/227224125-846dcd7b-7afa-4aa7-b6ff-689a4f3782ad.png">
    
   This is because Python built in random generator uses a Mersenne Twister with fairly large state ((doc)[https://docs.python.org/3/library/random.html]), thus assigning seed is slow. Generating bytes is also slow as it involves many memory allocations. In contrast, Java built in random generator (used by Java SDK's synthetic source) uses a linear congruential generator (LCG) by Donald Knuth ((doc)[https://docs.oracle.com/javase/8/docs/api/java/util/Random.html]) which is way faster.
   
   I compared the performance between builtin generating random bytes and cythonized LCG implemenration, generating 1M random bytes of 1024 bytes. The latter shows more than 10 x performance gain (run time 10 s / < 1 s). This doubles the performance of synthetic pipeline. We should be able to switch to the LCG
   
   Once this is done Python synthetic pipeline has minimum cost of generating bytes themselves and can then be used to benchmarking the peformance of SDF.
   
   
   
   
   
   ### Issue Priority
   
   Priority: 3 (nice-to-have improvement)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn closed issue #25944: [Task]: Improve the performance of Python Synthetic Source

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn closed issue #25944: [Task]: Improve the performance of Python Synthetic Source
URL: https://github.com/apache/beam/issues/25944


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on issue #25944: [Task]: Improve the performance of Python Synthetic Source

Posted by "lostluck (via GitHub)" <gi...@apache.org>.
lostluck commented on issue #25944:
URL: https://github.com/apache/beam/issues/25944#issuecomment-1496353458

   This is also true for the Go SDK's load tests. We'd largely be best off by dictating the random source generation alg for consistency in the synthetic sources as performance measures. 
   
   Switching to a cheaper RNG approach with worse RNG is among many changes to improve the Load Test metrics for the Go SDK....
   https://github.com/apache/beam/pull/17698/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org