You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by GitBox <gi...@apache.org> on 2020/03/10 15:58:25 UTC

[GitHub] [beam] kamilwu opened a new pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

kamilwu opened a new pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092
 
 
   Beam was using `numpy.random.RandomState` during generation of Synthetic input, which, starting from NumPy 1.17, is considered as a legacy generator. This caused the slowdown on Python 3. The solution was using `random` module instead of Numpy.
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
   XLang | --- | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/)
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) 
   Portable | --- | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-610858295
 
 
   Hi @tvalentyn, sorry for the delay. Changes are ready, I will push them in a second.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597583707
 
 
   Run Python PreCommit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-598220736
 
 
   Alright. This is still quite huge slowdown. @tvalentyn Do you have any other ideas how to resolve the issue? Our primary goal is to move all performance tests to Python 3 eventually. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r405921080
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n),
+        *map(self.getrandbits, itertools.repeat(64, n)))[:length]
 
 Review comment:
   Since we don't need py2 compatibility anymore, consider using `to_bytes`. Here's an equivalent for chunk_size=8. It seems to be somewhat slower than current method (perhaps since I'm not using `map+repeat()`), but with larger `chuck_size`, seems to be more efficient. Large chunk size may be less efficient for short bytesequences.
   
   ```
   chunk_size_bytes = 8 // TBD - larger chunks seem to improve performance.
   chunk_size_bits = chunk_size_bytes * 8
   num_chunks = length // chunk_size_bytes + 1
   
   return b''.join([self.getrandbits(chunk_size_bits).to_bytes(chunk_size_bytes, sys.byteorder) for _ in range(num_chunks)])[:length]
   ```
   
   If you decide to keep current implementation - please add a comment explaining the mechanics for readers not familiar with this code (we generate 8-byte stings, and then fit them into a representation of C++'s long-long, which also takes up 8 bytes).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597510807
 
 
   8x slowdown is huge. Can you share synthetic source parameters used in this benchmark?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn merged pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource on Python 3

Posted by GitBox <gi...@apache.org>.
tvalentyn merged pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource on Python 3
URL: https://github.com/apache/beam/pull/11092
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r405711175
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n), *map(self.getrandbits, itertools.repeat(64,
+                                                                 n)))[:length]
+
+
+Generator = _Random
+
+# TODO: Remove this when Beam drops Python 2.
 
 Review comment:
   yes - https://issues.apache.org/jira/browse/BEAM-7372

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-611659283
 
 
   Thanks, @kamilwu!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597583630
 
 
   Run Python2_PVR_Flink PreCommit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r406144726
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n),
+        *map(self.getrandbits, itertools.repeat(64, n)))[:length]
 
 Review comment:
   I'll push a commit with that so you can see how it looks in the code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597583567
 
 
   Looks like we have problem with "no space left" on apache-beam-jenkins-7 worker once again...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-605117264
 
 
   Also encountered a test internally that is slower on Py3 due to Synthetic source issue. Choosing an optimal implementation per python version should unblock this, so looking forward to your change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597394055
 
 
   Thanks, @kamilwu. I'll have to take a closer look - I see that this pr causes a 8x slowdown in one of the benchmarks we run internally. I don't know yet where it comes from.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn edited a comment on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn edited a comment on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-598036518
 
 
   Yes, looks like it is 3x slowdown on Py2 on the internal test we have, not 8x.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn edited a comment on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn edited a comment on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-600364689
 
 
   Yes, I agree; we should be migrating the tests to Python 3. I think we'll pick the option that works well on Py3 and move on, or branch the generation on Py2 vs Py3, if we don't find one-fits-all solution. Afterall, the input generation should not be the slowest part of the pipeline.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-600364689
 
 
   Yes, I agree; we should be migrating the tests to Python 3. I think we'll pick the option that works well on Py3 and move or branch the generation on Py2 vs Py3, if we don't find one-fit-all solution. Afterall, the input generation should not be the slowest part of the pipeline.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597180032
 
 
   R: @tvalentyn 
   
   I did many tests and gathered the results in one single document: https://docs.google.com/document/d/1AegjUCc5w4B90_rvR8WAfIeL65PHUi4oGYkodgPS0o0/edit?usp=sharing. The previous solution (https://github.com/apache/beam/pull/10885) didn't work very well on big `element_size`, which was most probably the cause of failures and slowdowns.
   
   Although this PR is still a bit slower than we have now using numpy 1.16 and Python 2.7, I'd rather avoid downgrading numpy via supplying an additional requirements.txt file to a Dataflow job. According to this document (https://numpy.org/neps/nep-0029-deprecation_policy.html), the support for numpy 1.16 will be dropped on Jan 13, 2021 - and this looks like a temporary solution.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597953191
 
 
   It is testing a pipeline with a side input:
    
    '--sideinput_num_records=1',
    '--sideinput_key_size=100000',
    '--maininput_key_size=100000',
    '--maininput_num_records=100000',
   
   I reran the test, and the regression was on the order of 3x; running the tests again 5 times to see the difference. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r405397492
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n), *map(self.getrandbits, itertools.repeat(64,
+                                                                 n)))[:length]
+
+
+Generator = _Random
+
+# TODO: Remove this when Beam drops Python 2.
 
 Review comment:
   @tvalentyn Do we have a JIRA issue that accumulates all things that have to be done when Beam drops py2? I know there is a https://issues.apache.org/jira/browse/BEAM-5949, but this refers only to `__ne__`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-610644707
 
 
   Hi @kamilwu , do you still have time to work on this? Let me know if you need help. Thank you! 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-611537686
 
 
   Run Python2_PVR_Flink PreCommit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r406369262
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n),
+        *map(self.getrandbits, itertools.repeat(64, n)))[:length]
 
 Review comment:
   This is great. I really like how we simplified this code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-611488835
 
 
   Run Python2_PVR_Flink PreCommit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-610865238
 
 
   I reran some tests and it seems the performance on py2 is almost the same as before. @tvalentyn Could you check if slowdown on one of your internal tests still exists?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597617171
 
 
   Run Python PreCommit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597180443
 
 
   pylint complains about using `map`, I'll try to fix this tomorrow

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-597606439
 
 
   Run Python2_PVR_Flink PreCommit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r406054568
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n), *map(self.getrandbits, itertools.repeat(64,
+                                                                 n)))[:length]
+
+
+Generator = _Random
+
+# TODO: Remove this when Beam drops Python 2.
 
 Review comment:
   Thanks, this is what I was looking for

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r406142245
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n),
+        *map(self.getrandbits, itertools.repeat(64, n)))[:length]
 
 Review comment:
   How about not using chunks at all? I did some tests again and, surprisingly, it looks like we don't need them. 
   Here's a test with a set of different chunk sizes:
   `for CHUNK_SIZE in {4,8,32,64}; do python -m timeit -s "import random; import sys; chunk_size=$CHUNK_SIZE; len=10; num_chunks=len//chunk_size+1" 'b"".join([random.getrandbits(chunk_size * 8).to_bytes(chunk_size, sys.byteorder) for _ in range(num_chunks)])[:len]'; done
   `
   
   Results:
   for len==10:
   200000 loops, best of 5: 1.62 usec per loop
   200000 loops, best of 5: 1.34 usec per loop
   200000 loops, best of 5: 1.02 usec per loop
   200000 loops, best of 5: 1.19 usec per loop
   
   for len==1000:
   5000 loops, best of 5: 87.7 usec per loop
   5000 loops, best of 5: 50.7 usec per loop
   20000 loops, best of 5: 16.7 usec per loop
   20000 loops, best of 5: 11 usec per loop
   
   And without chunks:
   `python -m timeit -s "import random; import sys; len=10" 'random.getrandbits(len * 8).to_bytes(len, sys.byteorder)'`
   
   for len==10:
   1000000 loops, best of 5: 358 nsec per loop
   
   for len==1000:
   50000 loops, best of 5: 4.5 usec per loop

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-598036518
 
 
   Yes, looks like it is 3x slowdown on Py2, not 8x. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on a change in pull request #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#discussion_r405711343
 
 

 ##########
 File path: sdks/python/apache_beam/testing/synthetic_pipeline.py
 ##########
 @@ -61,6 +65,35 @@
   np = None
 
 
+class _Random(Random):
+  """A subclass of `random.Random` from the Python Standard Library that
+  provides a method returning random bytes of arbitrary length.
+  """
+
+  # `numpy.random.RandomState` does not provide `random()` method, we keep this
+  # for compatibility reasons.
+  random_sample = Random.random
+
+  def bytes(self, length):
+    """Returns random bytes.
+
+    Args:
+      length (int): Number of random bytes.
+    """
+    n = length // 8 + 1
+    # pylint: disable=map-builtin-not-iterating
+    return struct.pack(
+        '{}Q'.format(n), *map(self.getrandbits, itertools.repeat(64,
+                                                                 n)))[:length]
+
+
+Generator = _Random
+
+# TODO: Remove this when Beam drops Python 2.
 
 Review comment:
   Commented on BEAM-5459, thank you!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
kamilwu commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-601218678
 
 
   I think branching the generation makes the most sense, since Py2 deprecation in Beam is closer than ever. I'll try to go back with a solution in the next week. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #11092: [BEAM-9085] Fix performance regression in SyntheticSource
URL: https://github.com/apache/beam/pull/11092#issuecomment-600365872
 
 
   What do you think about branching the generation based on Py version as a workaround until we deprecate Py2 support in Beam?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services