You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Valentyn Tymofieiev (Jira)" <ji...@apache.org> on 2019/09/11 21:38:00 UTC
[jira] [Comment Edited] (BEAM-8198) Investigate possible performance regression of Wordcount 1GB batch benchmark on Py3.

    [ https://issues.apache.org/jira/browse/BEAM-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928020#comment-16928020 ] 

Valentyn Tymofieiev edited comment on BEAM-8198 at 9/11/19 9:37 PM:
--------------------------------------------------------------------

Looking at Jenkins jobs for Wordcount 1 GB benchmark (https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py37), we can do the following to reproduce these runs.

1) Clone PKB and install PKB dependencies in a virtual environment with Python 2.7. It looks like we run perfkit benchmarker in Python 2.7 environment, but the benchmarks pipeline is triggered via gradle and can use other runtime.


{noformat}
git clone https://github.com/GoogleCloudPlatform/PerfKitBenchmarker.git
pip install -r ./PerfKitBenchmarker/requirements.txt

{noformat}

2) Clone Beam SDK and build SDK tarball against desired commit

3) Configure the parameters to the benchmark:


{noformat}
PROJECT=my_gcp_project
PKB_DIR=/path/to/PerfKitBenchmarker
PKB_BQ_TABLE=bq_dataset_to_save_results.wordcount_py36_beam216_pkb_results
BEAM_LOCATION=/path/to/clone/of/beam
BEAM_TARBALL=$BEAM_LOCATION/sdks/python/dist/apache-beam-2.16.0.dev0.tar.gz
TEMP_LOCATION=gs://some/temp/location/

{noformat}

4) Run the benchmark:


{noformat}
bash -c "python $PKB_DIR/pkb.py \
--project=${PROJECT} --dpb_log_level=INFO --bigquery_table=${PKB_BQ_TABLE} \
--k8s_get_retry_count=36 --k8s_get_wait_interval=10 --temp_dir=/tmp \
--beam_location=${BEAM_LOCATION} --official=true --dpb_service_zone=fake_zone --beam_sdk=python \
--benchmarks=beam_integration_benchmark \
--beam_it_class=apache_beam.examples.wordcount_it_test:WordCountIT.test_wordcount_it \
--beam_it_module=:sdks:python:test-suites:dataflow:py36 \
--beam_prebuilt=true --beam_python_sdk_location=${BEAM_TARBALL} \
--beam_runner=TestDataflowRunner --beam_it_timeout=12000 \
'--beam_it_args=--project=${PROJECT},\
--staging_location=${TEMP_LOCATION},\
--temp_location=${TEMP_LOCATION},\
--input=gs://apache-beam-samples/input_small_files/ascii_sort_1MB_input.0000*,\
--output=${TEMP_LOCATION}temp-storage-for-end-to-end-tests/py-it-cloud/output,\
--expect_checksum=ea0ca2e5ee4ea5f218790f28d0b9fe7d09d8d710,\
--num_workers=10,--autoscaling_algorithm=NONE'"
{noformat}




was (Author: tvalentyn):
Looking at Jenkins jobs for Wordcount 1 GB benchmark (https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py37), we can do the following to reproduce these runs.

1) Clone PKB and install PKB dependencies in a virtual environment with Python 2.7. It looks like we run perfkit benchmarker in Python 2.7 environment, but the benchmarks pipeline is triggered via gradle and can use other runtime.

git clone https://github.com/GoogleCloudPlatform/PerfKitBenchmarker.git
pip install -r ./PerfKitBenchmarker/requirements.txt

2) Clone Beam SDK and build SDK tarball against desired commit

3) Configure the parameters to the benchmark:


{noformat}
PROJECT=my_gcp_project
PKB_DIR=/path/to/PerfKitBenchmarker
PKB_BQ_TABLE=bq_dataset_to_save_results.wordcount_py36_beam216_pkb_results
BEAM_LOCATION=/path/to/clone/of/beam
BEAM_TARBALL=$BEAM_LOCATION/sdks/python/dist/apache-beam-2.16.0.dev0.tar.gz
TEMP_LOCATION=gs://some/temp/location/

{noformat}

4) Run the benchmark:


{noformat}
bash -c "python $PKB_DIR/pkb.py \
--project=${PROJECT} --dpb_log_level=INFO --bigquery_table=${PKB_BQ_TABLE} \
--k8s_get_retry_count=36 --k8s_get_wait_interval=10 --temp_dir=/tmp \
--beam_location=${BEAM_LOCATION} --official=true --dpb_service_zone=fake_zone --beam_sdk=python \
--benchmarks=beam_integration_benchmark \
--beam_it_class=apache_beam.examples.wordcount_it_test:WordCountIT.test_wordcount_it \
--beam_it_module=:sdks:python:test-suites:dataflow:py36 \
--beam_prebuilt=true --beam_python_sdk_location=${BEAM_TARBALL} \
--beam_runner=TestDataflowRunner --beam_it_timeout=12000 \
'--beam_it_args=--project=${PROJECT},\
--staging_location=${TEMP_LOCATION},\
--temp_location=${TEMP_LOCATION},\
--input=gs://apache-beam-samples/input_small_files/ascii_sort_1MB_input.0000*,\
--output=${TEMP_LOCATION}temp-storage-for-end-to-end-tests/py-it-cloud/output,\
--expect_checksum=ea0ca2e5ee4ea5f218790f28d0b9fe7d09d8d710,\
--num_workers=10,--autoscaling_algorithm=NONE'"
{noformat}



> Investigate possible performance regression of Wordcount 1GB batch benchmark on Py3.
> ------------------------------------------------------------------------------------
>
>                 Key: BEAM-8198
>                 URL: https://issues.apache.org/jira/browse/BEAM-8198
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core, testing
>            Reporter: Valentyn Tymofieiev
>            Assignee: Valentyn Tymofieiev
>            Priority: Major
>             Fix For: 2.16.0
>
>
> context: https://lists.apache.org/thread.html/51e000f16481451c207c00ac5e881aa4a46fa020922eddffd00ad527@%3Cdev.beam.apache.org%3E
> Setting fix version to 2.16.0 to understand the cause, hopefully before the vote.
> cc: [~altay] [~thw] [~markflyhigh]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)