You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "Abacn (via GitHub)" <gi...@apache.org> on 2023/03/24 15:47:59 UTC

[GitHub] [beam] Abacn opened a new issue, #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Abacn opened a new issue, #25966:
URL: https://github.com/apache/beam/issues/25966

   ### What needs to happen?
   
   Currently, Python PostCommit use `--sdk_location` to upload the Python SDK subjected to testing to Dataflow. It is found that building the wheels for the SDK is very slow:
   ```
   2023/03/24 15:31:16 Executing: /usr/local/bin/pip install --disable-pip-version-check /var/opt/google/staged/dataflow_python_sdk.tar[gcp]
   2023/03/24 15:31:16 Processing /var/opt/google/staged/dataflow_python_sdk.tar
   2023/03/24 15:31:17 Preparing metadata (setup.py): started
   2023/03/24 15:31:35 Preparing metadata (setup.py): finished with status 'done'
   2023/03/24 15:31:37 Building wheels for collected packages: apache-beam
   2023/03/24 15:31:37 Building wheel for apache-beam (setup.py): started
   2023/03/24 15:32:42 Building wheel for apache-beam (setup.py): still running...
   2023/03/24 15:33:44 Building wheel for apache-beam (setup.py): still running...
   2023/03/24 15:34:49 Building wheel for apache-beam (setup.py): still running...
   2023/03/24 15:35:04 Building wheel for apache-beam (setup.py): finished with status 'done'
   2023/03/24 15:35:04 Successfully built apache-beam
   2023/03/24 15:35:04 Installing collected packages: apache-beam
   2023/03/24 15:35:07 Successfully installed apache-beam-2.47.0.dev0
   ```
   
   it takes 4 minutes to install apache beam from source, where 3 and half minutes is used to build the wheel. It should be able to build a wheel locally, whenever possible (host machine generates manylinux wheel, which is the case of Jenkins). This would cut the running time of Python PostCommit on Dataflow by half.
   
   
   ### Issue Priority
   
   Priority: 2 (default / most normal work should be filed as P2)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486137969

   on my machine it took:
   
   real	1m24.951s
   user	1m19.878s
   sys	0m5.468s


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486169484

   Yes, likely # of cores matters. Also if sibling_sdk_worker experiment is not enabled for the project, performance would be worse, since compilation would happen in each container separately. This experiment is in process of getting rolled out at the moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1483311233

   Hi @AnandInguva, do you think this is a reasonable request? If so I would like to work on this, also CC: @tvalentyn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486129377

   I see, thanks for correction. I also did my own tests before replying above, they looked like the following:
   
   ```
   docker run --rm -it --entrypoint=/bin/bash apache/beam_python3.7_sdk:2.45.0
   root@577f14daaa3f:/# pip uninstall apache-beam[gcp]
   root@577f14daaa3f:/# wget https://files.pythonhosted.org/packages/09/07/a8cef9d9193a65f7d7a35d72b46c97cc3684eea7b7728a89f5accbb5f297/apache-beam-2.45.0.zip
   root@577f14daaa3f:/# time pip install ./apache-beam-2.45.0.zip 
   
   Successfully installed apache-beam-2.45.0
   WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
   
   [notice] A new release of pip is available: 23.0 -> 23.0.1
   [notice] To update, run: pip install --upgrade pip
   
   real	0m13.258s
   user	0m11.930s
   sys	0m2.265s
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1485122257

   Thanks @tvalentyn given that 
   - replicating #9966 in Dataflow-owned containers would reduce the time
   - planning to switch Dataflow Python to use Beam-provided containers next quarter
   there is no need to change Dataflow-owned containers at this moment. Could come back when the switch is done to see if this task still values.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1483590079

   Thanks Yi for the suggestion. building a wheel for a Dataflow integration test suite, that builds an SDK once and runs a suite of Pipelines makes sense and can save some time, although the above suggestion is a shorter path to achieve more of the gain. We should have done this earlier.
   
   Care should be taken to build the wheel for the correct target platform and correct python version. If build happens on Jenkins/GH action, build environment is somewhat predeterimined. However forcing users to build the wheel locally every time they run a gradle task may add some friction (needs extra dependencies, increases build the time, will it work well on macs or different deps needed?).
   
   Note there are also slight differences in wheel naming pattern between py37 and py38.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486155905

   Hi @tvalentyn thanks for sharing the experiments. My experiments done in https://github.com/apache/beam/issues/25966#issuecomment-1485890579 checked out the latest master, installed it in local virtual env and also packaged it to tarball. So the difference between the `--sdk_location` and the `:latest` image was minimum (<6 h difference). And afaik there is no cython change today. For some reason Dataflow VM is building wheels notably slower than local machine, maybe it has fewer (2) cores matter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] AnandInguva commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "AnandInguva (via GitHub)" <gi...@apache.org>.
AnandInguva commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1483132501

   @Abacn are you working on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486137629

   I don't reproduce the fast behavior on `gcr.io/apache-beam-testing/beam-sdk/beam_python3.10_sdk:latest`. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486132687

   I am not sure where the discrepancy comes from, perhaps the cache needs to be primed first to be more efficient, or maybe it's not as fast with Python 3.7. 
   
   Repeating this same with `docker run --rm -it --entrypoint=/bin/bash gcr.io/cloud-dataflow/v1beta3/python310:beam-master-20230322` 
   
   shows a much slower installation:
   
   ```
   real	2m43.945s
   user	2m36.349s
   sys	0m7.157s
   
   ```
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn closed issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn closed issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit
URL: https://github.com/apache/beam/issues/25966


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1483571884

   I think replicating https://github.com/apache/beam/commit/81af13c1027ae3ff9ae12a56c9f38c6ad602f3c1 in Dataflow-owned containers would reduce the time to build a wheel to 20 seconds. (installation from a wheel might still be much faster, perhaps around 3 seconds).
   
   Note: We are also planning to switch Dataflow Python to use Beam-provided  containers next quarter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] Abacn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1485890579

   Using beam provided container image (`--sdk_container_image=gcr.io/apache-beam-testing/beam-sdk/beam_python3.10_sdk:latest`), re-installation of the SDK is somewhat faster,
   however still takes 2min30s (1 minutes faster than default, Dataflow provided image). Building the wheel takes 2min20s.
   
   jobId: `2023-03-27_14_11_55-1971041453326172975`
   ```
   2023-03-27 17:15:14.903 EDT 2023/03/27 21:15:14 Found artifact: dataflow_python_sdk.tar
   2023-03-27 17:15:14.903 EDT 2023/03/27 21:15:14 Installing setup packages ...
   2023-03-27 17:15:15.444 EDT Processing /var/opt/google/staged/dataflow_python_sdk.tar
   2023-03-27 17:15:15.937 EDT Preparing metadata (setup.py): started
   2023-03-27 17:15:19.149 EDT Preparing metadata (setup.py): finished with status 'done'
   ...
   2023-03-27 17:15:20.042 EDT Building wheels for collected packages: apache-beam
   2023-03-27 17:15:20.043 EDT Building wheel for apache-beam (setup.py): started
   2023-03-27 17:16:40.346 EDT Building wheel for apache-beam (setup.py): still running...
   2023-03-27 17:17:39.535 EDT Building wheel for apache-beam (setup.py): finished with status 'done'
   ...
   2023-03-27 17:17:39.559 EDT Successfully built apache-beam
   2023-03-27 17:17:40.178 EDT Installing collected packages: apache-beam
   ...
   2023-03-27 17:17:42.247 EDT Successfully installed apache-beam-2.47.0.dev0
   ```
   
   based on this experiment #25970 could still save 2.5 minutes per test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486138246

   which as you mention still faster than 3 min
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #25966: [Task]: Pass sdk wheel instead of tarball for Python Dataflow PostCommit

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25966:
URL: https://github.com/apache/beam/issues/25966#issuecomment-1486144666

   subsequent uninstallation-and-installation is faster. Possible explanation:  cache contents matters and when there are recent changes   in the cythonized codepath, installation is slower.
   Regardless, I am open to the idea to install wheels in tests with caveats mentioned in https://github.com/apache/beam/issues/25966#issuecomment-1483590079


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org