You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2023/01/19 16:21:32 UTC
[GitHub] [beam] RobMcKiernan opened a new issue, #25085: [Bug]: Dependencies from private repositories unable to be seen
RobMcKiernan opened a new issue, #25085:
URL: https://github.com/apache/beam/issues/25085
### What happened?
Running a gcp dataflow, using the python sdk 2.44.0 I can no longer access my private repositories. It works on 2.43.0
My set up is as follows:
```docker
FROM my.private.repourl/python:3.8-slim-builder as builder
FROM my.private.repourl/python:3.8-slim
COPY --from=apache/beam_python3.8_sdk:2.43.0 /opt/apache/beam /opt/apache/beam
# this virtual env has all the dependencies I need pre-installed on it
COPY --from=builder $VENV_PATH $VENV_PATH
ENTRYPOINT ["/opt/apache/beam/boot"]
```
This is my run command:
```sh
poetry run python -m projname.main \
--project="$PROJECT_ID" \
--runner=DataFlowRunner \
--temp_location=gs://"$BUCKET_NAME"/temp \
--region="$REGION" \
--job_name="$JOB_NAME" \
--setup_file=./setup.py \
--subnetwork https://www.googleapis.com/compute/v1/projects/"$PROJECT_ID"/regions/"$REGION"/subnetworks/"$SUBNET" \
--experiment=use_runner_v2 \
--sdk_container_image=$IMAGE_NAME \
--template_location=gs://"$BUCKET_NAME"/templates/"$JOB_NAME" \
```
Checking my dataflow worker logs it fails to see my private repos:
```
ERROR: Could not find a version that satisfies the requirement package-i-want<3.0.0,>=2.2.0 (from name-of-my-dataflow) (from versions: none)
```
I think this is the culprit PR: https://github.com/apache/beam/pull/23684/files#diff-cc1f3d7f808c692a6102847bec78809f2e4350c5ee34278100ce0f55d8c23d68R234
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] RobMcKiernan commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "RobMcKiernan (via GitHub)" <gi...@apache.org>.
RobMcKiernan commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1527212805
Yep, that worked! My new Dockerfile, in case it helps anyone:
```
# This image is just a thin wrapper around the standard python10 slim image. It should work just fine using the standard image
FROM eu.gcr.io/my-proj/python:3.10-slim
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
COPY --from=apache/beam_python3.10_sdk:2.46.0 /opt/apache/beam /opt/apache/beam
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
PIP_NO_CACHE_DIR=off \
PIP_DISABLE_PIP_VERSION_CHECK=on \
PIP_DEFAULT_TIMEOUT=100 \
POETRY_NO_INTERACTION=1 \
PATH=/usr/lib/google-cloud-sdk/bin:$PATH
WORKDIR /app
# -- Omitted Section to sort out my gcloud authentication, which I'm not including out of paranoia --
RUN pip install --no-cache-dir \
poetry \
keyring \
keyrings.google-artifactregistry-auth
COPY ./pyproject.toml ./poetry.lock ./
# setting virtualenvs.create to false prevents poetry using venvs as
# beam >2.43 uses global python packages only
RUN poetry config virtualenvs.create false \
&& poetry install --no-cache --no-root --only main \
&& rm -rf /root/.cache
ENTRYPOINT ["/opt/apache/beam/boot"]
```
tl;dr for anyone skipping to the end: Make sure your python packages are installed in `/usr/local/lib/python<version number>/site-packages` in your docker container.
Cheers for your help everyone! Should I close, or would you like it kept open? I guess at a minimum this should be documented somewhere.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn closed issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn closed issue #25085: [Bug]: Dependencies from private repositories unable to be seen
URL: https://github.com/apache/beam/issues/25085
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1426531049
I see. It looks like you may be copying site-packages directory from a different virtual environment. There was a change recently that creates one virtual environment per each SDK process: https://github.com/apache/beam/pull/16658
It could be that you were impacted by this change, if you have been using a non-default virtual environment to store your packages.
Note that dependencies installed in the global python environment should still be accessible in individual python environments, which are created after https://github.com/apache/beam/pull/16658.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1398865657
ack, thank, I'll try to get some eyes here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1400747109
FWIW, if there is a regression between versions, it should be possible to bisect the regression to an exact commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1527717699
looks like i missed the second diff.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] riteshghorse commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "riteshghorse (via GitHub)" <gi...@apache.org>.
riteshghorse commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1400644538
are your private dependencies listed in `requirements.txt` somehow and not pulled locally when running the job?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] RobMcKiernan commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "RobMcKiernan (via GitHub)" <gi...@apache.org>.
RobMcKiernan commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1433105556
Ah ok, yep that sounds like it could be the culprit then.
I've noticed that the dataflow docs use `pip` to install python packages whereas I'm using poetry. I wonder if that plays into this? https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild
The envvar `VENV_PATH` is set to `/venv` in my `COPY --from=builder $VENV_PATH $VENV_PATH` if that helps illuminate anything
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] RobMcKiernan commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "RobMcKiernan (via GitHub)" <gi...@apache.org>.
RobMcKiernan commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1408729605
Sorry I've been away for the past week.
> are your private dependencies listed in `requirements.txt` somehow and not pulled locally when running the job?
I don't use a requirements.txt. Instead I use [poetry](https://python-poetry.org/), which creates a `poetry.lock` file, which serves a similar purpose as a `requirements.txt`. I have verified that my local poetry virtual env has my private python repos installed in it.
The other part to this is that I've created a base docker container for my workers on gcp to use. The private docker image referred to in my Dockerfile `FROM my.private.image-repo/python:3.8-slim-builder as builder` has access to my private python repositories (I've verified this by pulling the docker image myself and exec-ing into it). It seems it is at this point that it fails to have access to my private repos.
@tvalentyn no, I'm afraid my python version has remained constant.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1527410075
You could modify CHANGES.md to further document suggestions/instructions pertaining to change in behavior in 2.44.0 if you'd like.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] RobMcKiernan commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "RobMcKiernan (via GitHub)" <gi...@apache.org>.
RobMcKiernan commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1527495787
I just tried raising a PR, but it appears that I don't have the needed permissions to push to this repo. This is the diff of my PR:
```diff
diff --git a/CHANGES.md b/CHANGES.md
index 871f24bf9d..c7578a8a61 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -254,6 +254,8 @@
runner (such as Dataflow Runner v2) will need to provide this package and its dependencies.
* Slices now use the Beam Iterable Coder. This enables cross language use, but breaks pipeline updates
if a Slice type is used as a PCollection element or State API element. (Go)[#24339](https://github.com/apache/beam/issues/24339)
+* Custom worker Dockerfiles must now install their dependencies in the global python environment. For example, when using poetry
+ you must use `poetry config virtualenvs.create false` before installing deps [#25085](https://github.com/apache/beam/issues/25085)
## Deprecations
diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md
index 17ee452a57..46a7f69209 100644
--- a/website/www/site/content/en/documentation/runtime/environments.md
+++ b/website/www/site/content/en/documentation/runtime/environments.md
@@ -198,6 +198,7 @@ Beam offers a way to provide your own custom container image. The easiest way to
>The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br>
>**Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline.**
+>**NOTE**: When using version >=2.44.0 you must ensure dependencies are installed in the global python environment in the resulting image
2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker.
```
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1527716927
Thanks a lot!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] RobMcKiernan commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "RobMcKiernan (via GitHub)" <gi...@apache.org>.
RobMcKiernan commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1525791480
I'm back working on this now. I tried altering my `PYTHONPATH` in my Dockerfile, but that didn't seem to work, although I'm not quite sure why.
I'm now experimenting using `poetry config virtualenvs.create false` to install my packages in the global python environment. I'll let you know how it goes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1527716787
np. you might have to fork a repo first to create PRs. Sent you https://github.com/apache/beam/pull/26471
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1527411570
Glad to hear you resolved the issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] Abacn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "Abacn (via GitHub)" <gi...@apache.org>.
Abacn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1398850520
CC: @robertwb
CC: @tvalentyn
Sounds like a regression. Is there a workaround to mitigate this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1400748999
> ERROR: Could not find a version that satisfies the requirement package-i-want<3.0.0,>=2.2.0 (from name-of-my-dataflow) (from versions: none)
Re: 'from versions: none' - just to double check, when you changed versions of Beam, did you by chance also change the version of Python interpreter in addition to Beam version? Could you double check that it didn't change?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1426534655
I think 2.44.0 is the first release that include https://github.com/apache/beam/pull/16658 , which matches the timing you describe.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1433943540
The global environment will have packages installed in ./usr/local/lib/python3.8/site-packages. If you activate a custom venv, I think it will be ignored now that the codepath has changed in https://github.com/apache/beam/pull/16658, and a python process creates an individual environment.
I suppose you could try to manipulate the PYTHONPATH variable to include your environment, but that may be brittle if you have package mismatches.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] riteshghorse commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "riteshghorse (via GitHub)" <gi...@apache.org>.
riteshghorse commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1398944434
I looked at the mentioned culprit PR and I don't think its quite the culprit because it is not discarding anything that used to work earlier. I'll take a closer look at the bug for other possibilities.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [beam] tvalentyn commented on issue #25085: [Bug]: Dependencies from private repositories unable to be seen
Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on issue #25085:
URL: https://github.com/apache/beam/issues/25085#issuecomment-1525828919
sg, thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@beam.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org