You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/11/22 07:59:17 UTC

[PR] [SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

zhengruifeng opened a new pull request, #43953:
URL: https://github.com/apache/spark/pull/43953

   ### What changes were proposed in this pull request?
   Cache python deps for linter and documentation
   
   
   ### Why are the changes needed?
   1, to avoid unnecessary installation: some packages were installed multiple times;
   2, to centralize the installations: should only modify dockerfile in the future
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, infra-only
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43953:
URL: https://github.com/apache/spark/pull/43953#issuecomment-1822373647

   please hold on, it seems there are env conflicts between `pyspark` and `lint`
   I convert it to draft, and will look into it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1401642950


##########
.github/workflows/build_and_test.yml:
##########
@@ -743,8 +734,6 @@ jobs:
         Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'e1071', 'roxygen2', 'ggplot2', 'mvtnorm', 'statmod'), repos='https://cloud.r-project.org/')"
         Rscript -e "devtools::install_version('pkgdown', version='2.0.1', repos='https://cloud.r-project.org')"
         Rscript -e "devtools::install_version('preferably', version='0.4', repos='https://cloud.r-project.org')"
-    - name: Install dependencies for documentation generation

Review Comment:
   this step is also used in 3.3/3.4/3.5, so move it to `Install dependencies for documentation generation for branch-3.3, branch-3.4, branch-3.5`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1401642302


##########
.github/workflows/build_and_test.yml:
##########
@@ -689,15 +689,6 @@ jobs:
         # Should delete this section after SPARK 3.5 EOL.
         python3.9 -m pip install 'flake8==3.9.0' pydata_sphinx_theme 'mypy==0.982' 'pytest==7.1.3' 'pytest-mypy-plugins==1.9.3' numpydoc 'jinja2<3.0.0' 'black==22.6.0'
         python3.9 -m pip install 'pandas-stubs==1.2.0.53' ipython 'grpcio==1.59.3' 'grpc-stubs==1.24.11' 'googleapis-common-protos-stubs==2.2.0'
-    - name: Install Python linter dependencies

Review Comment:
   this step is only used in master branch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1426263880


##########
dev/infra/Dockerfile:
##########
@@ -139,3 +139,60 @@ RUN python3.12 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 'protobuf
 # TODO(SPARK-46078) Use official one instead of nightly build when it's ready
 RUN python3.12 -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
 RUN python3.12 -m pip install torcheval
+
+
+# Refer to https://github.com/ContinuumIO/docker-images/blob/main/miniconda3/debian/Dockerfile
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -q && \
+    bash miniconda.sh -b -p /opt/miniconda3 && \
+    rm miniconda.sh && \
+    ln -s /opt/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
+    ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda && \
+    find /opt/miniconda3/ -follow -type f -name '*.a' -delete && \
+    find /opt/miniconda3/ -follow -type f -name '*.js.map' -delete && \
+    conda clean -afy
+
+# Additional Python deps for linter and documentation, delete this section if another Python version is used
+# Since there maybe conflicts between envs, here uses conda to manage it.
+# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
+#   See also https://github.com/sphinx-doc/sphinx/issues/7551.
+# Jinja2 3.0.0+ causes error when building with Sphinx.
+#   See also https://issues.apache.org/jira/browse/SPARK-35375.
+RUN conda create -n doc python=3.9
+
+RUN conda run -n doc pip install \

Review Comment:
   Why are we listing out individual dependencies here vs. listing them in a requirements file (or equivalent) that we can use across Docker, GitHub Actions, and local build scripts? Don't we want our build and test dependencies to be consistent?
   
   I made a past attempt at this over in #27928. It failed because, in addition to building a shared set of build and test dependencies, it also pinned transitive build and test dependencies, which the reviewers weren't keen on. But we can separate the two ideas from each other.
   
   IMO there should be a single requirements file for build and test dependencies (whether or not it pins transitive dependencies is a separate issue), and that file should be used everywhere. What do you think?
   
   I also don't follow why we need to pull in conda. What is it getting us over vanilla pip?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1427903514


##########
dev/infra/Dockerfile:
##########
@@ -139,3 +139,60 @@ RUN python3.12 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 'protobuf
 # TODO(SPARK-46078) Use official one instead of nightly build when it's ready
 RUN python3.12 -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
 RUN python3.12 -m pip install torcheval
+
+
+# Refer to https://github.com/ContinuumIO/docker-images/blob/main/miniconda3/debian/Dockerfile
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -q && \
+    bash miniconda.sh -b -p /opt/miniconda3 && \
+    rm miniconda.sh && \
+    ln -s /opt/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
+    ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda && \
+    find /opt/miniconda3/ -follow -type f -name '*.a' -delete && \
+    find /opt/miniconda3/ -follow -type f -name '*.js.map' -delete && \
+    conda clean -afy
+
+# Additional Python deps for linter and documentation, delete this section if another Python version is used
+# Since there maybe conflicts between envs, here uses conda to manage it.
+# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
+#   See also https://github.com/sphinx-doc/sphinx/issues/7551.
+# Jinja2 3.0.0+ causes error when building with Sphinx.
+#   See also https://issues.apache.org/jira/browse/SPARK-35375.
+RUN conda create -n doc python=3.9
+
+RUN conda run -n doc pip install \

Review Comment:
   good to know this. You are right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1401642950


##########
.github/workflows/build_and_test.yml:
##########
@@ -743,8 +734,6 @@ jobs:
         Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'e1071', 'roxygen2', 'ggplot2', 'mvtnorm', 'statmod'), repos='https://cloud.r-project.org/')"
         Rscript -e "devtools::install_version('pkgdown', version='2.0.1', repos='https://cloud.r-project.org')"
         Rscript -e "devtools::install_version('preferably', version='0.4', repos='https://cloud.r-project.org')"
-    - name: Install dependencies for documentation generation

Review Comment:
   this step is also used in 3.3/3.4/3.5, so move to `Install R linter dependencies for branch-3.3, branch-3.4, branch-3.5`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1426263880


##########
dev/infra/Dockerfile:
##########
@@ -139,3 +139,60 @@ RUN python3.12 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 'protobuf
 # TODO(SPARK-46078) Use official one instead of nightly build when it's ready
 RUN python3.12 -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
 RUN python3.12 -m pip install torcheval
+
+
+# Refer to https://github.com/ContinuumIO/docker-images/blob/main/miniconda3/debian/Dockerfile
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -q && \
+    bash miniconda.sh -b -p /opt/miniconda3 && \
+    rm miniconda.sh && \
+    ln -s /opt/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
+    ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda && \
+    find /opt/miniconda3/ -follow -type f -name '*.a' -delete && \
+    find /opt/miniconda3/ -follow -type f -name '*.js.map' -delete && \
+    conda clean -afy
+
+# Additional Python deps for linter and documentation, delete this section if another Python version is used
+# Since there maybe conflicts between envs, here uses conda to manage it.
+# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
+#   See also https://github.com/sphinx-doc/sphinx/issues/7551.
+# Jinja2 3.0.0+ causes error when building with Sphinx.
+#   See also https://issues.apache.org/jira/browse/SPARK-35375.
+RUN conda create -n doc python=3.9
+
+RUN conda run -n doc pip install \

Review Comment:
   Why are we listing out individual dependencies here vs. listing them in a requirements file (or equivalent) that we can use across Docker, GitHub Actions, and local build scripts? Don't we want our build and test dependencies to be consistent?
   
   I made a past attempt at this over in #27928. It failed because, in addition to building a shared set of build and test dependencies, it also pinned transitive build and test dependencies, which the reviewers weren't keen on. But we can separate two ideas from each other.
   
   IMO there should be a single requirements file for build and test dependencies (whether or not it pins transitive dependencies is a separate issue), and that file should be used everywhere. What do you think?
   
   I also don't follow why we need to pull in conda. What is it getting us over vanilla pip?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #43953: [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation
URL: https://github.com/apache/spark/pull/43953


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1426576702


##########
dev/infra/Dockerfile:
##########
@@ -139,3 +139,60 @@ RUN python3.12 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 'protobuf
 # TODO(SPARK-46078) Use official one instead of nightly build when it's ready
 RUN python3.12 -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
 RUN python3.12 -m pip install torcheval
+
+
+# Refer to https://github.com/ContinuumIO/docker-images/blob/main/miniconda3/debian/Dockerfile
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -q && \
+    bash miniconda.sh -b -p /opt/miniconda3 && \
+    rm miniconda.sh && \
+    ln -s /opt/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
+    ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda && \
+    find /opt/miniconda3/ -follow -type f -name '*.a' -delete && \
+    find /opt/miniconda3/ -follow -type f -name '*.js.map' -delete && \
+    conda clean -afy
+
+# Additional Python deps for linter and documentation, delete this section if another Python version is used
+# Since there maybe conflicts between envs, here uses conda to manage it.
+# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
+#   See also https://github.com/sphinx-doc/sphinx/issues/7551.
+# Jinja2 3.0.0+ causes error when building with Sphinx.
+#   See also https://issues.apache.org/jira/browse/SPARK-35375.
+RUN conda create -n doc python=3.9
+
+RUN conda run -n doc pip install \

Review Comment:
   hi @nchammas actually, I wanna give up this PR to cache the dependencies for linter and documentation, since there is still weird issue in building documentation with conda in CI (while it works well in my local).
   
   The reason in this PR to try `conda` was that there were env conflicts between PySpark and lint/doc, that is, installation of lint/doc dependencies will break some tests. 
   
   As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #43953:
URL: https://github.com/apache/spark/pull/43953#discussion_r1426980392


##########
dev/infra/Dockerfile:
##########
@@ -139,3 +139,60 @@ RUN python3.12 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 'protobuf
 # TODO(SPARK-46078) Use official one instead of nightly build when it's ready
 RUN python3.12 -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
 RUN python3.12 -m pip install torcheval
+
+
+# Refer to https://github.com/ContinuumIO/docker-images/blob/main/miniconda3/debian/Dockerfile
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -q && \
+    bash miniconda.sh -b -p /opt/miniconda3 && \
+    rm miniconda.sh && \
+    ln -s /opt/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
+    ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda && \
+    find /opt/miniconda3/ -follow -type f -name '*.a' -delete && \
+    find /opt/miniconda3/ -follow -type f -name '*.js.map' -delete && \
+    conda clean -afy
+
+# Additional Python deps for linter and documentation, delete this section if another Python version is used
+# Since there maybe conflicts between envs, here uses conda to manage it.
+# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
+#   See also https://github.com/sphinx-doc/sphinx/issues/7551.
+# Jinja2 3.0.0+ causes error when building with Sphinx.
+#   See also https://issues.apache.org/jira/browse/SPARK-35375.
+RUN conda create -n doc python=3.9
+
+RUN conda run -n doc pip install \

Review Comment:
   > As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?
   
   That shouldn't be the case. Assuming you `COPY` the requirements file into the image, changing the file will [invalidate the cache][1]:
   
   > The first encountered `COPY` instruction will invalidate the cache for all following instructions from the Dockerfile if the contents of `<src>` have changed. This includes invalidating the cache for `RUN` instructions. 
   
   [Also][2]:
   
   > For the `ADD` and `COPY` instructions, the modification time and size file metadata is used to determine whether cache is valid. During cache lookup, cache is invalidated if the file metadata has changed for any of the files involved.
   
   [1]: https://docs.docker.com/engine/reference/builder/#copy
   [2]: https://docs.docker.com/develop/develop-images/guidelines/#leverage-build-cache



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43953:
URL: https://github.com/apache/spark/pull/43953#issuecomment-1822687237

   unfortunately, there is conflict in `jinja2` and cause multiple pyspark tests fail.
   I plan to test miniconda later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org