You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/12/31 01:02:28 UTC

[PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

zhengruifeng opened a new pull request, #44546:
URL: https://github.com/apache/spark/pull/44546

   ### What changes were proposed in this pull request?
   Enable the caching provided by [`setup-python`](https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages)
   
   
   
   
   ### Why are the changes needed?
   avoid downloading the Python dependencies if possible
   
   
   ### Does this PR introduce _any_ user-facing change?
   no, infra-only
   
   
   ### How was this patch tested?
   ci, manually check:
   
   first run to cache the dependencies
   https://github.com/zhengruifeng/spark/actions/runs/7363727839/job/20043467880
   
   ![image](https://github.com/apache/spark/assets/7322292/898d3029-fb67-4513-ae4c-46ed0eb155ee)
   
   ![image](https://github.com/apache/spark/assets/7322292/5d8c2862-acc8-4e2e-b832-4e42027961f8)
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #44546: [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests
URL: https://github.com/apache/spark/pull/44546


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #44546:
URL: https://github.com/apache/spark/pull/44546#issuecomment-2052721068

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44546:
URL: https://github.com/apache/spark/pull/44546#discussion_r1438914015


##########
dev/py-tests/requirements-sql.txt:
##########
@@ -0,0 +1,11 @@
+# PySpark dependencies for SQL tests
+
+numpy==1.26.2

Review Comment:
   Related prior discussion on pinning development dependencies: https://github.com/apache/spark/pull/27928#pullrequestreview-375712684



##########
dev/py-tests/requirements-sql.txt:
##########
@@ -0,0 +1,11 @@
+# PySpark dependencies for SQL tests
+
+numpy==1.26.2

Review Comment:
   > actually, I think maybe we should always specify the versions
   
   I agree with this, and this is something I tried to do in the PR I linked to just above, but several committers were against it.
   
   When I look at the [number of PRs related to pinning dev dependencies][1] over the past three years, I wonder if committers still feel the same way today.
   
   Not pinning development dependencies creates constant breakages that can pop up whenever an upstream library releases a new version. When we pin dependencies, by contrast, we choose when to upgrade and deal with the potential breakage.
   
   [1]: https://github.com/apache/spark/pulls?q=is%3Apr+infra+pin



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

Posted by "nchammas (via GitHub)" <gi...@apache.org>.
nchammas commented on code in PR #44546:
URL: https://github.com/apache/spark/pull/44546#discussion_r1439674215


##########
dev/py-tests/requirements-sql.txt:
##########
@@ -0,0 +1,11 @@
+# PySpark dependencies for SQL tests
+
+numpy==1.26.2

Review Comment:
   Would you be open to my making another attempt at the approach in #27928? (@zhengruifeng can also take this on if they prefer, of course.)
   
   Basically, we have two sets of development dependencies:
   - `requirements.txt`: direct dependencies only that are as flexible as possible; this is what devs install on their laptops
   - `requirements-pinned.txt`: pinned dependencies derived automatically from `requirements.txt` using [pip-tools][1]; this is used for CI
   
   I know this adds a new tool that non-Python developers may not be familiar with (pip-tools), but it's extremely easy to use, has been around a long time, and is in use by many large projects, the most notable of which is [Warehouse][2], the project that backs PyPI.
   
   [1]: https://github.com/jazzband/pip-tools
   [2]: https://github.com/pypi/warehouse



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #44546:
URL: https://github.com/apache/spark/pull/44546#discussion_r1438768278


##########
dev/py-tests/requirements-sql.txt:
##########
@@ -0,0 +1,11 @@
+# PySpark dependencies for SQL tests
+
+numpy==1.26.2

Review Comment:
   > The requirements file format allows for specifying dependency versions using logical operators (for example chardet>=3.0.4) or specifying dependencies without any versions. In this case the pip install -r requirements.txt command will always try to install the latest available package version. To be sure that the cache will be used, please stick to a specific dependency version and update it manually if necessary.
   
   specify versions according to the suggestion in https://github.com/actions/setup-python?tab=readme-ov-file#caching-packages-dependencies
   
   actually, I think maybe we should always specify the versions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #44546:
URL: https://github.com/apache/spark/pull/44546#discussion_r1439110439


##########
dev/py-tests/requirements-sql.txt:
##########
@@ -0,0 +1,11 @@
+# PySpark dependencies for SQL tests
+
+numpy==1.26.2

Review Comment:
   Yeah. I don't have a great solution on this. We could have CI dedicated dep files maybe ... because now we have too many dependencies in CI with too many matrix ... but not sure .. At least, now I am not super against this idea..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-46549][INFRA] Cache the Python dependencies for SQL tests [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #44546:
URL: https://github.com/apache/spark/pull/44546#discussion_r1439989732


##########
dev/py-tests/requirements-sql.txt:
##########
@@ -0,0 +1,11 @@
+# PySpark dependencies for SQL tests
+
+numpy==1.26.2

Review Comment:
   @nchammas I just notice the previous discussion https://github.com/apache/spark/pull/27928.
   
   I personally prefer using `requirements.txt` files with pinned versions, one reason is that the dependency is actually cached in docker file, and I was confused about the version used in CI from time to time, e.g.
   we used the cached `RUN python3.9 -m pip install numpy pyarrow ...` before, and when pyarrow 13 released at 2023-8-23, I didn't know this release broke PySpark before the cached image was refreshed (at 2023-9-13).
   
   But I don't feel very strong about it and defer to @HyukjinKwon and @dongjoon-hyun on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org