You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/16 09:43:21 UTC
[GitHub] [spark] bozhang2820 opened a new pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
bozhang2820 opened a new pull request #31849:
URL: https://github.com/apache/spark/pull/31849
### What changes were proposed in this pull request?
This change is to ignore cache for SNAPSHOT dependencies in spark-submit.
### Why are the changes needed?
When spark-submit is executed with --packages, it will not download the dependency jars when they are available in cache (e.g. ivy cache), even when the dependencies are SNAPSHOT.
This might block developers who work on external modules in Spark (e.g. spark-avro), since they need to remove the cache manually every time when they update the code during developments (which generates SNAPSHOT jars). Without knowing this, they could be blocked wondering why their code changes are not reflected in spark-submit executions.
### Does this PR introduce _any_ user-facing change?
Yes. With this change, developers/users who run spark-submit with SNAPSHOT dependencies do not need to remove the cache every time when the SNAPSHOT dependencies are updated.
### How was this patch tested?
Added a unit test.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] bozhang2820 commented on pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
bozhang2820 commented on pull request #31849:
URL: https://github.com/apache/spark/pull/31849#issuecomment-803724404
> Hi, @bozhang2820 .
> It seems that this code and the added test case is really flaky. Could you take a look, please?
>
> * https://github.com/apache/spark/runs/2158443719
>
> * https://github.com/apache/spark/runs/2161516532
Hi @dongjoon-hyun, sorry for the trouble. Let's first revert the change: https://github.com/apache/spark/pull/31918
Also CC @HyukjinKwon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31849:
URL: https://github.com/apache/spark/pull/31849#issuecomment-803675231
Hi, @bozhang2820 .
It seems that this code and the added test case is really flaky. Could you take a look, please?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31849:
URL: https://github.com/apache/spark/pull/31849#issuecomment-800121124
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40693/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31849:
URL: https://github.com/apache/spark/pull/31849#discussion_r596549816
##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
##########
@@ -1153,6 +1153,8 @@ private[spark] object SparkSubmitUtils extends Logging {
// We need a chain resolver if we want to check multiple repositories
val cr = new ChainResolver
cr.setName("spark-list")
+ cr.setChangingMatcher(PatternMatcher.REGEXP)
+ cr.setChangingPattern(".*-SNAPSHOT")
Review comment:
While I agree with the intention (I faced the same problem and had to manually remove cache too), I concern that it will actually make users more confused in a way because it's Maven or Ivy's standard behaviour, and now we're changing how they work by default in Spark, which probably users wouldn't know.
I know it's very unlikely but some users might want to use one cached snapshot (presumably as they know the behaviours of Maven or Ivy resolvers work). Let's say, one CI regularly publishes snapshot, and users want to test one specific version created at the specific time. After this PR, they are forced to use the newest snapshot always.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31849:
URL: https://github.com/apache/spark/pull/31849#discussion_r596584465
##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
##########
@@ -1153,6 +1153,8 @@ private[spark] object SparkSubmitUtils extends Logging {
// We need a chain resolver if we want to check multiple repositories
val cr = new ChainResolver
cr.setName("spark-list")
+ cr.setChangingMatcher(PatternMatcher.REGEXP)
+ cr.setChangingPattern(".*-SNAPSHOT")
Review comment:
Thanks @bozhang2820 for correcting me. Yeah, I think it makes sense.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #31849:
URL: https://github.com/apache/spark/pull/31849#issuecomment-803675231
Hi, @bozhang2820 .
It seems that this code and the added test case is really flaky. Could you take a look, please?
- https://github.com/apache/spark/runs/2158443719
- https://github.com/apache/spark/runs/2161516532
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31849:
URL: https://github.com/apache/spark/pull/31849#issuecomment-801668801
Merged to master.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31849:
URL: https://github.com/apache/spark/pull/31849#issuecomment-800119756
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136110/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] bozhang2820 commented on a change in pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
bozhang2820 commented on a change in pull request #31849:
URL: https://github.com/apache/spark/pull/31849#discussion_r596582901
##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
##########
@@ -1153,6 +1153,8 @@ private[spark] object SparkSubmitUtils extends Logging {
// We need a chain resolver if we want to check multiple repositories
val cr = new ChainResolver
cr.setName("spark-list")
+ cr.setChangingMatcher(PatternMatcher.REGEXP)
+ cr.setChangingPattern(".*-SNAPSHOT")
Review comment:
The major reason for this change is to reduce frictions for developers, especially new ones, to contribute to Spark external modules.
I think the behavior proposed here matches the standard ones in Ivy. See "use a naming convention like a special suffix" in https://ant.apache.org/ivy/history/latest-milestone/bestpractices.html, and the implementation in ivy's `IBiblioResolver`: https://github.com/apache/ant-ivy/blob/master/src/java/org/apache/ivy/plugins/resolver/IBiblioResolver.java#L88-L92
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #31849:
URL: https://github.com/apache/spark/pull/31849
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #31849: [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31849:
URL: https://github.com/apache/spark/pull/31849#discussion_r596549816
##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
##########
@@ -1153,6 +1153,8 @@ private[spark] object SparkSubmitUtils extends Logging {
// We need a chain resolver if we want to check multiple repositories
val cr = new ChainResolver
cr.setName("spark-list")
+ cr.setChangingMatcher(PatternMatcher.REGEXP)
+ cr.setChangingPattern(".*-SNAPSHOT")
Review comment:
While I agree with the intention (I faced the same problem and had to manually remove cache too), I concern that it will actually make users more confused in a way because it's Maven or Ivy's standard behaviour, and now we're changing how they work by default (which probably users don't know).
I know it's very unlikely but some users might want to use one cached snapshot (presumably as they know the behaviours of Maven or Ivy resolvers work). Let's say, one CI regularly publishes snapshot, and users want to test one specific version created at the specific time. After this PR, they are forced to use the newest snapshot always.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org