You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shannon Carey (Jira)" <ji...@apache.org> on 2021/02/18 20:13:00 UTC

[jira] [Comment Edited] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

    [ https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286704#comment-17286704 ] 

Shannon Carey edited comment on SPARK-32385 at 2/18/21, 8:12 PM:
-----------------------------------------------------------------

Here's another reason that either a BOM or a move away from dependency management on the Spark side would be helpful.

Problems such as this [https://stackoverflow.com/questions/42352091/spark-sql-fails-with-java-lang-noclassdeffounderror-org-codehaus-commons-compil] occur even if the user has apparently done everything right. The Spark top-level POM specifies version 3.0.9 of janino in its <dependencyManagement>, but when Maven pulls that transitive dependency in via something like spark-sql, it gets the latest version instead (such as 3.1.2). This occurs due to surprising behavior in Maven, recorded in https://issues.apache.org/jira/browse/MNG-5761 and https://issues.apache.org/jira/browse/MNG-6141 .

This problem forces people to add direct dependencies to specific versions of transitive things, sometimes without understanding the cause of the issue, and leads to POMs being more fragile.

If you provide a BOM, that could help with this, if the versions are specified. Or, don't rely purely on dependency management in Maven, for libraries.


was (Author: rehevkor5):
Here's another reason that either a BOM or a move away from dependency management on the Spark side would be helpful.

Problems such as this [https://stackoverflow.com/questions/42352091/spark-sql-fails-with-java-lang-noclassdeffounderror-org-codehaus-commons-compil] occur even if the user has apparently done everything right. The Spark top-level POM specifies version 3.0.9 of janino in its <dependencyManagement>, but when Maven pulls that transitive dependency in via something like spark-sql, it gets the latest version instead (such as 3.1.2). This occurs due to surprising behavior in Maven, recorded in https://issues.apache.org/jira/browse/MNG-5761 andhttps://issues.apache.org/jira/browse/MNG-6141 .

This problem forces people to add direct dependencies to specific versions of transitive things, sometimes without understanding the cause of the issue, and leads to POMs being more fragile.

If you provide a BOM, that could help with this, if the versions are specified. Or, don't rely purely on dependency management in Maven, for libraries.

> Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32385
>                 URL: https://issues.apache.org/jira/browse/SPARK-32385
>             Project: Spark
>          Issue Type: Improvement
>          Components: Build
>    Affects Versions: 3.1.0
>            Reporter: Vladimir Matveev
>            Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, Jackson). Also, versions of these dependencies are not updated as frequently as they are released upstream, which is totally understandable and natural, but which also means that often Spark has a dependency on a lower version of a library, which is incompatible with a higher, more recent version of the same library. This incompatibility can manifest in different ways, e.g as classpath errors or runtime check errors (like with Jackson), in certain cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them explicitly in its {{pom.xml}} file. However, this approach, being somewhat workable if the Spark-using project itself uses Maven, breaks down if another build system is used, like Gradle. The reason is that Maven uses an unconventional "nearest first" version conflict resolution strategy, while many other tools like Gradle use the "highest first" strategy which resolves the highest possible version number inside the entire graph of dependencies. This means that other dependencies of the project can pull a higher version of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher version of Jackson in the project. Spark itself depends on several modules of Jackson; if only one of them gets a higher version, and others remain on the lower version, this will result in runtime exceptions due to an internal version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a "bill of materials" descriptor (see here: [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html] and here: [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). This descriptor would contain all versions of all dependencies of Spark; then downstream projects will be able to use their build system's support for BOMs to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring projects, e.g. Spring Boot, there are BOM descriptors published which can be used in downstream projects to fix the versions of Spring components and their dependencies, significantly reducing confusion around proper version numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org