You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Erik Krogen (Jira)" <ji...@apache.org> on 2022/10/28 23:09:00 UTC
[jira] [Commented] (SPARK-40939) Release a shaded version of Apache Spark / shade jars on main jar

    [ https://issues.apache.org/jira/browse/SPARK-40939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625953#comment-17625953 ] 

Erik Krogen commented on SPARK-40939:
-------------------------------------

As a reference for prior work, there is also HADOOP-11656, in which Hadoop began publishing a new {{hadoop-client-runtime}} JAR into which all of the transitive dependencies are shaded. In the [proposal|https://issues.apache.org/jira/secure/attachment/12709266/HADOOP-11656_proposal.md] a technique similar to Flink's was proposed and eventually rejected due to higher maintenance burden to publish separate artifacts for each downstream library that is shaded.

There are some pitfalls that come with Spark being a Scala project, unlike Hadoop/Flink which are Java based. Most shading tools cannot handle certain Scala language elements, specifically {{ScalaSig}} causes problems because shading tools that are not Scala-aware do not perform relocations within the {{ScalaSig}} (see examples [one|https://github.com/coursier/coursier/issues/454#issuecomment-288969207] and [two|https://lists.apache.org/thread/x7b4z0os9zbzzprb5scft7b4wnr7c3mv] and [this previous Spark PR that tried to shade Jackson|https://github.com/apache/spark/pull/10931]). That being said, {{sbt}}'s [assembly plugin has had support for this since 2020|https://github.com/sbt/sbt-assembly/pull/393], and this functionality was subsequently pulled out into a standalone library, [Jar Jar Abrams|http://eed3si9n.com/jarjar-abrams/]. So there is hope that this should be more achievable now than it was back in 2016 when that PR was filed. There's also been [interest in shading all of Spark's dependencies on the Spark dev-list|https://lists.apache.org/thread/vkkx8s2zv0ln7j7oo46k30x084mn163p].

I would love to hear what the community thinks of pursuing this earnestly with the tools available in 2022, though [~almogtavor] I'll note that this type of large change is better discussed on the dev mailing list (and probably an accompanying SPIP).

> Release a shaded version of Apache Spark / shade jars on main jar
> -----------------------------------------------------------------
>
>                 Key: SPARK-40939
>                 URL: https://issues.apache.org/jira/browse/SPARK-40939
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy
>    Affects Versions: 3.4.0
>            Reporter: Almog Tavor
>            Priority: Major
>
> I suggest shading in Apache Spark, to resolve the dependency hell that may occur when building / deploying Apache Spark. This mainly occurs on Java projects and on Hadoop environments, but shading will help for using Spark with Scala & even Python either.
> Flink has a similar solution, delivering [flink-shaded|https://github.com/apache/flink-shaded/blob/master/README.md].
> The dependencies I think that are relevant for shading are Jackson, Guava, Netty & any of the Hadoop ecosystems if possible.
> As for releasing sources for the shaded version, I think the [issue that has been raised in Flink|https://github.com/apache/flink-shaded/issues/25] is relevant and unanswered here too, hence I don't think that's an option currently (personally I don't see any value for it either).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org