You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Erik Krogen (Jira)" <ji...@apache.org> on 2023/02/23 17:53:00 UTC
[jira] [Created] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"

Erik Krogen created SPARK-42539:
-----------------------------------

             Summary: User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
                 Key: SPARK-42539
                 URL: https://issues.apache.org/jira/browse/SPARK-42539
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.2, 3.2.3, 3.1.3
            Reporter: Erik Krogen


Recently we observed that on version 3.2.0, it is possible for user-provided Hive JARs to break the ability for Spark, via the Hive metadata client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when using the default behavior of the "builtin" Hive version. After SPARK-35321, when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client version is used, we will call the method {{Hive.getWithoutRegisterFns()}} (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break with a {{NoSuchMethodError}}. This particular failure mode was resolved in 3.2.1 by SPARK-37446, but while investigating, we found a general issue that it's possible for user JARs to override Spark's own JARs -- but only inside of the IsolatedClientLoader when using "builtin". This happens because even when Spark is configured to use the "builtin" Hive classes, it still creates a separate URLClassLoader for the HiveClientImpl used for HMS communication. To get the set of JAR URLs to use for this classloader, Spark [collects all of the JARs used by the user classloader (and its parent, and that classloader's parent, and so on)](https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438). Thus the newly created classloader will have all of the same JARs as the user classloader, but the ordering has been reversed! User JARs get prioritized ahead of system JARs, because the classloader hierarchy is traversed from bottom-to-top. For example let's say we have user JARs "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this:
{code}
MutableURLClassLoader
-- foo.jar
-- hive-exec-2.3.8.jar
-- parent: URLClassLoader
----- spark-core_2.12-3.2.0.jar
----- ...
----- hive-exec-2.3.9.jar
----- ...
{code}

This setup provides the expected behavior within the user classloader; it will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the MutableURLClassLoader is only checked if the class doesn't exist in the parent. But when a JAR list is constructed for the IsolatedClientLoader, it traverses the URLs from MutableURLClassLoader first, then it's parent, so the final list looks like (in order):
{code}
URLClassLoader [IsolatedClientLoader]
-- foo.jar
-- hive-exec-2.3.8.jar
-- spark-core_2.12-3.2.0.jar
-- ...
-- hive-exec-2.3.9.jar
-- ...
-- parent: boot classloader (JVM classes)
{code}
Now when a lookup happens, all of the JARs are within the same URLClassLoader, and the user JARs are in front of the Spark ones, so the user JARs get prioritized. This is the opposite of the expected behavior when using the default user/application classloader in Spark, which has parent-first behavior, prioritizing the Spark/system classes over the user classes. (Note that this behavior is correct when using the {{ChildFirstURLClassLoader}}.)

After SPARK-37446, the NoSuchMethodError is no longer an issue, but this still breaks assumptions about how user JARs should be treated vs. system JARs, and presents the ability for the client to break in other ways. For example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have been included; the changes in Hive 2.3.9 were needed to improve compatibility with older HMS, so if a user were to accidentally include these older JARs, it could break the ability of Spark to communicate with HMS 1.x

I see two solutions to this:

*(A) Remove the separate classloader entirely when using "builtin"*
Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even create a new classloader when using "builtin". This makes sense, as [called out in this comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], since the point of "builtin" is to use the existing JARs on the classpath anyway. This proposes simply extending the changes from SPARK-26839 to all Java versions, instead of restricting to Java 9+ only.

*(B) Reverse the ordering of parent/child JARs when constructing the URL list*
The most targeted fix that can be made is to simply reverse the ordering on [this line in HiveUtils|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L419], which prioritizes child-classloader JARs over parent-classloader JARs, reversing the expected ordering. There is already special handling for {{ChildFirstURLClassLoader}}, so all that needs to be done is to reverse this order.

I prefer (A) because I think it is a clean solution in that it both simplifies the classloader setup, and reduces divergence / special handling for different Java versions. At the time SPARK-26839 went in (2019), Java 9+ support was newer and less well-tested. Now after a few years we can see that the approach in SPARK-26839 clearly works well, so I see no reason _not_ to extend this approach to other Java versions as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org