You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "pan3793 (via GitHub)" <gi...@apache.org> on 2023/08/15 02:07:12 UTC

[GitHub] [spark] pan3793 opened a new pull request, #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

pan3793 opened a new pull request, #42493:
URL: https://github.com/apache/spark/pull/42493

### What changes were proposed in this pull request?

This PR demonstrates the possibility and steps of upgrading Spark's built-in Guava from 14 to 32+.

Note: it's not a simple version bumping, requires some pre-steps.

- it requires HIVE-27560 (https://github.com/apache/hive/pull/4542), which makes Hive 2.3.10 compatible with Guava 14+ (thanks to @LuciferYang)
- it requires removing `IsolatedClientLoader`, see detail at https://www.mail-archive.com/dev@spark.apache.org/msg30708.html and https://github.com/apache/spark/pull/33989#issuecomment-928616327

### Why are the changes needed?

It's a long-standing issue, see prior discussions at https://github.com/apache/spark/pull/35584, https://github.com/apache/spark/pull/36231, and https://github.com/apache/spark/pull/33989

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed after disabling/tweaking some tests. And manually tested by running some queries based on Hive tables in my internal Hadoop cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44811][BUILD] Upgrade Guava to 32+ [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+
URL: https://github.com/apache/spark/pull/42493


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1679932605

   > Perhaps it's worth checking out in the dev mailing list, if it's not done already.
   
   @sunchao I posted the proposal in the mailing list thread [What else could be removed in Spark 4?](https://www.mail-archive.com/dev@spark.apache.org/msg30708.html).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1679946305

   > I do agree that Hive 2.3.9/10 should be compatible with most Hive versions.
   
   Yes, I can contribute some use cases.
   
   - We use Hive 2.3.9 to access HMS 2.1/2.3 internally for several years. 
   - In default setup of [E-MapReduce of AliCloud](https://www.alibabacloud.com/help/en/e-mapreduce/latest/emr-5-13-x-version-notes), it uses Spark 3.3.1(Hive 2.3.9) to access HMS 3.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1678728911

   cc @sunchao @JoshRosen @HyukjinKwon @dongjoon-hyun @srowen @mridulm, what do you think about this approach?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "sunchao (via GitHub)" <gi...@apache.org>.

sunchao commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1679842677

   Dropping `IsolatedClientLoader` is also my main concern here - I'm not sure whether anyone in the community still rely on the feature for some reason (cc @xkrogen). I do agree that Hive 2.3.9/10 should be compatible with _most_ Hive versions.
   
   Perhaps it's worth checking out in the dev mailing list, if it's not done already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44811][BUILD] Upgrade Guava to 32+ [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1859495547

   UPDATE
   
   Dropping `IsolatedClientLoader` is not necessary after we dropped support in SPARK-45328, see technical details at https://github.com/apache/spark/pull/42599.
   
   I think there are no known technical issues blocking us from upgrading Guava while keeping IsolatedClientLoader in Spark 4.0.0
   
   Summary of upgrading Guava steps:
   
   1. SPARK-45292: Remove Guava from shared classes from IsolatedClientLoader https://github.com/apache/spark/pull/42599
   2. Wait for Hive 2.3.10 for HIVE-27560
   3. Upgrade Spark built-in Guava to 32+


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cxzl25 commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "cxzl25 (via GitHub)" <gi...@apache.org>.

cxzl25 commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1685871961

   Sharing a real case from my company.
   Using Spark version 3.2 in production environment, using buildin's Hive Client version 2.3.9 with two patches, communicating with Hive meta store 1.1.0, it has been working fine for three years now.
   
   [HIVE-25500 ](https://issues.apache.org/jira/browse/HIVE-25500)Switch back to alter_partition(s) in HMS client for Hive 2.3.x
   [[SPARK-44454](https://issues.apache.org/jira/browse/SPARK-44454)][SQL][HIVE] HiveShim getTablesByType support fallback
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44811][BUILD] Upgrade Guava to 32+ [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1835054166

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1678857243

   > We should probably do something like this for 4.0, yes.
   
   @srowen yes, it's target to 4.0
   
   > Why does it require removing the isolated classloader?
   
   Guava is marked as shared, which means that Guava is not isolated totally. (I don't understand the background, but the fact is, it's required, excluding it from shared classes breaks the test)
   
   https://github.com/apache/spark/blob/afcccb42c96cc7785a85c9463523f34d7a900a1d/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L205C3-L219
   
   So technically, if we want to upgrade Guava with `IsolatedClientLoader`, we should make all Hive versions compatible with the new Guava, it's impossible.
   
   And as I comment in https://www.mail-archive.com/dev@spark.apache.org/msg30708.html, I think the `IsolatedClientLoader` is not required after dropping Java 8
   
   > Dropping support for Java 8 means dropping support for Hive lower than 
   2.0(exclusive)[1].
   > 
   > IsolatedClientLoader aims to allow using different Hive jars to communicate 
   with different versions of HMS. AFAIK, the current built-in Hive 2.3.9 client 
   works well on communicating with the Hive Metastore server through 2.1 to 3.1 
   (maybe 2.0 too, not sure). This brings a new question, does the 
   IsolatedClientLoader required then?
   > 
   > I think we should drop IsolatedClientLoader because
   > 
   > 1. As explained above, we can use the built-in Hive 2.3.9 client to communicate 
   with HMS 2.1+
   > 2. Since SPARK-42539[2], the default Hive 2.3.9 client does not use 
   IsolatedClientLoader, and as explained in SPARK-42539, IsolatedClientLoader 
   causes some inconsistent behaviors.
   > 3. It blocks Guava upgrading. HIVE-27560[3] aim to make Hive 2.3.10(unreleased) 
   compatible with all Guava 14+ versions, but unfortunately, Guava is marked as 
   `isSharedClass`[3] in IsolatedClientLoader, so technically, if we want to 
   upgrade Guava we need to make all supported Hive versions(through 2.1.x to 
   3.1.x) to support high version of Guava, I think it's impossible.
   > 
   > [1] sql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientVersions.scala
   > [2] https://issues.apache.org/jira/browse/SPARK-42539
   > [3] https://issues.apache.org/jira/browse/HIVE-27560
   > [4] https://github.com/apache/spark/pull/33989#issuecomment-926277286


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1678389086

   Thank you for your efforts on this. @pan3793 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1687562520

@JoshRosen thanks for the comments, https://github.com/apache/spark/pull/42599 is opened to try removing Guava from `sharedClasses`, let's wait for CI results.

> If Spark 4.0 drops JDK8 support then we'd already lose support for Hive < 2.0 using out-of-the-box Hive metastore clients.

Yes, definitely, from the community perspective, that means Spark 4.0 drops support for HMS < 2.0.

> However, some users run forked versions of old Hive which do support newer JDKs and for them the `IsolatedClientLoader` removal would create pains for their Spark 4.0 upgrade. Since those users are already forking Hive, though, I suppose they could simply recompile their Hive fork against Spark's newer Guava version.

I think it's out of the scope of the Spark community's responsibility. And it's very weird to combine Hive 0.13(released in 2014) with Spark 4.0(will release in 2024), considering that Spark 4.0 decided to drop support for Java 8 and Hadoop 2, I don't think we should insist to support such an old Hive version.

I definitely see lots of users fork and maintain their internal Hadoop/Hive/Spark versions, but as you said, they could simply recompile their Hive fork against Spark's newer Guava version, I suppose they also could recompile their Spark fork against their Hive fork.

In short, I think it does not make sense that a user supposes Spark 4.0.0 should work out-of-box with a forked Hive released a decade ago. He/she should expect some code changes on Spark to make such a combination work.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] pan3793 commented on a diff in pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on code in PR #42493:
URL: https://github.com/apache/spark/pull/42493#discussion_r1294121980


##########
pom.xml:
##########
@@ -343,6 +343,15 @@
         <enabled>false</enabled>
       </snapshots>
     </repository>
+
+    <repository>
+      <id>mvn-repo</id>
+      <url>https://raw.github.com/pan3793/mvn-repo/master</url>

Review Comment:
   This is a temporary repo for https://github.com/apache/hive/pull/4542



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1678839305

   We should probably do something like this for 4.0, yes.
   Why does it require removing the isolated classloader? That's the only part I'm concerned about breaking something else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] JoshRosen commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "JoshRosen (via GitHub)" <gi...@apache.org>.

JoshRosen commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1687092403

Regarding the proposed removal of `IsolatedClientLoader`, I am concerned that doing so might break some users' ability to interface with very old Hive metastores:

Although newer Hive client versions are compatible with a wider range of metastore versions, I am unsure whether they cover _all_ old versions. If users are relying on Hive 0.13, for example, then I am concerned that removing `IsolatedClientLoader` might prevent those users from being able to upgrade Spark without first upgrading their metastore.

As was pointed out upthread in https://github.com/apache/spark/pull/42493#issuecomment-1678857243, though, those old Hive versions require Java 8. If Spark 4.0 drops JDK8 support then we'd already lose support for Hive < 2.0 using out-of-the-box Hive metastore clients.

However, some users run forked versions of old Hive which _do_ support newer JDKs and for them the `IsolatedClientLoader` removal would create pains for their Spark 4.0 upgrade. Since those users are _already_ forking Hive, though, I suppose they could simply recompile their Hive fork against Spark's newer Guava version.

---

That said, I wonder whether we can decouple the `IsolatedClientLoader` removal decision from the Guava upgrade by removing Guava from `sharedClasses` (previously discussed at https://github.com/apache/spark/pull/33989#issuecomment-928616327). This could also prove useful in case we uncover some other reason why we might want to continue using `IsolatedClientLoader`.

I think Guava would only need to be shared if we were passing instances of Guava classes across the classloader boundary, but I don't think we're doing that.

I tried removing Guava from `sharedClasses` and used SBT to run `org.apache.spark.sql.hive.client.HiveClientSuites` and tests seem to pass. Given this, maybe we could try kicking off a full CI run to see if we get any failures there? I think it may be worth re-testing here in order to verify our assumptions about Guava and the `IsolatedClientLoader`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on pull request #42493: [SPARK-44811][BUILD] Upgrade Guava to 32+

Posted by "xkrogen (via GitHub)" <gi...@apache.org>.

xkrogen commented on PR #42493:
URL: https://github.com/apache/spark/pull/42493#issuecomment-1681039320

   > Dropping `IsolatedClientLoader` is also my main concern here - I'm not sure whether anyone in the community still rely on the feature for some reason (cc @xkrogen). I do agree that Hive 2.3.9/10 should be compatible with _most_ Hive versions.
   
   Thanks for the ping @sunchao! We're still using it in Spark 3.1.1 to access Hive 1.1, but we're working to get rid of it after the compatibility work you did on the HMS side (thanks for that!). Our only blocker has been some rollout concerns, nothing from the compatibility angle. So, good from my perspective.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org