You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "attilapiros (via GitHub)" <gi...@apache.org> on 2023/09/22 21:25:37 UTC
[GitHub] [spark] attilapiros opened a new pull request, #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
attilapiros opened a new pull request, #43064:
URL: https://github.com/apache/spark/pull/43064
### What changes were proposed in this pull request?
Supporting Hive 4.0 metastore where partition filters even for CHAR and a VARCHAR types can be pushed down.
**Hive 4.0 is still beta! This is why this is work on progress PR.**
### Why are the changes needed?
Supporting more Hive versions (with extra performance improvement) is good for our users.
### Does this PR introduce _any_ user-facing change?
Yes. Regarding supporting Hive 4.0 metastore the documentation is updated accordingly.
### How was this patch tested?
#### Manually
I used the docker image of apache/hive:4.0.0-beta-1 for starting a metastore and a hiveserver2 (along with a hadoop3 docker image).
Created a table:
```
CREATE EXTERNAL TABLE testTable1 (
column1 String
) PARTITIONED BY (partColumn1 CHAR(30), partColumn2 VARCHAR(30)) LOCATION 'hdfs://hadoop3:8020/tmp/hive_external/';
```
Inserted some values in beeline:
```
insert into table testtable1 values ("column1_v1", "partcolumn1_v1", "partcolumn2_v1"), ("column1_v2", "partcolumn1_v2", "partcolumn2_v2");
```
Started my spark in the hiveserver2 container as:
```
./bin/spark-shell --conf spark.sql.hive.metastore.version=4.0.0 --conf spark.sql.hive.metastore.jars="/opt/hive/lib/*"
```
Run the query as:
```
scala> sql("select * from testtable1 where partcolumn1 = 'partcolumn1_v1' and partcolumn2 = 'partcolumn2_v1'").show
Hive Session ID = 6846fe0e-968a-474d-afec-4f67b3a2a274
+----------+--------------------+--------------+
| column1| partcolumn1| partcolumn2|
+----------+--------------------+--------------+
|column1_v1|partcolumn1_v1 ...|partcolumn2_v1|
+----------+--------------------+--------------+
```
And check the HMS calls in the metastore container in the file `/tmp/hive/hive.log`:
```
...
2023-09-22T21:06:34,293 INFO [Metastore-Handler-Pool: Thread-1356] HiveMetaStore.audit: ugi=hive ip=172.30.0.5 cmd=source:172.30.0.5 get_partitions_by_filter : tbl=hive.default.testtable1
...
```
Which contains the expected `get_partitions_by_filter`.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
URL: https://github.com/apache/spark/pull/43064
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43064:
URL: https://github.com/apache/spark/pull/43064#discussion_r1335221474
##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/package.scala:
##########
@@ -131,8 +131,16 @@ package object client {
"org.pentaho:pentaho-aggdesigner-algorithm",
"org.apache.hive:hive-vector-code-gen"))
+ // Since HIVE-14496, Hive.java uses calcite-core
+ case object v4_0 extends HiveVersion("4.0.0",
+ extraDeps = Seq("org.apache.derby:derby:10.14.2.0"),
+ exclusions = Seq("org.apache.calcite:calcite-druid",
+ "org.apache.curator:*",
+ "org.pentaho:pentaho-aggdesigner-algorithm",
Review Comment:
Hive 4.0-beta1 depends on calcite 1.25, since CALCITE-1474 (calcite 1.11), `org.pentaho:pentaho-aggdesigner`(available in Conjar, which already sunset) was upgraded to `net.hydromatic:aggdesigner`(available in Maven Central), thus I think this exclusion is invalid, simply remove it would be fine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1817590589
Is there any update for Apache Hive 4.0, @attilapiros ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1821771472
Thank you for the updates and the link, @attilapiros .
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] attilapiros commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1741525251
@dongjoon-hyun
Regarding Hive 4.0 there is the [Test with the TPC-DS benchmark](https://issues.apache.org/jira/browse/HIVE-26654) to be done but when the release is out I will update this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] attilapiros commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1732139192
@dongjoon-hyun
Thanks!
> Are you using the current beta-1?
Yes.
> Is there a timeline for Hive 4.0 GA?
I will ask around but as I know they still have some blockers.
> Although I know that you filed this as Bug for some old releases, but I believe this PR should be a subtask for Apache Spark 4.0.0 because there is no existing Spark users with Apache Hive 4.0.0 Megastore.
Sorry that was a mistake of mine thanks for fixing that in Jira.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1732140284
Thank you. And, if you are fine with Apache Spark 4.0, that's great! I was worried. 😄
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1741575567
Thank you so much for keeping us up-to-date, @attilapiros !
> Regarding Hive 4.0 there is the [Test with the TPC-DS benchmark](https://issues.apache.org/jira/browse/HIVE-26654) to be done but when the release is out I will update this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1733000616
cc @wangyum too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]
Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1821770267
> Is there any update for Apache Hive 4.0, @attilapiros ?
@dongjoon-hyun they still having some more issues to solve (as I see some TPC-DS queries performance issues):
https://lists.apache.org/thread/3okjgw3y6tso7l2rg3hhy8lccp6d6mmy
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1972199715
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org