You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "attilapiros (via GitHub)" <gi...@apache.org> on 2023/09/22 21:25:37 UTC

[GitHub] [spark] attilapiros opened a new pull request, #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore

attilapiros opened a new pull request, #43064:
URL: https://github.com/apache/spark/pull/43064

   
   ### What changes were proposed in this pull request?
   
   Supporting Hive 4.0 metastore where partition filters even for CHAR and a VARCHAR types can be pushed down.
   
   **Hive 4.0 is still beta! This is why this is work on progress PR.** 
   
   ### Why are the changes needed?
   
   Supporting more Hive versions (with extra performance improvement) is good for our users.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Regarding supporting Hive 4.0 metastore the documentation is updated accordingly.
   
   ### How was this patch tested?
   
   #### Manually
   
   I used the docker image of apache/hive:4.0.0-beta-1 for starting a metastore and a hiveserver2 (along with a hadoop3 docker image).
   
   Created a table:
   ```
   CREATE EXTERNAL TABLE testTable1 ( 
     column1 String 
   ) PARTITIONED BY (partColumn1 CHAR(30), partColumn2 VARCHAR(30)) LOCATION 'hdfs://hadoop3:8020/tmp/hive_external/';
   ```
   
   Inserted some values in beeline:
   
   ```
   insert into table testtable1 values ("column1_v1", "partcolumn1_v1", "partcolumn2_v1"), ("column1_v2", "partcolumn1_v2", "partcolumn2_v2");
   ```
   
   Started my spark in the hiveserver2 container as:
   ```
   ./bin/spark-shell --conf spark.sql.hive.metastore.version=4.0.0 --conf spark.sql.hive.metastore.jars="/opt/hive/lib/*"
   ```
   
   Run the query as:
   ```
   scala> sql("select * from testtable1 where partcolumn1 = 'partcolumn1_v1' and partcolumn2 = 'partcolumn2_v1'").show
   Hive Session ID = 6846fe0e-968a-474d-afec-4f67b3a2a274
   +----------+--------------------+--------------+
   |   column1|         partcolumn1|   partcolumn2|
   +----------+--------------------+--------------+
   |column1_v1|partcolumn1_v1   ...|partcolumn2_v1|
   +----------+--------------------+--------------+
   ```
   
   And check the HMS calls in the metastore container in the file `/tmp/hive/hive.log`:
   ```
   ...
   2023-09-22T21:06:34,293  INFO [Metastore-Handler-Pool: Thread-1356] HiveMetaStore.audit: ugi=hive       ip=172.30.0.5   cmd=source:172.30.0.5 get_partitions_by_filter : tbl=hive.default.testtable1
   ...
   ```
   
   Which contains the expected `get_partitions_by_filter`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore
URL: https://github.com/apache/spark/pull/43064


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] pan3793 commented on a diff in pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.
pan3793 commented on code in PR #43064:
URL: https://github.com/apache/spark/pull/43064#discussion_r1335221474


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/package.scala:
##########
@@ -131,8 +131,16 @@ package object client {
         "org.pentaho:pentaho-aggdesigner-algorithm",
         "org.apache.hive:hive-vector-code-gen"))
 
+    // Since HIVE-14496, Hive.java uses calcite-core
+    case object v4_0 extends HiveVersion("4.0.0",
+      extraDeps = Seq("org.apache.derby:derby:10.14.2.0"),
+      exclusions = Seq("org.apache.calcite:calcite-druid",
+        "org.apache.curator:*",
+        "org.pentaho:pentaho-aggdesigner-algorithm",

Review Comment:
   Hive 4.0-beta1 depends on calcite 1.25, since CALCITE-1474 (calcite 1.11), `org.pentaho:pentaho-aggdesigner`(available in Conjar, which already sunset) was upgraded to `net.hydromatic:aggdesigner`(available in Maven Central), thus I think this exclusion is invalid, simply remove it would be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1817590589

   Is there any update for Apache Hive 4.0, @attilapiros ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1821771472

   Thank you for the updates and the link, @attilapiros .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore

Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1741525251

   @dongjoon-hyun 
   Regarding Hive 4.0 there is the [Test with the TPC-DS benchmark](https://issues.apache.org/jira/browse/HIVE-26654) to be done but when the release is out I will update this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore

Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1732139192

   @dongjoon-hyun 
   
   Thanks!
   
   > Are you using the current beta-1?
   
   Yes.
   
   > Is there a timeline for Hive 4.0 GA?
   
   I will ask around but as I know they still have some blockers.
   
   > Although I know that you filed this as Bug for some old releases, but I believe this PR should be a subtask for Apache Spark 4.0.0 because there is no existing Spark users with Apache Hive 4.0.0 Megastore.
   
   Sorry that was a mistake of mine thanks for fixing that in Jira.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1732140284

   Thank you. And, if you are fine with Apache Spark 4.0, that's great! I was worried. 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1741575567

   Thank you so much for keeping us up-to-date, @attilapiros !
   
   > Regarding Hive 4.0 there is the [Test with the TPC-DS benchmark](https://issues.apache.org/jira/browse/HIVE-26654) to be done but when the release is out I will update this PR.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #43064: [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1733000616

   cc @wangyum too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]

Posted by "attilapiros (via GitHub)" <gi...@apache.org>.
attilapiros commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1821770267

   > Is there any update for Apache Hive 4.0, @attilapiros ?
   
   @dongjoon-hyun they still having some more issues to solve (as I see some TPC-DS queries performance issues):
   https://lists.apache.org/thread/3okjgw3y6tso7l2rg3hhy8lccp6d6mmy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45265][SQL][WIP] Supporting Hive 4.0 metastore [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #43064:
URL: https://github.com/apache/spark/pull/43064#issuecomment-1972199715

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org