You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "eubnara (via GitHub)" <gi...@apache.org> on 2024/02/28 06:35:13 UTC

[PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

eubnara opened a new pull request, #45309:
URL: https://github.com/apache/spark/pull/45309

### What changes were proposed in this pull request?

Make `spark-sql`, `spark-shell` be able to access iceberg with HiveCatalog.
If a user want to access iceberg table with HiveCatalog through `spark-sql`, `spark-shell`, the user should specify additional configuration:

```
$ spark-sql --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.hadoop_prod=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.hadoop_prod.type=hive \
--conf spark.sql.catalog.hadoop_prod.uri=thrift://hms1.example.com:9083,thrift://hms2.example.com:9083 \
--conf spark.hadoop.iceberg.engine.hive.enabled=true \
--conf spark.jars=hdfs:///some/path/to/iceberg-spark-runtime-3.2_2.12-1.4.3.jar \
--conf spark.hadoop.hive.aux.jars.path=hdfs:///some/path/to/iceberg-hive-runtime-1.4.3.jar \
--conf spark.security.credentials.hive.enabled=true
```

### Why are the changes needed?

`spark-sql` and `spark-shell` cannot access iceberg table with HiveCatalog because there is no HIVE_DELEGATION_TOKEN.

### Does this PR introduce _any_ user-facing change?

If there is a user who specify `--conf spark.security.credentials.hive.enabled=true`, spark will get HIVE_DELEGATION_TOKEN even though deploy mode is not "cluster".

### How was this patch tested?

Manually tested on on-premise internal cluster with Hadoop 3.3.4, Iceberg 1.4.3, and Spark 3.2.3.

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

Posted by "eubnara (via GitHub)" <gi...@apache.org>.

eubnara commented on PR #45309:
URL: https://github.com/apache/spark/pull/45309#issuecomment-1969212746

   Even with this patch, `insert into` is broken. (describe extended, select * from queries are okay)
   Maybe https://issues.apache.org/jira/browse/SPARK-30885 is related?
   
   ```
   Driver stacktrace:
           at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
           at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
           at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
           at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
           at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
           at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
           at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
           at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
           at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
           at scala.Option.foreach(Option.scala:407)
           at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
           at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
           at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
           at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
           at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
           at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
           at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
           at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:228)
           ... 61 more
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
           at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:274)
           at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:132)
           at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:105)
           at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)
           at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:146)
           at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
           at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$17(FileFormatWriter.scala:239)
           at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
           at org.apache.spark.scheduler.Task.run(Task.scala:131)
           at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1492)
           at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.NullPointerException
           at org.apache.iceberg.mr.hive.TezUtil$TaskAttemptWrapper.<init>(TezUtil.java:105)
           at org.apache.iceberg.mr.hive.TezUtil.taskAttemptWrapper(TezUtil.java:78)
           at org.apache.iceberg.mr.hive.HiveIcebergOutputFormat.writer(HiveIcebergOutputFormat.java:73)
           at org.apache.iceberg.mr.hive.HiveIcebergOutputFormat.getHiveRecordWriter(HiveIcebergOutputFormat.java:58)
           at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
           at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
           ... 14 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #45309:
URL: https://github.com/apache/spark/pull/45309#issuecomment-1968395911

   IMO it's an Iceberg side issue, and in addition to the case you listed above, accessing multiple Kerberized HMS cases should be considered, e.g. the Spark built-in HMS and Iceberg HMS are different, configure more than one Iceberg Hive catalogs 
   
   +cc @pvary @szehon-ho @sunchao
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

Posted by "pan3793 (via GitHub)" <gi...@apache.org>.

pan3793 commented on PR #45309:
URL: https://github.com/apache/spark/pull/45309#issuecomment-1968353484

   `HiveDelegationTokenProvider` takes care of the Spark built-in HMS client token refresh, Iceberg uses its own implemented HMS client, and should take care of itself.
   
   As an example, Apache Kyuubi implements a Hive Connector based on Spark DSv2 API, which allows connecting to multi HMSs, and implements `KyuubiHiveConnectorDelegationTokenProvider` to take care of the token refresh for its managed HMS clients https://github.com/apache/kyuubi/pull/4560


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

Posted by "eubnara (via GitHub)" <gi...@apache.org>.

eubnara closed pull request #45309: [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell
URL: https://github.com/apache/spark/pull/45309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

Posted by "eubnara (via GitHub)" <gi...@apache.org>.

eubnara commented on PR #45309:
URL: https://github.com/apache/spark/pull/45309#issuecomment-1969269354

   Oh! I finally figure out why it fails.
   I should not use iceberg-hive-runtime jar on spark-sql or spark-shell.
   I forgot to specify database and query with "catalog".
   
   ```
   SELECT * FROM prod.db.table; # correct
   SELECT * FROM db.table; # wrong
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

Posted by "eubnara (via GitHub)" <gi...@apache.org>.

eubnara commented on PR #45309:
URL: https://github.com/apache/spark/pull/45309#issuecomment-1968379376

   Thanks for reply.
   With `spark-sql` or `spark-shell`, it is impossible to use iceberg with HiveCatalog? only iceberg with HadoopCatalog is supported?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47197] Failed to connect HiveMetastore when using iceberg with HiveCatalog on spark-sql or spark-shell [spark]

Posted by "eubnara (via GitHub)" <gi...@apache.org>.

eubnara commented on PR #45309:
URL: https://github.com/apache/spark/pull/45309#issuecomment-1968419075

   Thanks for explanation. I think I need to review spark, iceberg codes more...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org