You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2021/01/26 20:30:00 UTC
[jira] [Commented] (HUDI-281) HiveSync failure through Spark when useJdbc is set to false

    [ https://issues.apache.org/jira/browse/HUDI-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272398#comment-17272398 ] 

sivabalan narayanan commented on HUDI-281:
------------------------------------------

[~uditme]: is this still a valid issue? 

> HiveSync failure through Spark when useJdbc is set to false
> -----------------------------------------------------------
>
>                 Key: HUDI-281
>                 URL: https://issues.apache.org/jira/browse/HUDI-281
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Hive Integration, Spark Integration, Usability
>            Reporter: Udit Mehrotra
>            Priority: Major
>              Labels: bug-bash-0.6.0
>
> Table creation with Hive sync through Spark fails, when I set *useJdbc* to *false*. Currently I had to modify the code to set *useJdbc* to *false* as there is not *DataSourceOption* through which I can specify this field when running Hudi code.
> Here is the failure:
> {noformat}
> java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.session.SessionState.start(Lorg/apache/hudi/org/apache/hadoop_hive/conf/HiveConf;)Lorg/apache/hadoop/hive/ql/session/SessionState;
>   at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLs(HoodieHiveClient.java:527)
>   at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQLUsingHiveDriver(HoodieHiveClient.java:517)
>   at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:507)
>   at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
>   at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:132)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:96)
>   at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
>   at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:235)
>   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
>   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
>   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229){noformat}
> I was expecting this to fail through Spark, becuase *hive-exec* is not shaded inside *hudi-spark-bundle*, while *HiveConf* is shaded and relocated. This *SessionState* is coming from the spark-hive jar and obviously it does not accept the relocated *HiveConf*.
> We in *EMR* are running into same problem when trying to integrate with Glue Catalog. For this we have to create Hive metastore client through *Hive.get(conf).getMsc()* instead of how it is being down now, so that alternate implementations of metastore can get created. However, because hive-exec is not shaded but HiveConf is relocated we run into same issues there.
> It would not be recommended to shade *hive-exec* either because it itself is an Uber jar that shades a lot of things, and all of them would end up in *hudi-spark-bundle* jar. We would not want to head that route. That is why, we would suggest if we consider removing any shading of Hive libraries.
> We can add a *Maven Profile* to shade, but that means it has to be activated by default otherwise it will fail default if *useJdbc* is set to false, and later when we commit *Glue Catalog* changes.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)