You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "pete91z (via GitHub)" <gi...@apache.org> on 2023/03/07 23:52:46 UTC

[GitHub] [hudi] pete91z opened a new issue, #8118: [SUPPORT] error in run_sync_tool.sh

pete91z opened a new issue, #8118:
URL: https://github.com/apache/hudi/issues/8118

   **Describe the problem you faced**
   
   Attempting to use run_sync_tool.sh to hive sync a hudi table gives the following error:
   
   ./run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/smaster:10000 --user hive --pass 'pass' --base-path 's3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi_2/‘ --partitioned-by location --database persis --table tempstream_hudi
   setting hadoop conf dir
   Running Command : java -cp /usr/local/spark/lib/hive-metastore-2.3.9.jar:::/usr/local/spark/lib/hive-exec-2.3.9-core.jar::/usr/local/spark/lib/hive-jdbc-2.3.9.jar::/usr/local/spark/lib/jackson-annotations-2.13.4.jar:/usr/local/spark/lib/jackson-core-2.13.4.jar:/usr/local/spark/lib/jackson-core-asl-1.9.13.jar:/usr/local/spark/lib/jackson-databind-2.13.4.1.jar:/usr/local/spark/lib/jackson-dataformat-yaml-2.13.4.jar:/usr/local/spark/lib/jackson-datatype-jsr310-2.13.4.jar:/usr/local/spark/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/spark/lib/jackson-module-scala_2.12-2.13.4.jar::/usr/local/spark/share/hadoop/common/*:/usr/local/spark/share/hadoop/mapreduce/*:/usr/local/spark/share/hadoop/hdfs/*:/usr/local/spark/share/hadoop/common/lib/*:/usr/local/spark/share/hadoop/hdfs/lib/*:/usr/local/spark/etc/hadoop:/home/spark_331/hudi-0.12.2/hudi-sync/hudi-hive-sync/../../packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-0.12.2.jar org.apache.hudi.hive.HiveSyncTool --jdbc-url jdbc:h
 ive2://smaster:10000 --user hive --pass pass --base-path s3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi/ --partitioned-by location --database persis --table tempstream_hudi
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing tempstream_hudi
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
   	at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:359)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE IF NOT EXISTS `persis`.`tempstream_hudi`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `id` string, `reading` int, `record_ts` bigint) PARTITIONED BY (`location` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('hoodie.query.as.ro.table'='false','path'='s3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi/') STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi/' TBLPROPERTIES('spark.sql.sources.schema.partCol.0'='location','spark.sql.sources.schema.numParts'='1','spark.sql.sources.schema.numPartCols'='1','spark.sql.sources.provider'='hudi','spark.sql.sources.schema.p
 art.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"string","nullable":true,"metadata":{}},{"name":"reading","type":"integer","nullable":true,"metadata":{}},{"name":"record_ts","type":"timestamp","nullable":true,"metadata":{}},{"name":"location","type":"integer","nullable":true,"metadata":{}}]}')
   	at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:70)
   	at org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.createTable(QueryBasedDDLExecutor.java:92)
   	at org.apache.hudi.hive.HoodieHiveSyncClient.createTable(HoodieHiveSyncClient.java:188)
   	at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:279)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:217)
   	at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142)
   	... 1 more
   Caused by: java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot persist persis.tempstream_hudi into Hive metastore as table property keys may not start with 'spark.sql.': [spark.sql.sources.provider, spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts, spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0]
   	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
   	
   
   **To Reproduce**
   
   from pyspark.sql.functions import col
   from pyspark.sql.types import StringType, IntegerType, StructType, StructField, TimestampType, FloatType, ArrayType
   t= [{"id":"rlk2ljk24jt-dlkgj24t0rg","location":1,"reading":45,"record_ts": "2023-03-06T13:27:45Z"},{"id":"sdlkgj24230-drlgkj4","location":2,"reading":67,"record_ts":"2023-03-06T16:45:23Z"},{"id":"lkgj2434j-gt4l5k4kl5hj","location":3,"reading":15,"record_ts":"2023-03-06T12:45:33Z"}]
   df=spark.createDataFrame(t)
   df=df.withColumn("record_ts",df.record_ts.cast("timestamp")
   df=df.withColumn("location",df.location.cast("int"))
   df.printSchema()
   tableName='tempstream_hudi'
   hudi_options = {
     'hoodie.table.name': tableName,
     'hoodie.datasource.write.recordkey.field': 'id',
     'hoodie.datasource.write.partitionpath.field': 'location',
     'hoodie.datasource.write.table.name': tableName,
     'hoodie.datasource.write.operation': 'insert',
     'hoodie.datasource.write.precombine.field': 'record_ts',
     'hoodie.upsert.shuffle.parallelism': 2,
     'hoodie.insert.shuffle.parallelism': 2,
     'hoodie.index.type':'SIMPLE'}
   basePath='s3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi_2'
   df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
   
   Re-reading the table to confirm write:
   
   >>> df=spark.read.format("hudi").load("s3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi_2")
   23/03/07 15:05:50 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   23/03/07 15:05:50 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   >>> df.show()                                                                   
   +-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+-------+--------------------+--------+
   |_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|                  id|reading|           record_ts|location|
   +-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+-------+--------------------+--------+
   |  20230307150228278|20230307150228278...|rlk2ljk24jt-dlkgj...|                     1|4f8364ea-5c68-4ca...|rlk2ljk24jt-dlkgj...|     45|2023-03-06T13:27:45Z|       1|
   |  20230307150228278|20230307150228278...|lkgj2434j-gt4l5k4...|                     3|c15ea7b1-220d-4ef...|lkgj2434j-gt4l5k4...|     15|2023-03-06T12:45:33Z|       3|
   |  20230307150228278|20230307150228278...| sdlkgj24230-drlgkj4|                     2|c8cebcce-c048-41f...| sdlkgj24230-drlgkj4|     67|2023-03-06T16:45:23Z|       2|
   +-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+-------+--------------------+--------+
   
   >>> 
   
   Run the run_sync_tool.sh script :
   
   /run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/smaster:10000 --user hive --pass 'pass' --base-path 's3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi_2/‘ --partitioned-by location --database persis --table tempstream_hudi
   
   
   **Expected behavior**
   
   I was expecting the table to be created and registered in the metastore database.
   
   **Environment Description**
   
   * Hudi version : 0.12.2
   
   * Spark version : 3.3.1
   
   * Hive version : 2.3.9 (jar files only)
   
   * Hadoop version : 3.3.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   $JAVA_HOME/bin/java -version
   java version "1.8.0_361"
   Java(TM) SE Runtime Environment (build 1.8.0_361-b09)
   Java HotSpot(TM) 64-Bit Server VM (build 25.361-b09, mixed mode)
   
   **Additional context**
   
   Hive metastore DB is mysql. Hive and Hadoop are not installed per se, but I am using the bundled jar files that come with spark to carry out the Hive metastore operations:
   
   [jars]$ ls *hive*
   hive-beeline-2.3.9.jar    hive-jdbc-2.3.9.jar         hive-service-rpc-3.1.2.jar   hive-shims-scheduler-2.3.9.jar  spark-hive-thriftserver_2.12-3.3.1.jar
   hive-cli-2.3.9.jar        hive-llap-common-2.3.9.jar  hive-shims-0.23-2.3.9.jar    hive-storage-api-2.7.2.jar
   hive-common-2.3.9.jar     hive-metastore-2.3.9.jar    hive-shims-2.3.9.jar         hive-vector-code-gen-2.3.9.jar
   hive-exec-2.3.9-core.jar  hive-serde-2.3.9.jar        hive-shims-common-2.3.9.jar  spark-hive_2.12-3.3.1.jar
   [jars]$ ls *hadoop*
   dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar  hadoop-client-api-3.3.2.jar      hadoop-shaded-guava-1.1.1.jar           parquet-hadoop-1.12.2.jar
   hadoop-aws-3.3.2.jar                                   hadoop-client-runtime-3.3.2.jar  hadoop-yarn-server-web-proxy-3.3.2.jar
   
   Where a certain directory structure is expected by the tool, I have symlinked the files.
   
   I am also using the spark thriftserver for connectivity (running on port 10000)
   
   SPARK_HOME=/usr/local/spark
   HADOOP_HOME=/usr/local/spark
   HIVE_HOME=/usr/local/spark
   
   
   Writes to metastore for non Hudi dataframe / table works fine:
   
   from pyspark.sql.functions import col
   from pyspark.sql.types import StringType, IntegerType, StructType, StructField, TimestampType, FloatType, ArrayType
   spark.catalog.setCurrentDatabase("persis")
   t= [{"id":"rlk2ljk24jt-dlkgj24t0rg","location":1,"reading":45,"record_ts": "2023-03-06T13:27:45Z"},{"id":"sdlkgj24230-drlgkj4","location":2,"reading":67,"record_ts":"2023-03-06T16:45:23Z"},{"id":"lkgj2434j-gt4l5k4kl5hj","location":3,"reading":15,"record_ts":"2023-03-06T12:45:33Z"}]
   df=spark.createDataFrame(t)
   df=df.withColumn("record_ts",df.record_ts.cast("timestamp")).withColumn("location",df.location.cast("int"))
   df.write.format("parquet").partitionBy("location").option("path","s3a://<bucket>/Airflow/DEV/LANDING/tempstream_3").saveAsTable("tempstream_3")
   spark.sql("select * from tempstream_3").show(10,False)
   
   >>> spark.sql("select * from tempstream_3").show(10,False)
   +-----------------------+-------+-------------------+--------+
   |id                     |reading|record_ts          |location|
   +-----------------------+-------+-------------------+--------+
   |rlk2ljk24jt-dlkgj24t0rg|45     |2023-03-06 13:27:45|1       |
   |lkgj2434j-gt4l5k4kl5hj |15     |2023-03-06 12:45:33|3       |
   |sdlkgj24230-drlgkj4    |67     |2023-03-06 16:45:23|2       |
   +-----------------------+-------+-------------------+--------+
   
   
   **Stacktrace**
   
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing tempstream_hudi
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
   	at org.apache.hudi.hive.HiveSyncTool.main(HiveSyncTool.java:359)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE IF NOT EXISTS `persis`.`tempstream_hudi`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `id` string, `reading` int, `record_ts` bigint) PARTITIONED BY (`location` int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('hoodie.query.as.ro.table'='false','path'='s3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi/') STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://<bucket>/Airflow/DEV/LANDING/tempstream_hudi/' TBLPROPERTIES('spark.sql.sources.schema.partCol.0'='location','spark.sql.sources.schema.numParts'='1','spark.sql.sources.schema.numPartCols'='1','spark.sql.sources.provider'='hudi','spark.sql.sources.schema.p
 art.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"id","type":"string","nullable":true,"metadata":{}},{"name":"reading","type":"integer","nullable":true,"metadata":{}},{"name":"record_ts","type":"timestamp","nullable":true,"metadata":{}},{"name":"location","type":"integer","nullable":true,"metadata":{}}]}')
   	at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:70)
   	at org.apache.hudi.hive.ddl.QueryBasedDDLExecutor.createTable(QueryBasedDDLExecutor.java:92)
   	at org.apache.hudi.hive.HoodieHiveSyncClient.createTable(HoodieHiveSyncClient.java:188)
   	at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:279)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:217)
   	at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:154)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:142)
   	... 1 more
   Caused by: java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot persist persis.tempstream_hudi into Hive metastore as table property keys may not start with 'spark.sql.': [spark.sql.sources.provider, spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts, spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0]
   	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)
   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:230)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
   	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:230)
   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:225)
   	at java.security.AccessController.doPrivileged(Native Method)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:239)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.spark.sql.AnalysisException: Cannot persist persis.tempstream_hudi into Hive metastore as table property keys may not start with 'spark.sql.': [spark.sql.sources.provider, spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts, spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0]
   	at org.apache.spark.sql.hive.HiveExternalCatalog.verifyTableProperties(HiveExternalCatalog.scala:137)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:249)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101)
   	at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:244)
   	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
   	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:373)
   	at org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:169)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
   	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
   	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
   	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
   	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
   	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
   	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:622)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
   	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
   	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)
   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:291)
   	... 16 more
   
   	at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)
   	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
   	at org.apache.hudi.hive.ddl.JDBCExecutor.runSQL(JDBCExecutor.java:68)
   	... 7 more
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #8118: [SUPPORT] error in run_sync_tool.sh

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.

ad1happy2go commented on issue #8118:
URL: https://github.com/apache/hudi/issues/8118#issuecomment-1538015308

   @pete91z Were you able to test it out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pete91z commented on issue #8118: [SUPPORT] error in run_sync_tool.sh

Posted by "pete91z (via GitHub)" <gi...@apache.org>.

pete91z commented on issue #8118:
URL: https://github.com/apache/hudi/issues/8118#issuecomment-1511358660

   Hi, my attention has been a bit diverted to other issues lately, but I should be able to re-test this week and update here. Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #8118: [SUPPORT] error in run_sync_tool.sh

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on issue #8118:
URL: https://github.com/apache/hudi/issues/8118#issuecomment-1464806799

   > Caused by: java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot persist persis.tempstream_hudi into Hive metastore as table property keys may not start with 'spark.sql.': [spark.sql.sources.provider, spark.sql.sources.schema.partCol.0, spark.sql.sources.schema.numParts, spark.sql.sources.schema.numPartCols, spark.sql.sources.schema.part.0]
   at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.
   
   Not sure why the hive sync went through the hive thrift server, should it connect to the HMS directly? Did you try the hms mode instead?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #8118: [SUPPORT] error in run_sync_tool.sh

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.

ad1happy2go commented on issue #8118:
URL: https://github.com/apache/hudi/issues/8118#issuecomment-1506828156

   @pete91z Were you able to get a chance to try with HMS option? Are you still facing this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pete91z commented on issue #8118: [SUPPORT] error in run_sync_tool.sh

Posted by "pete91z (via GitHub)" <gi...@apache.org>.

pete91z commented on issue #8118:
URL: https://github.com/apache/hudi/issues/8118#issuecomment-1463613215

   Workaround I'm using at the moment is to create the table in spark-sql, but omitting the tblproperties clause:
   
   CREATE EXTERNAL TABLE IF NOT EXISTS persis.tempstream_hudi( _hoodie_commit_time string, _hoodie_commit_seqno string, _hoodie_record_key string, _hoodie_partition_path string, _hoodie_file_name string, id string, reading bigint, record_ts string) PARTITIONED BY (location int) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH SERDEPROPERTIES ('hoodie.query.as.ro.table'='false','path'='s3a:///Airflow/DEV/LANDING/tempstream_hudi_2/') STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a:///Airflow/DEV/LANDING/tempstream_hudi_2/';
   
   then adding the partitions manually:
   
   alter table tempstream_hudi add if not exists partition(location=1) LOCATION 's3a:///Airflow/DEV/LANDING/tempstream_hudi_2/1';
   
   alter table tempstream_hudi add if not exists partition(location=2) LOCATION 's3a:///Airflow/DEV/LANDING/tempstream_hudi_2/2';
   
   alter table tempstream_hudi add if not exists partition(location=3) LOCATION 's3a:///Airflow/DEV/LANDING/tempstream_hudi_2/3';
   
   HOWEVER - accessing the table via thrift / Hive metastore is not Hudi aware, and a select query returns rows from all files (and therefore potentially duplicates, so I have to window functions to show only the latest row versions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pete91z commented on issue #8118: [SUPPORT] error in run_sync_tool.sh

Posted by "pete91z (via GitHub)" <gi...@apache.org>.

pete91z commented on issue #8118:
URL: https://github.com/apache/hudi/issues/8118#issuecomment-1465355639

   Thanks for your reply. I'm not actually running the Hive Metastore service, but have the Metastore DB for Spark configured in a Mysql database. The spark thrift server runs on port 10000. I assumed running the script with jdbc against thriftserver port 10000 meant it would try to connect via jdbc and execute the create SQL statement (it looks like it does try it, but fails with the above error, I can see this in the thrift log).  I'll try using the HMS option and let you know how that goes, ultimately what I'm trying achieve is to add the hudi table to the metastore db catalog (similar to using create table in spark sql or dataframe.write.SaveAsTable).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pete91z commented on issue #8118: [SUPPORT] error in run_sync_tool.sh

Posted by "pete91z (via GitHub)" <gi...@apache.org>.

pete91z commented on issue #8118:
URL: https://github.com/apache/hudi/issues/8118#issuecomment-1539868435

   I re-tested last week, but still getting errors. The current workaround is  manually making the required DDL changes into spark-sql. Can I just double check that what I'm trying to  do with the sync tool is supported? I am not running a full Hive Metastore system, but have a Hive Metastore repo DB (running in mysql). I am using the native hive jars that come with Spark. I start up the thriftserver which runs on port 10000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org