You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/01/07 05:33:12 UTC
[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755894869


   > It is indeed a MOR table.Can you check your driver logs. You might find some exceptions around registering _rt table. You can look for logs around the log message
   > 
   > "Trying to sync hoodie table "
   
   error as follows:
   there is no sql about  creating _rt table , only _ro table
   ----------------------------------------------------------------------------------------------------------------------------------------------
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 371
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 337
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 404
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev03:18850 in memory (size: 73.0 KB, free: 2.5 GB)
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev02:6815 in memory (size: 73.0 KB, free: 3.5 GB)
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 333
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 418
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 385
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 410
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
   21/01/07 13:23:10 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
   21/01/07 13:23:10 INFO ObjectStore: Initialized ObjectStore
   21/01/07 13:23:11 INFO HiveMetaStore: Added admin role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: Added public role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: No user is added in admin role, since config is empty
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_all_databases
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_all_databases	
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=default pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*	
   21/01/07 13:23:11 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=dw pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=dw pat=*	
   21/01/07 13:23:11 INFO HiveSyncTool: Trying to sync hoodie table api_trade_ro with base path /data/stream/mor/api_trade of type MERGE_ON_READ
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last compaction commit as Option{val=null}
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last delta commit Option{val=[20210107132154__deltacommit__COMPLETED]}
   21/01/07 13:23:12 INFO HoodieHiveClient: Reading schema from /data/stream/mor/api_trade/dt=2021-01/350a9a01-538c-4a7e-8c17-09d2cdc85073-0_0-20-85_20210107132154.parquet
   21/01/07 13:23:12 INFO HiveSyncTool: Hive table api_trade_ro is not found. Creating it
   21/01/07 13:23:12 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:23:12 INFO HoodieHiveClient: Executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:24:09 INFO HiveSyncTool: Schema sync complete. Syncing partitions for api_trade_ro
   21/01/07 13:24:09 INFO HiveSyncTool: Last commit time synced was found to be null
   21/01/07 13:24:09 INFO HoodieHiveClient: Last commit time synced is not known, listing all partitions in /data/stream/mor/api_trade,FS :DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-398057260_1, ugi=root (auth:SIMPLE)]]
   21/01/07 13:24:09 INFO HiveSyncTool: Storage partitions scan complete. Found 1
   21/01/07 13:24:09 INFO HiveMetaStore: 0: get_partitions : db=ads tbl=api_trade_ro
   21/01/07 13:24:09 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_partitions : db=ads tbl=api_trade_ro	
   21/01/07 13:24:10 INFO HiveSyncTool: New Partitions [dt=2021-01]
   21/01/07 13:24:10 INFO HoodieHiveClient: Adding partitions 1 to table api_trade_ro
   21/01/07 13:24:10 INFO HoodieHiveClient: Executing SQL ALTER TABLE `ads`.`api_trade_ro` ADD IF NOT EXISTS   PARTITION (`dt`='2021-01') LOCATION '/data/stream/mor/api_trade/dt=2021-01' 
   21/01/07 13:24:33 INFO HiveSyncTool: Changed Partitions []
   21/01/07 13:24:33 INFO HoodieHiveClient: No partitions to change for api_trade_ro
   21/01/07 13:24:33 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:24:33 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210107132154
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:658)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:128)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:91)
   	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
   	at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:196)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:163)
   	at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
   	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
   Caused by: NoSuchObjectException(message:ads.api_trade_ro table not found)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
   	at com.sun.proxy.$Proxy41.get_table(Unknown Source)
   	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
   	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
   	at com.sun.proxy.$Proxy42.getTable(Unknown Source)
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:654)
   	... 48 more
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Shutting down the object store...
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Shutting down the object store...	
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Metastore shutdown complete.
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Metastore shutdown complete.	
   21/01/07 13:24:33 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> ads, hoodie.insert.shuffle.parallelism -> 10, path -> /data/stream/mor/api_trade, hoodie.datasource.write.precombine.field -> modified, hoodie.datasource.hive_sync.partition_fields -> dt, hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.MultiPartKeysValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> api_trade, hoodie.index.type -> GLOBAL_BLOOM, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> true, hoodie.datasource.write.recordkey.field -> id, hoodie.table.name -> api_trade, hoodie.datasource.hive_sy
 nc.jdbcurl -> jdbc:hive2://0.0.0.0:10000, hoodie.datasource.write.table.type -> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.datasource.query.type -> snapshot, hoodie.bloom.index.update.partition.path -
   ----------------------------------------------------------------------------------------------------------------------------------------------
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org